Notes on Yost Chapter 14: 18 April 2003
Auditory perception and sound source determination pp. 207 – 225.

The basic problem of this chapter is how the listener segregates and aggregates the multiple frequencies that strike the ear into separate and distinct sources of sound. This is obviously a more difficult problem than is faced by the visual system (though the determination of objects by their reflection of light is more difficult that it would appear, in spite of there being a two dimension receptor surface and a third dimension supplied by the two eyes). The problem of course is that all of the frequencies play out in two dimensions (frequency and amplitude) entirely interleaved on a two dimensional surface, or so it would seem (as in Figure 14.1).

Yost provides seven potential cues to segregation and grouping: 1) Spectral separation; 2) Spectral profile; 3) Harmonicity; 4) Spatial separation; 5) Temporal separation; 6) Temporal onset and offset; 7) Temporal modulation. Perhaps these could be grouped into three subsets, of similarity in (A) timing, in (B) location, and in (C) pitch; and perhaps another could be added, which would be (D) continuity in pitch or time.

For Yost spectral separation is simply that if we hear two different frequencies then they could be assigned in principle each to a different source. He writes (page 209) that frequency separation is not a strong contributor to source identification for real world complex sounds, as shown in Figure 14.1 –– we do not put the top two tones together as separate from the two bottom tones. Bregman (who has written a lot on complex auditory processing, and is cited by Yost in the supplement to Chapter 14) has shown that temporal properties of stimuli have a strong effect on whether different frequencies are heard together or not. He has a demonstration (Bregman and Pinker 1978) that goes something like this: Tone "a" is presented, and is alternated with the simultaneous occurrence of "b and c" with a very brief break between them. This sounds like the alternation of a pure tone a with a complex tone ("be"), which of course it is: nothing surprising. But for the next part of the demonstration depends on "a" and "b" being maintained as separate and usually discriminability different tones, but placed increasingly closer together on the frequency dimension. Now this is heard as a repeating tone "a" (so "a" and "b" have been merged), which is given at twice the rate of the repeating tone "c". So "b" is apparently heard as being "a". I think Figure 14.5 fits here as well, in which alternating tones are heard as continuing percepts of two separate sources, rather than a single alternating tone (which would be heard if there were a glide from one to the other). The general point of these demonstration is that even the perception of simple tones can be affected by a broader context. Bregman thinks that the shared temporal features of the two is their critical feature.

Spectral profile is a concept largely the invention of Dave Green (a distinguished hearing scientist, once at Harvard, now retired from Florida). He shows that in a very complex sound mixture having as many as 10 separate tones we can detect one tone that increases in relative level from one presentation to the next, under some conditions as easily as if that tone was presented by itself, even though in the comparison presentation all go up or down. This indicates that we can easily monitor relative levels in a complex, so that the one that is different in level "pops–out" from the rest (see Figures 14.2, 3, & 4). This word "pops–out" is critical to most of this work, which show that common manipulations of the group of stimuli of one kind or another serves to isolate a particular signal and move it away from the masking background. Here Yost is using our ability to pick up on relative levels to imagine that we can tell the difference between sounds sources on the basis of their relative spectrum – you could imagine that this is a very simplified version of our being able to tell the differences between different people by the balance of different frequencies in their voices. So in Figure 14.2 you can see that the spectra along the diagonal left up to right down are the same, though the level changes. Figure 14.3 is easy to see as the basic condition used by Dave Green in his study of profile analysis, but the outcome of Figure 14.4 is difficult to understand.

The trick to understanding the left hand ordinate is not easy, as it indicates that the threshold for detecting an increment in a 1 kHz tone is about – 21 dB, whereas in Figure 10.7 (and it is 10.7, not 10.8 as Yost indicates) the value for increment detection in a 60 dB stimulus (which this was) is just slightly less than +1 dB. So what is going on? Apparently Dave Green chose to make a transformation that made his effect look very big, so he chose to express the increment threshold in log units of the relative increment in pressure. If you start with the right hand ordinate then the threshold increment is about 0.8 dB: which means that we can just detect the difference between a 60 dB 1000 Hz signal and a 1000 Hz signal set at 60.8 dB. So then we work back and forth through the logarithmic relationships: to figure out the pressure in a 60 dB signal we write out 20log(Ps/Pr) = 60, so log(Ps/Pr) = 3, so (Ps/Pr) = 1000; then to figure out the pressure in the noticeably different signal we write out 20log(Pd/Pr) = 60.8, so log(Pd/Pr) = 3.04, so (Pd/Pr) = 1096, and the increment in pressure must be about 10%. Then in order to express this 10% increment in dB values we write 20log(.1) = –20, and, voila, this is just about the -21 dB seen for the asterisk in Figure 14.4. Somehow this seems more than a little misleading! But after this things get more interesting. What happens to our ability to detect an increment in the 1000 Hz signal when there are 4 tones surrounding the comparison tone? Perhaps not surprising, the addition of more stimuli makes it difficult to pick out the critical comparison sinusoid. The threshold goes up, to about Ð8 dB in the left ordinate of Figure 14.4: but the antilog of (-8/20) = antilog(.40) = a 40% increase which gives us the relative increase in power, so the level of the just detected stimulus against a background of 5 tones is 20log(1400) = 62.9 dB which is a difference in dB level of about 2.9 dB over the standard; this means that a 60 dB stimulus would have to be increased to about 63 dB in order to be detected. If we went back to Figure 10.7 the new value would be (about) +3 instead of (about) +1, which close to the right hand ordinate in Figure 14.4. The 2 dB increase maybe doesn't seem so impressive, compared to the other way of doing the calculation, which yields about 13 dB. But the real point of this experimental result is to understand why performance improves when there are 11 sinusoids surrounding the center standard. The presentation on pages 211 and 212 is not transparently clear, but the suggestion is something like, the richer the band width, up to a point, the more readily the overall spectral content can be encoded, and the more readily a slight shift in the profile can be detected, as long as the several tones do not interfere with each other, by there being very close together. He suggests that when the number of tones continues to increase then the tones get too hard to discriminate, and so the difficult in getting the level of the target increases. The perhaps important feature of this demonstration is that maybe this process is present when we try to sort through the subtle spectral shifts that accompany different sound locations in 3–dimensional space. Remember the notches in the spectrum that shift according to how a complex noise hits the pinna. I do not think this implication has been put to a direct test, which could involve, perhaps, varying the number of separable components in the signal.

Yost then turns briefly to consider the idea that harmonicity may serve to segregate sound objects: he concludes that it apparently doesn't work as a segregating device, though one might think it should – so a musical instrument (or the human voice) often consists of a set of harmonics, and for this reason could be thought to be heard as a unitary sound. Yost gives an example in which the listener is presented with two complex tones with different fundamentals but interleaved components (maybe 200, 400, 600 etc. presented together with 333, 666, 999, etc.). Perhaps it is one source, perhaps two or more: but how would you know how to group them? Well most everyone agrees, group the stuff that comes and goes together, so it might become a single very complex source. The ideas about virtual pitch (of the missing fundamental) being the same as "real pitch" unfortunately breaks down, and this 6 tones complex does not sound like two tones (i.e., 200 and 333 –– or how about the 6th and 11th harmonics of a 33 Hz tone!), but a discordant noise. It must be (one might think) that the complicated inter–pulse intervals between all of these tones cannot be simply broken out by the nervous system.

However, it is certainly important that if these two complex patterns are presented separately to the two ears then two pitches can be heard. This suggests the hypothesis that virtual pitch is analyzed somewhere before the binaural level of the auditory system (cochlear nucleus perhaps). There are also data indicating that two pitches can be heard if the two complexes begin at different times. Yost gives another interesting example, of "mistuning" one of the harmonics of a complex tone, by more than about 8%: when this happens it stands out against the others as a separate pitch. This he suggests elsewhere may make it possible to discriminate among different voices. It is also a case in which harmonicity does serve to organize objects.

Spatial separation clearly works to segregate sound objects, as in the binaural masking level difference of Chapter 12. Sounds from different places are almost by definition different sources (unless they are echoes, which is another story). And note, that when they are heard as separate objects then they do not mask each other. These seems to be a very important rule, that is seen in the other examples of this chapter. However, Yost suggests that spatial location provides only a weak way of segregating sound objects (page 213), suggesting that first we make a distinction about what an object is, and then use spatial cues to say where it is: the "what" comes first as a separate function. So we readily resolve instruments in a monaural recording.

Simultaneous onsets and offsets are similarly powerful indications of a single source, and types of onset are very informative in other ways as well – fast versus slow onsets can identify different musical instruments, for example. So then temporal separation becomes a means of segregating objects, as part of the basis for the perceptual effect seen in Figure 14.5, in the demonstration of alternation of two pitches yielding streaming and fusion at certain streaming rate. Temporal pattern is also seen as relevant to the effects shown in Figure 14.6 and Figure 14.7, in the little section about "informational masking." In the experiment that led to Figure 14.6 a successive 10 note melody is repeated exactly or repeated with one note shifted in frequency, and the subject is asked whether the two melodies are the same or different. When the melody is changed from trial to trial then discrimination of the frequency of the single tone is terrible, especially for the beginning of the sequence. But this must be an auditory memory problem it would seem, because when the melody is the same from trial to trial then the discrimination is very good, perhaps not much different from listening to just a single tone (compare Figure 10.6 for example). The meaning of this demonstration as given in Yost's last paragraph is a bit obscure in this context; but I do think that it suggests that for familiar patterns (familiar voices perhaps?) subtle spectral shifts can be detected that would not be apparent in unfamiliar patterns; and these might carry information about phonemes and words, etc. Then another strange masking situation is described on page 216, a task in which a tonal signal is masked by different tones, as few as two other tones, as many as 100 tones. There seems to be a lot more masking for 2 tones than for many tones, and this suggests than uncertainty is the masker may be very important in determining the degree of masking. (Also confusing is the typo of "master" for "masker" several times in the left column of page 216.) Perhaps we can think of a Scharf–like attentional explanation of this "roving" masker effect.

Common temporal modulation is a powerful cue to identifying sound objects, though it may be best seen as a subtype of onsets and offsets rather than, as Yost has done, seeing it as a separate category of stimulation. Yost has two examples, both involving masking, and both pretty interesting. In the first, adapted from an experiment reported by Hall, Haggard, and Fernandes (1984), and shown in Figure 14.7, he describes (A) a tonal signal which is masked by a narrow band of noise (the target band) which is amplitude modulated at some certain rate determined by its bandwidth, and the threshold for the signal is raised by 40 dB. Then in the next part (B) he introduces another band of noise far from the signal, which is itself modulated but at another rate. Because this second band of noise is very far away from the signal it should not provide any energy in the critical band of the signal and, therefore, the degree of masking is still (about) 40 dB: no surprises. But then finally, in (C) the two noise bands with the same frequency components are modulated at the same rate, and now masking is disrupted, so that the degree of masking drops by about 10 dB: this is a big effect! Why does this work? The masking noise still adds the same amount to the critical band, of course, so there is no reason to think that there should be any masking relief. First Yost presents a common explanation, that the tone can be heard in the low–level dip in the modulated signal. But this would happen in condition A and well as C, and the argument that the dips in the second noise allow us to better detect the dips in the first noise seems rather lame. My loose assumption is that now the two masking tones are grouped as being the same object on the basis of their common modulation pattern, and this grouping relieves its masking effect on the uncorrelated signal. This seems to be the hypothesis favored by Hall et al., and is suggested to be an important cue used in the "cocktail party–effect" in addition to binaural localization. But in fact this effect, which is called "co–modulation release," is not well understood, save that it is clear that a simple theory based on activity in critical bands at the level of the basilar membrane is not going to provide an explanation for CMR. Instead a very high order of unmasking seems called for, so that sounds from a different sources defined by common AM modulation do not mask each other just as sounds from different places do not mask each other.

Yost must like the effect because it is similar to a type of experiment he does himself. In Figure 14.8 he presents evidence from one of his own very complicated experiments on a phenomenon called "modulated detection interference." First (A) the listener has to detect a change in the depth of amplitude modulation for a single tone (here 4 kHz), and a threshold is determined. Then (B) a non–modulated tone (here 1 kHz) is added to the task, and it is shown that the threshold for detecting a change in modulation depth of the single 4 kHz tone is unaffected (1 kHz is too far away to mask 4 kHz). Then (C) the 1 kHz tone is modulated at the same rate and phase as the 4 kHz tone, and now the threshold for detecting a change in modulation depth for the 4000 Hz stimulus goes way up: the modulation of 1 kHz interferes with the detection of a change in modulation of 4 kHz. What could be the reason for this effect? Perhaps it results because both frequencies are fused into a single object because of their common modulation rate object and so a change in one of them cannot be separately discriminated. So in (D) Yost modulated the stimuli at different rates, and found that the threshold dropped back to more or less its original level, well maybe about _ of the effect goes away. So the theoretical argument is that having common temporal features is sufficient to bind stimuli together across a wide range and separation of frequencies. This is the other side of the experiment by Hall et al., showing that having similar modulation rates for signal and maskers increased their masking effects. On the other hand, it could be imagined that the critical feature is not tonal frequency but modulation frequency, and some cells are sensitive to modulation frequency regardless of their tonal best frequency. But in any event, it extends listening to across spectral components, something no doubt important at the higher levels of the nervous system.

Then lastly we turn to speech, about which Yost says very little: save for the motor gestures that are responsible for phonemes (which concern the place along the vocal tract that is modifying the vibration produced by the vocal folds, and the manner in which the vibration is modified); the notion of the phoneme as the fundamental unit of speech; and the acoustic (spectral and temporal) properties of speech as shown in a speech spectrograph; the notion of formant transitions that indicate a shift in the vocal tract; and some thought about how one might study the cues important in speech perception. This is unfortunately slim pickings with which to end the chapter. But looking back to work that we have seen along the way, we have covered a number of investigations concerned with speech perception and auditory function: neural coding for speech sounds for example, in the stability of phase locking to formants instead of rate–intensity functions and their asymptotic lack of discrimination, and effects of changes in temporal encoding, masking, and frequency analysis; and some effects of brain damage to auditory regions. Audition is critically important for speech perception, but, most obviously, there is more to speech analysis than is provided solely by audition.