----- Experience in your own room the magical nature of stereo sound -----

Basics

Issues in speaker
design

Stereo Recording and Rendering

Audio production

Conclusions

Projects

Your own desig

LXmini

LXmini+2

LXstudio

LX521.4

PHOENIX
dipole speaker

Three-Box active
system (1978)

Resources

------------------
Digital Photo
Processes

------------------
The
Sea Ranch

------------------
My Daughter
the Jeweler

What's new

LX - Store

Conversations
with Fitz

OPLUG
Forum

Recording & Rendering

--- Recording & Rendering 101 --- Acoustics vs. Hearing --- Subjective evaluation ---
--- Room optimized stereo --- Sound reproduction --- Recording what we hear ---
--- Experimental results --- Theory --- SRA --- Sound field control ---

Recording what we hear - A progression

QUESTION:
If a tree falls in a dark forest and no one is there to hear it, does it make any sound?

ANSWER:
No!

The falling tree sets huge numbers of air particles into oscillatory motion. They push on other air particles and cause a chain reaction that propagates away from the tree at the speed of sound. In this process mechanical energy is transformed into heat as the wave hits other objects, is reflected, diffused and absorbed.

If a person is in range of the air particle disturbance, then a few particles hit the left and right ear drum. This is registered in the brain and perceived as sound.
For evolutionary reasons it is important to recognize the nature of a sound source. The detailed shape of the external ear, i.e. the pinna and the ear canal, changes the strength of the sound wave at the ear drum depending upon the frequency of oscillation and the direction from which the air particles arrive. This is further enhanced by the sound shadowing of the head between the ears. The separation of the two ears causes a delay between the particles arriving at each ear drum when the source is not located in the median plane, the vertical plane that bisects the body. Thus, turning the head sideways or up and down changes the air particle strength at the ear drums.
The brain has evolved to process spectral, temporal and directional cues to form a mental picture of the origin of a sound, its direction, distance, size and nature. This is further enhanced by visual and tactile cues, and certainly by learning and memory.

If a person or brain B1 observes and perceives a tree falling in the forest (B) and this sensation is to be transmitted to a second person B2 in a different location and at a different time, then several pieces of information have to be recorded. They would include the ear drum signals of B1 and the body motion V in response to the sound wave and any structure borne vibration. The ear drum signals would have to be applied to the ear drums of B2, and the motional signals from B1 would have to impart the same motion to B2. There are inherent errors in this transmission. For example, the outer ear of B1 does not match the outer ear of B2, but the brain of B2 is only experienced to process sounds with its own ears. Also the details in neural wiring of brain B2 are likely to be sufficiently different from B1 so that an exact duplication of perception is ultimately not possible.

Regardless of the seeming futility we try the next best approach which is binaural recording and playback. Here the person B1 is replaced by a dummy head with torso D. Microphones replace the ear drums. The outer ears and the shape of the head are an average of human forms as are the surface textures. The recorded dummy head signals should be applied to the ear drums of a listener B3. This is not without problems.

Ear canal phones (D) have wearer specific frequency response errors, can become physically uncomfortable and may isolate the person too much from the surroundings.
Circum-aural headphones (E) tend to have ear cup resonances. They are lessened with supra-aural headphones, but those may suffer from insufficient isolation from ambient sounds.
In all three cases there are individual outer ear dependent frequency response aberrations and for optimum performance they need to be individually equalized.

Binaural sound reproduction can be tonally and spatially very realistic except for localization in the frontal hemisphere. There it suffers from in-head localization. The soundstage is usually not perceived as being outside and in front of the head. I have been told that out-of-head localization can be learned, but have not spent enough time to find out if that also works for me. The in-head soundstage follows any head movement rather than being stationary. This provides a completely unnatural cue to the brain. It can be avoided by tracking the movement of the head and adjusting each ear signal according to the head's position relative to the soundstage. Video game consoles sometimes use this technique and in combination with a visual image they can give a realistic spatial rendering.

So the simplified transmission system (C) above can reproduce a certain number of cues sufficiently close to create a fairly good illusion of a real event, but lacks in frontal localization and in body related tactile inputs compared to (B). If the dummy head recording is played back over loudspeakers in a reflection free room (G), then two new problems become apparent. The frequency spectrum of the recording had been modified by the external ear of the dummy head D and is modified again by the external ear of the listener B3. This causes a sound coloration that can be avoided by applying the inverse of the head-related-transfer-function for a given loudspeaker and listener setup to the microphone signals of D. In addition the signal from the left loudspeaker impinges upon the left ear and the right ear. Similarly the right speaker sends signals to both right and left ears. This cross-talk between the ears can be cancelled for a precisely fixed loudspeaker and listener setup. Compensating signals are added electronically to left and right loudspeaker signals so that L_L cancels at the left ear the contribution from R_L. Correspondingly R_R compensates for L_R. Alternatively a wall can be placed between the speakers that extends forward to the head of B3. It physically blocks the crosstalk signals L_R and R_L. Both solutions confine the listener's head to a small region for the cancellation to be effective and they do work well under anechoic playback conditions. The required setup conditions are hardly met in typical living rooms (H) where a multiplicity of loudspeaker sound reflections easily destroys the acoustic balancing act. While the brain tries to compensates, it becomes eventually tiring to listen.

The dummy head is sometimes replaced by a sphere microphone. It is an 8" rigid sphere with omni-directional microphone capsules at the location of the ears, but with no pinna or ear canal.

Sphere recordings, though, are not the complete answer for commercially viable recordings. They tend to pick up too much of the reverberant sound field even when placed close to the performers. They depend upon adequate acoustics of the recording venue since they capture a spatial impression quite realistically.

Basically, though, accurate or realistic recordings have not been possible because of the absence of an accurate playback standard that could confirm such an achievement. With only loudspeakers of greatly variable performance to choose from - and interacting unpredictably in their unavoidably unique acoustic surroundings - recording practices substituted exaggerated clarity as a suitable - and achievable goal.

That emphasis has continued to prejudice and influence recording techniques to this day. Therefore it is common practice to use a multitude of microphones to highlight individual performers or instrument groups. Recordings are done in studios for full control over the reverberant sound. Lost in the process is any sense of a coherent acoustic space, of the venue acoustics, in the recording. Instead, the space in which the sound occurs is often chopped up into isolated lumps.

Typical loudspeakers in typical room setups are not capable of reproducing a full spatial impression even if the cues for it are imbedded in the recording. This is due to the room reflections which in turn are a function of the polar response of the loudspeakers and their placement in the room. Thus what has been done to the spatial aspect in the recording process goes largely unnoticed during playback. Even the recording/mixing/mastering engineer was probably not aware of the consequences of his decisions because the typical monitor loudspeakers are not up to the task of telling him. Loudspeakers must be either full range omni-directional or dipolar, and be placed away from reflecting surfaces, and be placed symmetrical to the room boundaries. In that configuration the brain can disassociate the room reflected sound from the loudspeaker direct sound and fully use the spatial cues in the direct sound to form spatial impressions and localization of phantom sources.

It has been observed that quite realistic recordings can be made by simply placing small omni-directional microphone capsules on the frame of eye glasses just in front of the pinna. Recordings in a concert hall (I) with live audience exhibit clarity and spatial realism, even when the seat is far from the orchestra. But they sound distant because the reverberant sound so dominates and due to the extraneous noises from nearby audience members. But if anything this adds to the naturalness. I have been impressed how the live direct sound from the orchestra seems to dominate the perception and the hall sound is merely heard as providing envelopment even when my seat is in row Y and far from the stage. It appears that a large amount of decorrelation of reverberant sound from direct sound is taking place in the brain. This seems to be similar to what can happen in a living room with proper loudspeakers and setup.

The microphones on the head are obviously similar to a sphere microphone. They have the same advantages and drawbacks. They do not replicate exactly what is perceived in the brain of a listener at that location. They do give a realistic spatial impression but lack the perceived intimacy of the live event.

It seems that two microphones cannot record at the same time a spatial impression and intimacy in the right proportions. Thus the idea of separating these two tasks emerged and that it should be possible to accomplish a coherent recording with four microphones.

The microphone setup (K) consists of two main microphones L_M and R_M and two ambient microphones L_A and R_A. The main microphones are super-cardioids (Schoeps MK 41) for well controlled off-axis frequency response. They are separated by D₁ to duplicate the acoustic path length between the ears. They are angled to cover the width of the sound source while being placed further from it than is usual. The overall aim is to record an audience perspective of the source. The ambient microphones are omni-directional (Schoeps MK 2S). They are placed sufficiently behind the main microphones, D₂, and apart from each other, D₃, to be decorrelated from each other. Typical dimensions could be D₁ = 8 inch, a = 110 degrees, D₂ = 30 feet, D₃ = 40 feet.

The main and ambient microphone outputs are summed in each channel (L). The ambient output level is adjusted to provide a realistic balance when played back over ORION++ loudspeakers.

The 4-microphone configuration is being tested at this time by Don Barringer. Initial recordings of organ and chorus in Washington National Cathedral gave convincing evidence of the correctness of the configuration. The main microphones were 80 feet from the organ.

Recording and Environment

All acoustical events take place in an environment. Those environments can be very different, like the relatively open space of the falling tree, the large and closed space of a concert hall, a restaurant or your bath room. We perceive the specifics of the environments from the multitude of reflections of air particles that occur when they strike boundaries in their path of propagation. Reflections are always delayed relative to the time that air particle motion takes to propagate directly from source to receiver. Thus any acoustic event that is perceived as sound has at least two elements to it: the direct signal and the reflections from the environment. Our brain is totally used to process that mix of information and we may become aware of the two elements by paying attention.

The 4-microphone setup captures the direct and reflected signal spectra with temporal separation by placing the microphones into different parts of the acoustic space. Thus, when played back over two loudspeakers, stronger cues are presented to the brain about the acoustic source and its environment than a two microphone setup could provide, which captures both elements simultaneously and in difficult to control proportions. The ratio of direct to ambient reflected signals is very different for a musician in the orchestra, the conductor in front of the orchestra or a member of the audience behind the conductor in the concert hall. Underlying any recording is a decision as to which perspective to present to the loudspeaker listener. Is it the musician's perspective, the conductor's, the audience's or none of the above and instead some artificial perspective that serves a particular purpose? The 4-microphone setup appears to be well suited to capture the audience perspective and might be called the "D+R Stereo" technique for capturing direct and reflected sounds separately.

Sound stream segregation

In natural hearing the evolutionary and adaptive processor between the ears is capable of segregating a direct sound stream from a reflected sound stream and from a structure borne vibration. We are able to perceive the direction from which sound is coming, the distance of the source and even its size primarily from the direct sound. The reflected sound can enhance or detract from this information and we can perceptually remove it, if it is not relevant to the source information. This works in a wide range of acoustic environments.

For example, think of having a conversation with another person at a cocktail party. At the same time there are many conversations going on around you. This can make it difficult to understand your partner. The sound stream that is coming from her to your left and right ears tends to fuse with the many sound streams that are happening around you and are also impinging on your ears. You may step closer to her to increase the volume of this sound stream relative to the other streams which then become to you the noisy background for your conversation. On the other hand you may be also interested in the conversation between X and Y over in the corner of the room. As you focus in your mind on that conversation you pick up segments of that specific sound stream and make sense of them, and all this while you are having a conversation here. Your brain is multi-tasking.

I can say from personal experience that this is a difficult process if you are not intimately familiar with the language that is spoken around you. My native language is German and it took me years of exposure to English to be able to process sound in the acquired language as easily as in my native language. Certainly, language cognition is one element of the "cocktail party effect". Other elements are the timing differences between left and right ear signals due to the direction from which a sound stream arrives, the envelope modulation depth of the X-Y sound stream relative to the background sound streams or noise, and the timing of the X-Y stream segments relative to your conversation. It is difficult to hear while you are talking.

The venue in which the cocktail party takes place also has an effect on the ease of conversation. If it is a large hall with highly reflective surfaces and long reverberation time, then distant conversations lose envelope modulation depth and are difficult to understand even though the volume level in the hall may not be that high. The long reverberation fuses distant streams. If the room is small and there are many people, then the modulation depth of a more distant sound stream becomes low even when its volume is high and thus fewer segments are heard and recognized.

When a recording is made, the microphones capture the direct sound stream from the sound sources and the reflected sound stream from the recording venue.

When the recording is played back in a room, then the listener is exposed again to two sound streams, the direct sound from the two loudspeakers and the room reflected sound. Given that the loudspeakers have uniform directivity, we apparently can perceptually segregate the two streams to a large degree. This was recognized in the ORION and PLUTO comparison.

Imbedded in the direct sound from the two loudspeakers is the direct sound stream that the microphones received and the recording venue's reflected stream. Thus during playback the processor between the ears is asked to deal with four streams of acoustic information: the direct loudspeaker sound and its reflection in the listening room, and the direct microphone signal and its reflection in the recording venue.

The proposed four microphone technique captures direct and reflected sound streams in the recording venue with time and intensity separation between them. Mixing the two streams in optimal proportion before playback should give the listener's brain stronger cues for constructing an illusion about the original sound sources in their acoustic space, similar to what a person in the audience in that space would hear live. But, just as language familiarity is helpful in the cocktail party effect, so is familiarity with live acoustic events to recognize and appreciate spatial characteristics in a recording. Most recording techniques aim for clarity first. Spatiality is secondary and often synthesized which is readily recognized over ORION or PLUTO.

With four microphones both clarity and spatiality should be captured. This is not an attempt at surround sound. The sound stage in the listening room will always be behind the loudspeakers and in that sense it is a spatial distortion relative to the recording situation. It is an easily accepted distortion because of familiar elements in it.