The Sound of HD Conferencing

The Sound of HD Conferencing

The videoconferencing solutions available today are vastly improved compared to those available just a few short years ago. Videoconferencing, in fact, is moving towards providing a meeting experience that is ever more "realistic" and approaches the experience of a true in-person, face-to-face encounter. Many of the recent improvements and much of the industry hype is centered on "high definition" (HD), which provides greatly improved video. But what about the audio?

Many vendors have taken liberty with the term "HD" in a variety of ways, including marketing support for HD IP telephones (with audio response to 7 kHz), HD videoconferencing systems (14 kHz), and multimedia sound systems (22 kHz). In the PC world, Intel has been promoting "HD audio" for consumer applications, which is capable of delivering up to eight channels at 192 kHz/32-bit quality. The earlier Intel AC'97 specification supports only six channels at 48 kHz/20-bit. So the term HD is used loosely, and can be somewhat confusing.

  • In videoconferencing, a variety of audio standards are used, depending on the available bandwidth and the inter-system compatibility. Videoconferencing "codecs" (compression/decompression algorithms) are generally designed to maintain audio-video quality while minimizing delay (the most troublesome contributor to an
  • unnatural communications experience) and minimizing bandwidth (which is always a precious commodity in communications). Hence the audio codecs for a videoconferencing system are generally not the same as what are used in consumer devices.

Research since the 1980s has confirmed that audio quality is the key determinant of videoconferencing quality. [See sidebar, "Does Video Look Better with Sound?] Videoconferencing systems once supported only narrowband 3 kHz audio, which provided a hollow, almost "tinny" sound. Newer systems now support 7 kHz, 14 kHz, and even 22 kHz wideband audio. These expanded frequency ranges provide a richer sound (similar to how FM radio provides a fuller sound compared to AM radio) and a better meeting experience. More importantly, however, the new systems deliver important additional audio information that makes speech more intelligible, thereby increasing meeting effectiveness and reducing the "meeting fatigue" that occurs from participants having to strain to understand the talker.

Most videoconferencing events involve people speaking into one or perhaps two microphones, so there is no real need for 5-channel surround sound, and there are unlikely to be extremely low bass tones or extremely high frequencies, as would be the case with music. In fact, the concept of HD videoconferencing has focused on supporting HD video resolution, with most vendors selecting some wideband audio algorithm (often proprietary) because HD users expect an improved audio experience. However, when HD systems from different vendors try to connect, the audio codec negotiated is typically ITU G.722 (7 kHz bandwidth) since this is a common standard.

Understanding the challenges of sound quality in a videoconference start with understanding audio compression. Capturing voice or other audio signals for transmission in a digital format, as is the case for a videoconference, requires filtering, processing, and digitizing the signals, and then compression of the results. There are many ways to process audio signals, each representing a tradeoff between multiple performance and efficiency parameters. The compression-decompression algorithms used for videoconferencing and voice over IP networks have been designed to optimize around three basic parameters:

- Frequency Response (or Input Bandwidth): High bandwidth systems reproduce more of the fundamental frequencies and overtones of the original signal, and also require input and output devices (microphones and loudspeakers) capable of handling these signals. As an example, the common PSTN telephone network, with its roots in analog transmission, is designed to handle input signals up to just over 3 kHz. This limited frequency range is marginally acceptable for voice, but not for music, where many critical frequencies above
3 kHz are present.

- Compression Rate: Some audio codecs can compress an input signal by a factor of ten or more. Obviously, the higher the compression rate, the smaller the output bit stream, leaving more network bandwidth available for video (or other data). Generally, the higher the compression rate, the lower the quality of the resultant audio signal.

- Delay: Delay (or latency) is a crucial determinant of videoconferencing quality and of user satisfaction. With a high-latency communications environment, such as experienced with a satellite-based phone or video call where the delay is caused by the network, two-way conversation becomes very awkward and unnatural. The latency causes people to trip over each other in conversation or to pause frequently to see if the other side is speaking. Latency can also be introduced by the codec hardware and software architecture. High-latency codecs can produce excellent quality sound with high compression rates, but are more suitable for archiving and retrieval applications where the delay is not noticeable. Some codecs such as the AAC algorithm are available in LD or "low delay" variations.

Speech intelligibility (how easily speech is correctly understood), or rather the lack of it, is a primary contributor to "meeting fatigue." If, during a conference call, you have to work hard to understand the remote participants, have trouble determining who is talking at any given time, or have to speak in an unnatural cadence to accommodate system delays, your brain will tire quickly.

A 2003 white paper published by Polycom's Jeff Rodman describes the five key elements that lead to a user's perception of speech quality:

1. Bandwidth is the frequency range of audio signals that is carried to the listener. Telephones, which are limited (by filters installed on the network) to a frequency range between 300 Hz to 3.3 kHz, carry only 20 percent of the frequencies present in typical human speech. By comparison, AM radio extends to about 5 kHz, FM radio spans 30 Hz to 15 kHz, and common CD audio encodes signals from 20 Hz to 20 kHz. Modern audio systems designed with digital communications in mind, including IP telephony, speakerphones, and videoconferencing, now support up to 22 kHz frequency response.

While the human voice has most of its power below 7 or 8 kHz, the human ear can hear all the way up to 15-20 kHz, varying from person to person, by age and other factors. Consonants in speech are a key factor in speech articulation and recognition as they separate words like "mold" from "bold" or "sailing" from "failing." While the most of the energy in English speech in vowels lies below 3 kHz, the sound energy in consonant sounds is predominantly in frequencies above 3.3 kHz. For example, the sound that distinguishes the "s" in "sailing" from the "f" in "failing" occurs between 4 kHz and 14 kHz (depending upon the person speaking those words). When these frequencies are removed, whether by loudspeakers, microphones, or the codec engine, no cue remains (other than context and background knowledge) as to which word has been said.

2. Reverberation is the total soundfield in a room that remains after a sound source is silenced. Reverberation is affected by the physical characteristics of the room (reflective walls, etc.), room size, microphone type and pick-up pattern, and the orientation between the talker and the microphone. If the microphone is not pointed at the talker or is more distant, a greater proportion of the sound picked up by the microphone will be reverberation instead of direct speech, and the end result will be a decrease in intelligibility.

3. Amplitude refers to how loud the talker sounds to the listener. A quiet talker is more difficult to understand than a loud one, all things being equal.

4. Interaction is the ability of two or more participants to interact naturally with each other in a telephone conference. It is essential that one talker be able to interrupt another without disturbing the flow of conversation, or the dialogue will feel stilted and unnatural. Interaction, or interactivity, is enabled by two parameters: low delay systems (algorithms, hardware, networks) and full duplex (simultaneous send and receive) capabilities.

5. Noise refers to the proportion of ambient noise that is picked up along with speech. Room noise, such as air conditioning and projector fan noise, can be easily heard by microphones, and such noise can play a significant role in decreasing the intelligibility of speech.

Recently, two audio enhancements have been introduced by videoconferencing vendors. One of these is stereo sound, a concept with which most people are very familiar. Since the early days of the industry, videoconferencing systems have mixed all microphone inputs (after echo canceling) with any line inputs (not echo canceled) into a single audio signal prior to processing. Hence, most echo cancellers have been designed with this single channel in mind, and vendors have placed their emphasis on trying to maintain full duplex operations.

Several new videoconferencing systems now include support for AAC compression (made extremely popular by the Apple iPod). Besides supporting wideband response (up to 22 kHz), AAC includes native support for stereo sound. Some of the video systems on the market today support stereo only on their line inputs (typically used for VCR or DVD sources), not on their microphone inputs. With stereo microphone support, the videoconferencing system can provide excellent separation when multiple people are speaking at the same time - voices are separated by left and right channel so they are clearly heard in one ear or the other and not combined in one monaural wave of noise.

Related to stereo, but not exactly the same, is another audio enhancement dubbed "spatial audio." With spatial audio (not yet standardized), multiple microphones are used with spatial processing software to capture position-relevant audio. This information is relayed to the other side where it is used in the audio output system (with multiple speakers) to give localization cues to the remote audience. This is very handy, for example, in a situation where five people are sitting around a conference room at the far end because it helps match the location of the voice to the location of talker within the video image, assuming the camera position is matched to the microphone position.

While the move to high definition video is likely to excite many videoconferencing users, the evolution to wideband (14 kHz) audio is certain to have a more dramatic effect on more users in a shorter time period of time. In fact, analysts expect audio performance to be a strong determinant of videoconferencing user satisfaction moving forward.

At this time, most vendors seem to be favoring the use of AAC-LD audio to accompany their HD video future, but these implementations are generally not interoperable. The ITU standard G722.1.C is a standard, and is supported by multiple vendors, and is interoperable. Independent testing by France Telecom in March 2005 found G.722.1.C to have a higher quality in speech applications than AAC-LD.

Videoconferencing users can expect to see a steady stream of audio improvements, including intelligent muting to eliminate the sounds of typing and other background noises, audio error concealment to reduce the effects of network traffic congestion, and new algorithms to provide higher quality wideband voice in channels of 64 kb/s or narrower.

Andrew W. Davis is managing partner at Wainhouse Research, LLC, a Brookline, MA-based firm providing market research, business planning, and marketing services for vendors, service providers, and end users of rich media conferencing and collaboration solutions. He can be reached

The AVNetwork staff are storytellers focused on the professional audiovisual and technology industry. Their mission is to keep readers up-to-date on the latest AV/IT industry and product news, emerging trends, and inspiring installations.