Homer Dudley‘s Speech Synthesisers, “The Vocoder” (1940) & “Voder”(1939)
The Siemens system was used by many European experimental composers throughout the 50’s and 60’s including Mauricio Kagel, Bengt Hambreus, Milko Kelemen and the director of the Munich Studio Für Elektronische Musik, Josef Anton Riedl.
A vocoder (pronounced /ˈvoʊkoʊdər/, a combination of the words voice and encoder) is an analysis / synthesis system, mostly used for speech. In the encoder, the input is passed through a multiband filter, each band is passed through an envelope follower, and the control signals from the envelope followers are communicated to the decoder. The decoder applies these (amplitude) control signals to corresponding filters in the (re) synthesizer.
It was originally developed as a speech coder for telecommunications applications in the 1930s, the idea being to code speech for transmission. Its primary use in this fashion is for secure radio communication, where voice has to be encrypted and then transmitted. The advantage of this method of “encryption” is that no ‘signal’ is sent, but rather envelopes of the bandpass filters. The receiving unit needs to be set up in the same channel configuration to resynthesize a version of the original signal spectrum. The vocoder as both hardware and software has also been used extensively as an electronic musical instrument.
“At the 1939 World’s Fair a machine called a Voder was shown . A girl stroked its keys and it emitted recognsable speech. No human vocal cords entered into the procedure at any point; the keys simply combined some electronically produced vibrations and passed these on to a loud-speaker.”
(“As We May Think” by Vannevar Bush, 1945. )
From: The Dance Music Manual by Rick Snoman :
One final effect that’s particularly useful if the vocalist is incapable of singing in key is the vocoder. Of all the vocal effects, these are not only the most instantly recognizable, but are also the most susceptible to changes in fashion. The robotic voices and talking synth effects they generate can be incredibly clichéd unless they’re used both carefully and creatively, but the way in which they operate opens up a whole host of creative opportunities. Fundamentally, vocoders are simple in design and allow you to use one sound – usually your voice (known as the modulator) – to control the tonal characteristics of a second sound (known as the carrier), which is usually a synthesizer’s sustained timbre. However, as simple as this may initially appear, actually producing musically usable results is a little more difficult, since simply dialling up a synth preset and talking, or singing, over it will more often than not produce unusable results. Indeed, to use a vocoder in a musically useful way, it’s important to have a good understanding of exactly how they work and to do this we need to begin by examining human speech. A vocoder works on the principle that we can divide the human voice into a number of distinct frequency bands. For instance, plosive sounds such as ‘p’ or ‘b’ consist mostly of low frequencies, ‘s’ or ‘t’ sounds consist mostly of high frequencies, vowels consist mostly of mid-range frequencies and so forth. When a vocal signal enters the vocoder, a spectral analyser measures the signal’s properties and subsequently uses a number of filters to divide the signal into a number of different frequency bands. Once divided, each frequency band is sent to an envelope follower, which produces a series of control voltages3 based on the frequency content and volume of the vocal part. This exact same principle is also used on the carrier signal and these are tuned to the same frequency bands as the modulator’s input. However, rather than generate a series of control voltages, they are connected to a series of voltage-controlled amplifiers. Thus, as you speak into the microphone the subsequent frequencies and volume act upon the carrier’s voltage-controlled amplifiers, which either attenuates or amplifies the carrier signal, in effect superimposing your voice onto the instrument’s timbre. Consequently, since the vocoder analyses the spectral content and not the pitch of the modulator, it isn’t necessary to sing in tune as it wouldn’t make any difference. From this, we can also determine that the more filters that are contained in the vocoder’s bank, the more accurately it will be able to analyse and divide the modulating signal, and if this happens to be a voice, it will be much more comprehensible. Typically, a vocoder should have a minimum of six frequency bands to make speech understandable, but it’s important to note that the number of bands available isn’t the only factor when using a vocoder on vocals. The intelligibility of natural speech is centred between 2.5 and 5 kHz; higher or lower than this and we find it difficult to determine what’s being said. This means that when using a vocoder, the carrier signal must be rich in harmonics around these frequencies, since if it’s any higher or lower then some frequencies of speech may be missed altogether. To prevent this, it’s prudent to use a couple of shelving filters to remove all frequencies below 2 kHz and above 5 kHz before feeding them into the vocoder. Similarly, for best results the carrier signal’s sustain portion should remain fairly constant to help maintain some intelligibility. For instance, if the sustain portion is subject to an LFO modulating the pitch or filter, the frequency content will be subject to a cyclic change that may push it in and out of the boundaries of speech, resulting in some words being comprehensible while others become unintelligible. Plus, it should also go without saying that if you plan on using your voice to act as a modulator it’s essential that what you have to say, or sing, is intelligible in the first place. This means you should ensure that all the words are pronounced coherently and clearly. More importantly, vocal tracks will unquestionably change in amplitude throughout the phrases and this will create huge differences in the control voltages generated by the vocoder. This results in the VCA levels that are imposed onto the carrier signal to follow this change in level producing an uneven vocoded effect, which can distort the results. Subsequently, it’s an idea to compress the vocals before they enter the vocoder and if the carrier wave uses an LFO to modulate the volume compress this too. The settings to use will depend entirely on the vocals themselves and the impact you want them to have in the mix (bear in mind that dynamics can affect the emotional impact), but as a very general starting point set the ratio on both carrier and modulator to 3:1 with a fast attack and release, and then reduce the threshold so that the quietest parts only just register on the gain reduction meter. Additionally, remember that it isn’t just vocals that will trigger the vocoder and breath noises, rumble from the microphone stand and any extraneous background noises will also trigger it. Thus, along with a compressor you should also consider employing a noise gate to remove the possibility of any superfluous noises being introduced. With both carrier and modulator under control there’s a much better chance of producing a musically useful effect and the first stop for any vocoder is to recreate the robotic voice. To produce this effect, the vocoder needs to be used as an insert effect, not send, as all of the vocal line should go through the vocoder. Once this modulator is entering the vocoder you’ll need to program a suitable carrier wave. Obviously, it’s the tone of this carrier wave that will produce the overall effect, and two sawtooth waves detuned from each other by _ and _4 with a short attack, decay and release but a very long sustain should provide the required timbre. If, however, this makes the vocals appear a little too bright, sharp, thin or ‘edgy’ it may be worthwhile replacing one of the sawtooth waves with a square or sine wave to add some bottom-end weight. Though this effect is undoubtedly great fun for the first couple of minutes, after the typical Luke, I am your father’ it can wear thin and if used as is in a dance track it will probably sound a little too clichéd, so it’s worthwhile experimenting further. Unsurprisingly, much of the experimentation with a vocoder comes from modulating the carrier wave in one way or another and the simplest place to start is by adjusting the pitch in time with the vocals. This can be accomplished easily in any audio/MIDI sequencer by programming a series of MIDI notes to play out to the carrier synth, in effect creating a vocal melody. Similarly, an arpeggio sequence used as a carrier wave can create a strange gated, pitch-shifting effect, while an LFO modulating the pitch can create an unusual cyclic pitch-shifted vocal effect. Filter cut-off and resonance can also impart an interesting effect on vocals and in many sequencers this can be automated so that it slowly opens during the verses, creating a build-up to a chorus section. Also, note that the carrier does not necessarily have to be created with saw waves, and a sine wave played around C3 or C4 can be used to recreate a more tonally natural vocal melody that will have some peculiarity surrounding it. Note: Vocoders do not always have to be used on vocals and you can produce great results by using them to impose one instrument onto another. For instance, using a drum loop as a modulator and a pad as the carrier, the pad will create a gating effect between the kicks of the loop. Alternatively, using the pad as a modulator and the drums as the carrier wave, the drum loops will turn into a loop created by a pad! Ultimately these have only been simple suggestions to point you in a more creative direction and you should be willing to try out any effects you can lay your hands on to hear the effect it can have on a vocal. Bear in mind that due to the very nature of dance music it’s always open to experimentation and it’s much better to initiate a new trend than simply follow one set by another artist.
Vocoder Sound Files:
Easy to use, powerful, very well documented freeware. Who could ask for more?
From the album liner notes written by D.H. VanLenten:
“This recording contains samples of synthesized speech – speech artificially constructed from the basic building blocks of the English language. A machine which produces synthesized speech is called, fittingly, a talking machine. There are many possible kinds of speech synthesizers or talking machines. Instead of building and testing a variety of them, scientists at Bell Telephone Laboratories simulate their behavior with a high-speed, general purpose computer. The computer is instructed (programmed) to accept in sequence on punched cards the names of the speech sounds which make up an English sentence. It then processes this information, in accordance with the linguistic rules governing the English language, and produces an output analogous to the output of the talking machine it is programmed to simulate. The talking machine simulated by the computer in this recording would normally be operated by continuously feeding it a set of nine control signals. The signals correspond to voice pitch, voice loudness, lip opening and other speech variables. When every instant of sound is specified, and every variable accounted for, such a machine produces human-sounding speech.
Setting up the computer to simulate this talking machine requires two sets of instructions or, more precisely, a two-part computer program. One part of the computer program performs the actual sound making function – it imitates the “talking’ of a talking machine. The second part consists of rules for combining individual speech sounds into connected speech, and for producing the nine control signals that activate the talking machine. Scientists at Bell Telephone Laboratories have developed a computer program that permits them to feed the names of speech sounds into the computer on punched cards. They also have devised a phonetic code using the letters of the alphabet. At present, it is made up of 22 consonant and 12 vowel sounds:
CONSONANTS: P – B – T – D – K – G – M – N – NG (as in sing) – F – V – S – Z – SH (as in she) – ZH (as in azure) – H – W – R – L – Y – TH (as in thin) – DH (as in then)
VOWELS: EE (as in bee) – I (as in ill) – AY (as in rate) – E (as in end) – AE (as in add) – AH (as in ah) – AW (as in jaw) – (as in go) – OO (as in foot) – UU (as in food) – UH (as in up) – ER (as in her)
Each speech sound is specified on a separate punched card. When a sequence of cards is fed into the computer, it “operates’ on the information – following the rules set up in the second part of its program – to produce the nine control signals that activate the talking machine program. For example, if the sequence of cards, H – EE – S – AW – DH – UH – K – AE – T, is fed into the computer, the machine will say “He saw the cat,’ in flat monotones. Proper inflection and phrasing are achieved by specifying on each card the changes in pitch and timing natural to human speech.
By specifying the pitch of the sounds, it also is possible to make the computer sing. In two of the samples recorded, the computer first sings a familiar tune and then, singing the same song, is accompanied by music played by another computer. The “speech’ of the simulated talking machine comes out of the computer as tiny magnetized spots on half-inch magnetic tape. The tape is fed to another machine which converts the spots to a tape suitable for playing on an ordinary tape recorder.
The first eight and very last samples of synthesized speech on this recording are part of a research program aimed, principally, at formulating a minimum set of rules for making plausible English speech. The ninth and tenth selections were produced by analyzing a person’s speech and re-constructing it synthetically on a computer. The objective of this program is to duplicate the sounds and transitions made by a human speaker, including his accent and dialect.
Knowledge developed through such research programs may be useful in devising new techniques for transmitting speech more efficiently over communications systems. In the near future, for example, a person may be able to type on a keyboard and cause a typing machine thousands of miles away to speak for him. There is also the possibility that talking machines may be built for people who are unable to speak.”
Link To MP3
This ingenious device, designed by Herman von Helmholtz XR (1821-1894), was the very first sound synthesizer: a tool for studying and artificially recreating musical tones and the sounds of human speech.
Suppose I sing the word ‘car’ and then on the same note sing ‘we’. The two vowel sounds will be similar in so far as they have the same pitch G , yet they have a clearly distinct sound quality, or timbre G . What is it that accounts for this difference, and the timbres G of musical sounds in general? Helmholtz set out to answer this very question in the mid nineteenth century, building on the work of the Dutch scientist Franz Donders (1818-1889).
Helmholtz showed that the timbre G of musical notes, and vowel sounds, is a result of their complexity: just as seemingly-pure white light actually contains all the colors of the rainbow, clearly defined musical notes are composed of many different tones. If you play the A above middle C on an organ, for example, the sound you hear has a clearly defined “fundamental” pitch G of 440Hz G . But the sound does not only contain a simple “fundamental” vibration at 440Hz G , but also a “harmonic series” of whole number multiples of this frequency G called “overtones” (e.g., 880Hz G , 1320Hz, 1760Hz, etc.). Helmholtz proved, using his synthesizer, that it is this combination of overtones at varying levels of intensity that give musical tones, and vowel sounds, their particular sound quality, or timbre G .
How the synthesizer works
Helmholtz’s apparatus uses tuning forks, renowned for their very pure tone, to generate a fundamental frequency G and the first six overtones which may then be combined in varying proportions. The tuning forks are made to vibrate using electromagnets and the sound of each fork may be amplified by means of a Helmholtz resonator with adjustable shutter operated mechanically by a keyboard.
By varying the relative intensities of the overtones, Helmholtz was able to simulate sounds of various timbres G and, in particular, recreate and understand the nature of the vowel sounds of human speech and singing. Vowel sounds are created by the resonances G of the vocal tract, with each vowel defined by two or three resonant frequencies G known as formants. When we say or sing ‘a’ (as in ‘had’), for instance, the vocal tract amplifies frequencies G close to 800Hz G , 1800Hz and 2400Hz amongst others. When we require a different vowel sound, the muscles of the throat and mouth change the shape of the vocal tract, producing a different set of resonances G .