Audio

From EMC Electronic Music Coders
Jump to navigation Jump to search

Overview[edit]

This document is intended to act as a teaching tutorial for sound terminology, theory and practice, across multiple disciplines, but focusing on acoustics, psychoacoustics, environmental acoustics, electroacoustics, speech acoustics, audiology, noise and soundscape studies. In many cases, we draw comparisons between these disciplines and attempt to explain their basic models and how they differ, beginning with the Introductory module.

INTRODUCTION: Sound is .....[edit]

A survey of basic concepts in each discipline

1: Sound-Medium Interface[edit]

The Tutorial is divided into a number of modules which are designed to cover a particular topic similar to a lab-based class or a set of studio demos. They are divided into an Acoustic set and an Electroacoustic set. Subtopics in each module can be accessed separately by a link in the series A, B, C, etc.

Interdisciplinary Thematic Search Engine[edit]

The subject matter of this document is organized according to various themes, the first five of which are traced through various subdisciplines, each of which treats the theme differently. The relevant terms for each theme and each discipline are grouped together. The themes are:

Analytical Dimensions of Sound[edit]

Magnitude[edit]

Vibration[edit]

Levels of Acoustic Interaction[edit]

Sound - Medium Interface[edit]

Sound - Environment Interaction[edit]

Sound - Sound Interaction[edit]

Specific Subdisciplines[edit]

Audiology and Hearing Loss[edit]

Noise Measurement Systems[edit]

Electroacoustic and Tape Studio Terms[edit]

Linguistics and Speech Acoustics[edit]

Communications Theory[edit]

The principal discipline which is the "home" for each term is indicated by an icon, as follows:

acoustics psychoacoustics soundscape Noise electroacoustics [linguistics]audiology [music] [1]

Terms that are found in more than one discipline are indicated as follows: Acoustics / Electroacoustics

Components of Electronic Instrument[edit]

If we want to play some music, sound has to be generated somehow, right? Then the first family of modules that we are going to tackle is that of the sound sources: oscillators, noise sources, and samplers, mainly.

Oscillators[edit]

Oscillators are those modules that generate a pitched tone. Their frequency content

varies depending on the waveform that is generated. Historically, only simple waveforms were generated, according to the electronic knowledge available. Typical waveforms are thus triangular, rectangular, sawtooth, and sinusoidal. These are all very simple waveforms that can be obtained by a few discrete components. From simple designs come simple spectra: their shape is very straight and unnatural, thus requiring additional processing to obtain pleasant sounds. Their spectral properties are discussed in Section 2.8, after the basic concepts related to frequency-domain analysis have been discussed (Figure 1.5). Figure 1.4: Connection between two modules. The output of Module 1 is connected to the input of Module 2 by a TS jack. This way, the input voltage of Module 2 follows the output voltage of Module 1, and thus the signal is conveyed from Module 1 to Module 2. Figure 1.5: Typical synthesizer oscillator waveforms include (from left to right) sawtooth, triangular, and rectangular shapes.

Oscillators usually have at least one controllable parameter: the pitch (i.e. the fundamental frequency they emit). Oscillators also offer control over some spectral properties. For example, rectangular waveform oscillators may allow pulse width modulation (PWM) (i.e. changing the duty cycle Δ, discussed later). Another important feature of oscillators is the synchronization to another signal.


Synchronization to an external input (a master oscillator) is available on many oscillator designs. So- called hard sync allows an external rising edge to reset the waveform of the slave oscillator and is a very popular effect to apply to oscillators. The reset implies a sudden transient in the waveform that alters the spectrum, introducing high-frequency content. Other effects known as weak sync and soft sync have different implementations. Generally, with soft sync, the oscillator reverses direction at the rising edge of the external signal. Finally, weak sync is similar to hard sync, but the reset is applied only if the waveform is close to the beginning or ending of its natural cycle. It must be noted, however, that there is no consensus on the use of the last two terms, and different synthesizers have different behaviors. All these synchronization effects require a different period between slave and master. More complex oscillators have other ways to alter the spectrum of a simple waveform (e.g. by using waveshaping). Since there are specific modules that perform waveshaping, we shall discuss them later. Oscillators may allow frequency modulation (i.e. roughly speaking, controlling the pitch with a high-frequency signal). Frequency modulation is the basis for FM synthesis techniques, and can be either linear or logarithmic (linear FM is the preferred one for timbre sculpting following the path traced by John Chowning and Yamaha DX7’s sound designers). To conclude, tone generation may be obtained from modules not originally conceived for this aim,such as an envelope generator (discussed later) triggered with extremely high frequency.


Noise sources[edit]

also belong to the tone generators family. These have no pitch, since noise is a broadband signal, but may allow the selection of the noise coloration (i.e. the slope of the spectral rolloff), something we shall discuss in the next chapter. Noise sources are very useful to create percussive sounds, to create drones, or to add character to pitched sounds. Finally, the recent introduction of digital modules allows for samplers to be housed in a Eurorack module. Samplers are usually capable of recording tones from an input or to recall recordings from a memory (e.g. an SD card) and trigger their playback. Other all-in-one modules are available that provide advanced tone generation techniques, such as modal synthesis, FM synthesis, formant synthesis, and so on. These are also based on digital architecture with powerful microcontrollers or digital signal processors (DSPs). 1.3.2 Timbre Modification and Spectral Processing As discussed, most tone generators produce very static sounds that need to be colored, altered, or emphasized. Timbre modification modules can be divided into at least four classes: filters, waveshapers, modulation effects, and vocoders. Filtering devices are well known to engineers and have played a major role in electrical and communication engineering since the inception of these two fields. They are devices that operate in the frequency domain to attenuate or boost certain frequency components. Common filters are the low-pass, band-pass, and high-pass type. Important filters in equalization applications are the peak, notch, and shelving filters. Musical filters are rarely discussed in engineering textbooks, since engineering requirements are different from musical requirements. Among these, we have a low implementation cost, predetermined spectral roll-off (e.g. 12 or 24 dB/oct), and the possibility to introduce a resonance at the cutoff frequency, eventually leading to self-sustained oscillation.4 10 Modular Synthesis


While engineering textbooks consider filters as linear devices, most analog musical filters can be operated in a way that leads to nonlinear behavior, requiring specific knowledge to model them in the digital domain.


Waveshaping devices[edit]

have been extensively adopted by synthesizer developers such as Don Buchla and others in the West Coast tradition to create distinctive sound palettes. A waveshaper introduces new spectral components by distorting the waveform in the time domain. A common form of waveshaper is the foldback circuit, which wraps the signal over a desired threshold. Other processing circuits that are common with guitar players are distortion and clipping circuits. Waveshaping in the digital domain requires a lot of attention in order to reduce undesired artifacts (aliasing). Other effects used in modular synthesizers are so-called modulation effects, most of which are based on delay lines: chorus, phaser, flanger, echo and delay, reverb, etc. Effects can be of any sort and are not limited to spectral processing or coloration, so the list can go on. Vocoders have had a large impact in the history of electronic music and its contaminations. They also played a major role in the movie industry to shape robot voices. Several variations exist; however, the main idea behind it is to modify the spectrum of a first sound source with a second one that provides spectral information. An example is the use of a human voice to shape a synthesized tone, giving it a speech-like character. This configuration is very popular. Figure 1.6: A CRB Voco-Strings, exposed at the temporary Museum of the Italian Synthesizer in 2018, in Macerata, Italy. This keyboard was manufactured in 1979–1982. It was a string machine with vocoder and chorus, designed and produced not more than 3 km from where I wrote most of this book. Photo courtesy of Acusmatiq MATME. Owner: Riccardo Pietroni.


Envelope, Dynamics, Articulation[edit]

Another notable family of effects includes all the envelope, dynamics, and articulation devices. Voltage-controlled amplifiers (VCAs) are meant to apply a time-varying gain to a signal in order to shape its amplitude in time and create a dynamic contour. They can be controlled by a high- frequency signal, introducing amplitude modulation (AM), but more often they are controlled by envelope generators (EGs). These are tools that respond to a trigger or gate signal to generate a voltage that rises and decays, determining the temporal evolution of a note or any other musical event. Usually, such evolution is described by four parameters: the attack, decay, and release times and the sustain level, producing an ADSR scheme, depicted in Figure 1.7. Most envelope generation schemes follow the so-called ADSR scheme, where a tone is divided into three phases, requiring four parameters: • A: The attack time. This parameter is expressed as a time parameter in [s] or [ms] or a percentage of a maximum attack time (i.e. 1–100).


• D: The decay time. The time to reach a steady-state level (usually the sustain, or zero when no sustain is provided by the EG), also expressed as a time ([s], [ms]) or a percentage of a maximum decay time (1–100). • S: The sustain level. The steady-state level to be reached when the decay phase ends. This is usually expressed as a percentage of the peak level that is reached in the attack phase 1–100). • R: The release time. The time to reach zero after the musical event ends (e.g. note off event).


This is also expressed in [s], [ms], or percentage of a maximum release time (1–100). Subsets of this scheme, such as AR, with no sustain phase, can still be obtained by ADSR. An EG generates an envelope signal, which is used as an operand in a product with the actual signal to shape. It is important to distinguish between an EG and a VCA; however, sometimes both functionalities are comprised in one device or module. Envelope generators are also used to control other aspects of sound production, from the pitch of the oscillator to the cutoff of a filter (Figure 1.8). Similarly, low-frequency oscillators (LFOs) are used to control any of these parameters. LFOs are very similar to oscillators, but with a frequency of oscillation that sits below the audible range or slightly overlapping with its lower part. They are used to modulate other parameters. If they modulate the pitch of an oscillator, they are performing vibrato. If they modulate the Figure 1.7: A linear envelope generated according to the ADSR scheme.


VCA[edit]

amplitude of a tone through a VCA, they are performing tremolo. Finally, if they are used to shape the timbre of a sound (e.g. by modulating the cutoff of a filter), they are performing what is sometime called wobble. Other tools for articulation are slew limiters, which smooth step-like transitions of a control voltage. A typical use is the smoothing of a keyboard control voltage that provides a glide or portamento effect by prolonging the transition from one pitch value to another.


A somewhat related type of module is the sample and hold (S&H). This module does the inverse of a slew limiter by taking the value at given time instants and holding it for some time, giving rise to a step-like output. The operation of an S&H device is mathematically known as a zero-order hold filter. An S&H device requires an input signal and depends on a clock that sends triggering pulses. When these are received, the S&H outputs the instantaneous input signal value and holds it until a new trigger arrives. Its output is inherently step-like and can be used to control a range of other modules. 1.3.4 “Fire at Will,” or in Short: Sequencers Step sequencers is another family of modules that allow you to control the performance. Sequencers specifically had – and still have – a distinctive role in the making of electronic music, thanks to their Figure 1.8: Advanced envelope generation schemes may go beyond the ADSR scheme. The panel of a Viscount-Oberheim OB12 is shown, featuring an initial delay (DL) and a double decay (D1, D2) in addition to the usual controls. Modular Synthesis 13

machine-like precision and their obsessive repetition on standard time signatures. Sequencers are made of an array or a matrix of steps, each representing equally spaced time divisions. For drum machines, each step stores a binary information: fire/do not fire. The sequencer cycles repeatedly along the steps and fires whenever one of them is armed. We may call this a binary sequencer. For synthesizers, each step has one or more control voltage values associated, selectable through knobs or sliders. These can be employed to control any of the synth parameters, most notably the pitch, which is altered cyclically, following the values read at each step. Sequencers may also include both control voltage and a binary switch, the latter for arming the step. Skipping some steps allows creating pauses in the sequence. Sequencers are usually controlled by a master clock at metronome rate (e.g. 120 bpm), and at each clock pulse a new step is selected for output, sending the value or values stored in that step. This allows, for example, storing musical phrases if the value controls the pitch of a VCO, or storing time-synchronized modulations if the value controls other timbre-related devices. Typical sequencers consist of an array of 8 or 16 steps, used in electronic dance music (EDM) genres to store a musical phrase or a drumming sequence of one or two bars with time signature 4/4. The modular market, however, provides all sorts of weird sequencers that allow for generative music, polyrhythmic composition, and so on. Binary sequencers are used for drum machines to indicate whether a part of the drum should fire or not. Several rows are required, one for each drum part. Although the Roland TR-808 is widely recognized as one of the first drum machines that could be programmed using a step sequencer, the first drum machine ever to host a step sequencer was the Eko Computer Rhythm, produced in 1972 and developed by Italian engineers Aldo Paci, Giuseppe Censori, and Urbano Mancinelli. This sci-fi wonder has six rows of 16 lit switches, one per step. Each row can play up to two selectable drum parts (Figure 1.9).


Figure 1.9: The Eko Computer Rhythm, the first drum machine ever to be programmed with a step sequencer. It was devised and engineered not more than 30 km away from where this book was written. Photo courtesy of Acusmatiq MATME. Owner: Paolo Bragaglia. Restored by Marco Molendi. 14 Modular Synthesis


Utility Modules[edit]

There are, finally, a terrific number of utility modules that, despite their simplicity, have a high value for patching. Attenuators and attenuverters, mixers, multiples, mutes, and multiplexers and demultiplexers are very important tools to operate on signals. A brief definition is given for each one of these:

• Attenuators and attenuverters. An attenuator is a passive or active circuit that just attenuates the signal using a potentiometer. In the digital domain, this is equivalent to multiplying a signal by any value in the range [0, 1]. Attenuverters, additionally, are able to invert the signal, as if multiplying the signal by a number in the range [−1, 1]. Please note that inversion of a periodic signal is equivalent to a phase shift of 180° or π.


• Mixers. These modules allow you to sum signals together. They may be passive, providing just an electrical sum of the input voltages, they may be active, and they may have faders to control the gain of each input channel. Of course, in VCV Rack, there will be no difference between active and passive; we will just be summing discrete-time signals.


• Multiples. It is often useful to duplicate a signal. Multiples are made for this. They provide one input signal into several outputs. In Rack, this is not always required, since cables can be stacked from outputs, allowing duplication without requiring a multiple. However, they can still be useful to make a patch tidy.


• Mutes. It is sometimes useful to mute a signal, especially during a performance. Mutes are just switches that allow the signal flow from an input to an output or not. • Multiplexers and demultiplexers. These modules allow for complex routing of signals. A multiplexer, or mux, has one input and multiple outputs and a knob to select where to route the input signal. A demultiplexer, or demux, on the contrary, has multiple inputs and one output. In this case, the knob selects which input to route to the output. Mux and demux devices only allow one signal to pass at a time. Interface and control modules are also available to control a performance with external tools or add expressiveness. MIDI-to-CV modules are necessary to transform Musical Instruments Digital Interface (MIDI) messages into a CV. Theremin-like antennas and metal plates are used as input devices, while piezoelectric transducers are used to capture vibrations and touch, to be processed by other modules.

Elements of Signal Processing for Synthes[edit]

Continuous-Time Signals[edit]

TIP: Analog synthesizers and effects work with either voltage or current signals. As any physical quantity, these are continuous-time signals and their amplitude can take any real value – this is what we call an analog signal. Analog synthesizers do produce analog signals, and thus we need to introduce this class of signals. A signal is defined as a function or a quantity that conveys some information related to a physical system. The information resides in the variation of that quantity in a specific domain. For instance, the voltage across a microphone capsule conveys information regarding the acoustic pressure applied to it, and henceforth of the environment surrounding it.

From a mathematical standpoint, we represent signals as functions of one or more independent variables. The independent variable is the one we indicate between braces (e.g. when we write y ¼ f xð Þ, the independent variable is x). In other domains, such as image processing, there are usually two independent variables, the vertical and horizontal axes of the picture. However, most audio signals are represented in the time domain (i.e. in the form f tð Þ, with t being the time variable). Table 2.1: Notable continuous-time signals of interest in sound synthesis Name Mathematical Description Representation


Sine sin 2πftð Þ


Cosine cos 2πftð Þ


Decaying exponential e�αt; α40 Another less common form is: at; 05a51 Sawtooth t modTð Þ


White noise Zero mean, aleatory signal


Time can be defined as either continuous or discrete. Physical signals are all continuous-time signals; however, discretizing the time variable allows for efficient signal processing, as we shall see later. Let us define an analog signal as a signal with continuous time and continuous amplitude: s ¼ f tð Þ : R ! R


The notation indicates that the variable t belongs to the real set and maps into a value that is function of t and equally belongs to the real set. The independent variable is taken from a set (in this case, R) that is called domain, while the dependent variable is taken from a set that is called codomain. For any time instant t, s tð Þ takes a known value.


Most physical signals, however, have finite length, and this yields true for musical signals as well, otherwise recording engineers would have to be immortal, which is one of the few qualities they still miss. For a finite continuous-time signal that lives in the interval T1;T2 � �, we define it in this shorter time span: s ¼ f tð Þ : T1;T2 � � ! R


A class of useful signals is reported in Table 2.1.

Discrete-Time Signals[edit]

TIP: Discrete-time signals are crucial to understand the theory behind DSP. However, they differ from digital signals, as we shall see later. They represent an intermediate step from the real world to the digital world where computation takes place. Let us start with a question: Why do we need discrete-time signals? The short answer is that computers do not have infinite computational resources. I will elaborate on this further. You need to know that: 1. Computers crunch numbers. 2. These numbers are represented as a finite sequence of binary digits. Why finite? Think about it: Would you be able to calculate the sum of two numbers with an infinite number of digits after the decimal point? It would take you an infinite amount of time. Computers are not better than you, just a little faster. Small values are thus rounded to reduce the number of digits required to represent them. They are thus said to be finite-precision (quantized) numbers. Similarly, you cannot express larger numbers with a finite number of digits. This is also very intuitive: If your pocket calculator only has five digits, you cannot express a number larger than 99,999, right?

3. Computing is done in a short but not null time slice. For this reason, the less data we feed to a computer, the shorter it takes to provide the result. A continuous-time signal has infinite values between any two time instants, even very close ones. This means that it would take an infinite amount of time to process even a very small slice of signal. To make computation feasible, therefore, we need to take snapshots of the signal (sampling) at regular intervals. This is an approximation of the signal, but a good one, if we set the sampling interval according to certain laws, which we shall review later.

In this chapter, we shall discuss mainly the third point (i.e. sampling). Quantization (the second point) is a secondary issue for DSP beginners, and is left for further reading. Here, I just want to point outthat if you are not familiar with the term “quantization,” it is exactly the same thing you do when measuring the length of your synth to buy a new shelf for it. You take a reference, the measuring tape,and compare the length of the synth to it. Then you approximate the measure to a finite number of digits (e.g. up to the millimeter). Knowing the length with a precision up to the nanometer is not onlyunpractical by eye, but is also useless and hard to write, store, and communicate. Quantization has only marginal interest in this book, but a few hints on numerical precision are given in Section 2.13. Let us discuss sampling.

As we have hinted, the result of the sampling process is a discrete-time signal (i.e. a signal that exists only at specific time instants). Let us now familiarize ourselves withdiscrete-time signals. These signals are functions, like continuous-time signals, with the independent variable not belonging to the real set. It may, for example, belong to the integer set Z (i.e. to the setof all positive and negative integer numbers). While real signals are defined for any time instant t, discrete signals are defined only at equally spaced time instants n belonging to Z. This set is lesspopulated than R because it misses values between integer values. There are, for instance, infinite values between instants n ¼ 1 and n ¼ 2 in the real set that the integer set does not have. However,do not forget that even Z is an infinite set, meaning that theoretically our signal can go on forever.

Getting more formal, we define a discrete-time signal as s ¼ f n½ Š : Z ! R, where we adopta notation usual for DSP books, where discrete-time signals are denoted by the square brackets and the use of the variable n instead of t. As you may notice, the signal is still a real-valued signal. As we have discussed previously, another required step is quantization. Signals with their amplitude quantized do not belong to R anymore. Amplitude quantization is a step that is independent of time discretization. Indeed, we could have continuous time signals with quantized amplitude (although it is not very usual). Most of the inherent beauty and convenience of digital signal processing is related to the properties introduced by time discretization, while amplitude quantization has only a few side effects that require some attention during the implementation phase. We should state clearly that digital signals are both discretized in time and their amplitude. However, for simplicity, we shall now focus on signals that have continuous amplitude.

A discrete-time signal is a sequence of ordered numbers and is generally represented as shown in Figure 2.1. Any such signal can be decomposed into single pulses, known as Dirac pulses. A Dirac pulse sequence with unitary amplitude is shown in Figure 2.2, and is defined as: Figure 2.1: An arbitrary discrete-time signal.


CHAPTER 1 Modular Synthesis: Theory

The scope of this book is twofold: while focusing on modular synthesizers – a very fascinating and active topic – it tries to bootstrap the reader into the broader music-related DSP coding, without the complexity introduced by popular DAW plugin formats. As such, an introduction to modular synthesis cannot be neglected, since Rack is heavily based on the modular synthesis paradigm. More specifically, it faithfully emulates Eurorack mechanical and electric standards. Rack, as the name tells, opens up as an empty rack where the user can place modules. Modules are the basic building blocks that provide all sorts of functionalities. The power of the modular paradigm comes from the cooperation of small, simple units. Indeed, modules are interconnected at will by cables that transmit signals from one to another.

Although most common hardware modules are analog electronic devices, I encourage the reader to remove any preconception about analog and digital. These are just two domains where differential equations can be implemented to produce or affect sound. With Rack, you can add software modules to a hardware setup (Section 11.3 will tell you how), emulate complex analog systems (pointers to key articles and books will be provided in Section 11.1), or implement state-of-the-art numerical algorithms of any sort, such as discrete wavelet transform, non-negative matrix factorization, and whatnot (Section 11.2 will give you some ideas). To keep it simple, this book will mainly cover topics related to oscillators, filters, envelopes, and sequencers, including some easy virtual analog algorithms, and provide some hints to push your plugins further. 1.1 Why Modular Synthesis?

Why do we need modularity for sound generation?

Most classical and contemporary music is structured using modules, motives, and patterns. Most generating devices can be divided into modules either physically or conceptually. I know, you cannot split a violin in pieces and expect them to emit sound singularly. But you can still analytically divide the string, the bridge, and the resonant body, and generate sound by emulating their individual properties and the interconnection of the three.

Modularity is the direct consequence of analytical thinking. Describing the whole by dividing it into simpler components is a strategy adopted in most scientific and engineering areas. Unfortunately, studying the interconnection between the parts is often left “for future works.” Sometimes it does no harm, but sometimes it leaves out the largest part of the issue.

I like to think about the modular sound generation paradigm as the definitive playground to learn about the concept of separation and unity and about the complexity of nonlinear dynamical systems. Even though each module in its own right is well understood by its engineer or developer, the whole system often fails to be analytically tractable, especially when feedbacks are employed. This is where the fun starts for us humans (a little bit less fun if we are also in charge of analyzing it using mathematical tools). 1.2 An Historical Perspective 1.2.1 The Early Electronic and Electroacoustic Music Studios

As electronic music emerged in the twentieth century, a large part of the experimental process involved with it is related to the first available electronic devices, which were often built as rackmount panels or heavy cabinets of metal and wood. The 1950s are probably the turning point decade in the making of electronic music. Theoretical bases had been set in the first half of the century by composers and engineers. The invention of the positive feedback oscillator (using vacuum tubes) dates back to 1912–1914 (done independently by several researchers and engineers), while the mathematical formalization of a stability criterion for feedback systems was later devised by Heinrich Barkhausen in 1921. The field-effect transistor was patented in 1925 (in Canada, later patented in the US as patent no. 1,745,175), although yet unfeasible for its manufacturing complexity. After World War II, engineering had evolved wildly in several fields, following, alas, the war’s technological investment, with offspring such as control theory and cybernetics. Purely electronic musical instruments were already available far before the end of the war (e.g. the Theremin in 1920 and the Ondes Martenot in 1928), but still relegated to the role of classical musical instruments. Noise had been adopted as a key concept in music since 1913 with Luigi Russolo’s futurist manifest L’arte dei rumori, and atonality had been developed by composers such as Arnold Schoenberg and Anton Webern. In the aftermath of World War II, an evolution was ready to start.

A part of this evolution was driven by European electronic and electroacoustic music studios, most notably the WDR (Westdeutscher Rundfunk) in Cologne, the RTF (Radiodiffusion-Télévision Française) studio in Paris, and the Centro di Fonologia at the RAI (Radiotelevisione Italiana) studios in Milan.

The Cologne studio was born in the years 1951–1953. Werner Meyer-Eppler was among the founders, and he brought his expertise as a lecturer at the University of Bonn in electronic sound production (Elektrische Klangerzeugung) into the studio. Karleinz Stockhausen was involved from 1953, playing a key role in the development of the studio. He opposed to the use of keyboard instruments of the time (e.g. the Melochord1 and the Monochord, introduced in the studio by Meyer-Eppler) and turned to a technician of the broadcast facility, Fritz Enkel, for getting simple electronic devices such as sine wave generators.

The Paris studio was employed by the Groupe de Recherches Musicales, which featured pioneers Pierre Schaeffer, Pierre Henry, and Jacques Poullin, and was mostly based on Schaeffer’s concepts of concrete music, and therefore mostly oriented to tape works and electroacoustic works.

Finally, the studio in Milan was born officially in 1955 and run by composers Luciano Berio and Bruno Maderna and technician Marino Zuccheri. Most of the equipment was designed and assembled by Dr. Alfredo Lietti (1919–1998), depicted in Figure 1.1 in front of his creatures. It is interesting to note that Lietti started his career as a radio communication technician, showing again how communication technology had an impact on the development of electronic music devices. The Studio di Fonologia at its best boasted third-octave and octave filter banks, other bandpass filters, noise and tone generators, ring modulators, amplitude modulators, a frequency shifter, an echo chamber, a plate reverb, a mix desk, tape recorders, various other devices, and the famous nine sine oscillators (Novati and Dack, 2012). The nine oscillators are often mentioned as an example of how so few and simple devices were at the heart of a revolutionary musical practice (Donati and Paccetti, 2002), with the stigmatic statement “Avevamo nove oscillatori,” we only had nine oscillators.

Figure 1.1: C.1960, Alfredo Lietti at the Studio di Fonologia Musicale, RAI, Milan. Lietti is shown standing in front of rackmount devices he developed for the studio, mostly in the years 1955–1956. A cord patch bay is clearly visible on the right. Credits: Archivio NoMus, Fondo Lietti.



CHAPTER 2 Elements of Signal Processing for Synthesis

Modular synthesis is all about manipulating signals to craft an ephemeral but unique form of art. Most modules can be described mathematically, and in this chapter we are going to deal with a few basic aspects of signal processing that are required to develop modules. The chapter will do so in an intuitive way, to help readers that are passionate about synthesizers and computer music to get into digital signal processing without the effort of reading engineering textbooks. Some equations will be necessary and particularly useful for undergraduate or postgraduate students who require some more detail. The math uses the discrete-time notation as much as possible, since the rest of the book will deal with discrete-time signals. Furthermore, operations on discrete-time series are simpler to grasp, in that it helps avoid integral and differential equations, replacing these two operators with sums and differences. In my experience with students from non-engineering faculties, this makes some critical points better understood.

One of my aims in writing the book has been to help enthusiasts get into the world of synthesizer coding, and this chapter is necessary reading. The section regarding the frequency domain is the most tough; for the rest, there is only some high school maths. For the eager reader, the bible of digital signal processing is Digital Signal Processing (Oppenheim and Schafer, 2009). Engineers and scholars with experience in the field must forgive me for the overly simplified approach. For all readers, this chapter sets the notation that is used in the following chapters.

This book does not cover acoustic and psychoacoustic principles, which can be studied in specialized textbooks (e.g. see Howard and Angus, 2017), and it does not repeat the tedious account about sound, pitch, loudness, and timbre, which is as far from my view on musical signals as much as tonal theories are far from modular synthesizer composition theories. 2.1 Continuous-Time Signals

TIP: Analog synthesizers and effects work with either voltage or current signals. As any physical quantity, these are continuous-time signals and their amplitude can take any real value – this is what we call an analog signal. Analog synthesizers do produce analog signals, and thus we need to introduce this class of signals.

A signal is defined as a function or a quantity that conveys some information related to a physical system. The information resides in the variation of that quantity in a specific domain. For instance, the voltage across a microphone capsule conveys information regarding the acoustic pressure applied to it, and henceforth of the environment surrounding it.

From a mathematical standpoint, we represent signals as functions of one or more independent variables. The independent variable is the one we indicate between braces (e.g. when we write , the independent variable is x). In other domains, such as image processing, there are usually two independent variables, the vertical and horizontal axes of the picture. However, most audio signals are represented in the time domain (i.e. in the form

, with t being the time variable).

Time can be defined as either continuous or discrete. Physical signals are all continuous-time signals; however, discretizing the time variable allows for efficient signal processing, as we shall see later.

Let us define an analog signal as a signal with continuous time and continuous amplitude:

The notation indicates that the variable t belongs to the real set and maps into a value that is function of t and equally belongs to the real set. The independent variable is taken from a set (in this case, ) that is called domain, while the dependent variable is taken from a set that is called codomain. For any time instant t,

takes a known value.

Most physical signals, however, have finite length, and this yields true for musical signals as well, otherwise recording engineers would have to be immortal, which is one of the few qualities they still miss. For a finite continuous-time signal that lives in the interval

, we define it in this shorter time span:

A class of useful signals is reported in Table 2.1.

Table 2.1: Notable continuous-time signals of interest in sound synthesis Name Mathematical Description Representation Sine

Cosine

Decaying exponential Another less common form is:

Sawtooth

White noise Zero mean, aleatory signal 2.2 Discrete-Time Signals

TIP: Discrete-time signals are crucial to understand the theory behind DSP. However, they differ from digital signals, as we shall see later. They represent an intermediate step from the real world to the digital world where computation takes place.

Let us start with a question: Why do we need discrete-time signals? The short answer is that computers do not have infinite computational resources. I will elaborate on this further. You need to know that:

   Computers crunch numbers.
   These numbers are represented as a finite sequence of binary digits. Why finite? Think about it: Would you be able to calculate the sum of two numbers with an infinite number of digits after the decimal point? It would take you an infinite amount of time. Computers are not better than you, just a little faster. Small values are thus rounded to reduce the number of digits required to represent them. They are thus said to be finite-precision (quantized) numbers. Similarly, you cannot express larger numbers with a finite number of digits. This is also very intuitive: If your pocket calculator only has five digits, you cannot express a number larger than 99,999, right?
   Computing is done in a short but not null time slice. For this reason, the less data we feed to a computer, the shorter it takes to provide the result. A continuous-time signal has infinite values between any two time instants, even very close ones. This means that it would take an infinite amount of time to process even a very small slice of signal. To make computation feasible, therefore, we need to take snapshots of the signal (sampling) at regular intervals. This is an approximation of the signal, but a good one, if we set the sampling interval according to certain laws, which we shall review later.

In this chapter, we shall discuss mainly the third point (i.e. sampling). Quantization (the second point) is a secondary issue for DSP beginners, and is left for further reading. Here, I just want to point out that if you are not familiar with the term “quantization,” it is exactly the same thing you do when measuring the length of your synth to buy a new shelf for it. You take a reference, the measuring tape, and compare the length of the synth to it. Then you approximate the measure to a finite number of digits (e.g. up to the millimeter). Knowing the length with a precision up to the nanometer is not only unpractical by eye, but is also useless and hard to write, store, and communicate. Quantization has only marginal interest in this book, but a few hints on numerical precision are given in Section 2.13.

Let us discuss sampling. As we have hinted, the result of the sampling process is a discrete-time signal (i.e. a signal that exists only at specific time instants). Let us now familiarize ourselves with discrete-time signals. These signals are functions, like continuous-time signals, with the independent variable not belonging to the real set. It may, for example, belong to the integer set (i.e. to the set of all positive and negative integer numbers). While real signals are defined for any time instant t, discrete signals are defined only at equally spaced time instants n belonging to . This set is less populated than because it misses values between integer values. There are, for instance, infinite values between instants and in the real set that the integer set does not have. However, do not forget that even

is an infinite set, meaning that theoretically our signal can go on forever.

Getting more formal, we define a discrete-time signal as , where we adopt a notation usual for DSP books, where discrete-time signals are denoted by the square brackets and the use of the variable n instead of t. As you may notice, the signal is still a real-valued signal. As we have discussed previously, another required step is quantization. Signals with their amplitude quantized do not belong to

anymore. Amplitude quantization is a step that is independent of time discretization. Indeed, we could have continuous time signals with quantized amplitude (although it is not very usual). Most of the inherent beauty and convenience of digital signal processing is related to the properties introduced by time discretization, while amplitude quantization has only a few side effects that require some attention during the implementation phase. We should state clearly that digital signals are both discretized in time and their amplitude. However, for simplicity, we shall now focus on signals that have continuous amplitude.

A discrete-time signal is a sequence of ordered numbers and is generally represented as shown in Figure 2.1. Any such signal can be decomposed into single pulses, known as Dirac pulses. A Dirac pulse sequence with unitary amplitude is shown in Figure 2.2, and is defined as:

Figure 2.1: An arbitrary discrete-time signal.

Figure 2.2: The Dirac sequence (a) and the sequence 10

(b).

(2.1)

A Dirac pulse of amplitude A shifted in time by T samples is denoted as:1

(2.2)

By shifting in time multiple Dirac pulses and properly weighting each one of them (i.e. multiplying by a different real coefficient), we obtain any arbitrary signal, such as the one in Figure 2.1. This process is depicted in Figure 2.3. The Dirac pulse is thus a sort of elementary particle in this quantum game we call DSP.2 If you have enough patience, you can sum infinite pulses and obtain an infinite-length discrete-time signal. If you are lazy, you can stop after N samples and obtain a finite-length discrete-time signal

. If you are lucky, you can give a mathematical definition of a signal and let it build up indefinitely for you. This is the case of a sinusoidal signal, which is infinite-length, but it can be described by a finite set of samples (i.e. one period).

Figure 2.3: The signal in Figure 2.1 can be seen as the sum of shifted and weighted copies of the Dirac delta.

Table 2.2 reports notable discrete-time sequences that are going to be of help in progressing through the book.

Table 2.2: Notable discrete-time sequences, their mathematical notation and their graphical representation Mathematical description Graphical representation Unitary step

Discrete-time ramp

Discrete-time sine

Discrete-time decaying exponential

2.3 Discrete-Time Systems

TIP: Discrete-time systems must not be confused with digital systems. As with discrete-time signals, they represent an intermediate step between continuous-time systems and digital systems. They have a conceptual importance, but we do not find many of them in reality. One notable exception is the Bucket-Brigade delay. This kind of delay, used in analog effects for chorus, flangers, and other modulation effects, samples the continuous-time input signal at a frequency imposed by an external clock. This will serve as a notable example in this section.

How do we deal with signals? How do we process them? With systems!

Systems are all around us. A room is an acoustic system that spreads the voice of a speaker. A loudspeaker is an electroacoustic system that transforms an electrical signal into an acoustic pressure wave. A fuzz guitar pedal is a nonlinear system that outputs a distorted version of an input signal. A guitar string is a damped system that responds to a strike by oscillating with its own modes. The list can go on forever, but we need to focus on those systems that are of interest to us. First of all, we are interested in discrete-time systems, while the above lists only continuous-time systems. Discrete-time systems can emulate continuous-time systems or can implement something new. Discrete-time systems are usually implemented as algorithms running on a processing unit. They are depicted in the form of unidirectional signal-flow graphs, such as the one seen in Figure 2.4, and they are usually implemented in code.

Figure 2.4: Unidirectional graph for an LTI system, composed by a summation point, unitary delays ( ), and products by scalar (gains), depicted with a triangle. The graph implements the difference equation

.

Discrete-time systems can:

   have one or more inputs and one or more outputs;
   be stable or unstable;
   be linear or nonlinear;
   have memory or be static; and
   be time-variant or time-invariant.

Many more discrete-time systems exist, but we shall focus on the above ones. In digital signal processing textbooks, you usually find a lot on linear time-invariant systems (LTIs). These are fundamental to understand the theory underlying systems theory, control theory, acoustics, physical modeling, circuit theory, and so on. Unfortunately, useful systems are nonlinear and time-variant and often have many inputs and/or outputs. To describe such systems, most theoretical approaches try to look at them in the light of the LTI theory and address their deviations with different approaches. For this reason, we shall follow the textbook approach and prepare some background regarding the LTI systems. Nonlinear systems are much more complex to understand and model. These include many acoustical systems (e.g. hammer-string interaction, a clarinet reed, magnetic response to displacement in pickups and loudspeaker coils) and most electronic circuits (amplifiers, waveshapers, etc.). However, to deal with these, a book is not enough, let alone a single chapter! A few words on nonlinear systems will be spent in Sections 2.11 and 8.3. Further resources will be given to continue on your reading and start discrete-time modeling of nonlinear systems.

Now it is time to get into the mathematical description of LTI systems and their properties. A single-input, single-output discrete-time system is an operator that transforms an input signal into an output signal

(2.3)

Such a system is depicted in a signal-flow graph, as shown in Figure 2.5.

Figure 2.5: Flow graph representation of a discrete-time system.

There are three basic building blocks for LTI systems:

   scalar multiplication;
   summation points; and
   delay elements.

The delay operation requires an explanation. Consider the following equation:

(2.4)

with the time index n running possibly from minus infinity to infinity and L being an integer. Equation 2.4 describes a delaying system introducing a delay of length L samples. You can prove yourself that this system delays a signal with pen and paper, fixing the length L and showing that for each input value the output is shifted by L samples to the right. The delay will be often denoted as

, for reasons that are clear to the expert reader but are less apparent to other readers. This is related to the Z-transform (the discrete-time equivalent of the Laplace transform), which is not covered in this book for simplicity. Indeed, the Z-transform is an important tool for the analysis and design of LTI systems, but for this hands-on book it is not strictly necessary, and as usual I recommend reading a specialized DSP textbook.

Let us now take a break and discuss the Bucket-Brigade delay (BBD). This kind of device consists of a cascade of L one-sample delay cells ( ), creating a so-called delay line (

). Each of these cells is a capacitor, able to store a charge, and each cell is separated by means of transistors acting as a gate. These gates are activated at once by a periodic clock, thus allowing the charge of each cell to be transferred, or poured, to the next one. In this way, the charge stored in the first cell travels through all the cells, reaching the last one after L clock cycles, thus delayed by L clock cycles. You can clearly see the analogy with a bucket of water that is passed from hand to hand in a line of firefighters having to extinguish a fire. Figure 2.6 may help you understand this concept. This kind of device acts as a discrete-time system; however, the stored values are physical quantities, and thus no digital processing happens. However, being a discrete-time system, it is subject to the sampling theorem (discussed later) and its implications.

Figure 2.6: Simplified diagram of a Bucket-Brigade delay. Charge storage cells are separated by switches, acting as gates that periodically open, letting the signals flow from left to right. When the switches are open (a), no signal is flowing, letting some time pass. When the switches close (b), each cell transmits its value to the next one. At the same time, the input signal is sampled, or freezed, and its current value is stored in the first cell. It then travels the whole delay line, through the cells, one by one.

Back to the theory, we have now discussed three basic operations: sum, product, and delay. By composing these three operations, one can obtain all sorts of LTI systems, characterized by the linearity and the time-invariance properties. Let us examine these two properties.

Linearity is related to the principle of superposition. Let be the response of a system to input , and similarly be the response to input stimulus

. A system is linear if – and only if – Equations 2.5 and 2.6 hold true:

(2.5)

(2.6)

where a is a constant. These properties can be combined into the following equation, stating the principle of superposition:

(2.7)

for arbitrary constants a and b.

In a few words, a system is linear whenever the output to each different stimulus is independent on other additive stimuli and their amplitudes.

LTI systems are only composed of sums, scalar products, and delays. Any other operation would violate the linearity property. Take, for instance, a simple system that instead of a scalar product multiplies two signals. This is described by the following difference equation:

(2.8)

This simple system is memoryless (the output is not related to a previous input) and nonlinear as the exponentiation is not a linear function. It is straightforward that

, except for trivial cases with at least one of the two inputs equal to zero.

Another interesting property of LTI systems is time-invariance. A system is time-invariant if it always operates in the same way disregarding the moment when a stimulus is provided, or, more rigorously, if:

(2.9)

Time-variant systems do change their behavior depending on the moment we are looking at them.

Think of an analog synthesizer filter affected by changes in temperature and humidity of the room, and thus behaving differently depending on the time of the day. We can safely say that such a system is a (slowly) time-variant one. But even more important, it can be operated by a human and changes its response while the user rotates its knobs. It is thus (quickly) time-variant while the user modifies its parameters. A synthesizer filter, therefore, is only an approximation of an LTI system: it is time-variant. We can neglect slow variations due to the environment (digital filters are not affected by this – that’s one of their nice properties), and assume time-invariance while they are not touched, but we cannot neglect the effect of human intervention. This, in turn, calls for a check on stability while the filter is manipulated.

Causality is another property, stating that the output of a system depends only on any of the current and previous inputs. An anti-causal system can be treated in mathematical terms but has no real-time implementation in the real world since nobody knows the future (yet).

Anti-causal systems can be implemented offline (i.e. when the entire signal is known). A simple example is the reverse playback of a track. You can invert the time axis and play the track from the end to the beginning because it has been entirely recorded. Another example is a reverse reverb effect. Real-time implementations of this effect rely on storing a short portion of the input signal, reverting it, and then passing through a reverb. This is offline processing as well, because the signal is first stored and then reverted.

Stability is another important property of systems. A so-called bounded-input, bounded-output (BIBO) stable system is a system whose output never goes to infinity when finite signals are applied:

(2.10)

This means that if the input signal is bounded (both in the positive and negative ranges) by a value Bx for its whole length, then the output of the system will be equally bounded by another maximum value By, and both the bounding values are not infinite. In other words, the output will never go to infinity, no matter what the input is, excluding those cases where the input is infinite (in such cases, we can accept the output to go to infinity).

Unstable systems are really bad for your ears. An example of an unstable system is the feedback connection of a microphone, an amplifier, and a loudspeaker. Under certain conditions (the Barkhausen stability criterion), the system becomes unstable, growing its output with time and creating a so-called Larsen effect. Fortunately, no Larsen grows up to infinity, because no public address system is able to generate infinite sound pressure levels. It can hurt, though. In Gabrielli et al. (2014), a virtual acoustic feedback algorithm to emulate guitar howling without hurting your ears is proposed and some more information on the topic is given.

One reason why we are focusing our attention on LTI systems is the fact that they are quite common and they are tractable with a relatively simple math. Furthermore, they are uniquely defined by a property called impulse response. The impulse response is the output of the system after a Dirac pulse is applied to its input. In other words, it is the sequence:

(2.11)

An impulse response may be of finite or infinite length. This distinction is very important and leads to two classes of LTI systems: infinite-impulse response system (IIR) and finite-impulse response system (FIR). Both are very important in the music signal processing field.

Knowing the impulse response of a system allows to compute the output as:

(2.12)

Or, if the impulse response has finite length M:

(2.13)

Those who are familiar with math and signal processing will recognize that the sum in Equation 2.12 is the convolution between the input and the impulse response:

(2.14)

where the convolution is denoted by the symbol *. As you may notice, the convolution sum can be quite expensive if the impulse response is long: for each output sample, you have to evaluate up to infinite products! Fortunately, there are ways to deal with IIR systems that do not require infinite operations per sample (indeed, most musical filters are IIR, but very cheap). Inversely, FIR systems with a very long impulse response are hard to deal with (partitioning the convolution can help by exploit parallel processing units). In this case, dealing with signals in the frequency domain provides some help, as the convolution between

can be seen in the frequency domain as the product of their Fourier transforms (this will be shown in Section 2.4.3). Of particular use are those LTI systems where impulse response can be formulated in terms of differential equations of the type:

(2.15)

These LTI systems, most notably filters, are easy to implement and analyze. As you can see, they are based on three operations only: sums, products, and delays. The presence of delay is not obvious from Equation 2.15, but consider that to recall the past elements (

, etc.), some sort of delay mechanism is required. This requires that the present input is stored and recalled after k samples. Delays are obtained by storing the values in memory. Summing, multiplying, writing a value to memory, reading a value from memory: these operations are implemented in all modern processors3 by means of dedicated instructions. Factoring the LTI impulse response as in Equation 2.15 helps making IIR filters cheap: there is no need to perform an infinite convolution: storing the previous outputs allows you to store the memory of all the past history. 2.4 The Frequency Domain

Before going deeper and explaining how to transform a continuous-time signal into a discrete-time signal, and vice versa, we need to introduce a new domain: the frequency domain.

The physical concept of frequency itself is very simple. The frequency is the number of occurrences per unit of time of a given phenomenon (e.g. the repetition of a cycle of a sine wave or the zero crossing of a generic signal). This is generally called temporal frequency and denoted with the letter f. The measurement unit for the temporal frequency is the Hertz, Hz (with capital H, please!), in memory of Heinrich Hertz. The reciprocal of the frequency is the period T (i.e. the time required to complete one cycle at the given frequency), so that

.

While audio engineers are used to describing the frequency in Hz, because they deal with acoustic signals, in signal processing we are more used to angular frequency, denoted with the Greek letter and measured in radians per second (rad/s).4 For those more used to angular degrees, and . The angular frequency is related to the temporal frequency, but it measures the angular displacement per unit of time. If we take, for example, a running wheel, the angular frequency measures the rotation per second, while the temporal frequency measures the number of complete rotations per second. In other words if the wheel completes a turn in a second, ,

. The analytical relation between temporal frequency, angular frequency, and period is thus:

(2.16)

If the concept of frequency is clear, we can now apply this to signal processing. Follow me with a little patience as I point out some concepts and then connect them together. In signal processing, it is very common to use transforms. These are mathematical operators that allow you to observe the signal under a different light (i.e. they change shape5 to the input signal and take it to a new domain, without affecting its informative content). The domain is the realm where the signal lives. Without going too abstract, let us take the mathematical expression:

(2.17)

If , then the resulting signal lives in the continuous-time domain. If , then lives in the discrete-time domain. We already know of these two domains. We are soon going to define a third domain, the frequency domain. We shall see that in this new domain, the sine signal of Equation 2.17 has a completely new shape (a line!) but still retains the same meaning: a sinusoidal wave oscillating at

.

There is a well-known meme from the Internet showing that a solid cylinder may look like a circle or a square if projected on a wall with light coming from a side or from the bottom, stating that observing one phenomenon from different perspectives can yield different results, yet the truth is a complex mixture of both observations (see Figure 2.7). Similarly, in our field, the same phenomenon (a signal) can be observed and projected in different domains. In each domain, the signal looks different, because each domain describes the signal in a different way, yet both domains are speaking of the same signal. In the present case, one domain is time and the other is frequency. Often the signal is generated in the time domain, but signals can be synthesized in the frequency domain as well. There are also techniques to observe the signal in both domains, projecting the signal in one of the known time-frequency domains (more in Section 2.4.6), similar to the 3D image of the cylinder in Figure 2.7. Mixed time-frequency representations are often neglected by sound engineers, but are very useful to grasp the properties of a signal at a glance, similar to what the human ear does.6

Figure 2.7: The projection of a cylinder on two walls from orthogonal perspectives. The object appears with different shapes depending on the point of view. Similarly, any signal can be projected in the time and the frequency domain, obtaining different representations of the same entity.

We shall first discuss the frequency domain and how to take a time-domain signal into the frequency domain.7 2.4.1 Discrete Fourier Series

Let us consider a discrete-time signal. We have just discovered that we can see any discrete-time signal as a sum of Dirac pulses, each one with its weight and time shift. The Dirac pulse is our elementary brick to build a discrete-time signal. It is very intuitive to see how from the sum of these bricks (Figure 2.3) we build the signal in Figure 2.1. Let us now consider a different perspective. What if we consider the signal in Figure 2.1 as composed by a sum of sinusoidal signals? Figure 2.8 shows how this happens. It can be shown that this process is general, and a large class of signals (under the appropriate conditions) can be seen as consisting of elementary sinusoidal signals, each with its weight, phase, and frequency. This analysis is done through different formulations of the so-called Fourier transform, depending on the kind of input signal we consider. We are going to start with a periodic discrete-time signal.

Figure 2.8: The signal of Figure 2.1 is periodic, and can thus be decomposed into a sum of periodic signals (sines).

Let us take a periodic8 signal

. We shall take an ideal sawtooth signal, for the joy of our synthesizer-fanatic reader. This signal is created by repeating linear ramps, increasing the dependent variable over time, and resetting it to zero after an integer number of samples N. This signal takes the same value after an integer number of samples, and can thus be defined as periodic:

(2.18)

As we know, the frequency of this signal is the reciprocal of the period (i.e.

). In this section, we will refer to the frequency as the angular frequency. The fundamental frequency is usually denoted with a 0 subscript, but for a cleaner notation we are sticking with the F subscript for the following lines.

Given a periodic signal with period N, there are a small number of sinusoidal signals that have a period equal or integer divisor of N. What if we write the periodic signal as a sum of those signals? Let us take, for example, cosines as these basic components. Each cosine has angular frequency that is an integer multiple of the fundamental angular frequency . Each cosine will have its own amplitude and phase. The presence of a constant term

is required to take into account an offset, or bias, or DC component (i.e. a component at null frequency – it is not oscillating, but it fulfills Equation 2.18). This is the real-form discrete Fourier series (DFS).

(2.19)

Equation 2.19 tells us that any periodic signal in the discrete-time domain can be seen as a sum of a finite number of cosines at frequencies that are multiples of the fundamental frequency, each weighted properly by an amplitude coefficient and time-shifted by a phase coefficient

. The cosine components are the harmonic partials,9 or harmonics, of the original signal, and they represent the spectrum of the signal.

We can further develop our math to introduce a more convenient notation. Bear with me and Mr. Jean Baptiste Joseph Fourier for a while. It can be shown that a cosine having frequency and phase components in its argument can be seen as the sum of a sine and a cosine of the same frequency and with no phase term in their argument, as shown in Equation 2.20. In other words, the added sinusoidal term takes into account the phase shift given by

. After all, intuitively, you already know that cosine and sine are the same thing, just shifted by a phase offset:

(2.20)

We can plug Equation 2.20 into Equation 2.19, obtaining a new formulation of the Fourier theorem that employs sines and cosines and requires the same number of coefficients ( instead of , ) to store information regarding

(2.21)

Now, if you dare to play with complex numbers, in signal processing theory we have another mathematical notation that helps to develop the theory further. If you want to stop here, this is OK, but you won’t be able to follow the remainder of the section regarding Fourier theory. You can jump to Section 2.4.3, and you should be able to understand almost everything.

We take pace from Euler’s equation:

(2.22)

where is the imaginary part.10 The term

is called a complex exponential.

By reworking Equation 2.22, we obtain the following equalities:

(2.23)

Now we shall apply these expressions to the second formulation of the DFS. If we take

, we can substitute the sine and cosine expressions of Equation 2.23 in Equation 2.21, yielding:

This can be rewritten by separating the two sums as:

The last step consists of transforming all the constant terms

into one single vector that shall be our frequency representation of the signal:

(2.24)

where

(2.25)

This little trick not only makes Equation 2.21 more compact, but also instructs us how to construct a new discrete signal that bears all the information related to

.

Several key concepts can be drawn by observing how

is constructed:

is complex valued except from . is the offset component, now seen as a null-frequency component, or DC term. is composed of terms related to negative frequency components () and positive frequency components ( ). Negative frequencies have no physical interpretation, but nonetheless are very important in DSP. Negative frequency components have identical coefficients, besides the sign of (i.e. they are complex conjugate

   ). Roughly speaking, negative frequency components can be neglected when we observe a signal.11

To conclude, we are now able to construct a frequency representation of a time-domain signal for periodic signals. The process is depicted in Figure 2.9.

Figure 2.9: A signal in time (left), decomposed as a sum of sinusoidal signals (waves in the back). The frequency and amplitudes of the sinusoidal signals are projected into a frequency domain signal (right). The projection metaphor of Figure 2.7 is now put in context. 2.4.2 Discrete Fourier Transform

Theoretically speaking, periodic signals are a very important class of signals in music. However, periodic signals stay exactly the same for their whole life (i.e. from minus infinity to infinity). The closest to such a signal I can think of is a Hammond tonewheel that spins forever. But eventually AC mains will blackout for a while – at least at my institution it happens quite frequently. A broader class of signals of interest are non-periodic and/or non-stationary (i.e. changing over time). To describe non-periodic but stationary signals, we need to depart from the DFS and develop the discrete Fourier transform.12

We shall here extend the concepts developed for the DFS to non-periodic signals of finite duration (i.e. those signals that are non-zero only for a finite period of time). Let us take a signal

, which is null besides a finite time interval n (e.g. a short ramp), as in Figure 2.10a.

Figure 2.10: A finite duration signal (a) is converted into a periodic signal by juxtaposing replicas (b). The periodic version of

can be now analyzed with the DFS.

By applying a neat trick, our efforts for obtaining the DFS will not be wasted. Indeed, if we make periodic, by replicating it infinite times we can apply the DFS to the periodic version of the signal , which we highlight with a tilde, as before, to highlight that it has been made periodic. The periodic signal can be treated using the DFS as shown before. The aperiodic signal is again seen as a sum of complex exponentials (Equation 2.26). The number of the exponentials is equal to the finite length of the signal. This time, the complex exponentials are called partials and are not necessarily harmonic. The set of the complex-valued coefficients

is again called a spectrum. The spectrum is a discrete signal with values (often called bins) defined only for the finite set of frequencies k:

(2.26)

Please note that in Equation 2.26, we do not refer to a fundamental frequency , since is not periodic, but we made the argument of the exponential explicit, where

are the (N−1) frequencies of the complex exponentials and N can be selected arbitrarily, as we shall later detail.

Equation 2.26 not only states that a discrete-time signal can be seen as a sum of elementary components. It also tells how to transform a spectrum

into a time-domain signal. But wait a minute: we do not yet know how to obtain the spectrum! For tutorial reasons, we focused on understanding the Fourier series and its derivatives. However, we need, in practice, to know how to transform a signal from the time domain to the frequency domain. The discrete Fourier transform (DFT) is a mathematical operator that takes a real13 discrete-time signal of length N into a complex discrete signal of length N that is function of frequency:

(2.27)

The DFT is invertible, meaning that the signal in the frequency domain can be transformed back to obtain the original signal in the time domain without degradation of the signal (i.e.

). The direct and inverse DFT are done through Equations 2.28 and 2.29:

(2.28)

(2.29)

The equations are very similar. The added term is required to preserve the energy. As seen above,

is a complex-valued signal. How to interpret this? The real and imaginary part of each frequency bin do not alone tell us anything of practical use. To gather useful information, we need to process further the complex-valued spectrum to obtain the magnitude spectrum and the phase spectrum. The magnitude spectrum is obtained by calculating the modulus (or norm) of each complex value, while the phase is obtained by the argument function:

(2.30)

(2.31)

The magnitude and phase spectra provide different kinds of information. The first is most commonly used in audio engineering because it provides information that we directly understand (i.e. the energy of the signal at each frequency component, something our ears are very sensitive to). The phase spectrum, on the other hand, tells the phase of the complex exponentials along the frequency domain. It has great importance in the analysis of LTI systems.

It is important to note that the length of the spectrum is equal to the length of the signal in the time domain.

Since the number of DFT points, or DFT bins (i.e. the number of frequencies), is equal to the length N of the signal, we incur a practical issue: very short signals have very few DFT bins (i.e. our knowledge of the signal in frequency is quite rough). To gather more insight, there is a trick called zero-padding. It consists of adding zeros at the end or at the beginning of the signal. This artificially increases the signal length to

points by padding with zeros, thus increasing the number of DFT bins. Zero-padding is also employed to round the length of the signal to the closest power of 2 that is larger than the signal, which makes it suitable to efficient implementations of the DFT (a few words regarding computational complexity will come later). Applying zero-padding (i.e. calculating the M-points DFT) does not alter the shape of the spectrum; it just adds M-N points by interpolating the values of the N-points DFT.

So far, we have discovered:

   Any finite-length discrete-time signal can be described in terms of weighted complex exponentials (i.e. sum of sinusoids). The weights are called the spectrum of the signal. The complex exponentials are equally spaced in frequency.
   From the weights of these exponentials, we can obtain the signal back in the time domain.
   Both 

and

   have same length N.

Let us now review the DFT of notable signals. 2.4.3 Properties of the Discrete Fourier Transform

Linearity:

(2.32)

meaning that the DFT of the sum of two signals is equal to the sum of the two DFTs. This property has a lot of practical and theoretical consequences.

Time scaling:

(2.33)

One of the consequences of this property is of interest to sound designers: it implies the pitch shifting of a signal that is played back faster (a > 1) or slower (a < 1).

Periodicity:

(2.34)

The spectrum is replicated in frequency (this is relevant to the problem of aliasing).

Convolution:

(2.35)

This means that the product of two DFTs is equal to the convolution between the two signals. On par with that, it is also true that the product of two signals in the time domain is equivalent to the convolution of their spectra in the frequency domain (i.e.

).

Parseval’s theorem:

meaning that the energy stays the same whether we evaluate it in the time domain (left) or in the frequency domain (right) 2.4.4 Again on LTI Systems: The Frequency Response

By now, we know that a signal is completely described by its DFT

. The impulse response of a system is a signal, a very important one, because it contains all the information related to its system. But what happens if we transform it into the frequency domain? We obtain another very important property of the system, which conveys all the information regarding the system, just translated into the frequency domain. This is the so-called frequency response:

(2.36)

The frequency response is even more useful than the impulse response for musical applications. If you want to design a filter, for example, you can design its desired spectrum and transform it into the time domain through the inverse DFT.14 The frequency response of two systems can be compared to determine which one best suits your frequency specifications.

But there is one more use for the frequency response: you can use it to process a signal. Processing a signal with an LTI system implies the convolution between the signal and the system impulse response. This can be expensive in some cases. Computational savings can be obtained by transforming both in the frequency domain, multiplying them, and transforming them back in the domain. This is possible thanks to the convolution property, reported in Section 2.4.3. Processing the signal through the LTI system in such a way may reduce the computational cost, provided that our latency constraints allow us to buffer the input samples for the DFT. The larger this buffer, the larger the computational savings. Why so? Keep on reading through the next section.

Studio engineers and music technology practitioners usually confuse the terms “frequency response” and “magnitude frequency response.” The frequency response obtained by computing the DFT of the impulse response is a complex-valued signal, not very informative for us humans. Computing its magnitude provides the sort of curves we are more used to seeing for evaluating loudspeaker quality or the spectral balance of a mix. However, they miss the phase information, which can be computed as well as the argument function of the frequency response. 2.4.5 Computational Complexity and the Fast Fourier Transform

The computational complexity of a DSP algorithm is the number of operations (generally sums and products) that are required for each input sample. It does not necessarily tell how fast the algorithm will run on your computer. There are many complex factors related to the implementation of an algorithm, such as the available computational resources (memory, cache, registers, mathematical instructions, parallel processing units, etc.) or the coding strategy, that are important to the fast execution of a piece of code, but in general we first evaluate the computational cost of an algorithm and try to reduce it on paper, and then see how it performs in the real world.

For the DFT, consider the following equation:

(2.37)

where

.

For a sequence of length N, Equation 2.37 requires complex products and complex sums. Summing two complex numbers requires summing the respective real and imaginary parts, and thus it requires two real sums. The product of two complex numbers instead requires four real products because it is expressed as

.

The computational cost is thus of the order of , because for increasing N, the cost of computing the products prevails over the

sums.15

The computational cost of the DFT can be improved by exploiting some of its properties to factor the number of operations. One famous algorithm for a fast Fourier transform (FFT) was devised by Cooley and Tukey (1965). From that time, the FFT acronym has been widely misused instead of DFT. It should be noted that the DFT is the transform, while the FFT is just a fast implementation of the DFT, and there is no difference between the two but the computational cost. The computational savings of the FFT are obtained by noting that it can be written as:

(2.38)

thus reducing the DFT to two smaller DFTs of length . The coefficients computed by these two DFTs are used as partial results for the computation of two of the final DFT bins. If is even, these two DFTs can be split in two smaller ones each. This procedure can be iterated. If N is a power of 2, the procedure can be repeated up to a last iteration where we have several DFTs of length 2, called butterflies. If, for example, , there are M stages and for each stage only N/2 butterflies need be computed. Since each butterfly has one complex product and two complex sums, we have in total complex products and complex sums, yielding an order of operations. To evaluate the advantage of the FFT, let us consider N = 1024 bins. The DFT requires approximately operations, while the FFT requires

operations – a saving of 100 times (i.e. two orders of magnitude)! 2.4.6 Short-Time Fourier Transform

For long signals, it is hard to compute a single DFT, even when using FFT algorithms that reduce the computational cost. A different approach consists of taking slices of the signal and analyzing them with a DFT. This is particularly useful for a different class of signals, those that are non-stationary. Those signals change their properties (e.g. their frequency content with time), and thus slicing them in small pieces allows you to evaluate how the frequency content changes. The slices will be of the same length. Usually, before applying the DFT, we also apply windowing to the slice, which smooths the corners at the beginning and at the end of the slice. In this context, a windowed signal is intended as a slice of the signal multiplied sample by sample with a window of the same length. The window may have a smooth transition at the beginning and end, and unitary amplitude in the middle (see Figure 2.11b). Windowing consists of multiplying sample-wise the slice with the window, thus applying a short fade-in and fade-out to the signal to be analyzed. It can be shown that multiplying a signal with a window alters its frequency content. Please note that even if you do not apply windowing explicitly, slicing a signal is still equivalent to multiplying the original signal by a rectangular window (Figure 2.11a). Thus, altering the signal is unavoidable in any case.16

Figure 2.11: A rectangular window (a) and a Hanning window (b).

The shape of the window determines its mathematical properties, and thus its effect on the windowed signal. Indeed, windows are formulated in order to maximize certain criteria. In general, there is no optimal window, and selection depends on the task.

To recap what we have discovered in this section:

   Any infinite-length discrete-time signal, or any finite-length signal that is too long to be practical to analyze in its entirety, can be windowed to obtain a finite-length signal, in order to apply the DFT. If the signal is non-stationary, the DFT will highlight the spectral information of the portion of signal we have windowed.
   The operation of windowing alters the content of the signal.

2.5 Once Again on LTI Systems: Filters

After discussing LTI systems and the frequency domain, we can describe filters. Filters are LTI systems that are designed to modify the frequency content of an input signal by cutting or boosting selected components. Common types of filters, defined by their frequency mask, are:

   Low-pass filters (LPFs). These ideally cancel all content above a so-called cutoff frequency. In reality, there is a transition band where the frequency components are progressively attenuated. The steepness of the filter in the transition band is also called roll-off. The roll-off may end, reaching a floor that is non-zero. This region, called stop-band, has a very high, but not infinite, attenuation. Synthesizer filters usually have a roll-off of 12 or 24 dB/oct, and generally have a small resonant bell at the cutoff frequency.
   High-pass filters (HPFs). These ideally cancel those frequency components below a cutoff frequency. They are thus considered the dual of low-pass filters. Also, HPFs have a roll-off and may have a resonance at the cutoff frequency.
   Band-pass filters (BPFs). These are filters that only select a certain frequency band by attenuating both the low- and the high-frequency content. They are defined by their bandwidth (or its inverse, the quality factor, Q) and their central frequency. They have two transition bands and two stop bands.
   Shelving filters. These are filters that cut or boost the content in the bass or treble range, and leave the rest unmodified. They are used in equalizers to treat the two extremes of the audible range. They are defined by the cutoff, the roll-off, and the gain.
   Peaking filters. These cut or boost a specific bandwidth, and leave the rest unmodified. They are defined by the bandwidth (or the quality factor, Q), the central frequency, and the gain. They are used in parametric equalizers.
   Comb filters. These filters apply a comb-like pattern that alters (cuts or boosts) the spectrum at periodic frequency intervals. They can have feedback (boost) or feedforward (cut) configurations.

Another class of filters, widely employed in DAWs, the so-called brickwall filters, are just low-pass or high-pass filters with a very steep roll-off. They are used in audio track mastering applications to remove any content above or below a certain cutoff frequency.

Most synthesizer filters are of the IIR type and have a gentle roll-off (Figure 2.12).

Figure 2.12: Different types of filters: (a) low-pass, band-pass, and high-pass, with pass band and stop band as indicated in the figure, and transition bands in between; (b) shelving and peaking filters; (c) notch and band-stop filters; (d) feedforward comb; and (e) feedback comb. 2.6 Special LTI Systems: Discrete-Time Differentiator and Integrator

TIP: This is the 101 for virtual analog modeling of electronic circuits. Read carefully if you want to get started on this topic.

Later it will be useful to obtain the discrete-time derivative of a signal. A bit of high school math will refresh the concept of derivative. The derivative of a curve is the slope of the curve in a given point. A line of the form

(2.39)

has slope and offset q. A line never changes slope, so its derivative is

for all values of x. With the line, we compute the slope as:

(2.40)

where the two quantities are the difference between any two points with and their y coordinates . A line never changes its derivative, and thus any two points are good. For generic signals, the slope can change with time, and thus for each instant an approximation of the derivative is obtained using Equation 2.40 and considering two very close points. In the case of discrete-time signals, the choice of the points is pretty obvious:17 there are no two closer points than two consecutive samples

. The difference equation for the first-order backward differentiator is thus expressed as:

(2.41)

The quantity is the time corresponding to one sample. When transferring a continuous-time problem to the discrete-time domain, therefore, . Otherwise, the term can be expressed in terms of samples, thus

. Many DSP books adopt this rule, while physical modeling texts and numerical analysis books use the former to retain the relation between timescales in the two domains.

The frequency response of the ideal differentiator is a ramp rising by 6 dB per octave, meaning that for each doubling of the frequency there is a doubling of the output value. It also means that any constant term (e.g. an offset in the signal) is canceled because it lies at null frequency, as we would expect from a differentiator, that calculates only the difference between pairs of values. It must be said, for completeness, that the digital differentiator of Equation 2.41 slightly deviates at high frequencies from the behavior of an ideal differentiator. Nonetheless, it is a widely used approximation of the differentiation operator in the discrete-time domain. More complex differentiation schemes exist but are extremely complex and bear additional issues, and thus they have very limited application to music signal processing.

A very different case is that of digital integrators, where several approximations exist and are selected depending on the use case. Let us first consider the integration operation. In the continuous-time domain, the integral is the area under the curve corresponding to a signal. In analog electronics, this is done by using an operational amplifier (op-amp) with a feedback capacitor, allowing you to cumulate the signal amplitude over time. Similarly, in the discrete-time domain, it is sufficient to indefinitely cumulate the value of the incoming signal. Similar to the digital differentiator, the digital integrator is just an approximation of the ideal integrator. Two extremely simple forms exist, the forward and the backward Euler – or rectangular – integrators, described by Equations 2.42 and 2.43:

(2.42)

(2.43)

The forward Euler integrator requires two memory elements but features otherwise similar characteristics. If you want proof that the difference equations (Equations 2.42 and 2.43) implement an integrator, consider this: integrating a curve, or a function, by definition, implies evaluating the underlying area. Our input samples tell us the shape of the curve at successive discrete points. We can take many rectangles that approximate the area between each two consecutive points of that curve. Figure 2.13, for example, shows two rectangles approximating the area under the curve (black line), of which we know three points. The area of these two rectangles slightly underestimates the real value of the area under the curve. Intuitively, by fitting this curve with more rectangles (i.e. reducing their width and increasing the number of points, that is, the sampling rate) the error reduces.

Figure 2.13: A discrete-time feedback integrator applied to a signal. The hatched area results from the integration process, while the remaining white area under the signal (solid bold line) is the approximation error. The shorter the time step, the lower the error.

A computer can thus calculate an approximation of the area by summing those rectangles, as follows: the distance between two of them is the sampling interval

. This gives us the width of each rectangle. We only have to decide the height of the rectangle. We can take either x[n] or x[n+1], the former for the forward Euler integrator and the latter for the backward integrator. The integrator used in Figure 2.13 is the forward Euler, as the height of the rectangle is taken from x[n−1].

To obtain the area in real time, as the new samples come in, we don’t have to sum all the rectangles at each time step. We can just cumulate the area of one rectangle at a time and store the value in the variable

. At the next step, we will add the area of a new rectangle to the cumulative value of all previous ones, and so forth.

Other integrators exist that have superior performance but higher computational cost. In this context, we are not interested in developing the theory further, but the reader can refer to Lindquist (1989) for further details. 2.7 Analog to Digital and Back

Discrete-time signals can be directly generated by computers, processed, and plotted on a screen. However, if we deal with music, we want to record them and listen to them!

These two operations are done through two different processes, the analog-to-digital and the digital-to-analog conversions, which are the inverse of each other. Before putting it in mathematical terms, let us provide some background and clarify the terms used in this section. The analog-to-digital conversion of a signal comprises two fundamental steps: sampling and quantization, in the correct order (Figure 2.14). The first step yields a discrete-time signal, while the second transforms it into a proper digital signal (i.e. a signal described with a discrete and finite set of numbers). This signal is said to have finite precision. If this process is not properly done, it leads to aliasing, an ugly beast that we shall discuss in Section 2.9.

Figure 2.14: The analog-to-digital conversion mechanism (A/D or ADC).

The step of sampling is conducted by taking quick snapshots of a signal at equally spaced time intervals. Let be a continuous-time signal that has a finite bandwidth (i.e. its spectrum is zero – or at least almost inaudible – over a certain frequency

). To sample this signal, it can theoretically be multiplied (modulated) by a train of Dirac pulses:

(2.44)

where the time interval is the inverse of the sampling frequency (or sampling rate, or sample rate)

. The operation is thus:

(2.45)

The result of this operation is a signal that is zero everywhere besides the instants . These instantaneous values are stored, using a suitable method, in the vector

.

Is there any particular choice for the ? Yes, there is! Let us take a look at in the frequency domain. By evaluating the continuous-time Fourier transform of and applying its properties, it can be shown that the sampled signal is periodic in the frequency domain, as shown in Figure 2.15. The product of and is equivalent to the convolution in the frequency domain (per the convolution property of the DFT). The spectrum of is thus replicated in both the positive and negative frequency axis, and each replica has a distance that is equal to . From this observation, we can deduce that there are values of

that are too low and will make the replicas overlap. Is this a problem? Yes, it is! Not convinced? Let us examine what happens next when we convert back the discrete signal into a continuous-time signal.

Figure 2.15: The first stage of the sampling process consists of multiplying by . The product of two signals in the time domain consists of the convolution between their Fourier transform in the frequency domain. The figure shows the two magnitude spectra and the result of the frequency-domain convolution, consisting of replicas of the spectrum of

.

This process is called reconstruction of the continuous-time signal. It creates a continuous-time signal by filling the gaps. First, a continuous time-domain signal is obtained from

by multiplying it with a train of Dirac pulses:

(2.46)

yielding a signal that is zero except from the instants

.

Then a low-pass filter is applied that interpolates between the gaps. Such an ideal low-pass filter is called reconstruction filter. Intuitively, the scope of this filter is to filter all the replicas seen in Figure 2.15 and get only the one centered around zero. To get confirmation of this idea, let us consider that in the time-domain, being a low-pass filter, it reacts slowly to abrupt changes, and it will thus smooth the instantaneous peaks of (remember that it is zero besides the instants

) and connect them.

Ideally, this filter should have a flat frequency response up to the cutoff frequency, and then drop instantly to zero. What do you think is the best filter cutoff to separate the spectral replicas of Figure 2.15? Surely (i.e. the midpoint between two replicas). The shape of this filter is thus a rectangle with a corner at . The filter is an LTI system, and thus the output of the filter will be the convolution between and the filter impulse response

(2.47)

At this point, the filtered signal is the reconstructed signal we were looking for, and will be exactly the same as the input signal . Perfect reconstruction is possible if – and only if – the replicas are at a distance

that is large enough to allow the reconstruction filter to cancel all other replicas and their tails. If the replicas do overlap, the filter will let parts of the replicas in, causing aliasing.

What we have reported so far is the essence of a pillar theorem, the Nyquist sampling theorem, stating that the sampling rate must be chosen to be at least twice the bandwidth of the input signal in order to uniquely determine from the samples

. If this condition is true, then the sampling and reconstruction process do not affect the signal at all.

In formal terms, the Nyquist theorem18 states that if

is band-limited:

then is uniquely determined by its samples

if – and only if –

(2.48)

is also called the Nyquist frequency, and is considered the upper frequency limit for a signal sampled at a given

.

If Equation 2.48 is not respected, spurious content is added to the original signal that is usually perceived as distortion, or, properly, aliasing distortion.

A question may now arise: If we are not lucky and our signal is not band-limited, how do we deal with it? The only answer is: we band-limit it!19 Any real-world signal that needs to be recorded and sampled is first low-pass filtered with a so-called anti-aliasing filter, having cutoff at the Nyquist frequency, thus eliminating any content above it.

A final topic related to filtering: we have seen that the reconstruction filter has its cutoff at the Nyquist frequency, like the anti-aliasing filter, in order to guarantee perfect reconstruction.20 The idea reconstruction filter is a low-pass filter with cutoff frequency at the Nyquist frequency and vertical slope; in other words, it would look like a rectangle in the frequency domain. From Table 2.3, we observe that its inverse Fourier transform (i.e. the impulse response of this filter) is a symmetrical pulse called sinc that goes to zero at infinity, and thus has infinite length. Such an infinite impulse response is hard to obtain. In general, ideal low-pass filters can only be approximated (e.g. by truncating the impulse response up to a certain – finite – length). This makes the reconstruction process not perfect, but trade-offs can be drawn to obtain excellent acoustic results.

Table 2.3: DFT of notable signals Signal Time Domain Frequency Domain Dirac delta


Cosine


Rectangular pulse or window


Pulse train


Note: The figures are only for didactical purposes. They do not consider the effect of sampling that replicates the shown spectra with period Fs. Aliasing will occur for non-band-limited signals. 2.8 Spectral Content of Typical Oscillator Waveforms

Prior to discussing aliasing, we need to discuss what the spectra of typical oscillator waveforms look like. We shall discuss the following waveforms, easily found on most synthesizers:

   sawtooth;
   triangular; and
   rectangular.

Of course, there are lots of variations thereof, but it is enough to understand the properties of these ones.

Sawtooth signals are usually obtained by a rising ramp that grows linearly and is reset at the end of the period using the modulo operation. Mathematically, it is described as:

(2.49)

where

is the period of the signal. The signal is divided by T to keep in the range from −1 to 1, and it is shifted by −1 to have zero mean.

The sawtooth spectrum has even and odd harmonics, and the amplitude of each harmonic is dependent of its frequency with a 1-over-f relation (i.e. with a spectral rolloff of 6 dB/oct). Such a rich spectrum is related to the abrupt jump from 1 to −1. It must be noted that a reversed ramp (i.e. falling) has similar properties.

Triangle waves are, on the other hand, composed by rising and falling ramps, and thus they have no discontinuity. The spectrum of a triangular wave is thus softer than that of a sawtooth wave. Its mathematical description can be derived from that of the sawtooth, by applying an absolute value to obtain the ascending and descending ramps:

(2.50)

The triangle wave has only odd harmonics, and its decay is steeper than that of the sawtooth (12 dB/oct).

The triangle wave has a discontinuity in its first derivative. This can be observed at the end of the ramps: a descending ramp has a negative slope, and thus a negative derivative, while an ascending ramp has a positive slope, and thus a positive derivative. What does the triangle derivative look like? It looks like a square wave! Since the relation between a signal and its derivative, in the frequency domain, is an increase proportional to the frequency, we can easily conclude that the square wave has the same harmonics as the triangle (odd), but with a slower spectral decay (6 dB/oct). The square wave thus sounds brighter than the triangle wave. If we look at the time domain, we can justify the brighter timbre by observing that it has abrupt steps from 1 to −1 that were not present in the triangle wave. The triangular waveform can thus be described as:

(2.51)

where we approximated the time derivative with a backward difference as in Equation 2.41. At this point, an important distinction must be made between the square and rectangular waveforms. A square wave is a rectangular waveform with a 50% duty cycle. The duty cycle is the ratio between the time the wave is up and the period , as shown in Figure 2.16. This alteration of the wave symmetry also affects the spectral content, and thus the timbre. Remember from Table 2.3 that the DFT of a rectangular window has its zeros at frequencies that depend on the length of the window. Considering that a square wave is a periodic rectangular window, it can be seen as the convolution of a train of Dirac pulses with period T and a rectangular window with equal to half the period T. In the frequency domain, this is equal to the product of the spectra of the window and the pulse train. It can then be shown that the square wave has zeros at all even harmonics. However, if the duty cycle changes, the ratio between the period and the

changes as well, shifting the zeros in frequency and allowing the even harmonics to rise. In this case, the convolution of the Dirac train and the rectangular window will result in lobes that may or may not kill harmonics, but will for sure affect the amplitude of the harmonics, as seen in Figure 2.17. Going toward the limit, with the duty cycle getting close to zero, the result is a train of Dirac pulses, and thus the frequency response gets closer to that of a Dirac pulse train, with the first lobe reaching infinity, and thus all harmonics having the same amplitude. This, however, does not occur in practice.

Figure 2.16: Pulse width modulation (PWM) consists of the alteration of the duty cycle (i.e. the ratio between and the period

).

Figure 2.17: A rectangular wave with duty cycle different from 50% has a spectrum that results from the convolution (in time) or the product (in frequency) of the window (a sinc function, dashed lines) and a train of Dirac pulses (producing the harmonics, solid lines).

Table 2.4: Typical oscillator waveforms and related spectra Sawtooth waveform Triangle waveform Square waveform (50% duty cycle)

It should be noted that practical oscillator designs do often generate slight variations over the theoretical waveform. An example is provided by the Minimoog Voyager sawtooth that was studied in Pekonen et al. (2011), which shows it to be smoother than the ideal sawtooth wave. A lot of oscillator designs differ from the ideal waveforms described in this section, and a good virtual analog emulation should take this difference into consideration. 2.9 Understanding Aliasing

Since this book is mainly about generating signals inside the computer, we should sit and reflect on this point: If aliasing is only caused by an improper sampling process, why should we care about aliasing in a virtual modular software, where no signal is recorded or sampled from an audio card? Stop and think for a second before moving on.

The answer is: Yes, we still do care about aliasing! In this setting, we do care even more than a studio engineer that records audio signals by digital means. Sound engineers have anti-aliasing filters in their analog-to-digital converters that make any input signal band-limited before sampling. Unfortunately, this is not the case for us software synth freaks, because generating a discrete-time signal employing a mathematical non-band-limited function is equivalent to sampling that non-band-limited function without an anti-aliasing filter. In other words, generating a signal (e.g. an ideal sawtooth) in software is equivalent to taking the ideal, non-band-limited continuous-time sawtooth from Plato’s Hyperuranion and sampling it. There is no workaround to this, and no anti-aliasing filter can be imposed before this sampling takes place.

Aliasing is one of the most important issues in virtual synthesizers, together with computational cost, and often these two issues conflict with each other: to reduce aliasing, you have to increase the computational cost of the algorithm. The presence of aliasing affects the quality of the signal. Analog synthesizers were not subject to aliasing at all. One notable exception, again, is the Bucket-Brigade delay. BBD circuits are discrete-time systems, and thus subject to the Nyquist sampling theorem. As such, they can generate aliasing. But any other analog gear had no issues of this kind. With the advent of virtual analog in the 1990s, solutions had to be found to generate waveforms without aliasing on the available signal processors of the time. Nowadays, x86 and ARM processors allow a whole lot of flexibility for the developer and improved audio quality for the user.

Common oscillator waveforms (sawtooth, square, and triangle waves) are not band-limited as they have a harmonic content that decays indefinitely at a rate of 6 or 12 db/oct typically, and thus they go well above the 20 kHz limit (ask your dog). This does not pose problems in a recording setting. Any analog-to-digital converter applies an anti-aliasing filter to suppress any possible component above

, thus limiting the bandwidth of the signal. Any digital audio content, therefore, if recorded and sampled properly, does not exhibit noticeable aliasing. Unfortunately, this is not the case, because, as we said, generating a discrete-time signal employing a non-band-limited function is equivalent to sampling that non-band-limited function (i.e. freezing its theoretical behavior at discrete points in time, exactly as the sampling process does).

The outcome of an improper sampling process or the discretization of a non-band-limited signal results in the leak of spurious content in the audible range, which is undesired (unless you are producing dubstep, but that is another story). With periodic signals, such as a sawtooth wave sweeping up in frequency, the visible and audible effect of aliasing is the mirroring of harmonics approaching the Nyquist frequency and being reflected to the range. As aliasing gets even more severe, other replicas of the spectrum get in the range and the visible effect is the mirroring of (already mirrored) harmonics to the left bound of the

range.

After aliasing has occurred, there is generally no way to repair it, since the proper spectrum and the overlapping spectrum are embedded together.

Fortunately, the scientific literature is rich with methods to overcome aliasing in virtual synthesizers. In Sections 8.2 and 8.3, a couple of methods are exposed to deal with aliasing in oscillators and waveshapers, while other methods are referenced in Section 11.1 for further reading.

Let us examine a signal in the frequency domain. In Section 2.7, we described the sampling process and discovered that it necessarily generates aliases of the original spectrum, but these can be separated by a low-pass filter. The necessary condition is that these are well separated. However, if the original spectrum gets over the Nyquist frequency, the aliases start overlapping with each other, as shown in Figure 2.18. Zooming in, and considering a periodic signal, as in Figure 2.19, with equally spaced harmonics, you can see that after the reconstruction filter, the harmonics of the right alias gets in the way. This looks like the harmonics of the original signal are mirrored. The aliasing components mirrored to the right come, in reality, from the right alias, while those mirrored at 0 Hz come from the left alias.

Figure 2.18: Overlapping of aliases, occurring due to an insufficient sampling frequency for a given broadband signal. The ideal reconstruction filter is shown (dashed, from to

).

Figure 2.19: The effect of aliasing on the spectrum of a periodic signal. The ideal spectrum of a continuous-time periodic signal is shown. The signal is not band-limited (i.e. sawtooth). When the signal is sampled without band-limiting, the original spectrum, aliases gets in the way. The effect looks like a mirroring of component over the Nyquist frequency (highlighted with a triangle) and a further mirroring at 0 Hz (highlighted with a diamond), producing the zigzag (dashed line). Please note that this “mirroring” in reality comes from the right and left aliases.

Let us consider aliasing in the time domain and take a sinusoidal signal. If the frequency of the sine is much lower than the sampling rate (Figure 2.20a), each period of the sine is described by a large number of discrete samples. If the frequency of the sine gets as high as the Nyquist frequency (Figure 2.20b), we have only two samples to describe a period, but that is enough (under ideal circumstances). When, however, the frequency of the sine gets even higher, the sampling process is “deceived.” This is clear after the reconstruction stage (i.e. after the reconstruction filter tries to fill in the gaps between the samples). Intuitively, since the filter is a low-pass with cutoff at the Nyquist frequency, the only thing it can do is fill the gaps (or join the dots) to describe a sine at a frequency below Nyquist (Figure 2.20c). As you can see, when aliasing occurs, a signal is fooling the sampling process and is taken for another signal.

Figure 2.20: Effect of aliasing for sinusoidal signals. (a) A continuous-time sine of frequency is shown to be sampled at periodic intervals, denoted by the vertical grid with values highlighted by the circles. (b) A sine at exactly is still reconstructed correctly with two points per period. (c) A continuous-time sine at frequency is mistaken as a sine of frequency

, showing the effect of aliasing. The dashed curve shows the reconstruction that is done after sampling, very similar to the signal in (a). In other words, the sine is “mirrored.” 2.10 Filters: The Practical Side

Common types of filters were discussed in Section 2.5. Filters are implemented in the discrete domain by unidirectional signal-flow graphs implementing the differential equation that yield the impulse response of the desired filter. A conceptual flow for a filter implementation is the following:

   Define a frequency response in the frequency domain using a mask.
   Find a suitable differential equation that implements the filter. These are available from textbooks.
   Find a suitable design strategy to compute the actual coefficients that yield the desired frequency response.
   Implement the filter as a flow graph.
   Translate the flow graph in code (e.g. C/C++).
   Test the filter (e.g. with a Dirac pulse) to verify that the Fourier transform of the impulse response is approximately the desired one.

In this book, we will not cover filter design techniques, but provide a couple of examples easily implemented in Rack, leaving the reader to specific books. A very important filter topology is the biquad filter, or second-order section (SOS). This has application in equalization, modal synthesis, and in the implementation of several IIR filters of higher order, by cascading several SOS. The discrete-time differential equation of the biquad follows:

(2.52)

From its differential equation, it is clear that five coefficients should be computed (we are not covering here how) and four variables are required to store the previous two inputs and previous two outputs. The computational cost of this filter is five products and four sums. The differential equation can be translated in the signal-flow graph shown in Figure 2.21a. This is a realization of the filter that is called direct form 1 (DF1) because it directly implements the difference equation (Equation 2.52).

Figure 2.21: Signal-flow graph of a biquad filter in its direct form 1 (DF1) realization (a) and in its direct form 2 (DF2) realization (b).

Without going into deep detail, we shall mention here that other implementations of the same difference equation exist. The direct form 2 (DF2), for example, obtains the same difference equation but saves two memory storage locations, as shown by Figure 2.21b. In other words, it is equivalent but cheaper. Other realizations exist that obtain the same frequency response but have other pros and cons regarding their numerical stability and quantization noise.

It should be clear to the reader that the same differential equation yields different kinds of filters, depending on the coefficients’ design. As a consequence, you can develop code for a generic second-order filter and change its response (low-pass, high-pass, etc.) or cutoff frequency just by changing its coefficients. If we want to do this in C++, we can thus define a class similar to the following:

       class myFilter {
       private:
           float b0, b1, ...; // coefficients
           float xMem1, xMem2, yMem1, ...; // memory elements

       public:
           void setCoefficients(int type, float cutoff);
           float process(float in);
       }

In this template class, we have defined a public function, setCoefficients, that can be called to compute the coefficients according to some design strategy (e.g. by giving a type and a cutoff). The coefficients cannot be directly modified by other classes. This protects them from unwanted errors or bugs that may break the filter integrity (even small changes to the coefficients can cause instability). The process function is called to process an input sample and returns an output sample. An implementation of the DF1 SOS would be:


       float process (float in) {
           float cumulate;
           float out;
   
           sums = b2 * xMem2 – a2 * yMem2;
           sums += b1 * xMem1 – a1 * yMem1;
           out = b0 * in + sums;
           xMem2 = xMem1;
           yMem2 = yMem1;
           xMem1 = in;
           yMem1 = out;
       }


As you can see, the input and output values are stored in the first memory elements, while the first memory elements are propagated to the second memory elements.

Please note that, usually, audio processing is done with buffers to reduce the overhead of calling the process function once for each sample. A more common process function would thus take an input buffer pointer, an output buffer pointer, and the length of the buffers:


       void process(float *in, float *out, int length);

As we shall see, this is not the case with VCV Rack, which works sample by sample. The reasons behind this will be clarified in Section 4.1.1, and it greatly simplifies the task of coding. 2.11 Nonlinear Processing

Although the theory developed up to this point is mainly devoted to linear systems, some examples of nonlinear processing are provided due to their relevance in the audio processing field. 2.11.1 Waveshaping

Waveshaping is the process of modifying the appearance of a wave in the time domain in order to alter its spectral content. This is a nonlinear process, and as such it generates novel partials in the signal, hence the timbre modification. The effect of the nonlinearity is evaluated by using a pure sine as input and evaluating the number and level of new partials added to the output. The two most important waveshaping effects are distortion effects and wavefolding, or foldback.

For foldback, we have an entire section that discusses its implementation in Rack (see Section 8.3). We shall now spend a few words on the basics of waveshaping and distortion. When waveshaping is defined as a static nonlinear function, the signal is supposed to enter the nonlinear function and read the corresponding value. If x is the value of the input signal at a given time instant, the output of the nonlinear function is just

. Figure 2.22 shows this with a generic example.

Figure 2.22: A static nonlinear function affecting an input signal. As the term “waveshaper” implies, the output wave is modified in its shape by the nonlinear function. It should be clear to the reader that changing the amplitude of the input signal has drastic effects. In this case, a gain factor that reduces the input signal amplitude below the knees of

will make the output wave unaltered.

As you can see, the waveshaping (i.e. the distortion applied to the signal) simply follows the application of the mapping from x to y. In Figure 2.23, we show how a sine wave is affected by a rectifier. A rectifier is a circuit that computes the absolute value of the input, and it can be implemented by diodes in the hardware realm. The input sine “enters” the nonlinear function, and for each input value we compute the output value by visually inspecting the nonlinear function. As an example, the input value is mapped to the output value

and similarly for other selected points.

Figure 2.23: Rectification of a sine wave by computing the function

(absolute value).

Since the wave shape is very important in determining the signal timbre, any factor affecting the shape is important. As an example, adding a small offset to the input signal, or inversely to the nonlinear function, affects the timbre of the signal. Let us take the rectifier and add an offset. This affects the input sine wave, as shown in Figure 2.24. The nonlinear function is now . You can easily show yourself, by sketching on paper, that the output is equivalent to adding an offset to the sine input and using the previous nonlinearity

.

Figure 2.24: Rectification of a sine wave by computing the function . The same result is obtained by offsetting the input waveform by

before feeding it to the nonlinearity.

A static nonlinearity is said to be memoryless. A system that is composed of one or more static nonlinear functions and one or more linear systems (filters) is said to be nonlinear and dynamical, since the memory elements (in the discrete-time domain, the filter states) make the behavior of the system dependent not only on the input, but also on the previous values (its history). We are not going to deal with these systems due to the complexity of the topic. The user should know, however, that there are several ways to model these systems. One approach is to use a series of functions (e.g. Volterra series) (Schetzen, 1980).21 Another one is to use the so-called Hammerstein, Wiener, or Hammerstein-Wiener models. These models are composed by the cascade of a linear filter and a nonlinear function (Wiener model), or a nonlinear function and a linear filter (Hammerstein model), or the cascade of a filter, a nonlinearity, and another filter (Hammerstein-Wiener model) (Narendra and Gallman, 1966; Wiener, 1942; Oliver, 2001). Clearly, the position of the linear filters is crucial when there is a nonlinearity in the cascade. Two linear filters can be put in any order thanks to the commutative property of linear systems, but when there is a nonlinear component in the cascade this does not hold true anymore. There are many techniques to estimate the parameters of both the linear parts and the nonlinear function, in order to match the behavior of a desired nonlinear system (Wills et al., 2013), which, however, requires some expertise in the field.

Back to our static nonlinearities, let us discuss distortion. Distortion implies saturating a signal when it gets close to some threshold value. The simplest way to do this is clipping the signal to the threshold value when this is reached or surpassed. The clipping function is thus defined as:

(2.53)

where

is a threshold value. You surely have heard of clipping in a digital signal path, such as the one in a digital audio workstation (DAW). Undesired clipping happens in a DAW (or in any other digital path) when the audio level is so high that the digits are insufficient to represent digital value over a certain threshold. This happens easily with integer representations, while floating-point representations are much less prone to this issue.22 Clipping has a very harsh sound and most of the times it is undesired. Nicer forms of distortion are still S-shaped, but with smooth corners. An example is a soft saturation nonlinearity of the form:

(2.54)

Another saturating nonlinearity is:

(2.55)

Other distortion functions may be described by polynomials such as the sum of the terms:

(2.56)

where . This equation generates a signal with second harmonic with amplitude , third harmonic with amplitude

, and so forth, up to the nth harmonic. It is important to note that even harmonics and odd harmonics sound radically different to the human ear. The presence of even harmonics is generally considered to add warmth to the sound. A nonlinear function that is even only introduces even harmonics:

(2.57)

An odd function only introduces odd harmonics:

(2.58)

There are a number of well-known distortion functions, and many are described by functions that are hard to compute in real time. In such cases, lookup tables (LUTs) may be employed, which help to reduce the cost. This technique is described in Section 10.2.1.

An important parameter to modify the timbre in distortion effects is the input gain, or drive. This gain greatly affects the timbre. Consider the clipping function of Equation 2.53, shown in Figure 2.25a. When the input level is low, the output is exactly the same as the input. However, by rising the input gain, the signal reaches the threshold and gets clipped, resulting in the clipping distortion. Other distortion functions behave similarly, with just a smoother introduction of the distortion effect. The smaller the signal, the more linear the behavior (indeed, the function in Figure 2.25b is almost a line in the surroundings of the origin). Again, if we put the drive gain after the nonlinear function, its effect is a mere amplification and it does not affect the timbre of the signal that comes out of the nonlinear function.

Figure 2.25: Several distortion curves. The clipping function of Equation 2.53 (a), the hyperbolic tangent function (b), and the saturation of Equation 2.55. 2.11.2 Amplitude and Ring Modulation

Two common forms of nonlinear processing used in the heyday of analog signal processing are the amplitude and ring modulation techniques. While amplitude modulation (AM) rings a bell to most people because it was the first transmission technology for radio broadcast, and is still used in some countries, the name ring modulation (RM) is of interest only to synthesizer fans. In essence, they are almost similar, and we shall discover – in short – why.

Amplitude modulation as a generic term is the process of imposing the amplitude of a modulating signal onto a carrier signal. Mathematically, this is simply obtained by multiplying the two. With electronic circuits, this is much more difficult to obtain. The multiplication operation in the analog domain is achieved by a so-called mixer.23 One way to obtain amplitude modulation is to use diode circuits, in particular a diode ring, hence the name ring modulation. The use of diode ring circuits allowed for a form of amplitude modulation that is often called balanced, and it differs from the unbalanced amplitude modulation for the absence of the carrier in the output. To be more rigorous, the diode ring implements a double-sideband suppressed-carrier AM transmission (DSB-SC AM), while conventional AM contains the carrier (DSB AM). We say “conventional” with reference to DSB AM because it has been widely adopted for radio and TV broadcast. The reason for that is the presence of the carrier frequency in the modulated signal spectrum, which helps the receiver to recover the modulating signal. This allows you to design extremely simple circuits for the receiver. By the way, the envelope follower of Section 7.3 models a simple diode-capacitor circuit to recover the envelope from a modulated signal that is also perfectly suited to demodulate DSB AM signals.

Let us now introduce some math to discuss the spectrum of the modulated signal. The carrier signal is by design a high-frequency sinusoidal signal . In radio broadcast, it has a frequency of the order of tens or hundreds of kHz, while in synthesizer application it is in the audible range. The modulating signal

can be anything from the human voice, to music, and so on.

The DSB-SC, aka RM, simply multiplies the two signals:

(2.59)

Such a process is also called heterodyning.24 The result in the frequency domain is the shift of the original spectrum, considering the positive and negative frequencies, as shown in Figure 2.26. It can be easily noticed that the modulation of the carrier wave with itself yields a signal at double the frequency and a signal at DC

. This trick has been used by several effect units to generate an octaver effect. If the input signal, however, is not a pure sine, many more partials will be generated.

Figure 2.26: The continuous-time Fourier transforms of the modulating and carrier signals, and their product, producing a DSB-SC amplitude modulation, also called ring modulation.

The conventional AM differs, as mentioned above, in the presence of the carrier. This is introduced as follows:

(2.60)

where, besides the carrier, we introduced the modulation index

to weight the modulating signal differently than the carrier. The result is shown in Figure 2.27.

Figure 2.27: Fourier transform of the conventional AM scheme. In this case, the modulating signal is depicted as band-pass signal to emphasize the presence of the carrier in the output spectrum.

It must be noted that DSB-SC AM is better suited to musical applications as the presence of the carrier signal may not be aesthetically pleasing. From now on, when referring to amplitude modulation, we shall always refer to DSB-SC AM.

In subtractive synthesis, it is common to shape the temporal envelope of a signal with an envelope produced by an envelope generator. This is obtained by a VCA module. In mathematical terms, the VCA is a simple mixer25 as in ring modulation, with the notable difference that envelope signals should never be negative (or should be clipped to zero when they go below zero). When, instead of an EG, an LFO with sub-20 Hz frequency is used, the result is a tremolo effect. In this case, the processing does not give rise to new frequency components. The modulating signal is slowly time-varying and can be considered locally26 linear. 2.11.3 Frequency Modulation

Another form of radio and TV broadcast technique is frequency modulation (FM). This signal processing technique is also well known to musicians since the technique has been widely exploited on synthesizers, keyboards, and computer sound cards for decades.

The first studies of FM in sound synthesis were conducted by John Chowning, composer and professor at Stanford University. In the 1960s, he started studying frequency modulation for sound synthesis with the rationale that such a simple modulation process between signals in the audible range produce complex and inharmonic27 timbres, useful to emulate metallic tones such as those of bells, brasses, and so on (Chowning, 1973). He was a percussionist, and was indeed interested in the applications of this technique to his compositions. Among these, Stria (1977), commissioned by IRCAM, Paris, is one of the most recognized. At the time, FM synthesis was implemented in digital computers.28

In the 1970s, Japanese company Yamaha visited Chowning at Stanford University and started developing the technology for a digital synthesizer based on FM synthesis. The technology was patented by Chowning in 1974 and rights were sold to Yamaha. In the 1980s, the technology was mature enough, and in 1983 the company started selling the most successful FM synthesizer in history, the DX7, with more than 200,000 units sold until production ceased in 1989. Many competing companies wanted to produce an FM synthesizer at the time, but the technology was patented. However, it can be shown that the phase of a signal can be modulated, obtaining similar effects. Phase modulation (PM) was thus used by other companies, leading to identical results, as we shall see later.

The FM technology per se is quite simple and computationally efficient. It consists of modifying the frequency of a carrier signal, say a cosine, by a variable term, that is the instantaneous value taken by a modulating signal. If the carrier wave has a constant frequency

, after applying frequency modulation we obtain an instantaneous frequency:

(2.61)

where k is a constant that determines the amount of frequency skew. At this point, it is important to note that there is a strong relationship between the instantaneous phase and frequency of a periodic signal. Thus, Equation 2.61 can be rewritten as:

(2.62)

where

is the instantaneous phase of the signal. We can thus consider changing the phase of the cosine, instead of its frequency, to obtain the same effect given by Equation 2.61, leading to the following equation:

(2.63)

where the phase term is a modulating signal. If it assumes a constant value, the carrier wave is a simple cosine wave and there is no modulation. However, if the phase term is time-varying, we obtain phase modulation (PM). It is thus very important to note that FM and PM are essentially the same thing, and from Equation 2.62 we can state that frequency modulation is the same as a phase modulation where the modulating signal has been first integrated. Similarly, phase modulation is equivalent to frequency modulation where the input signal has been first differentiated.

Let us now consider the outcome of frequency or phase modulation. For simplicity, we shall use the phase modulation notation. Let the modulating signal be

. The result is a phase modulation of the form:

(2.64)

Considering that the sine is the integral of a cosine, the result of Equation 2.64 is also the same as a frequency modulation with

as the modulating signal.

In the frequency domain, it is not very easy to determine how the spectrum will look. The three impacting factors are . When the modulation index is low (i.e.

), the result is approximately:

(2.65)

which states, in other words, that the result is similar to an amplitude modulation with the presence of the carrier (cosine) and the amplitude modulation between the carrier (sine) and the modulating signal. This is called narrowband FM. With increasing modulation index, the bandwidth increases, as does the complexity of the timbre. One rule of thumb is Carson’s rule, which states that approximately 98% of the total energy is spread in the band:

(2.66)

where W is the bandwidth of the modulating signal. This rule is useful for broadcasting, to determine the channels’ spacing, but does not say much about the actual spectrum of the resulting signal, an important parameter to determine its sonic result. Chowning studied the relations between the three factors above, and particularly the ratio , showing that if the ratio is a rational number,29 the spectrum is harmonic and the fundamental frequency can be computed as if the ratio can be normalized to ratio of the common factors . Equation 2.63 describes what is generally called an operator,30 and may or may not have an input

. Operators can be summed, modulated (one operator acts as input to a second operator), or fed back (one operator feeds its input with its output), adding complexity to the sound. Furthermore, operators may not always be sinusoidal. For this reason, sound design is not easy on such a synthesizer. The celebrated DX7 timbres, for instance, were obtained in a trial-and-error fashion, and FM synthesizers are not meant for timbre manipulation during a performance as subtractive synthesizers are.

To conclude the section, it must be noted that there are two popular forms of frequency modulation. The one described up to this point is dubbed linear FM. Many traditional synthesizers, however, exploit exponential FM as a side effect (it comes for free) of their oscillator design. Let us consider a voltage-controlled oscillator (VCO) that uses the V/oct paradigm. Its pitch is determined as:

(2.67)

where the reference frequency may be the frequency of a given base note, and the input voltage is the CV coming from the keyboard. If the keyboard sends, for example, a voltage value of 2 V and , then the obtained tone has

, two octaves higher than the reference voltage. If, however, we connect a sinusoidal oscillator to the input of the VCO, we modulate the frequency of the VCO with an exponential behavior. The modulation index can be implemented as a gain at the input of the VCO. Please note that if the input signal is a low-frequency signal (e.g. an LFO signal), the result is a vibrato effect.

Exponential FM has some drawbacks with respect to linear FM, such as dependence of the pitch on the modulation index, and is not really used for the creation of synthesis engines. It is rather left as a modulation option on subtractive synthesizers. Indeed, exponential FM is used mostly with LFO. Being slowly time-varying, frequency modulation with a sub-20 Hz signal can be considered locally linear,31 and there are no novel frequency components in the output, but rather a vibrato effect. 2.12 Random Signals

TIP: Random signals are key elements in modular synthesizers: white noise is a random signal, and is widely used in sound design practice. Generative music is also an important topic and relies on random, probabilistic, pseudo-periodic generation of note events. In this section, we shall introduce some intuitive theory related to random signals, avoiding probability and statistical theoretical frameworks, fascinating but complex.

First of all, what do we mean by random signals?

In general, a discrete-time random sequence is a sequence with samples generated from a stochastic process (i.e. a process that for all

outputs a value which cannot be predicted).32 As an example, the tossing of a coin at regular time intervals is a stochastic process that produces random binary values. Measuring the impedance of a number of loudspeakers taken from the end of line of a production facility yields random impedance values that slightly deviate from the expected value. Both processes are stochastic and generate random values. However, the first has a binary outcome and is related to time, while the second produces real values and is not inherently related to time (we suppose all loudspeakers are measured at once).

Important stochastic processes are those related to the Brownian motion of particles, because they take an important role in the generation of noise in electronic circuits or in radio communication, and – as we have discussed in our historical perspective in Chapter 1 – these are the fields that have carried to the development of the signal and circuit theory that is behind synthesizers.

Since current is given by the flow of electrons and these are always randomly moving from atom to atom,33 there will always be current fluctuations in circuits that produce noise, even if they are not connected to a power supply. When no power supply is connected, the electrons move randomly toward one end of the conductor with the same probability as to the opposite end, and thus in the long run the current is always zero. However, at times when, by chance, more electrons are moving toward one side rather than the other, there will be a slight current flow. At a later moment, those electrons will change their mind and many will turn back, resulting in a current flow of opposite sign. As we said, averaging over a certain time frame, the mean current is zero; however, looking at a short timescale (and with a sensitive instrument), the current has a lot of random fluctuations that can be described statistically. The study of thermal noise is important in electronic and communication circuits design. In audio engineering, for example, amplifiers will amplify thermal noise if not properly designed and cascaded, resulting in poor sound quality. The properties of the noise are important as well, especially for those kinds of noise that do not have equal amplitude over the frequency range.

Back to the theory: if we consider the evolution of the random value in time, we obtain a random signal. With analog circuits, there are tons of different stochastic processes that we can observe to extract a random signal. In a computing environment, however, we cannot toss coins, and it is not convenient at all to dedicate an analog-to-digital converter to sample random noise just to generate a random signal. We have specific algorithms instead that we can run on a computer, each one with its own statistical properties, and each following a different distribution.

The distribution of a process is, roughly speaking, the number of times each value appears if we make the process last for an infinite amount of time. Each value thus has a frequency of occurrence (not to be confused with the temporal or angular frequency discussed in Section 2.4). In this context, the frequency of occurrence is interpreted as the number of occurrences of that value in a time frame.

Let us consider an 8-bit unsigned integer variable (unsigned char). This takes values from 0 to 255. Let us create a sequence of ten values generated by a random number generator that I will disclose later. We can visualize the distribution of the signal by counting how many times each of the 256 values appears. This is called a histogram. The ten-value sequence and its histogram are shown in Figure 2.28a. As you can see, the values span the whole range 0–255, with unpredictable changes. Since we have a very short number of generated values, most values have zero occurrences, some can count one occurrence, and one value has two occurrences, but this is just by chance. If we repeat the generation of numbers, we get a totally different picture (Figure 2.28b). These are, in technical terms, two realizations of the same stochastic process. Even though both sequences and their histograms are not similar at all, they have been generated using exactly the same algorithm. Why do they look different? Because we did not perform the experiment properly. We need to perform it on a larger run. Figure 2.29 shows two different runs of the same experiment with 10,000 generated values. As you can see, both values tend to be equally frequent, with random fluctuations from value to value. As long as the number of generated values increases, the histogram shows that these small fluctuations get smaller and smaller. With 1 million generated values, this gets clearer (Figure 2.30), and with the number of samples approaching infinity the distribution gets totally flat.

Figure 2.28: Two different sequences of random numbers generated from the same algorithm (top) and their histograms (bottom).

Figure 2.29: Histograms of two random sequences of 10,000 numbers each. Although different, all values tend to have similar occurrences.

Figure 2.30: Histogram of a sequence of 1 million random numbers. All values tend to have the same occurrences.

What we have investigated so far is the distribution of the values in the range 0–255. As we have seen, this distribution tends to be uniform (i.e. all values have the same probability of occurring). The algorithm used to generate this distribution is an algorithm meant to specifically generate a uniform distribution. Such an algorithm can be thought of as an extraction of a lottery number from a raffle box. At each extraction, the number is put back into the box. The numbers obtained by such an algorithm follow the uniform distribution, since at any time every number has the same chance of being extracted.

Another notable distribution is the so-called normal distribution or Gaussian distribution. The loudspeaker samples mentioned above may follow such a distribution, with all loudspeakers having a slight deviation from the expected impedance value (the desired one). The statistical properties of such a population are studied in terms of average (or mean) and standard deviation . If we model the impedance as a statistic variable following a normal distribution, we can describe its behavior in terms of mean and standard deviation and visualize the distribution as a bell-like function (a Gaussian) that has its peak at the mean value and has width dependent on the standard deviation. The standard deviation is interpreted as how much the samples deviate from the mean value. If the loudspeakers of the previous example are manufactured properly, the mean is very close to the desired impedance value and the standard deviation is quite small, meaning that very few are far from the desired value. Another way to express the width of the bell is the variance

, which is the square of the variance. To calculate the mean and standard deviation from a finite population (or, equally, from discrete samples of a signal), you can use the following formulae:

(2.68)

(2.69)

Two examples of normal distribution are shown in Figure 2.31.

Figure 2.31: Comparison of two normal distributions. The mean values are indicated by the diamond, and are, respectively, 248 and 255. The broader curve has larger variance, while the taller curve has lower variance. If these populations come from two loudspeaker production lines and the desired impedance is 250 Ω, which one of the two would you choose?

Going back from loudspeakers to signals, both the normal and uniform random distributions are useful in signal processing, depending on the application. Both uniform and normal random signals are classified as white noise sources, since the spectral content is flat in both cases. The difference between the two is in the distribution of the values (i.e. the amplitude of a signal), with Gaussian noise being more concentrated around the mean.

When we interpret white34 noise sources in the frequency domain, we have to be careful. These are said to be white, meaning that they have equal energy on all the frequency bands of interest; however, if we calculate the DFT of a window of white noise signal, we will not see a flat spectrum as that of a Dirac pulse. Similar to the frequency of occurrence, which tends to be flat if we take a lot of random samples, the DFT of a white noise window tends to be flat if we take a large number of samples. Similarly, we can average the frequency spectra obtained by several windows of white noise to see that it tends to get flat. Random signals are treated in frequency by replacing the concept of spectrum with the concept of power spectral density. We will not need to develop this theory for the purposes of this book; however, if the reader is interested, he or she can refer to Oppenheim and Schafer (2009).

White noise is not colored (i.e. it is flat in frequency). Pink noise is an extremely important form of colored noise. It is very common to find white and noise sources in synthesizers. Pink noise is also named 1-over-f noise,

noise, because its power spectral density is inversely proportional to the frequency. In other words, it decays in frequency by −20 dB/decade or −6 dB/octave. For this reason, it is darker than white noise and more usable for musical application. Since the slope imposed by integration of a signal is −6 dB/octave, it can also be obtained from white noise by integration.

Figure 2.32: Power spectral density of white noise (a) and pink noise (b). The latter exhibits a 6 dB/oct (or 20 dB/decade) spectral decay.

Having provided a short, intuitive description of random signals, a few words should be dedicated to discrete-time random number generation algorithms. Without the intent of being exhaustive, we shall just introduce a couple of key concepts of random number generators. While natural processes such as thermal noise are totally unpredictable, random number generators are, as any other software algorithm, made by sequences of machine instructions. As such, these algorithms are deterministic (i.e. algorithms that always produce the same outcome given the same initial conditions). This makes them somewhat predictable. For this reason, they are said to be pseudorandom, as they appear random on a local scale, but they are generated by a deterministic algorithm.

Thinking in musical terms, we can produce white noise by triggering a white noise sample. However, this sample is finite, and requires looping to make it last longer. Further repetitions of the sample will be predictable as we can store the first playback and compare the new incoming values to the ones we stored, spotting how the next sample will look. The periodicity of the noise sample can be spotted by the human ear if the period is not too long.

There are better ways to produce random numbers that are based on complex algorithms; however, these algorithms are subject to the same problem of periodicity, with the only added value that they do not require memory to store samples, and thus can be extremely long (possibly more than your life) without requiring storage resources. Most of these algorithms are based on simple arithmetic and bit operations, making them computationally cheap and quick. The C code snippet below calculates a random 0/1 value at each iteration and can serve as a taster. The 16-bit variable rval can be accessed bitwise. This is a very simple example, and thus has a low periodicity. Indeed, if you send it to your sound card, you will spot the periodicity, especially at high sampling rates, because the period gets shorter.

       for (i = 0; ; i++) {
       fout = rval.bit0 ^ rval.bit1;
       fout ^= rval.bit11 ^ rval.bit13;
       rval <<=1;
       rval.b0 = fout; // create a feedback loop
       // use the random value fout for your scopes
       }

Random number generators are initialized by a random seed (i.e. an initial value – often the current timestamp – that changes from time to time), ensuring the algorithm is not starting every time with the same conditions. If we start the algorithm twice with the same seed, the output will be exactly the same. In general, this is more important for cryptography rather than our musical generation of noise, but keep it in mind, especially during debugging, that random sequences will stay the same if we use the same seed each time, making an experiment based on random value reproducible. 2.13 Numerical Issues and Hints on Coding

Before concluding the boring part of the book, I want to write down a few remarks related to coding for math in C++. The topics we are going to discuss here are:

   computational cost of basic operators;
   issues with division; and
   variable types and effects on numerical precision.

First of all, a few things must be said regarding numerical formats. A lot of DSP literature of the past dealt with fixed-point processing, when floating-point hardware was more expensive. Fixed-point processing makes use of integer numbers to represent fractional numbers using a convention decided by the developer. Fixed-point arithmetic requires a lot of care and is subject to issues of numerical precision and quantization (a topic that we can neglect in this book). Since VCV Rack runs on x86 processors, there is a lot of floating-point power available. We will not even discuss fixed-point arithmetic here, and after all, with floating-point numbers, developing is easier and quicker.

Two types of floating-point variables are available: single precision (32-bit) and double precision (64-bit). The ubiquitous floating-point format is described by IEEE 754, nowadays implemented in most architectures, from personal computers, to smartphones and advanced microcontrollers. The standard describes both single- and double-precision formats. Single-precision floats represent numbers using an 8-bit exponent, a 23-bit mantissa, and a sign bit. They are thus able to span a range from to . Impressive, uh? They do so by using a nonuniform spacing, using smaller gaps between adjacent numbers in the low range (toward zero) and larger gaps when going toward infinity. Numbers smaller in magnitude than

can be represented as well but require special handling. These are called denormals, and can slow down computing performances by a factor of even 100 times, so they should be avoided.

Double-precision floating-point numbers are not really required for normal operations in audio DSP; however, there are cases when the double precision helps to avoid noise and other issues. There is a small case study in this book; however, most of the time, single-precision floating-point numbers will be used.

The most common mathematical operators we are going to use in the book are sum, difference, product, division, and conditional operators, as well as modulo. Current x86 processor architectures implement these instructions in several flavors and for both integer and floating-point operands. Each type of instruction takes different clock cycles, depending on its complexity and the type of operands. From the technical documentation of a recent and popular x86 processor (Fog, 2018), one can notice that a sum between two integer operands takes one cycle while the sum of two floating-point operands takes three cycles. Similarly, the product of two integer operands takes three cycles, while with floating-point operands it takes five. The division between two integer operands requires at least 23 cycles, while for floating-point operands only 14 to 16 are required. The division instruction also calculates the remainder (modulo). As you can see, divisions take a time larger by an order of magnitude. For this reason, divisions should be reduced as much as possible, especially in those parts of the code that are executed more often (e.g. at each sample). Sometimes divisions can be precalculated and the result can be employed for the execution of the algorithm. This is usually done automatically by the C++ code compiler if optimizations are enabled. For instance, to reduce the amplitude of a signal by a factor of 10, both the following are possible:


   signal = signal / 10;
   signal = signal * 0.1

If compiler optimizations are enabled, the compiler automatically switches from the first to the second in order to avoid division. However, when both operands can change at runtime, some strategies are still viable. Here is an example: if the amplitude of a signal should be reduced by a user-selectable factor, we can call a function only when the user inputs a new value:


   normalizationFactor = 1 / userFactor;

and then use normalizationFactor in the code executed for each sample:


   signal = signal * normalizationFactor;

This will save resources because the division is done only once, while a product is done during normal execution of the signal processing routine.

Another issue with division is the possibility of dividing by zero. The result of a division by zero is infinity, which cannot be represented in a numerical format. The value “not a number” (NaN) will appear instead. NaN must be avoided and filtered away at any cost, and division by zero should be avoided by proper coding techniques, such as adding a small non-null number in the denominator of a division. If you develop a filter that may in some circumstances produce NaNs, you should add a check on its output that resets its states and clears the output from these values.

Calculating the modulo is less expensive when the second operand is a power of 2. In this case, bit operations can be used, requiring less clock cycles. For instance, to reset an increasing variable when it reaches a threshold value that is a power of 2 (say 16), it is sufficient to write:


   var++;
   var = var & 0xF;

where “0x” is used to denote the hexadecimal notation. 0xF is decimal 15. When var is, say, 10 (0xA, or in binary format 1010), a binary AND with 0xF (binary 1111) will give back number 10. When var is 16 (0x10, binary 10000), the binary AND will give 0.

For what concerns double- and single-precision execution times, in general, double-precision instructions execute slower than single-precision floating-point instructions, but on x86 processors the difference is not very large. To obtain maximum speedup, in any case, it is useful to specify the format of a number to the compiler so that unnecessary operations do not take place. In the following line, for example:


   float var1, var2;
   var2 = 2.1 * var1;

the literal 2.1 will be understood by the compiler as a double-precision number. The variable var1 will thus be converted to double precision, the product will be done as a double-precision product, and finally the result will be converted to single precision for the assignment to var2. A lot of unnecessary steps! The solution is to simply state that the literal should be a single-precision floating-point value (unless the other operand is a double and you know that the operation should have a double precision):


   var2 = 2.1f * var1;

The “f” suffix will do that. If you are curious about how compilers transform your code, you can perform comparisons using online tools that are called compiler explorers or interactive compilers. These are nice tools to see how your code will probably be compiled if you are curious or you are an expert coder looking for extreme optimizations.

To conclude, let us briefly talk about conditional statements. The use of conditional clauses (if-else, ‘?’ operator) is not recommended in those parts of the code that can be highly optimized by the compiler, such as a for loop. When no conditional statements are present in the loop, the compiler can unroll the loop and parallelize it so that it can be executed in a batch fashion. This can be done because the compiler knows that the code will always execute in the same way. However, if a conditional statement is present, the compiler cannot know beforehand how the code will execute, if it will jump and where, and thus the code will not be optimized. In DSP applications, the main reason for performing for loops is the fact that the audio is processed in buffers. As we shall see later, in VCV Rack, the audio is processed sample by sample, requiring less use of loops. 2.14 Concluding Remarks

In this chapter, we have examined a lot of different aspects of signals and systems. Important aspects covered along this chapter include the following:

   Continuous-time signals are not computable because they are infinite sets of values and their amplitude belongs to the real set. Discrete-time signals can be computed after discretization of their amplitude value (quantization).
   Discrete-time signals are easy to understand and make some jobs quick. As an example, integral and derivative operations become sum and difference operators, while amplitude modulation, which requires complex electronic circuits, is easily obtained by multiplication.
   Discrete-time systems are used to process discrete-time signals. These systems can be linear and nonlinear. Both are important in musical applications. A review of these systems is done to contextualize the rest of the book.
   The Fourier theory has been reviewed, starting from DFS and getting to the DFT and the STFT. Looking at a signal in the frequency domain allows you to observe new properties of a signal and also allows you to perform some operations faster (e.g. convolution).
   The FFT is just an algorithm to compute the DFT, although these terms are used equivalently in audio processing parlance.
   The sampling theorem is discussed, together with its implications on aliasing and audio quality.
   Ideal periodic signals generated by typical synthesizer oscillators have been discussed.
   Random signals have been introduced with some practical examples. These signals are dealt with differently than other common signals. White noise is an example of a random signal.
   The frequency of occurrence of a random value should not be confused with the angular frequency of a signal.
   Only pseudorandom signals can be generated by computers, but they are sufficient for our applications.
   A few useful tips related to math and C++ are reported.

Exercises and Questions

   Do you produce or record music? If so, your DAW is linear or not? Are there input ranges where we can consider the system linear and ranges where we cannot consider it as linear?
   We know that a distortion effect is non-linear and a filter is linear. What happens if we combine them in a cascade? Does the result change if we put the distortion before or after the filter?
   Suppose you have a spectrum analyzer. You attach a sawtooth oscillator to it and watch its spectral content to reach as high as 24kHz. If you connect the oscillator to a digital effect with a sampling rate of 44.1kHz, and feed the digital effect output to the spectrum analyzer, what will you likely observe? What if the digital effect has a sampling rate of 48kHz?
   Table 2.3 shows the DFTs of notable signals, however it does not consider the effect of the sampling. Are there non-bandlimited signals in the table? If so, how should the DFTs look like in reality?
   Can you devise a nonlinear function that transforms an input sine wave into a square wave? Can you see how an added offset changes it into a rectangle wave with variable duty cycle?

Notes

1 As a student, it bothered me that a positive time shift of T is obtained by subtracting T, but you can easily realize why it is so.

2 Sometimes it is fun to draw connections between signal processing and other scientific fields. Interesting relations between time and frequency exist that can only be explained with quantum mechanics, and carry to a revised version of Heisenberg’s uncertainty principle, where time and frequency are complementary variables (Gabor, 1946).

3 Nowadays, even inexpensive 8-bit microcontrollers such as Atmel 8-bit AVR, used in Arduino boards, have multiply instructions.

4 Spoiler alert: We shall see later that the whole spectrum that we can describe in a discrete-time setting runs from 0 to 2π, and the interesting part of it goes from 0 to π, the upper half being a replica of the lower half.

5 Another tip about scientific intersections: signal processing is a branch of engineering that draws a lot of theory from mathematics and geometry. Transforms are also studied by functional analysis and geometry, because they can be seen in a more abstract way as means to reshape signals, vectors, and manifolds by taking them from a space into another. In signal processing, we are interested in finding good transforms for practical applications or topologies that allow us to determine distances between signals and find how similar they are, but we heavily rely on the genius of mathematicians who are able to treat these high-level concepts without even needing practical examples to visualize them!

6 Different from the transforms that we shall deal with, the human ear maps the input signal into a two-dimensional space with time and frequency as independent variables, where the frequency is approximately logarithmically spaced. Furthermore, the ear has a nonlinear behavior. Psychoacoustic studies give us very complex models of the transform implied by the human ear. Here, we are interested in a simpler and more elegant representation that is guaranteed by the Fourier series and the Fourier transform.

7 We shall see that there are slightly different interpretations of the frequency domain, depending on the transform that is used or whether the input signal is a continuous- or discrete-time one.

8 Following the convention of other textbooks, we shall highlight that a signal is periodic by using the tilde, if this property is relevant for our discussion.

9 Partials may also be inharmonic, as we shall see with non-periodic signals.

10 Engineering textbooks usually employ the letter j for the imaginary part, while high school textbooks usually employ the letter i.

11 The negative frequency components are redundant for us humans, but not for the machine. In general, we cannot just throw these coefficients away.

12 Non-stationary signals cannot be described in the framework of the Fourier transform, but require mixed time-frequency representations such as the short-time Fourier transform (STFT), the wavelet transform, and so on. Historically, one of the first comments regarding the limitations of the continuous-time Fourier transform can be found in Gabor (1946) or Gabor (1947).

13 For the sake of completeness: the DFT also accepts complex signals, but audio signals are always real.

14 Do not try this at home! Read a book on FIR filter design by windowing first.

15 The exact number is real products and

real sums.

16 Unless the original signal is zero everywhere outside the taken slice.

17 Although different approximation schemes exist.

18 It is very interesting to take a look at the original paper from Nyquist (1928), where the scientist developed his theory to find the upper speed limit for transmitting pulses in telegraph lines. These were all but digital. Shannon (1949) provided a proof for the Nyquist theorem, integrating it in his broader information theory, later leading to a revolution in digital communications.

19 More on this in the next section.

20 It actually is the same as the anti-aliasing filter. Symmetries are very common in math.

21 I cannot avoid mentioning the famous Italian mathematician Vito Volterra. He was born in my region, but his fortune lied in having escaped early in his childhood this land of farmers and ill-minded synth manufacturers.

22 The maximum value obtained with a signed integer is . This is 32,767 for N = 16 bits, or around 8.3 million for N = 24 bits. Floating-point numbers can reach up to

.

23 Not to be confused with an audio mixer (i.e. a mixing console). In electronic circuits and communication technology, a mixer is just a multiplier.

24 Incidentally, it should be mentioned that the Theremin exploits the same principle to control the pitch of the tone.

25 Again, here, mixer is not intended as an audio mixing console.

26 The modulating signal can be considered constant for a certain time frame.

27 Inharmonic means that the partials of the sound are not integer multiples of the fundamental, and are thus not harmonic.

28 The original research by Chowning was conducted by implementing FM as a dedicated software on a PDP-1 computer and as a MUSIC V program.

29 The two frequencies are integers.

30 Actually, Equation 2.63 describes a PM operator. An FM operator requires the integral term, resulting in a somewhat more complex equation:

. Yamaha operators also feature an EG and a digitally controlled amplifier, but for simplicity we shall not discuss this.

31 The modulating signal is approximately constant for a certain time frame.

32 Similar considerations can be done for a continuous-time random signal, but dealing with discrete-time signals is easier.

33 Unless we are at absolute zero.

34 The term “white” comes from optics, where white light was soon identified as a radiation with equal energy in all the visible frequency spectrum.