Lab 7 – Voice recognition with a linear time invariant system

The goal of this lab is to use create a linear time invariant system to implement an online recognition of a spoken digit.

• Click here to download the assignment.

• Due on March 4 by 5pm.

Comparison with average spectrum

We begin by adquiring our training set by prerecording the word ‘one’ and the word ‘two’. To this end we record $K$ waveforms $y_i$ for the spoken word ‘one’, and $K$ waveforms $z_i$ for the spoken word ‘two’. We normalize the respective signals to have unit energy and compute their DFTs to get
Y_i(k) = \ccalF\left( \frac{y_i}{|y_i|} \right)
\quad \text{and} \quad
Z_i(k) = \ccalF\left( \frac{z_i}{|z_i|} \right)

With the training sets acquired, we proceed to observe a new signal $x$ and compare the DFT $X=\ccalF(x)$ with the DFTs contained in the training sets $\ccalY$ and $\ccalZ$. To that end, we define the score function $q(X_{1},X_{2})$ that compares the DFTs~$X_{1}(k)$ and~$X_{2}(k)$ of two signals~$x_{1}$ and~$x_{2}$ of length~$N$ as
q(X_{1},X_{2}) = \sum_{k} \vert X_{1}(k) \vert^{2} \cdot \vert X_{2}(k) \vert^{2}
for $k$ in any contiguous set of $N$ integers. In this lab, we are asked to use the average spectra of each of the training sets that are obtained as follows
\bbarY = \frac{1}{K} \sum_{i=1}^{K} \vert Y_i \vert
\quad \text{and} \quad
\bbarZ = \frac{1}{K} \sum_{i=1}^{K} \vert Z_i \vert.
Interpret $H_{y}=\bbarY$ and $H_{z}=\bbarZ$ as frequency responses of filters $h_{y}=\ccalF^{-1}(H_{y})$ and $h_{z} = \ccalF^{-1}(H_{z})$.

To determine these impulse responses and use them to compute the score functions $q(X,\bbarY)$ and $q(X,\bbarZ)$ without using the DFT of the signal $x$, we need to interpret the score function $q$. We can then analyze the relationship between the score $q$ and the norm of $X_3=X_1H_2$ as follows,
||X_3||^2&=\sum_{k=0}^{N-1}|| X_3(k) ||^2\\
&=\sum_{k=0}^{N-1}|| X_1(k) H_2 ||^2\\
&=\sum_{k=0}^{N-1}|| X_1(k) |X_2(k)| ||^2\\
&=\sum_{k=0}^{N-1}|| X_1(k) ||^2 |X_2(k)| ^2\\

In other words, the score function is nothing more than the norm of a multiplication of DFTs. We can implement the score functions $q(X,\bbarY)$ and $q(X,\bbarZ)$ by using Parseval’s Energy conservation Theorem. To begin with we know by Parseval’s Energy Conservation Theorem that the energy of the signal in the time domain is equal to the energy of its spectrum,
&=||\ccalF^{-1}(X_1(k) H_2 )||^2\\
&=||\ccalF^{-1}(X_1(k)) *\ccalF^{-1}( H_2 )||^2\\
&=||x_1 *h_2||^2
We can then conclude that to obtain the score function without resorting to the frequency domain, we can explote the duality between convolution of signal in time and multiplication of the spectrum of signals,
&=||x_1 *h_2||^2
We can thus implement the score functions $q(X,\bbarY)$ and $q(X,\bbarZ)$ by computing the norms $||x_1 *h_y||^2$ and $||x_1 *h_z||^2$.

Online Implementation

For this part of the lab we are asked to exploit the fact that as we do not resort to the spectrum of the signal we can actuallu run it online. In order words, as all our calculations are carried out with the time representation of the signals, we can record and evaluate in parallel. To implement this part of the lab we can take chunks of 1 $s$ and compute the norm of convolution with $h_y$ and $h_z$ continuously. We then record, evaluate, rinse and repeat! Non-stop!

The code described here can be downloaded from the folder This folder contains the following six files:

  • $\p{record\}$: The class $\p{record\_sound}$ records a sound and outputs a wavelet recording.
  • $\p{training\}$: The class $\p{training\_set}$ records the training set.
  • $\p{test\}$: The class $\p{test\_set}$ records the test set.
  • $\p{}$: The class $\p{idft}$ computes the Inverse Discrete Fourier Transform of the spectrum of a signal.
  • $\p{conv\}$: The class $\p{conv\_comparison}$ computes the score function $q$ using the norm square of the convolution.
  • $\p{online\}$: The class $\p{online\_comparison}$ computes the score function $q$ using the norm square of the convolution in an online scheme.