Understanding WaveNet architecture
WaveNet is deep autoregressive, generative model, which takes raw audio as inputand produces human-like sound.This has brought new dynamics in speech synthesis. WaveNet is combines two ideas, which are Wavelet transform and Neural networks. As given in original paper, it was stated that speech of human in raw form is generally represented as a sequence of 16 bit and it produces total of \(2^{16}\) (65536) quantization values. This quantization values has to be processed through softmax, as 65536 neurons would be required. Hence making this computationally expensive. The need for reducing the sequence to 8 bits was required. This is done using \(\mu\)-law transformation, which is represented as \(F(x) = sign(x) \frac{ln(1+ \mu \lvert x \rvert)}{(1+\mu)}\), -1 ≤ x ≤ 1, where \(\mu\) takes value from 0 to 255 and x denotes input samples and then quantize to 256 values. The first step in an audio preprocessing step requires converting input waveform to quantized values, which has fixed integer range. The integer amplitudes are then one-hot encoded. These one-hot encoded samples are passed through causal convolution. Read more