Brussels / 2 & 3 February 2019


DeepSpeech: A Journey to <10% Word Error Rate STT

What is Deepspeech and how you can use it today

Deep Speech is an end-to-end trainable, character-level, deep recurrent neural network (RNN). In less buzzwordy terms: it’s a deep neural network with recurrent layers that gets audio features as input and outputs characters directly — the transcription of the audio. It can be trained using supervised learning from scratch, without any external “sources of intelligence”, like a grapheme to phoneme converter or forced alignment on the input.

One of the major goals from the beginning was to achieve a Word Error Rate in the transcriptions of under 10%. And now our word error rate on LibriSpeech’s test-clean set is 6.5%, which not only achieves our initial goal, but gets us close to human level performance.

In this talk we will cover our journey on how we started the project, what models we evaluated, tuned. The design choices for the project and how we achieved near Human Accuracy. We will cover technical details along with code demonstration and how you can use the engine, along with the models we trained or create and train your own model.

This is a distribution of the talk in lightning format (if I had 10 mins to do it)

Timeline (10min)

First 2 minutes

  • What is Text To Speach
  • Where it is used in the web today
  • Approaches used in Google, Microsoft and most online TTS engines

Next 1 minutes

  • Problems in TTS
  • Cite Sami Lemmetty thesis on the classical problems of creating a TTS (

Next 0.5 minute

  • What is DeepSpeach
  • What OpenSource engine enables us to do

Core Architecture (4.5 Minutes)

  • Explain Architecture overview (Softmax Layer, Feedforward Layer, Bidirectional RNN Layer and Input Features) - 1 slide for each
  • Explain the CTC algorithm we used: Explanation of the path and label probabilities
  • Define the language model and loss function
  • Compare performance results (include a live demo while I talk so that not to consume more time)

Talk about CommonVoice: How we collected the vice corpora (1minute)

  • Show three slides while I explain to show teh three different stages of the corpora building

Future work (1min)

  • How hyper-parameter tuning and network quantization can help

This gives an overview of the whole talk and what to expect out of it. In 45 minute all these points will be discussed in more details but also includes code demo.


Rabimba Karanjai