Home | History | Annotate | Download | only in g3doc
      1 ## Speech Model Tests
      2 
      3 Sample test data has been provided for speech related models in Tensorflow Lite
      4 to help users working with speech models to verify and test their models.
      5 
      6 For the hotword, speaker-id and automatic speech recognition sample models, the
      7 architecture assumes that the models receive their input from a speech
      8 pre-processing module. The speech pre-processing module receives the audio
      9 signal and produces features for the encoder neural network and uses some
     10 typical signal processing algorithms, like FFT and spectral subtraction, and
     11 ultimately produces a log-mel filterbank (the log of the triangular mel filters
     12 applied to the power spectra). The text-to-speech model assumes that the inputs
     13 are linguistic features describing characteristics of phonemes, syllables,
     14 words, phrases, and sentence. The outputs are acoustic features including
     15 mel-cepstral coefficients, log fundamental frequency, and band aperiodicity.
     16 The pre-processing modules for these models are not provided in the open source
     17 version of TensorFlow Lite.
     18 
     19 The following sections describe the architecture of the sample models at a high
     20 level:
     21 
     22 ### Hotword Model
     23 
     24 The hotword model is the neural network model we use for keyphrase/hotword
     25 spotting (i.e. "okgoogle" detection). It is the entry point for voice
     26 interaction (e.g. Google search app on Android devices or Google Home, etc.).
     27 The speech hotword model block diagram is shown in Figure below. It has an input
     28 size of 40 (float), an output size of 7 (float), one Svdf layer, and four fully
     29 connected layers with the corresponding parameters as shown in figure below.
     30 
     31 ![hotword_model](hotword.svg "Hotword model")
     32 
     33 ### Speaker-id Model
     34 
     35 The speaker-id model is the neural network model we use for speaker
     36 verification. It runs after the hotword triggers. The speech speaker-id model
     37 block diagram is shown in Figure below. It has an input size of 80 (float), an
     38 output size of 64 (float), three Lstm layers, and one fully connected layers
     39 with the corresponding parameters as shown in figure below.
     40 
     41 ![speakerid_model](speakerid.svg "Speaker-id model")
     42 
     43 ### Text-to-speech (TTS) Model
     44 
     45 The text-to-speech model is the neural network model used to generate speech
     46 from text. The speech text-to-speech models block diagram is shown
     47 in Figure below. It has and input size of 334 (float), an output size of 196
     48 (float), two fully connected layers, three Lstm layers, and one recurrent layer
     49 with the corresponding parameters as shown in the figure.
     50 
     51 ![tts_model](tts.svg "TTS model")
     52 
     53 ### Automatic Speech Recognizer (ASR) Acoustic Model (AM)
     54 
     55 The acoustic model for automatic speech recognition is the neural network model
     56 for matching phonemes to the input audio features. It generates posterior
     57 probabilities of phonemes from speech frontend features (log-mel filterbanks).
     58 It has an input size of 320 (float), an output size of 42 (float), five LSTM
     59 layers and one fully connected layers with a Softmax activation function, with
     60 the corresponding parameters as shown in the figure.
     61 
     62 ![asr_am_model](asr_am.svg "ASR AM model")
     63 
     64 ### Automatic Speech Recognizer (ASR) Language Model (LM)
     65 
     66 The language model for automatic speech recognition is the neural network model
     67 for predicting the probability of a word given previous words in a sentence.
     68 It generates posterior probabilities of the next word based from a sequence of
     69 words. The words are encoded as indices in a fixed size dictionary.
     70 The model has two inputs both of size one (integer): the current word index and
     71 next word index, an output size of one (float): the log probability. It consists
     72 of three embedding layer, three LSTM layers, followed by a multiplication, a
     73 fully connected layers and an addition.
     74 The corresponding parameters as shown in the figure.
     75 
     76 ![asr_lm_model](asr_lm.svg "ASR LM model")
     77 
     78 ### Endpointer Model
     79 
     80 The endpointer model is the neural network model for predicting end of speech
     81 in an utterance. More precisely, it generates posterior probabilities of various
     82 events that allow detection of speech start and end events.
     83 It has an input size of 40 (float) which are speech frontend features
     84 (log-mel filterbanks), and an output size of four corresponding to:
     85 speech, intermediate non-speech, initial non-speech, and final non-speech.
     86 The model consists of a convolutional layer, followed by a fully-connected
     87 layer, two LSTM layers, and two additional fully-connected layers.
     88 The corresponding parameters as shown in the figure.
     89 ![endpointer_model](endpointer.svg "Endpointer model")
     90 
     91 
     92 ## Speech models test input/output generation
     93 
     94 As mentioned above the input to models are generated from a pre-processing
     95 module (output of a log-mel filterbank, or linguistic features), and the outputs
     96 are generated by running the equivalent TensorFlow model by feeding them the
     97 same input.
     98 
     99 ## Link to the open source code
    100 
    101 ### Models:
    102 
    103 [Speech hotword model (Svdf
    104 rank=1)](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_hotword_model_rank1_2017_11_14.tflite)
    105 
    106 [Speech hotword model (Svdf
    107 rank=2)](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_hotword_model_rank2_2017_11_14.tflite)
    108 
    109 [Speaker-id
    110 model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_speakerid_model_2017_11_14.tflite)
    111 
    112 [TTS
    113 model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_tts_model_2017_11_14.tflite)
    114 
    115 [ASR AM
    116 model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_terse_am_model_2017_11_14.tflite)
    117 
    118 ### Test benches
    119 
    120 [Speech hotword model
    121 test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_hotword_model_test.cc)
    122 
    123 [Speaker-id model
    124 test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_speakerid_model_test.cc)
    125 
    126 [TTS model
    127 test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_tts_model_test.cc)
    128 
    129 [ASR AM model
    130 test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_asr_am_model_test.cc)
    131 
    132 [ASR LM model
    133 test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_asr_lm_model_test.cc)
    134 
    135 [Endpointer model
    136 test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_endpointer_model_test.cc)
    137 
    138 ## Android Support
    139 The models have been tested on Android phones, using the following tests:
    140 
    141 [Hotword] (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/android/BUILD?rcl=172930882&l=25)
    142 
    143 [Speaker-id] (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/android/BUILD?rcl=172930882&l=36)
    144