1 ## Speech Model Tests 2 3 Sample test data has been provided for speech related models in Tensorflow Lite 4 to help users working with speech models to verify and test their models. 5 6 For the hotword, speaker-id and automatic speech recognition sample models, the 7 architecture assumes that the models receive their input from a speech 8 pre-processing module. The speech pre-processing module receives the audio 9 signal and produces features for the encoder neural network and uses some 10 typical signal processing algorithms, like FFT and spectral subtraction, and 11 ultimately produces a log-mel filterbank (the log of the triangular mel filters 12 applied to the power spectra). The text-to-speech model assumes that the inputs 13 are linguistic features describing characteristics of phonemes, syllables, 14 words, phrases, and sentence. The outputs are acoustic features including 15 mel-cepstral coefficients, log fundamental frequency, and band aperiodicity. 16 The pre-processing modules for these models are not provided in the open source 17 version of TensorFlow Lite. 18 19 The following sections describe the architecture of the sample models at a high 20 level: 21 22 ### Hotword Model 23 24 The hotword model is the neural network model we use for keyphrase/hotword 25 spotting (i.e. "okgoogle" detection). It is the entry point for voice 26 interaction (e.g. Google search app on Android devices or Google Home, etc.). 27 The speech hotword model block diagram is shown in Figure below. It has an input 28 size of 40 (float), an output size of 7 (float), one Svdf layer, and four fully 29 connected layers with the corresponding parameters as shown in figure below. 30 31 ![hotword_model](hotword.svg "Hotword model") 32 33 ### Speaker-id Model 34 35 The speaker-id model is the neural network model we use for speaker 36 verification. It runs after the hotword triggers. The speech speaker-id model 37 block diagram is shown in Figure below. It has an input size of 80 (float), an 38 output size of 64 (float), three Lstm layers, and one fully connected layers 39 with the corresponding parameters as shown in figure below. 40 41 ![speakerid_model](speakerid.svg "Speaker-id model") 42 43 ### Text-to-speech (TTS) Model 44 45 The text-to-speech model is the neural network model used to generate speech 46 from text. The speech text-to-speech models block diagram is shown 47 in Figure below. It has and input size of 334 (float), an output size of 196 48 (float), two fully connected layers, three Lstm layers, and one recurrent layer 49 with the corresponding parameters as shown in the figure. 50 51 ![tts_model](tts.svg "TTS model") 52 53 ### Automatic Speech Recognizer (ASR) Acoustic Model (AM) 54 55 The acoustic model for automatic speech recognition is the neural network model 56 for matching phonemes to the input audio features. It generates posterior 57 probabilities of phonemes from speech frontend features (log-mel filterbanks). 58 It has an input size of 320 (float), an output size of 42 (float), five LSTM 59 layers and one fully connected layers with a Softmax activation function, with 60 the corresponding parameters as shown in the figure. 61 62 ![asr_am_model](asr_am.svg "ASR AM model") 63 64 ### Automatic Speech Recognizer (ASR) Language Model (LM) 65 66 The language model for automatic speech recognition is the neural network model 67 for predicting the probability of a word given previous words in a sentence. 68 It generates posterior probabilities of the next word based from a sequence of 69 words. The words are encoded as indices in a fixed size dictionary. 70 The model has two inputs both of size one (integer): the current word index and 71 next word index, an output size of one (float): the log probability. It consists 72 of three embedding layer, three LSTM layers, followed by a multiplication, a 73 fully connected layers and an addition. 74 The corresponding parameters as shown in the figure. 75 76 ![asr_lm_model](asr_lm.svg "ASR LM model") 77 78 ### Endpointer Model 79 80 The endpointer model is the neural network model for predicting end of speech 81 in an utterance. More precisely, it generates posterior probabilities of various 82 events that allow detection of speech start and end events. 83 It has an input size of 40 (float) which are speech frontend features 84 (log-mel filterbanks), and an output size of four corresponding to: 85 speech, intermediate non-speech, initial non-speech, and final non-speech. 86 The model consists of a convolutional layer, followed by a fully-connected 87 layer, two LSTM layers, and two additional fully-connected layers. 88 The corresponding parameters as shown in the figure. 89 ![endpointer_model](endpointer.svg "Endpointer model") 90 91 92 ## Speech models test input/output generation 93 94 As mentioned above the input to models are generated from a pre-processing 95 module (output of a log-mel filterbank, or linguistic features), and the outputs 96 are generated by running the equivalent TensorFlow model by feeding them the 97 same input. 98 99 ## Link to the open source code 100 101 ### Models: 102 103 [Speech hotword model (Svdf 104 rank=1)](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_hotword_model_rank1_2017_11_14.tflite) 105 106 [Speech hotword model (Svdf 107 rank=2)](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_hotword_model_rank2_2017_11_14.tflite) 108 109 [Speaker-id 110 model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_speakerid_model_2017_11_14.tflite) 111 112 [TTS 113 model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_tts_model_2017_11_14.tflite) 114 115 [ASR AM 116 model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_terse_am_model_2017_11_14.tflite) 117 118 ### Test benches 119 120 [Speech hotword model 121 test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_hotword_model_test.cc) 122 123 [Speaker-id model 124 test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_speakerid_model_test.cc) 125 126 [TTS model 127 test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_tts_model_test.cc) 128 129 [ASR AM model 130 test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_asr_am_model_test.cc) 131 132 [ASR LM model 133 test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_asr_lm_model_test.cc) 134 135 [Endpointer model 136 test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_endpointer_model_test.cc) 137 138 ## Android Support 139 The models have been tested on Android phones, using the following tests: 140 141 [Hotword] (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/android/BUILD?rcl=172930882&l=25) 142 143 [Speaker-id] (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/android/BUILD?rcl=172930882&l=36) 144