Coqui STT in Unity

It is possible to enable Speech Recognition (STT) in Unity with the use of Coqui STT. This is a tutorial on how to do that. Many thanks to @kbabilinski for doing most of the hard work porting DeepSpeech to Unity.

Create a new Unity project.

In Unity go to Edit->Project Settings->Player->Other Settings
Set Configuration->API Compatibility Level to .Net 4.x
Tick the box “Allow unsafe code”.

Create a folder structure as shown below:

Move the SampleScene in the root Scenes folder to the CoquiSTT/Scenes folder, then delete the Scenes folder in the root.

Get the Coqui STT source. Go to the git repository below and click on Code->Download Zip

Unzip the source and go to this directory:

Delete the file STTClient.csproj

Select all contents of this folder and drag into the STTClient folder inside Unity.

Currently there is a bug in the source which causes a number of these compile errors:
Stream’ is an ambiguous reference between ‘STTClient.Models.Stream’ and ‘System.IO.Stream’

To fix these compile errors, double click on each complile error to open the source file at the error location.
Change each Stream reference to STTClient.Models.Stream

For example, change this:

unsafe void FreeStream(Stream stream);

To this:

unsafe void FreeStream(STTClient.Models.Stream stream);

Save the file and let Unity compile it. Click on another compile error and repeat the process until all compile errors are resolved.

Next is to download the models. Go to git releases:

Download these files:


Because Unity does not show the extension in the Assets folder explorer, it is a bit hard to tell which file is which, so rename these files like this:

coqui-stt-0.9.3-models.tflite rename to coqui-stt-tflite.tflite
coqui-stt-0.9.3-models.scorer rename to coqui-stt-scorer.scorer

Note that pbmm files are not supported anymore in Coqui STT.

Place the tflite and scorer file in the Unity folder CoquiSTT/models/
If you have custom language and acoustic models you can place them here too.

Unzip the native_client.tflite.Windows.tar.xz file. Then look for the file and rename it to libstt.dll

Place the libstt.dll file in the Unity folder CoquiSTT/Plugins/win64/

Repeat this process for any other (available) platforms you want to support. Download the appropriate native_client package for each platform. Then change the .so extension to the extension appropriate for that platform. For example, the extension for IOS must be .bundle and for Android it must remain as .so and then place the library file in each appropriate platform folder inside Unity.

In Unity, select the file libstt, then go to the Inspector and untick the box Windows x86.
Click Apply. Do the same for all appropriate platforms.

Download the Unity scripts here:

These files originally come from a DeepSpeech to Unity port made by @kbabilinski. They have been slightly modified to make it work with CoqiSTT and contain a few other small imporvements.

Place the files ContinuousVoiceRecorder.cs and SpeechTextToText.cs in the Unity folder CoquiSTT/Scripts

The ContinuousVoiceRecorder script feeds the audio into Coqui STT in realtime and processes the intermediate result. The SpeechTextToText detects the users voice and processes the audio after the user stops talking.
Both examples can auto detect if the user is speaking using a volume threshold.

Create an empty game object in the Hierarchy and call it VoiceRecorder.

Drag and drop the ContinuousVoiceRecorder and SpeechTextToText script onto the VoiceRecorder game object.

Select the SpeechTextToText game object, then go to the Hierarchy and deselect one of the two scripts. Only one should run at the same time.

Go to Hierarchy->VoiceRecorder->Inspector->Speech Text To Text (script)->Tflite File Name and enter the file name: coqui-stt-tflite.tflite

Go to Hierarchy->VoiceRecorder->Inspector->Speech Text To Text (script)->Scorer File Name and enter the file name: coqui-stt-scorer.scorer

Do the same with the “Continuous Voice Recorder (Script)” in the Inspector.

Check that your microphone works.

Press Play and say something. The transcription should appear in Console as a Debug.Log

Here is a unitypackage minus the tflite and scorer files (to save space). You will still need to modify the project settings after importing:


Speech recognition with Coqui STT on Windows

Speech recognition and Flight Simulators

Recently, a few airplane addons for consumer PC based flight simulators have reached a level of detail which makes them suitable for home based practice for certain scenarios. There is one big shortcoming of this use case though: multi crew operations. In the real world, most commercial aircraft are flown by two pilots. The interaction between the two pilots is vast and strictly determined using Standard Operating Procedures (SOPs). For example, the pilot who is flying (PF) will instruct the other pilot (Pilot Monitoring, PM) “Gear Down”. PM will then put the gear lever down. Another example of heavy interaction between pilots is how checklists are read. PF will ask for “Landing Checklist”. PM will then respond with “Cabin Crew”. PF will then say “Advised”, followed by PM saying “Auto Thrust”, and PF responding “Speed”, etc.

Situations like this require speech recognition (Speech To Text, STT). Speech recognition has existed for a fairly long time but recognition quality for a relatively small amount of custom short phrases full of jargon has historically been very poor.

Since a few years, commercial AI based STT solutions which accept custom phrases have been available (Such as Microsoft Azure) but they are not suitable for 3rd party app deployment. This is because these services are cloud based (slow, needs internet), and charge on a data amount or time basis, making it difficult to create a suitable end user pricing system. In general these cloud based systems are also quite expensive.

Luckily there are now open source solutions available which are free, fast, and can run locally. Their accuracy and customization abilities rival paid counterparts. One such software package is called DeepSpeech from Mozilla but this project was recently scaled down. Luckily the core developers from DeepSpeech forked the project and created Coqui STT (and Coqui TSS) which is now in active development.

Out of the box, the supplied model for Coqui STT is not suitable for a flight simulator environment. This is because of the jargon used and phrases being out of context. For example, “gear up” might be recognized as “get up” and “flap one” might be recognized as “let one” (which is more common in every day use). It is not needed to train a new acoustic model (large database of sound files with transcription, requiring huge amount of processing time). Instead, a small custom language model (scorer) will be created to improve accuracy without requiring much processing time. A language model is basically a text file with custom sentences such as “flap one”, “gear down”, etc. To further improve accuracy, data from custom sound files can be incorporated into the model, if needed.

The other required part is converting Text To Speech (TTS). This is also very useful to simulate interaction with a synthetic ATC (Air Traffic Control) system. TTS has been possible for a long time. Most solutions sound quite robotic but Coqui has a TTS solution which sounds surprisingly natural. This will be discussed in another blog post.

The documentation for Coqui STT is somewhat outdated at this moment, so I created a tutorial on how to get this up and running.

Coqui is mainly Linux based but both training and deployment can run on Windows using WSL (Windows Subsystem for Linux). WSL is a virtual Linux environment inside of Windows. It also allows you to use windows based code editors like VS Code or Visual Studio. Additionally, WSL requires less resources than a virtual machine and is free. WSL only works on x64 processors (or ARM).

Note that some users have reported issues with specific firewall applications blocking internet access in WSL.

Coqui STT tutorial

If you have an nVidia Pascal GPU or later, you must install a special driver before installing WSL2:
Do not install any nvidia driver from a Linux terminal.

To install WSL2 (with Ubuntu 20.04 LTS), follow this guide:
Or this guide:

Don’t forget to reboot after you install WSL2.

-If you start Ubuntu (via Start menu) and you get the error message “wsl 2 requires an update to its kernel component” then run windows update and try again.
-The powershell must run with administrator permissions.
-There is no need to install the windows terminal.
-If you get an error relating to virtual disk, compression, and encryption, you have to disable disk compression for a certain folder. To fix this error, right click on this folder in Explorer: C:\Users\User\AppData\Local\Packages\ConicalGroupLimited…
Then un-tick the box at: Properties->General->Advanced->Compress contents to save disk space
Also un-tick Encrypt contents. Restart the PC and try starting Ubuntu again (via the start menu)
-To open a new Linux terminal, go to Windows search and type “ubuntu” to open the Ubuntu app.
-You can paste text into the Linux terminal by right clicking in the terminal window. If that doesn’t work, click on the Ubuntu Icon on the terminal window then tick the box in Properties->Options->Use Ctrl+Shift+C…
-In Windows, the Linux folder is located here: < \\wsl$ >
To open the Linux folder in Windows Explorer, type the above directory in the Path edit box. The Linux folder is only visible if a Linux terminal is running.
-When files are added or removed in the Linux environment, the change does not always show up in Windows Explorer. To refresh, navigate out of the directory, then navigate back.
-Curently, coqui STT only works with Python 3.6 and Tensorflow 1.15.4

If the command "wsl --install" doesn't work, first install all windows updates. Then try "wsl --install -d Ubuntu".

All commands are typed into the Linux terminal.

To make formatting clear, each command is separated by a new line. Long commands appear without a new line.

Install python:

sudo apt-get update

sudo apt-get upgrade

sudo apt-get -y purge python3.8

sudo apt-get -y autoremove

sudo apt install software-properties-common

sudo add-apt-repository ppa:deadsnakes/ppa

sudo apt-get update

sudo apt-get install python3.6-venv

python3.6 -m venv venv-stt

Next is to activate the virtual environment. All STT commands have to be executed in the virual environment, After installing, you only have to execute this line to start the virtual environment in a new terminal:

source $HOME/venv-stt/bin/activate

-Once the virtual environment is active, the terminal will change to something like this:
(venv-stt) username@USER-PC

Install pip and STT:

python3.6 -m pip install -U pip

python3.6 -m pip install stt

Or for GPU support you can run this:

python3.6 -m pip install stt-gpu

Download the tflite acoustic models:

curl -LO
curl -LO

-If the links above don’t work then you can find the files here:
-The pbmm model is not supported anymore, only the tflite model is supported.
-The scorer file is a generic language model and will not work well in a flight simulator environment. A custom language model has to be generated. More on that later.
-There is also a quantized tflite file available. This model is both smaller and faster than the unquantized model but it is also less accurate.

Custom language model

To create a custom language model, follow the steps below.

Clone the Coqui STT Git Repo:

git clone

Install dependencies:

pip install progressbar

pip install progressbar2

sudo apt-get update

sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev

sudo apt-get install libboost-all-dev libeigen3-dev

Get KenLM repository and build:

git clone

cd kenlm

mkdir build

cd build

sudo apt-get -y install cmake

sudo apt-get install build-essential

cmake ..

make -j 4

sudo make install

cd ~

Create a language model text file:

Use a text file with once sentence per line. Use utf8 encoding. This text should not contain any markup language. Remove any punctuation, but you can keep the apostroph as a character. Numbers should be written in full (ie as a cardinal) – that is, as eight rather than 8.

A sample file with WAV, CSV, and a language model file can be found here:

Place the language model in a file called language_model.txt and place it in the home/username directory (replace username with your username). In Windows Explorer, type < \\wsl$ > in the path bar to see the Linux drive. The full directory where you should place the file is this:

Example language model file format:

flap one
flap two
flap three

Create the required binary and vocab files from the language model:

python3.6 STT/data/lm/ --input_txt language_model.txt --output_dir . --top_k 500000 --kenlm_bins kenlm/build/bin/ --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback

-The generate_lm command will save the new language model as two files on disk: lm.binary and vocab-500000.txt

Get the generate_scorer_package binary and extract:

curl -LO

tar -xvf native_client.tflite.Linux.tar.xz --directory STT/native_client/

Install dependencies:

pip install --upgrade pip setuptools

pip install optuna

python3.6 -m pip uninstall tensorflow

python3.6 -m pip install tensorflow==1.15.4

pip install coqui-stt-ctcdecoder

pip install coqui-stt-training

To fix “command not found”:

sudo chmod +x STT/native_client/generate_scorer_package

To fix “ not found”:

Copy this file:


Then move the file here:


Generate scorer file:

sudo STT/native_client/generate_scorer_package --alphabet STT/data/alphabet.txt --lm lm.binary --vocab vocab-500000.txt --package kenlm.scorer --default_alpha 0.5891777425167632 --default_beta 0.6619145283338659

-If you get this error: “Invalid label 0”, it probably means that the path to the alphabet file is incorrect.
-The message: “Doesn’t look like a character based (Bytes Are All You Need) model”, is not an error.
-The message: “–force_bytes_output_mode was not specified, using value infered from vocabulary contents: false”, is not an error.
-The output is kenlm.scorer

Make some WAV files with phrases which exist in the language model.

Audio file format:
Audio files should be WAV, 16 bit, 16Khz, mono.
Place the audio and csv files (can use example files from in home/username/STT/data/

Check the accuracy of the custom language model:

stt --model model.tflite --scorer kenlm.scorer --audio STT/data/flap_1.wav


stt --model model_quantized.tflite --scorer kenlm.scorer --audio STT/data/flap_1.wav

The output will be the last line in the console.

If the accuracy is not so good, a new scorer file (language model) must be generated with updated alpha and beta values. To do this, first create audio and csv files of the language model. I made a special tool for this which you can find here (source included, C#, WPF):

Download (WPF C# source included):

Name the text file sample.csv and place in STT/data/
Do not name the sample file train.csv, dev.csv, or test.csv, otherwise the scripts won’t be able to automatically create the required database split.

CSV file format for manual editing:

The first line of the csv file needs to be exactly this: wav_filename,wav_filesize,transcript
The wav_filesize is the file size in bytes. You can get this in Windows by right clicking on the file->Properties->Size: (not size on disk). For wav_filename you can use either the path to the WAV file or just the WAV file if it is placed in the same directory as the csv file.

The transcript should be exactly the same as the audio file. It should not contain any markup language. Remove any punctuation, but you can keep the apostrophe as a character. Numbers should be written in full (ie as a cardinal) – that is, as eight rather than 8.

Example csv file format:

flap_1.wav,50464,flap one
flap_2.wav,50712,flap two
flap_3.wav,54208,flap three

Get a checkpoint file from the generic model and extract it:

curl -LO

tar -xvf coqui-stt-1.0.0-checkpoint.tar.gz

-Do not download and extract the checkpoints file on Windows and then transfer to the Linux folder because then lm_optimizer will fail.

Generate new alpha and beta values:

python3.6 STT/ --alphabet_config_path STT/data/alphabet.txt --scorer_path kenlm.scorer --auto_input_dataset STT/data/sample.csv --checkpoint_dir coqui-stt-1.0.0-checkpoint --n_trials 6 --n_hidden 2048 --lm_alpha_max 5 --lm_beta_max 5

-The script will try random alpha and beta values to see which gives the best result so you will get a different output every time.
-For n_trials use 2400 if you have time as that gives a more accurate output.
-The output is for example: Best params: lm_alpha=1.58401780601227 and lm_beta=1.796448020609769
-This will also output the following files (see the dataset distribution chapter for more details): train.csv, dev.csv, test.csv into STT/data/
-The WAV files corresponding to the csv files should be present also.

Copy the updated alpha and beta values, then run generate_scorer_package again using the updated values:

sudo STT/native_client/generate_scorer_package --alphabet STT/data/alphabet.txt --lm lm.binary --vocab vocab-500000.txt --package kenlm.scorer --default_alpha 1.58401780601227 --default_beta 1.796448020609769

Transcribe the audio files again to see if the model now preforms better:

stt --model model.tflite --scorer kenlm.scorer --audio STT/data/flap_1.wav


You can further improve the recognition accuracy of a custom language model using a process called “fine-tuning”. This is done using a pre-trained generic acoustic model and custom WAV files with transcripts specific to the custom language model. This works especially well if you use WAV files from your own voice.

The following command performs fine-tuning using the default checkpoint dataset and the WAV files supplied by the CSV files:

python3.6 -m coqui_stt_training.train --checkpoint_dir coqui-stt-1.0.0-checkpoint --train_files STT/data/train.csv --dev_files STT/data/dev.csv --test_files STT/data/test.csv --n_hidden 2048 --load_cudnn true --epochs 3

-The output will be placed in –checkpoint_dir
-The flag “–epochs 3” should be removed for actual training but it will take a long time so for a quick test to see if everything is working.

The three CSV files were created by the –auto_input_dataset flag in If you only have one dataset with 100% coverage (sample.csv), you can run the command below instead. Keep in mind though that auto_input_dataset creates a random distribution but the csv files used should be the same throughout the training process:

python3.6 -m coqui_stt_training.train --checkpoint_dir coqui-stt-1.0.0-checkpoint --auto_input_dataset STT/data/sample.csv --n_hidden 2048 --load_cudnn true --epochs 3

The training output is not suitable for deployment and has to be converted to a tflite model first:

python3 -m coqui_stt_training.export --checkpoint_dir coqui-stt-1.0.0-checkpoint --export_dir STT/data/


To finish up, generate new alpha and beta values using using the train and dev csv files:

python3.6 STT/ --alphabet_config_path STT/data/alphabet.txt --scorer_path kenlm.scorer --test_files STT/data/train.csv STT/data/dev.csv --checkpoint_dir coqui-stt-1.0.0-checkpoint --n_trials 6 --n_hidden 2048 --lm_alpha_max 5 --lm_beta_max 5

-Use –n_trials 2400 if you have time as that gives more accurate values.
-The script will try random alpha and beta values to see which gives the best result so you will get a different output every time.

Now create a new scorer package using the alpha and beta values from

sudo STT/native_client/generate_scorer_package --alphabet STT/data/alphabet.txt --lm lm.binary --vocab vocab-500000.txt --package kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

The fine-tuned model is now finished, so we can test its performance and transcribe some audio using the new tflite file:

stt --model STT/data/output_graph.tflite --scorer kenlm.scorer --audio STT/data/flap_1.wav

Dataset distribution

For machine learning, three datasets are required. For an improved workflow, all WAV files and the three different csv files should be placed in the same folder. You can either use one csv file with 100% of the WAV files (sample.csv in the example in this blog) and automatically generate the three separate csv files from that (using the –auto_input_dataset flag), or you can distribute the data using your own script. If you distribute the data using your own script, the ratio of content should be more or less like this (WAV samples must be randomly taken from all the data):

70% of all WAV and corresponding csv files for training (train).
20% of all WAV and corresponding csv files for validating (dev).
10% of all WAV and corresponding csv files for testing (test).

Meaning of train, dev, and test:

Training (train): for training the model.
Validation (dev): to keep checking the performance of your model in order to know when to stop training.
Testing (test): used to check the performance once training is finished.

Do not crate the dataset manually as that is too much work and is not reproducible.

The data sets should not overlap too much. More information about overlap, overfitting, and the train, dev, and test data sets is available here:


GPU inference is not supported. The GPU can only be used for training.

-Use “–train_cudnn true” instead of “–load_cudnn true” if you have a CUDA GPU. The graphics card drivers need to be installed correctly and GPU must be supported by Coqui STT.

-CUDA required Software requirements:

-When using WSL, you need a special nVidia driver:

-For Tensorflow 1.15.4 use CUDA 10.0 and cuDNN 7.4.2

-More information:


Further improvements to the recognition accuracy can be made by loading a different language model (scorer) depending on the phase of flight. For example, during taxiing different phrases are expected compared to the cruise phase. Also use a numbers-only language model when only numbers are expected to be heard.

Another way to improve command recognition is to force the language model to recognize words in the correct order. For example, let’s consider the language model below:

flap one
gear down

A misheard output could be “flap down” or “gear one”. Currently there is no way to force words to be recognized in the correct order as specified in the language model. However, there is a hack to acomplish the same thing. Words in sentences can be forced to be recognized in the correct order by removing the spaces. So the the language model in the example above would be like this:


Removing the space between words will only work for a system required to recognize commands, but for our use case it will do.

With a custom language model, only words within that language model will be recognized so the solution below is only applicable if a generic language model is used.

As a last line of defense, include common transcription mistakes in the command detection logic. For example, if your code is looking for “gear down”, also execute the command for transcriptions like “git down”, “dear down”, “gear done”, etc, provided the altitude and speed of the aircraft is reasonable for lowering the gear. This is also how humans operate on instructions in a familiar environment, through context aware assumption. For example, when you are PM and the airplane is at final approach and you hear PF say “gear don”, you know what he means and you will lower the gear. But if you are at cruising altitude and you hear “gear don”, or even the correct phrase “gear down”, then you might say “say again?”. This logic can be included in the deployment application and it should work quite well.

Other important points are (applicable to both recording for fine-tuning and for inference (inference/deployment is speech recognition applied to an audio sample):

-Check the microphone output waveform (looking for volume). If the microphone output is too loud, it can lead to clipping. When the microphone volume is too low, STT will not work well, even if the Signal to Noise Ratio is high.
-Some microphones use automatic gain. This will lead to a lot of background noise in the silent segment before speech, resulting in poor STT quality. Disable this feature if possible. Automatic gain can be easily seen in sound recording applications such as Audacity by inspecting the resulting waveform before, during, and after speech when a lot of static noise is present.
-There are many types of microphones. There are differences in direction sensitivity, general sensitivity, frequency response, and recording quality. Use the same type of microphone for both training and inference. Keep in mind though that a too sensitive microphone used in a noisy environment will not lead to good results.
-Microphones can pickup physical vibrations just as easily as sound so make sure that the microphone is unaffected by environmental vibrations caused by fans, typing, etc.
-Have as little background noise as possible. A silent background is always better unless the speech corpus has been mostly trained on noisy backgrounds. In a flight simulator environment (or any simulator or game environment), a headset has to be used because the speaker output will interfere with the STT process.
-Not breathing on the microphone.
-Speech training data used matching the deployment environment (type of microphone, speaker demographics, etc).
-Speak clearly, not too fast and articulate well.
-The audio buffer fed to the recognizer must not be clipped at the beginning or end of the recording. Clipping can happen due to inaccurate Voice Activity Detection (VAD) results or when using Push To Talk (PTT), due timing issues with the start recording and end recording of the Audio API used. If audio buffering is applied incorrectly, it can happen that audio data from a previous spoken sentence is included in the next one. To check for this behavior, make fast successive utterances using PTT and inspect the individual wave forms.
-Some audio players like Windows Groove Music do not play short sound files correctly and make it appear the audio is cut off at the end. Keep this in mind when debugging language model files or inference code applications. It is best to play sound files in Audacity so that you can see the waveform at the same time.
-When using PTT, it is important not to press the talk button too late and not to release the talk button too early. For the button release, a small delay can be build in to solve the problem of users releasing the button a fraction too early, but for the start of the audio recording, no workaround exists.
-The end user program needs to use either threads or tasks when calling STT and audio buffering related functions in order for the UI to remain responsive. This adds code complexity and should be thoroughly tested as it can easily lead to crashes.
-When using continuous inference, some additional logic has to be added which resets the output text if it is to be used for commands, otherwise STT will just keep adding words to the output text.
-Continuous inference in a noisy environment will lead to a lot of false positives, especially if the background noise contains speech.

When developing an end user application (for deployment/inference), it is very important to write exactly the same audio stream which is sent to the recognizer, to a WAV file as well. This way the audio can be inspected with a wave editor like Audacity. In doing so, buffer clipping issues (both in volume and start/end) can be easily detected.

Below is an example of three separate recordings which show that the buffer is cut short at the end of the recording and instead is added to the next recording, caused by buggy audio buffering code on my part. In addition, the microphone volume is set too high:

Without an output like this, inference will simply fail and you have no idea why.

Below is an example of microphone automatic gain. When the recording is started the signal is low but then progressively increases to very loud as the driver automatically increases the gain causing the background noise (mostly the fan from the laptop) to be much more pronounced. After some louder audio input is detected (typing on keyboard), the gain is automatically adjusted and a new background noise base line can be seen. The automatic gain applied in the driver cannot be modified on this laptop (Dell XPS 13 9343) and is therefore not suitable for STT as it is hard for the first word to be recognized due to the excessive gain on the microphone.

Practical Usage

In order to use Coqui STT in a flight simulator and do something useful with it, we need to use real time streaming audio for continuous recognition using a custom program on Windows. How to do this with Visual Studio and C# will be discussed in another blog post.

Deployment on Windows

Transcribing voice to text using Coqui STT can be done with a standalone Windows executable.

C# WPF sample

This is a tutorial how to get the C# WPF source sample to compile using Microsoft Visual Studio.

Get the Coqui STT source. Go to the git repository below and click on Code->Download Zip

Unzip the source and copy this folder to another location:

Double click on STT.WPF.sln in the \dotnet\STTWPF\ folder.

If you get the error “Project Target Framework not installed”, click on download.
Install all missing developer packs. Some more related errors might appear in the Error list.

After installing all missing framework developer packs, restart Visual Studio.

Select the STT.WPF build target (Debug and x64). Click start to build.

Currently there is a bug in the source which causes a number of these compile errors:
‘Stream’ is an ambiguous reference between ‘STTClient.Models.Stream’ and ‘System.IO.Stream’

To fix these compile errors, double click on each compile error to open the source file at the error location.
Change each Stream reference to STTClient.Models.Stream

For example, change this:
unsafe void FreeStream(Stream stream);
To this:
unsafe void FreeStream(STTClient.Models.Stream stream);

Next is to download the models. Go to git releases:

Download these files:


Copy the kenlm.scorer file here: home/username/kenlm.scorer

Place the tflite and scorer file in the executable directory (dotnet\STTWPF\bin\x64\Debug)
unzip native_client.tflite.Windows.tar.xz then place all *.so files in the executable directory too.

Go to the Solution Explorer, right click STTClient, then select Build.

Select STT.WPF as the build target on the top bar, then click Start to build. If any additional ambiguous Stream errors pop up, fix those first as stated above.

An error will popup: Cannot find the model file”. Go to the line which triggers the error and change the file name to “model_quantized.tflite“.

Save and run again.

On the program, click on “Enable external” (to enable the coqqui stt scorer).
Select a microphone device, then click Record, say something, then click stop. The transcript should show up.

Create a project from scratch

To create a project from scratch, follow the instructions below.

In Visual Studio, create a new C# WPF Application project. Name it TestSTT.
Do not select “place solution and project in the same directory”.

Get the Coqui STT source. Go to the git repository below and click on Code->Download Zip

Unzip the source and copy this folder:

Place the STTClient folder into the project folder.

Go to Solution Explorer->Right click on your Project (not the solution)->Properties->Build tab->
Select Platform target: x64
Select “Allow unsafe code”
Change the target in the task bar to x64 (Debug, x64, Start).

Go to Solution Explorer->Right click on the solution->Add->Existing project. Open the STTClient folder which you placed in the project folder before, then select the STTClient.csproj file and add it to the project.

Go to Solution Explorer->Right click on your Project (not the solution, and not STTClient)->Add->Project Reference. Then on the left side bar select Projects->Solution. Tick the box next to STTClient, then click Ok.

Go to Solution Explorer->Right click on “STTClient”->Build

Currently there is a bug in the source which causes a number of these compile errors:
‘Stream’ is an ambiguous reference between ‘STTClient.Models.Stream’ and ‘System.IO.Stream’

To fix these compile errors, double click on each compile error to open the source file at the error location.
Change each Stream reference to STTClient.Models.Stream

For example, change this:
unsafe void FreeStream(Stream stream);
To this:
unsafe void FreeStream(STTClient.Models.Stream stream);

Scroll past this error (it will be fixed after changing the stream reference):
does not implement interface member”

Make sure all “Stream is an ambiguous reference” related errors are fixed first.

Go to Solution Explorer->Right click on “STTClient”->Build. This time it should succeed. If it complains about a missing metafile for STTClient.dll, close Visual Studio and open it again, then build again.

Next is to download the models. Go to git releases:

Download these files:


Place the tflite and kenlm.scorer file in the executable directory (TestSTT\bin\Debug\netcoreapp3.1)
unzip native_client.tflite.Windows.tar.xz then place all *.so files in the executable directory.

Open App.xaml.cs or MainWindow.xaml.cs and add this line:
using STTClient.Interfaces;
If this gives an error, add the STTClient reference to the project as described before.

Demo App

A WPF C# demo application which supports PTT and continuous inference is available here:

The tflite and scorer file have to be added to the executable directory.

Check the required gain in a wave editor like Audacity first. Too much gain will lead to clipping.

VAD is not supported yet.

Press F7 for PTT. The window doesn’t have to have the focus in order to receive the key press because it uses a global key hook.

The audio heard by the STT engine is sent to the /WavOutput directory if “Output debug WAV” is ticked.


Once GPU inference is supported, it will lead to better performance. But whether CPU or GPU inference is used, it is still good to keep in mind performance issues.

On a 2.2 Ghz i5 processor, CPU usage is about 30% continuously if FeedAudioContent and IntermediateDecode is used in a loop. This is not acceptable in a real time 3D rendering use case. It would be better to either use a PTT solution (only send an audio sample for recognition when needed), or use streaming VAD to detect when audio is to be decoded into speech. The disadvantage of the latter is that it has to use a silence period to detect end of speech which leads to both latency and recognition errors if pauses in speech are too long.

The best solution depends on the use case. For ATC conversations, a PTT button is definitely the best option as that is both realistic (PTT is done in real life also), has the lowest latency, and is the most performant. For cockpit crew interaction, it is best to offer a user configurable option of either PTT, continuous recognition, or VAD, depending on the user preference.

When using VAD, it is important to set the microphone volume at an appropriate level. If STT requires a different volume, the wave data can be scaled in code. Note that volume detection alone is not a good way to implement VAD. More information here:

Note that using continuous speech recognition with a small language model will lead to many false positives if the background noise level is too high and contains speech fragments.

Further reading

Sociopaths in Aviation

It is normal that you can’t get along with everyone, there is nothing wrong with that. The highly complex but logical environment we operate in is mostly dictated by SOPs and regulations, you can have a good day at work with someone who you normally wouldn’t hang out with. That is a good thing. But every once in a while you come across someone who very few people like to work with. It’s not just you. We have all seen it and it doesn’t just apply to aviation.

Luckily the danger of personality issues is a widely recognized problem in Aviation. There have been many incidents and accidents in which personality clashes were a factor. CRM classes alone cannot solve this issue because it is not possible to change someone’s personality unless that person puts a significant amount of effort into that over a long period of time. People can change their behavior but only with their own free will. But, when under pressure, the worst part of someone’s personality usually surfaces.

A person should be able to set aside his personality issues, focus on the job, and do what’s right not who’s right. But this is a theoretical notion which sometimes doesn’t match with reality. If you disagree, you haven’t been flying long enough.

We are all people with feelings and empathy, so why do some people are horrible to work with? Because some people have very little empathy. They are sociopaths, and are a danger to Aviation. Let me explain. Sociopathy is a personality disorder which has many facets but the things you see the most in pilots regarding this are a combination of:

-Lack of empathy.

-Being arrogant, the feeling of being “better” than everyone else.

-Talking about other people as if they are stupid and clueless.

-Unable to keep positive professional relationships.

-Not being fair.

-Verbally blunt, lack of tact.

-Drive to acquire ever higher positions.

-Abuse of power.


-Not taking input from others seriously. Wanting to have it their way.

In a theoretical professional environment these things shouldn’t be a problem as it should be possible to operate the aircraft safely, even if you can’t get along with someone. Unfortunately this is not the reality. When someone pisses you off enough so that it occupies your thoughts for a long time, it can become a safety issue.

You are less likely to help a person who has made a mistake but has previously mistreated you. That is human psychology and very hard to resist. Another big issue is that your thoughts are likely to be somewhere else if you just had an “event” on the social level with your colleague but are now in a high workload situation.

Social friction is especially a problem when there is a high cockpit gradient, a senior pilot flying with a far less senior pilot. The less senior pilot is even less likely to help out if his/her colleague makes a mistake. This is both due to the unwillingness to speak up, and the inner doubt this creates. I have been there when I was a cadet. It is a serious issue.

The issues don’t have to start in the airplane. If someone misbehaves sufficiently in the office, the simulator, or even during days off, the negative effects of it can be carried over into the aircraft.

I think the problem of personality disorders like sociopathy, or worse, psychopathy in Aviation are not being taken seriously enough. Airlines and flight schools seem to favor the cleverest pilots who score the highest in the aptitude tests. Sure there are some personality tests like questionnaires and group assignments but these are easily faked. Because sociopaths are generally also highly intelligent, they know they have to modify their behavior temporarily in order to get hired.

So how to solve this problem? There is only one way in my opinion. Don’t engage in verbal conflict and operate the aircraft to the best of your abilities within the framework of CRM. When the situation is bad enough, ask the rostering department not to fly with that person again and call sick if you have to. Someone else is not going to change so if the situation is a safety issue, it is best to avoid it all together.

Hopefully this shines some light on a publically little discussed topic. If you have anything to add, please let me know in the comments.

A320 Descent Energy Management

energy management 0

When I first started flying jets, I struggled with descent energy management. I ended up too high, too low, and didn’t know when to use the Speed brakes. It wasn’t until I had about 2000 hours on the jet before I finally understood the principle. But why did it take so long? I used to think I was the only one who took so long to understand this subject but in hind sight, that didn’t turn out to be the case.

I started flying on the B737 and later transitioned to the A320. This was easy as far as energy management concerns because these aircraft behave practically the same way. Ten years after I started flying, I started instructing on the A320. It is interesting to see flying from an instructor point of view, and I learned something very important: just because someone else understands something, doesn’t mean it is easy. What I observed is that all new pilots struggle with descent energy management. Every single one of them. What is even more interesting is that I sometimes see experienced captains who still don’t get it.

So I was right after all. Descent Management isn’t easy. It is very difficult. But it is not rocket science either, so why did it take so long for me to understand it, and why do some pilots never seem to understand it? I believe there are a few reasons for this. There are some instructors who themselves don’t get it. Some instructors use over complicated and non-intuitive methods and math, confusing the student. Most instructors have forgotten that it is a difficult subject and subsequently don’t spend enough time explaining it. Often technical details are taken for granted but what is obvious to you might not be obvious to someone else. And finally, one of the problems is that there is no standard way of dealing with Descent Energy Management. There are many ways to do it but each instructor insists on their own way, confusing the student pilot even more.

energy management 1

One other problem is that the student pilot is often overloaded with new information. The learning curve is steep and Descent Energy Management is just one of many subjects to master. The first few days a fresh new pilot who just graduated from flight school (Cadet) flies on the real aircraft, the crew is given an additional pilot (safety pilot) to make sure the aircraft can land safely in case the Captain becomes incapacitated. This has mostly to do with landing skills. However, if you can’t put the aircraft on the approach with an appropriate energy state (speed and altitude), you are not going to land it in the first place. So it is important to make descent energy management a priority right from the start.

Currently very little documentation exists about Descent Energy Management on the A320. There are some documents about this subject available, some even from Airbus, but they are all either too theoretical, not practical, not written for pilots (for ATC instead), or overly simplified. So to address this issue, I decided to write a a book about the subject myself. That being said, Descent Energy Management is not something you can learn from a book alone as it takes a lot of experience to fully master this subject. A a well written and easy to understand guide on the subject will make the learning process much easier though. And hopefully it will become the standard one day, doing away with the jungle of different methods out there.

The book focuses on what most inexperienced pilots struggle with but it contains everything there is to say about the subject. It also contains a lot of real world examples and lots of advice, especially on how to make things easier.

energy management 2

You can buy the eBook and paperback version here:

A320 ECAM Rendering

I finished a few more EFIS vector graphics displays. They are created in a CAD program, converted to SVG, modified in Inkscape, converted to XAML with ViewerSVG, and rendered in Unity using NoesisGUI. The complete process is described here.

Here are some screenshots:


In Unity, it looks like this:

EWD and SD

The Status Display (SD) consists out of two separate parts. The top graphics part and the bottom table with the TAT, SAT, ISA, etc. Using two separate parts is easier to maintain if a design error is detected. They can be blended in code using NoesisGUI. All symbols can be animated in NoesisGUI too.

The source vector graphics files contain all symbols. Here is an example:


Path Tracing in Unity using Octane

Unity recently added support for Path Tracing using Octane so I decided to give it a try. I was in the beta program for a few months and it turned out the A320 CAD model in Unity caused quite a few problems due to the large amount of materials used. But eventually it did work.

Here is a sample render. Right click->View Image to expand. (WordPress really has to make this easier)


The images below are tone mapped in Photoshop. This looks a bit better.

render Octane 5render Octane 4render Octane 3render Octane 2

Although it works pretty much out of the box, I discovered a few issues:

-All materials appear slightly more rough when rendered with Octane. Unfortunately there is no global slider available to fix this issue.

-Although the renderer is full HDR (32 bit float RGBA which is 128 bit per pixel), it requires careful tweaking of the sun intensity, exposure, gamma, sky turbidity, and tone mapping in order to avoid white highlights or the entire scene looking too dark. This is a common issue with renderers and is described in detail here. You can also work around this problem by saving the render as an 16 bit EXR and then modifying it in Photoshop but that solution is less than ideal. An out of the box solution which is more physically inspired would be more ideal.

-Currently there are a few bugs which require some workarounds. This includes GameObjects with disabled MeshRenderers still being rendered and spot lights casting shadows.

-A model designed for realtime rendering does not necessarily look good with Path Tracing. This is not the fault of the Path Tracer but due to the fact that flat geometry with detail in AO maps is not rendered. Have a look at the screws on the FCU and you can see that it lacks AO. I did not try enabling the AO maps (not sure if that is possible) but that would make other geometry look worse due to quality issues. My AO mapse are just not designed for close up renders.

Even with the current issues, it is still an easy and quick way to get a nice looking render. And it is free 🙂

The A320 CAD model is available for purchase. Contact for more information.

Fast light source rendering

Rendering light sources is typically done using individual sprites but this can become computationally expensive pretty quickly if you use thousands of lights. A better approach is to use a single mesh and make the individual triangles face the camera in the shader. This way you can render a huge amount of lights (21844 with Unity 2017.2, or 1.431.655.765 lights with Unity 2017.3) in one draw call.

The lights don’t actually light other objects and it needs a good bloom shader but the aim is to make the lights itself look realistic.

Note that Unity’s post processing stack V1 bloom shader does not work well with SpriteLights. However, the current V2 beta (available on github) works exceptionally well, even better than Sonic Ether’s bloom shader as it has almost no flicker.

The funny thing is that there are thousands of references available on how a light affects an object. But the amount of references available on how the light itself looks you can count on one hand. I once found a scientific paper, but that’s about it. Perhaps that is why very few people get it right. Often you see an emissive sphere with a flare sprite slapped on top of it. But that is a far cry from a physically based approach, which I will describe here.

Most lights have a lens, which makes them either highly directional like a flashlight, or horizontally directional, the result of a cylindrical Fresnel lens. This directional behavior is simulated with a phase function which shows nicely on a polar graph. Here you can see two common light radiation patterns:


The blue graph has the function 1 + cos(theta*2) where theta is the angle between the light normal and the vector from the light to the camera. The output of the function is the irradiance. Adding this to the shader gives the lights a nice angular effect.


Next is the attenuation. Contrary to popular belief, focused lights (in the extreme case, lasers) still attenuate with the inverse square law, as described here:…distance-grows-similar-to-other-light-sources

But contrary to even popular scientific belief, lights themselves don’t behave in quite the same way, or at least not perceptually. The inverse square law states that the intensity is inversely proportional to the square of the distance. Because of this:


You see this reference all over, for example here:


Yet the light itself is brighter than bar number 4, which is about at the same distance as the light to the camera. The light itself doesn’t seem to attenuate with the inverse square law. So why is this? Turns out that in order to model high gain light sources (such as directional lights), you need to place the source location far behind the actual source location. Then you can apply the inverse square law like this:


Note that highly directional lights have a very flat attenuation curve, which can be approximated with a linear function if needed in order to save GPU cycles.

Some more reading about the subject here (chapter Validity of the Inverse Square Law):

One other problem is that the light will disappear if it gets too far from the camera. This is the result of the light being smaller than one pixel. That is fine for normal objects but not for lights because even extremely distant or small lights are easily visible in real life, for example a star. It would be nice if we would have a programmable rasterizer, but so far no luck. Instead, I scale the lights up when they are smaller than one pixel, so they remain the same screen size. Together with the attenuation, this gives a very realistic effect. And all of this is done in the shader so it is very fast, about 0.4 ms for 10.000 lights on a 780ti.

Since I made this system for a flight simulator, I included some specific lights you find in aviation, like walking strobe lights (also done entirely in the shader):


And PAPI lights, which are a bit of a corner case. They radiate light in a split pattern like this (used by pilots to see if they are high or low on the approach):


Simulated here, also entirely in the shader.


Normally there are only 4 of these lights in a row, but here are 10.000, just for the fun of it. They have a small transition where the colors are blended (just like in reality), which you won’t find in any simulator product, even multi million dollar professional simulators. That’s a simple lerp() by the way.

I should also note that the shaders don’t use any conditional if-else statements but use lerp, clamp, and scaling trickery instead. So it plays nice even on low-end hardware.

Available here (supports Build-in render pipeline, URP, and HDRP) :