Build a C# Speech To Text Call Recorder

Written by

in

Building a C# application that records audio and converts it into text involves two main steps: capturing the sound wave from an audio device and processing that data through a speech recognition engine.

Here is a complete guide to building a functional speech-to-text call recorder using C# and .NET. Architectural Overview

A call recorder with speech-to-text operates in three distinct phases:

[Audio Source] ──> [NAudio Capture Buffer] ──> [WAV File Writer] │ ▼ [Readable Text] <── [Text Output] <── [Cognitive Services Speech Engine]

Capture: Intercepting the audio stream from the microphone or system output.

Storage: Writing the raw audio bytes into a standard format like WAV.

Transcription: Passing the audio data to a speech-to-text engine for processing. Prerequisites and Setup

To build this project, you need the .NET SDK installed on your machine. Create a new C# Console Application and install the required NuGet packages for audio handling and speech recognition. Run these commands in your terminal:

dotnet new console -n CallRecorderSTT cd CallRecorderSTT dotnet add package NAudio dotnet add package Microsoft.CognitiveServices.Speech Use code with caution.

NAudio: An open-source .NET audio library used to capture audio from input devices.

Microsoft.CognitiveServices.Speech: The official SDK for Azure Speech Services, which provides highly accurate machine-learning transcriptions. Step 1: Capturing Audio with NAudio

First, we set up the audio recorder. NAudio provides the WaveInEvent class to capture audio from the default microphone. We must specify the audio format (16kHz sample rate, 16-bit depth, and mono channel are optimal for speech recognition). Create a file named AudioRecorder.cs:

using System; using NAudio.Wave; public class AudioRecorder { private WaveInEvent _waveSource; private WaveFileWriter _waveWriter; private readonly string _outputFilePath; public AudioRecorder(string outputPath) { _outputFilePath = outputPath; } public void StartRecording() { _waveSource = new WaveInEvent { // 16kHz, 16-bit, Mono is standard for cloud Speech-to-Text WaveFormat = new WaveFormat(16000, 16, 1) }; _waveSource.DataAvailable += OnDataAvailable; _waveSource.RecordingStopped += OnRecordingStopped; _waveWriter = new WaveFileWriter(_outputFilePath, _waveSource.WaveFormat); _waveSource.StartRecording(); } public void StopRecording() { _waveSource?.StopRecording(); } private void OnDataAvailable(object sender, WaveInEventArgs e) { if (_waveWriter != null) { _waveWriter.Write(e.Buffer, 0, e.BytesRecorded); _waveWriter.Flush(); } } private void OnRecordingStopped(object sender, StoppedEventArgs e) { _waveWriter?.Dispose(); _waveWriter = null; _waveSource?.Dispose(); _waveSource = null; } } Use code with caution. Step 2: Implementing Speech to Text

Once the audio is saved to a local WAV file, we feed it into the Azure Speech SDK. You will need an Azure Speech service resource key and region to use this engine. Create a file named SpeechTranscriber.cs:

using System; using System.IO; using System.Threading.Tasks; using Microsoft.CognitiveServices.Speech; using Microsoft.CognitiveServices.Speech.Audio; public class SpeechTranscriber { private readonly string _speechKey; private readonly string _speechRegion; public SpeechTranscriber(string speechKey, string speechRegion) { _speechKey = speechKey; _speechRegion = speechRegion; } public async Task TranscribeAudioAsync(string filePath) { if (!File.Exists(filePath)) { throw new FileNotFoundException(“Target audio file not found.”, filePath); } var speechConfig = SpeechConfig.FromSubscription(_speechKey, _speechRegion); using var audioConfig = AudioConfig.FromWavFileInput(filePath); using var recognizer = new SpeechRecognizer(speechConfig, audioConfig); var stopRecognition = new TaskCompletionSource(); var resultText = new System.Text.StringBuilder(); recognizer.Recognized += (s, e) => { if (e.Result.Reason == ResultReason.RecognizedSpeech) { resultText.AppendLine(e.Result.Text); } }; recognizer.Canceled += (s, e) => { stopRecognition.TrySetResult(0); }; recognizer.SessionStopped += (s, e) => { stopRecognition.TrySetResult(0); }; // Start continuous recognition for long audio files await recognizer.StartContinuousRecognitionAsync(); // Wait until recognition finishes or stops await stopRecognition.Task; await recognizer.StopContinuousRecognitionAsync(); return resultText.ToString(); } } Use code with caution. Step 3: Putting It All Together

Now, coordinate the recorder and transcriber workflow inside the main execution file (Program.cs).

using System; using System.Threading.Tasks; class Program { static async Task Main(string[] args) { // Replace with your actual Azure Cognitive Services credentials string azureKey = “YOUR_AZURE_SPEECH_KEY”; string azureRegion = “YOUR_AZURE_REGION”; string audioPath = “recorded_call.wav”; var recorder = new AudioRecorder(audioPath); var transcriber = new SpeechTranscriber(azureKey, azureRegion); Console.WriteLine(“Press [Enter] to START recording the call…”); Console.ReadLine(); recorder.StartRecording(); Console.WriteLine(“Recording… Press [Enter] to STOP recording and start transcription.”); Console.ReadLine(); recorder.StopRecording(); Console.WriteLine(“Recording saved. Processing transcription, please wait… “); try { string transcript = await transcriber.TranscribeAudioAsync(audioPath); Console.WriteLine(”— Call Transcript —“); Console.WriteLine(string.IsNullOrEmpty(transcript) ? “[No speech detected]” : transcript); Console.WriteLine(“———————–”); } catch (Exception ex) { Console.WriteLine($“An error occurred during transcription: {ex.Message}”); } } } Use code with caution. Enhancing for Real-World Use Cases

While the prototype above successfully records and transcribes a single input channel, production environments generally require two major upgrades:

Two-Channel (Stereo) Recording: Standard call recording requires capturing both the microphone (your voice) and the system audio output (the remote speaker). To do this with NAudio, you must spin up a WasapiLoopbackCapture instance for system audio alongside the WaveInEvent for the microphone, then mix the two streams into a stereo WAV file.

Real-Time Streaming Transcription: Instead of writing to a file and transcribing it afterward, you can pipe the raw byte buffer from the DataAvailable event directly into an Azure PushAudioInputStream. This sends chunks of audio to the cloud over WebSockets as they are spoken, rendering text on your screen in near real-time. If you would like to expand this application, let me know:

Should we modify the code to capture both sides of a call (microphone + computer audio)?

Do you prefer to switch to a completely free, offline engine like OpenAI’s Whisper?

Are you planning to build this into a Windows Desktop UI (WPF/WinForms)?

I can provide the specific code modifications based on your target setup.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *