WebSocket Dictation

Introduction

For real-time dictation, where the user will see transcribed text returning as they speak, nVoq's WebSocket API provides a more efficient, lower latency solution to HTTP. The operations are similar to those used by the HTTP API: authenticate, create the dictation, upload audio, and download results. The primary difference is that while the audio is uploaded in parts, text is being sent back to the client concurrently. This provides a great user experience, but the application developer has a bit more to manage in this asynchronous implementation.

Before You Begin

API User Account

If your organization has not already been in contact with our Sales team, please complete this short form on the Developer Registration Page and we will reach out to you regarding a user account and development with our APIs.

Once you have an account, you must change your password before the account can be used for API calls.

Audio Format

The nVoq API supports ogg Vorbis, webm, MPEG-4, and PCM encoded audio sampled at 16kHz. For more information on Audio Formats, click here

Note: the best audio processing performance is when the audio chunk is about 300ms long.

External Dependecies

Most platforms do not include a WebSocket implementation. So, you will need to download the appropriate third party WebSocket implementation from one of these locations:

Java: Tyrus - Java WebSocket Reference Implementation (get the jar file only here)
JavaScript: Nothing additional required. WebSocket support is built in.
C#: NAudio, Newtonsoft Json, WebSocketSharp Use Visual Studio Package Manager to install these from NuGet

Start your IDE

The nVoq API is a RESTful Web Services and WebSocket API and therefore does not constrain you to any specific platform or programming language. We provide sample code below for C#, Java, and JavaScript. Follow along and run this code in your environment. But, if you prefer C++, Go, or some other language, that's great! Just adapt the code below to your language's web services functionality and you should be good to go.

Let's Go!

Choose your programming language...

Step 1: Set Up

This section provides the starting point for a class that will contain the example implementation. To make the examples easier to read, long URL's and parameters are replaced with more legible variables. This section of code defines those variables and sets the values needed to execute the code in the sections that follow. Additionally, the asynchronous nature of the WebSocket API requires a few other key implementation details.


/** -------------------> REAL-TIME STREAMING FROM AN ASSIGNED AUDIO FILE <--------------------
*
* The Java example class includes the following:
*
* - extends javax.websocket.Endpoint to encapsulate the client 
*   side of the WebSocket connection
*
* - implements the javax.websocket.MessageHandler.Whole interface 
*   for receiving WebSocket messages
*
* - implements the Runnable interface to allow for uploading audio 
*   asynchronously in a separate thread
* 
*/

import java.io.*;
import java.net.*;
import java.nio.file.Paths;
import java.nio.file.Files;
import java.nio.ByteBuffer;
import java.util.*;
import org.glassfish.tyrus.client.ClientManager;
import javax.websocket.*;

public class WSProgram extends Endpoint implements MessageHandler.Whole, Runnable {

   // Your username
   private String username = "yourusername";
   // Your password
   private String password = "yourpassword";
   // to use an API key, replace password with apikey
   // private String apikey = "eyJ0eXA ... iOiJKV1QiLCJh";
   // Server URL
   // "wss://test.nvoq.com" when using Test Environment
   // "wss://healthcare.nvoq.com" when using Healthcare Environment
   private String baseUrl = "wss://test.nvoq.com";
   // "audio/webm" when using WebM format
   // "audio/ogg" when using Ogg format
   // "audio/mp4" when using MPEG-4 format
   // "audio/x-wav" when using WAVE format
   private String audioContentType = "audio/webm";
   // "webm" when using WebM format
   // "ogg" when using Ogg format
   // "mp4" when using MPEG-4 format
   // "pcm-16khz" when using WAVE format
   private String audioFormat = "webm"; // use "ogg" when using Ogg format
   // The url for the WebSocket Dictation Topics:
   //    behavioral_health
   //    cardiology
   //    chiropractic
   //    general_medicine   
   //    - aka Home Healthcare and Hospice
   //    fisher_general_medicine
   //    narrative
   //    orthopedics
   //    pathology
   //    radiology
   //    surgery
   //    veterinary
   private String url = baseUrl + "/wsapi/v2/dictation/topics/general_medicine";
   // audio file to stream to the server
   private String audioFileName = "testaudio.webm";
   // private String audioFileName = "C:\\Users\\PTillman\\Desktop\\Dictations\\Healthot1,Acme-D28163.webm";
   // variable to hold the WebSocket Session reference
   private Session mySession = null;
   
   /*
    * This method connects to the WebSocket server.
    */
   public void connect() {
      javax.websocket.Endpoint myEndpoint = this;
      java.net.URI uri = java.net.URI.create(url);
      try {
         ClientManager.createClient().connectToServer(myEndpoint, uri);
      } catch (Exception e) {
         System.out.println("exception: " + e.toString());
      }
   }   
}


<!-- ------>  REAL-TIME STREAMING FROM A SELECTED AUDIO FILE  <------- -->
<!-- WebSocket JavaScript How-To.  The script below                    -->
<!-- performs a basic WebSocket dictation.                             -->
<!-- ----------------------------------------------------------------- -->
<html>
<meta charset="UTF-8">
<body>
   
   <p>nVoq WebSocket HowTo</p>
   <p>Choose the audio file to upload.</p>
   <input type="file" id="fileinput" />
   <br />
   <br />
   <div id="working">NOT STARTED YET</div>
   <textarea rows="25" cols="75" id="results">Dictation results will appear here</textarea>

   <script>

      //Message to be sent to start dictation session
      var start = {
         "apiVersion" : "1.0",
         "method" : "STARTDICTATION",
         "params" : {
            "id" : "yourUserName", //Enter the user id here
            "authorization" : "yourPassword", //Enter the password here
            //to use and api key, replace authorization parameter with apikey
            //"apikey" : "eyJ0eXA ... iOiJKV1QiLCJh",
            "audioFormat" : {
               "encoding" : "pcm-16khz", //pcm-16khz|ogg|webm|mp4|pcm-8khz
               "sampleRate" : 16000
            },
         }
      };
      //Message to be sent that session is done
      var done = {
         "apiVersion" : "1.0",
         "method" : "AUDIODONE",
      };

      var text = "";
      var status = "WORKING";
      var connected = false;

      function readSingleFile(evt) {
         //implementation will go here...
      }

      document.getElementById('fileinput').addEventListener('change',
            readSingleFile, false);
   </script>

</body>
</html>


//-----------------------> REAL-TIME STREAMING FROM MICROPHONE <---------------------------+
//                                                                                         +
// This C# program will receive input from a microphone, send to the dictation server,     +
// receive the dictation, then display the dictation in the console windwo.                +
//                                                                                         +
//-----------------------------------------------------------------------------------------+

using NAudio.Wave;
using Newtonsoft.Json;
using Newtonsoft.Json.Serialization;
using System;
using System.ComponentModel;
using System.Security.Authentication;
using System.Threading;
using WebSocketSharp;

namespace CSharpWebSocketDictationSample
{
    //------------------------------------------------------------------------------------------
    // This class implements the nVoq web socket API
    // And, as a C# developer, you are the lucky winner of
    // code that reads audio from the microphone as well!
    //------------------------------------------------------------------------------------------
    class CSharpWebSocketDictationSample
    {
        ///  Program configuration 
        const string Username = "yourUsername";
        const string Password = "yourPassword";
        //to use an api key...
        //const string apikey = "eyJ0eXA ... iOiJKV1QiLCJh";
        const string ServiceUrl = "wss://test.nvoq.com/wsapi/v2/dictation/topics/general_medicine";
        const int TimeoutMillis = 10000; // Use a timeout of 10 seconds for most events

        ///  Program state variables 
        // Will be a WebSocketSharp client socket
        readonly WebSocket _webSocket;
        // NAudio will be used to read audio from the OS default microphone
        readonly WaveInEvent _waveSource;
        // In a GUI we wouldn't need waitable event objects, rather we'd update the UI/state directly from async callbacks.
        // But since this is a command-line tester, we'll set up a unique waitable event for every major step of the program.
        readonly AutoResetEvent _signalConnected = new AutoResetEvent(false);
        readonly AutoResetEvent _signalStartDictationResponseReceived = new AutoResetEvent(false);
        readonly AutoResetEvent _signalRecordingStopped = new AutoResetEvent(false);
        readonly AutoResetEvent _signalTextDoneReceived = new AutoResetEvent(false);
        // Collect some additional status information about which callbacks did what afer an event fires.
        volatile bool _timedOut = false; // Reused for all waits
        volatile string _connectionResult;
        volatile string _startDictationResult;
        volatile int _countAudioBytesSent;
        volatile int _lastLogAudioBytesSent;
        
        //.....More code to follow
    }
}

Step 2: Start Dictation

Once the WebSocket connection is constructed, start the dictation.

   
   /*
    * When the connection is established, onOpen is called.  From here we add
    * this object as the message handler and start the dictation.
    */
   @Override
   public void onOpen(Session session, EndpointConfig config) {
      try {
         session.addMessageHandler(this);
         mySession = session;
         this.startDictation();
      } catch (Exception e) {
         System.out.println("exception: " + e.toString());
      }
   }
   
   /*
    * This method creates and sends the JSON formatted start dictation message 
    * to the dictation server via the WebSocket session.
    */
   private void startDictation() {

      try {

         String startDictationMessage = new String("{\n"   
               + "\"apiVersion\": \"1.1\","
               + "\"method\": \"STARTDICTATION\","
               + "\"params\": {" + "\"id\": \"" + username + "\","
               + "\"authorization\": \"" + password + "\","
               //
               // to use API key, substitute the following  line
               // for the authorization line above.
               // + "\"apikey\": \"" + apiKey + "\","
               //
               + "\"externalId\": \"WebSocket Dictation - IntelliJ Java\","
               + "\"audioFormat\": {"
                  + "\"encoding\": \"" + audioFormat + "\","
                  + "\"sampleRate\": 16000" + "},"
               + "\"snsContext\": {"
                  + "\"dictationContext\": \"\","
                  + "\"selectionOffset\": 0,"
                  + "\"selectionEndIndex\": 0" + "},"
               + "\"returnSubscriptions\":["
                  + "\"HYPOTHESISTEXT\","
                  + "\"STABLETEXT\""
               + "],"
               // + "\"commandList\": ["
               //   + "{\"commandId\":\"1\", \"commandPhrase\":\"next field\", \"weight\":1" +"},"
               //   + "{\"commandId\":\"2\", \"commandPhrase\":\"previous field\", \"weight\":1" +"},"
               //   + "{\"commandId\":\"3\", \"commandPhrase\":\"jump up\", \"weight\":1" +"},"
               //   + "{\"commandId\":\"4\", \"commandPhrase\":\"go 1 down\", \"weight\":1" +"},"
               //   + "{\"commandId\":\"5\", \"commandPhrase\":\"go 2 down\", \"weight\":1" +"},"
               //   + "{\"commandId\":\"6\", \"commandPhrase\":\"go 3 down\", \"weight\":1" +"},"
               //   + "{\"commandId\":\"7\", \"commandPhrase\":\"go to start\", \"weight\":1" +"},"
               //   + "{\"commandId\":\"8\", \"commandPhrase\":\"jump to beginning\", \"weight\":1" +"},"
               //   + "{\"commandId\":\"9\", \"commandPhrase\":\"go to beginning\", \"weight\":1" +"},"
               //   + "{\"commandId\":\"10\", \"commandPhrase\":\"go to first word\", \"weight\":1" +"},"
               //   + "{\"commandId\":\"11\", \"commandPhrase\":\"jump down\", \"weight\":1" +"},"
               //   + "{\"commandId\":\"12\", \"commandPhrase\":\"focus chronic renal failure\", \"weight\":1" +"},"
               //   + "{\"commandId\":\"13\", \"commandPhrase\":\"bottom line\", \"weight\":1.0" + "}"
               // + "],"
               + "\"additionalParams\": {"
                  + "\"clientVendor\": \"ACME Inc.\","
                  + "\"clientProduct\": \"ACME WebSocket Program\","
                  + "\"clientVersion\": \"2.1.0\""
                  // + "\"clientVendor\": \"\","
                  // + "\"clientProduct\": \"\","
                  // + "\"clientVersion\": \"\""
               + "}"
            + "}"
         + "}");
         mySession.getBasicRemote().sendText(startDictationMessage);

      } catch (Exception e) {
         System.out.println(e.toString());
      }
   }

    //Message to be sent to start dictation session
      var start = {
         "apiVersion" : "1.0",
         "method" : "STARTDICTATION",
         "params" : {
            "id" : "yourUserName", //Enter the user id here
            "authorization" : "yourPassword", //Enter the password here
            //for api key, replace authorization with api key
            //"apikey" : "eyJ0eXA ... iOiJKV1QiLCJh",
            "audioFormat" : {
               "encoding" : "pcm-16khz", //pcm-16khz|ogg|webm|mp4|pcm-8khz
               "sampleRate" : 16000
            },
         }
      };
      
      //Now, connect to the server, and send the start message above...
      function readSingleFile(evt) {
         //Retrieve the first (and only) File from the FileList object
         var f = evt.target.files[0];

         if (f) {
            var r = new FileReader();
            r.onload = function(e) {

               //Open a websocket session
               var ws = new WebSocket(
                     "wss://test.nvoq.com/wsapi/v2/dictation/topics/general_medicine")

               //On open of WebSocket session
               ws.onopen = function(event) {
               //Send start message to server
               ws.send(JSON.stringify(start));
               //...
                  
            };
               
               //...


        //create a new instance of our sample program object and 
//call the class Main method...
static void Main(string[] args)
{
    CSharpWebSocketDictationSample program = new CSharpWebSocketDictationSample();
    program.Main();
}
  
// Constructor sets up a wav source (microphone audio) 
// and then opens connection to web socket
public CSharpWebSocketDictationSample()
{
    //NAudio simple access to microphone audio...
    _waveSource = new WaveInEvent();
    // Record and transmit audio in quarter second intervals. The interval isn't super important.
    // A smaller buffer gets you slightly more responsive text updates, and a larger buffer
    // uses slightly less CPU and bandwidth (framing overhead).
    _waveSource.BufferMilliseconds = 250;
    _waveSource.WaveFormat = new WaveFormat(16000, 1);
    _waveSource.DataAvailable += _waveSource_DataAvailable;
    _waveSource.RecordingStopped += _waveSource_RecordingStopped;

    // Construct client WebSocket and register callback functions.
    _webSocket = new WebSocket(ServiceUrl);
    _webSocket.OnClose += _webSocket_OnClose;
    _webSocket.OnError += _webSocket_OnError;
    _webSocket.OnMessage += _webSocket_OnMessage;
    _webSocket.OnOpen += _webSocket_OnOpen;
    //_webSocket.Log.Level = LogLevel.Trace; // If you need more logging about what the WebSocket library is doing
    //_webSocket.SslConfiguration.EnabledSslProtocols = System.Security.Authentication.SslProtocols.Tls12 | System.Security.Authentication.SslProtocols.Tls12;
    // ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
    // we do not support Tls1.0 ("Tls") in the future.
    _webSocket.SslConfiguration.EnabledSslProtocols = (SslProtocols)(768 | 3072); // Tls11, Tls12
}  

//Class main method.  This controls the flow of execution for the sample program.
private void Main()
{
    LogMessage("URL: " + ServiceUrl);
    LogMessage("Username: " + Username);

    Prompt("Press enter to connect WebSocket.");
    LogMessage("Attempting to connect...");
    _webSocket.ConnectAsync();
    AwaitSignal(_signalConnected);
    if (_connectionResult != "connected")
      ProgramExit("WebSocket failed to connect.");
    LogMessage("WebSocket connected.");

    Prompt("Press enter to send STARTDICTATION message.");
    SendStartDictationMessage();
    LogMessage("Waiting for server to respond to STARTDICTATION.");
    AwaitSignal(_signalStartDictationResponseReceived);
    if (_timedOut)
      ProgramExit("Timed out waiting for STARTDICTATION response from server.");
    if (_startDictationResult != "accepted")
      ProgramExit("STARTDICTATION not accepted: " + _startDictationResult);
    LogMessage("Server accepted STARTDICTATION.");

    Prompt("Press enter to start recording and transmitting audio.");
    StartAudioRecording();
    LogMessage("Recording started. You should see transcription results start to arrive from the server via TEXT messages.");

    Prompt("Press enter to stop recording audio and send AUDIODONE to server.");
    _waveSource.StopRecording();
    AwaitSignal(_signalRecordingStopped);
    if (_timedOut)
      ProgramExit("Timed out waiting for audio recording to stop.");
    LogMessage("Recording finished. Sending AUDIODONE.");

    SendAudioDoneToServer();

    LogMessage("Waiting for DONE message from server.");
    AwaitSignal(_signalTextDoneReceived);
    if (_timedOut)
      ProgramExit("Timed out waiting for DONE message from server.");

    ProgramExit("Final transcription received.");
}
  
//method creates a message to start the dictation and sends
//it to the server over the open WebSocket
private void SendStartDictationMessage()
{
    WebSocketDictationMessage msg = new WebSocketDictationMessage();
    msg.Method = "STARTDICTATION";
    msg.Params = new WebSocketDictationMessage.JobParams();
    msg.Params.Id = Username;
    msg.Params.Authorization = Password;
    // msg.Params.Apikey = apikey;
    msg.Params.AudioFormat = new WebSocketDictationMessage.AudioFormat();
    msg.Params.AudioFormat.Encoding = "pcm-16khz";
    msg.Params.AudioFormat.SampleRate = 16000;
    // {"STABLETEXT"} = request stable text
    // {"HYPOTHESISTEXT"} = request hypothesis text
    // {"STABLETEXT", "HYPOTHESISTEXT"} = request both stable text and hypothesis text
    msg.Params.ReturnSubscriptions = new string[] { "STABLETEXT" };
    SendJSONMessageToServer(msg);
}

//Utility method for sending JSON encoded messages over the open WebSocket
private void SendJSONMessageToServer(WebSocketDictationMessage msg)
{
    JsonSerializerSettings settings = new JsonSerializerSettings
    {
         DefaultValueHandling = DefaultValueHandling.Ignore,
         ContractResolver = new DefaultContractResolver()
         {
            NamingStrategy = new CamelCaseNamingStrategy()
         }
    };
    string jsonStr = JsonConvert.SerializeObject(msg, Formatting.None, settings);
    SendTextMessageToServer(jsonStr);
}

Step 3: Upload Audio

Each implementation below uploads audio according to the platform specifics.
If you don't have an audio file readily available, you can download one here.


   /*
    * This method reads an entire sample audio file and uploads it in 10 equal
    * parts to the server. This keeps the sample focused on API usage. In a
    * real-time dictation application, the audio is typically read from the
    * microphone.
    */
   private void uploadAudio() {

      byte[] postData;
      try {
         postData = Files.readAllBytes(Paths.get(audioFileName));
         int chunkSize = postData.length / 10;

         for (int position = 0; position < postData.length; position += chunkSize) {
            byte[] buf = Arrays.copyOfRange(postData, position, position + chunkSize);
            mySession.getBasicRemote().sendBinary(ByteBuffer.wrap(buf));
            Thread.sleep(1000);
         }

      } catch (Exception e) {
         System.out.println(e.toString());
      }
   }
   
   /*
    * This method sends a message to the server signaling that all the audio has
    * been uploaded and it can finalize the dictation.
    */
   private void audioDone() {

      byte[] postData;
      try {
         String audioDoneMessage = new String("{" + 
               "\"apiVersion\": \"1.1\"," + 
               "\"method\": \"AUDIODONE\"" + 
               "}");
         mySession.getBasicRemote().sendText(audioDoneMessage);
      } catch (Exception e) {
         System.out.println(e.toString());
      }
   }
   
   /*
    * To illustrate the concurrent nature of WebSocket dictations, we use this
    * to kick off the audio upload process.
    */
   public void startAudioUpload() {
      Thread t = new Thread(this);
      t.start();
   }

   /*
    * Upload the audio in parallel. When finished, send an audio done message.
    */
   public void run() {
      this.uploadAudio();
      this.audioDone();
   }



     //...
     var r = new FileReader();
     r.onload = function(e) {

     //...

     //Send audio file to server
     ws.send(r.result);
     //...


  
        //Start the flow of audio from the sound card/microphone
        private void StartAudioRecording()
        {
            _waveSource.StartRecording();
        }
  
        //When audio becomes available, send it over the socket to the server
        private void _waveSource_DataAvailable(object sender, WaveInEventArgs e)
        {
            byte[] bytes = new byte[e.BytesRecorded];
            Array.Copy(e.Buffer, 0, bytes, 0, e.BytesRecorded);
            if (_webSocket.IsAlive)
            {
                _webSocket.Send(bytes);
                _countAudioBytesSent += e.BytesRecorded;
                // Only log every N seconds of audio so as not to overwhelm the screen with log messages
                int loggingIntervalInSeconds = 2;
                int audioBytesPerSecond = _waveSource.WaveFormat.AverageBytesPerSecond;
                int loggingIntervalInBytes = audioBytesPerSecond * loggingIntervalInSeconds;
                int bytesTransmittedSinceLastLog = _countAudioBytesSent - _lastLogAudioBytesSent;
                if (bytesTransmittedSinceLastLog >= loggingIntervalInBytes)
                {
                    _lastLogAudioBytesSent = _countAudioBytesSent;
                    LogMessage("Number of audio bytes transmitted so far: " + _countAudioBytesSent);
                }
            }
        }
        
        //when the audio is done, call this (in the Main() method above...)
        private void SendAudioDoneToServer()
        {
            WebSocketDictationMessage msg = new WebSocketDictationMessage();
            msg.Method = "AUDIODONE";
            SendJSONMessageToServer(msg);
        }

Step 4: Receive Results

When messages are received from the WebSocket, they are passed to the message processing method according to the API and the registration in step 2 above.

   /*
    * When we receive a message from the server, write it out to the console.
    */
   @Override
   public void onMessage(String message) {
      System.out.println("Received message: " + message);

      /*
       * Once the server confirms the dictation is started, begin
       * uploading the audio in a separate thread.
       */
      if (message.contains("STARTDICTATION")) {
         this.startAudioUpload();
      }
      /*
       * Look for the text done message.  More complete JSON message
       * handling would be appropriate for a production
       * implementation.
       */
      if (message.contains("\"textDone\":true")){
         try{
            //close the session if finished.
            mySession.close();
         }catch(Exception e){
            System.out.println("exception: " + e.toString());
         }
      }

   }


  //Actions to take when a message is received from the server
  ws.onmessage = function(message) {
     connected = true;
     var result = JSON.parse(message.data);
     if (typeof result.data !== "undefined") {
        //See if error is returned by server
        if (typeof result.error == "undefined") {
           document.getElementById('working').innerHTML = status;
           status += "."
           //Save substituted text value of JSON object returned by the server
           text = result.data.substitutedText;
        } else {
           //Save error message value of JSON object returned by the server
           text = "Dictation server returned error: "
                  + result.error.message;
           //Close the websocket connection
           ws.close();
        };
     };
     if (typeof result.data !== "undefined"
         && result.data.textDone == "true") {
        //Close the websocket connection
        ws.close();
     }
   };


        //websocket api calls this method when new data is available
        private void _webSocket_OnMessage(object sender, MessageEventArgs args)
        {
            LogMessage("\n<- Server-to-Client Message: " + args.Data + "\n");

            if (args.IsBinary)
            {
                ProgramExit("Server unexpectedly sent us binary data.");
            }

            WebSocketDictationMessage msg = JsonConvert.DeserializeObject(args.Data);
            string method = msg.Method;

            if ("TEXT" == method)
            {
                if (msg.Data.TextDone)
                    _signalTextDoneReceived.Set();
                LogMessage("*** Press enter to stop recording");
            }
            else if ("STARTDICTATION" == method)
            {
                if (msg.Error != null)
                {
                    _startDictationResult = "error." + msg.Error.Reason;
                }
                else
                {
                    _startDictationResult = "accepted";
                }
                _signalStartDictationResponseReceived.Set();
            }
            else if (msg.Error != null)
            {
                ProgramExit("Received error from server. Reason: " + msg.Error.Reason + ", Message: " + msg.Error.Message);
            }
        }

Full Sample Code

Below is the full sample code. Copy and paste the entire contents of the code below into your favorite editor and save locally on your machine. Modify the URL's and username/password according to your credentials and system access. Then, run the program and enjoy all the excitement of securely converting audio to text via the nVoq.API platform.

If you have any questions, please reach out to support@nvoq.com.

API How-To