Word Markers & Audio Statistics

Use word markers to match audio segments with words in order to make corrected transcripts or extract audio pieces with the associated text.

There are some constraints to using word markers.

Word markers are cleaned up/deleted after 60 days.

There are two ways to get word markers: the New Method and the Old Method.

 

New Method

The new method supplies a list of word marker descriptions that is the same whether supplied in a WebSocket message or fetched via an HTTP request.  If Stable Text is used, the markers are supplied at the same time as the Stable Text, otherwise the markers are made available at the end of the dictation.

 

In the HTTP case, the dictation details can be fetched and there will be a wordMarkers property.

HTTP/1.1 200 OK

Content-Type: text/xml

Transfer-Encoding: chunked

 

4035

 

<?xml version="1.0" encoding="UTF-8"?>
<dictation>
   <done>true</done>
   <id>dict-POItPoZrQYapUS0aSByH0Q</id>
   <profile>profilename</profile>
   <queuedTime>1438810593324</queuedTime>
   <statsAsProperties>
      <entry> <key>wordMarkers</key> <value>

                            [{"audioStart":0.39,"audioLength":0.32999998,"textStart":0,"textLength":2,"text":"My"},
                             {"audioStart":0.71999997,"audioLength":0.51,"textStart":3,"textLength":4,"text":"text"},
                             {"audioStart":1.26,"audioLength":0.29,"textStart":8,"textLength":2,"text":"so"},
                             {"audioStart":1.55,"audioLength":0.41,"textStart":11,"textLength":3,"text":"far"},
                             {"audioStart":2.0,"audioLength":0.48,"textStart":14,"textLength":1,"text":"."}]

      </value> </entry>
      <!-- other entries omitted for brevity -->
   </statsAsProperties>
   <statsString><!-- omitted for brevity --></statsString>
   <url>https://server/SCDictation/rest/{tenant_key}/dictations/dict-POItPoZrQYapUS0aSByH0Q</url>
   <audioName>dictation-audio-dict-POItPoZrQYapUS0aSByH0Q</audioName>
   <tenantDatabase>MyTenant_1</tenantDatabase>
   <textName>dictation-text-dict-POItPoZrQYapUS0aSByH0Q</textName>
   <canceled>false</canceled>
   <canceledTime>0</canceledTime>
   <phonemesName>dictation-phonemes-dict-POItPoZrQYapUS0aSByH0Q</phonemesName>
   <streaming>true</streaming>
   <substitutedTextName>dictation-subst-dict-POItPoZrQYapUS0aSByH0Q</substitutedTextName>
</dictation>

0

 

In the WebSocket case, the message will contain a markers field.

 

{
	"data" : {
		"kind" : "HYPOTHESISTEXT",
		"text" : "My text so far.",
		"substitutedText" : "My text so far.",
                "markers" : [{"audioStart":0.39,"audioLength":0.32999998,"textStart":0,"textLength":2,"text":"My"},
                             {"audioStart":0.71999997,"audioLength":0.51,"textStart":3,"textLength":4,"text":"text"},
                             {"audioStart":1.26,"audioLength":0.29,"textStart":8,"textLength":2,"text":"so"},
                             {"audioStart":1.55,"audioLength":0.41,"textStart":11,"textLength":3,"text":"far"},
                             {"audioStart":2.0,"audioLength":0.48,"textStart":14,"textLength":1,"text":"."}]
	}
}

 

 

Word Marker Object Fields and Values

Substitutions can cause markers to merge (two or more word markers become one larger word marker). 
For example, if you write a RegEx substitution like this: 
Written Form: .*  Spoken Form: $0 (that is "match the entire transcript and replace it with the entire transcript") then you will get the entire transcript, but only get one word marker that spans the entire utterance and audio file. If you have a substitution which is making your word markers too coarse, remove or rewrite the substitution.

 

 

The Old Method

We are in the process of reformatting our API documentation. Please see our new documentation for this web service at https://test.nvoq.com/apidoc/dictation#operation/GetAudioProps

The old method is only available for HTTP dictation requests and is detailed below.

There are two calls used to get the markers:

These two calls are made after checking the Dictation status and before de-registering the Dictation observer.

It is important to note that the calls only return the markers and NOT the audio. The client is responsible for having or requesting the audio before deregistering the Dictation observer.

Request

The client makes a request for audio statistics.

Audio is processed for markers as 16kHz. If 8kHz audio is used, you must divide the results in half or convert the audio to 16kHz in order to get markers the line up properly.

GET /SCDictation/rest/{tenant_key}/dictations/{dictationIdentifier}/audio/stats HTTP/1.1

Accept: text/xml

Accept-Language: en-us,en;q=0.5

Accept-Encoding: gzip, deflate

Authorization: Basic <auth>

 Single Sign-On Request

The only thing different about SSO is the “Authorization:” header.

Authorization: Bearer <user>:<apikey> 

Response

The highlighted values pertain to word and segment audio markers. These values are pairs of indices that represent the start and end byte offsets of each word or segment in the dictation audio.

Segment markers may not be valid (but word markers are).

 

The server returns the audio properties.

HTTP/1.1 200 OK

Content-Type: text/xml

Transfer-Encoding: chunked

 

2b7

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<properties>

<entry key="numWordAudioSegments">20</entry>

<entry key="AudioDoneTimeInMillis">1491413790594</entry>

<entry key="AudioMarkers">[0, 288000]</entry>

<entry key="TenantDatabase"><tenant_key></entry>

<entry key="numWordSegments">20</entry>

<entry key="numSegments">1</entry>

<entry key="WordAudioMarkers">[34880, 46080, 46080, 64000, 155840, 169600, 169600, 183680, 275520, 288000, 288000, 308160]</entry>

<entry key="AudioURL">/SCFileserver/audio/<tenant_key>/IKpd-HjvRvSHh--oXtF5Mw</entry>

</properties>

 

0

Request

The client makes a request for text properties.

GET /SCDictation/rest/{tenant_key}/dictations/{dictationIdentifier}/text/stats HTTP/1.1

Accept: text/xml

Accept-Language: en-us,en;q=0.5

Accept-Encoding: gzip, deflate

Authorization: Basic <auth>

Single Sign-On Request

The only thing different about SSO is the “Authorization:” header.

Authorization: Bearer <user>:<apikey>

Response

The highlighted values pertain to word and segment text markers. These values are pairs of indices that represent the start and end of each word or segment in the dictation text.

For the Enhanced dictation server, segment markers may not be valid (but word markers are).

 

The server returns the text properties.

1dc

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<properties>

<entry key="Version">12</entry>

<entry key="TextDoneTimeInMillis">1491413799127</entry>

<entry key="DictationLength">126</entry>

<entry key="numWordSegments">6</entry>

<entry key="DictationWords">8</entry>

<entry key="Confidence">-1.0</entry>

<entry key="Transcription">This is a test of dictation.\n</entry>

<entry key="TenantDatabase"><tenant_key></entry>

<entry key="TextMarkers">[0, 156]</entry>

<entry key="TextURL">/SCDictation/rest/<tenant_key>/dictations/bMyRGK-XTc-WuJR6C06hNg/text</entry>

<entry key="numWordTextSegments">8</entry>

<entry key="SubstitutedText">This is a test of dictation.\n</entry>

<entry key="WordTextMarkers">[0, 3, 5, 6, 8, 8, 10, 13, 15, 16, 18, 26, 27, 27, 28, 28]</entry>

<entry key="numSegments">1</entry>

</properties>

 

0

 

 

From /text/stats you get string indexes into the final transcription text.  The indexes come in pairs, so there will always be an even number of entries in the list. The first value of a pair indicates the start position of a marker in the text, and the second value in the pair indicates the end of a marker value in the text.
WordTextMarkers=[0, 2, 3, 3, 5, 11, 13, 14]
 
From /audio/stats you get 16-bit 16kHz PCM byte counts.  These values also come in pairs, and indicate at what time in the audio the corresponding text was generated from.
WordAudioMarkers=[3518, 23358, 23360, 39038, 60798, 73278, 73278, 76798]
 
In order to use timestamps, not byte counts, an example method to convert the byte counts into a timestamp in millis:

    public static long audioByteCountToDurationInMillis(long byteCount)
    {
        double SAMPLES_PER_SECOND = 16000D;
        double sampleCount = byteCount / 2;
        double timeInSeconds = sampleCount / SAMPLES_PER_SECOND;
        double timeInMillis = timeInSeconds * 1000;
        return (long) timeInMillis;
    }

 

WordTextMarkers and WordAudioMarkers is like forming data structure(s).  Using the above data (bolded) as an example, and the conversion method, as well as the corresponding example transcription text:   CNS: Patient is

  Text Start Index Offset Text End Index Offset Audio Start Byte Count = Timestamp (millis) Audio End Byte Count / Timestamp (millis) Text Substring
Marker 1 0 2 3518 = 109 ms 23358 = 729 ms CNS
Marker 2 3 3 23360 = 730 ms 39038 = 1219 ms :
Marker 3 5 11 60798 = 1899 ms 73278 = 2289 ms Patient
Marker 4 13 14 73278 = 2289 ms 76798 = 2399 ms is