nVoqAPIs_Word_Markers - nVoq Support Center

Word Markers & Audio Statistics

Use word markers to match audio segments with words in order to make corrected transcripts or extract audio pieces with the associated text.

There are some constraints to using word markers.

"Words" are usually words only but can also be whole phrases like "New Year's Eve" or "iPhone XS Max".
Words can include control commands like "newline", "period", or "all caps".
Several sets of text markers can map to the same word. For example: "$123.45" can map from "one hundred twenty three dollars and forty five cents".

Word markers are cleaned up/deleted after 60 days.

There are two ways to get word markers: the New Method and the Old Method.

New Method

The new method supplies a list of word marker descriptions that is the same whether supplied in a WebSocket message or fetched via an HTTP request. If Stable Text is used, the markers are supplied at the same time as the Stable Text, otherwise the markers are made available at the end of the dictation.

In the HTTP case, the dictation details can be fetched and there will be a wordMarkers property.

HTTP/1.1 200 OK

Content-Type: text/xml

Transfer-Encoding: chunked

4035

<?xml version="1.0" encoding="UTF-8"?>
<dictation>
   <done>true</done>
   <id>dict-POItPoZrQYapUS0aSByH0Q</id>
   <profile>profilename</profile>
   <queuedTime>1438810593324</queuedTime>
   <statsAsProperties>
      <entry> <key>wordMarkers</key> <value>

                            [{"audioStart":0.39,"audioLength":0.32999998,"textStart":0,"textLength":2,"text":"My"},
                             {"audioStart":0.71999997,"audioLength":0.51,"textStart":3,"textLength":4,"text":"text"},
                             {"audioStart":1.26,"audioLength":0.29,"textStart":8,"textLength":2,"text":"so"},
                             {"audioStart":1.55,"audioLength":0.41,"textStart":11,"textLength":3,"text":"far"},
                             {"audioStart":2.0,"audioLength":0.48,"textStart":14,"textLength":1,"text":"."}]

</value> </entry>
      
   </statsAsProperties>
   <statsString></statsString>
   <url>https://server/SCDictation/rest/{tenant_key}/dictations/dict-POItPoZrQYapUS0aSByH0Q</url>
   <audioName>dictation-audio-dict-POItPoZrQYapUS0aSByH0Q</audioName>
   <tenantDatabase>MyTenant_1</tenantDatabase>
   <textName>dictation-text-dict-POItPoZrQYapUS0aSByH0Q</textName>
<canceled>false</canceled>
   <canceledTime>0</canceledTime>
   <phonemesName>dictation-phonemes-dict-POItPoZrQYapUS0aSByH0Q</phonemesName>
   <streaming>true</streaming>
   <substitutedTextName>dictation-subst-dict-POItPoZrQYapUS0aSByH0Q</substitutedTextName>
</dictation>

In the WebSocket case, the message will contain a markers field.

{
	"data" : {
		"kind" : "HYPOTHESISTEXT",
		"text" : "My text so far.",
		"substitutedText" : "My text so far.",
                "markers" : [{"audioStart":0.39,"audioLength":0.32999998,"textStart":0,"textLength":2,"text":"My"},
                             {"audioStart":0.71999997,"audioLength":0.51,"textStart":3,"textLength":4,"text":"text"},
                             {"audioStart":1.26,"audioLength":0.29,"textStart":8,"textLength":2,"text":"so"},
                             {"audioStart":1.55,"audioLength":0.41,"textStart":11,"textLength":3,"text":"far"},
                             {"audioStart":2.0,"audioLength":0.48,"textStart":14,"textLength":1,"text":"."}]
	}
}

Word Marker Object Fields and Values

audioStart - The time, relative to the start of the dictation, in seconds, that the audio for this word starts.
audioLength - The length of the audio for this word in seconds.
textStart - The starting character position of this word. The transcript for the dictation starts at 0.
textLength - The number of characters in this word. In general spaces are outside of words, and not part of any word marker.
text - The text for the word. Formatting will have been done.

Substitutions can cause markers to merge (two or more word markers become one larger word marker).
For example, if you write a RegEx substitution like this: Written Form: .* Spoken Form: $0 (that is "match the entire transcript and replace it with the entire transcript") then you will get the entire transcript, but only get one word marker that spans the entire utterance and audio file. If you have a substitution which is making your word markers too coarse, remove or rewrite the substitution.

The Old Method

We are in the process of reformatting our API documentation. Please see our new documentation for this web service at https://test.nvoq.com/apidoc/dictation#operation/GetAudioProps

The old method is only available for HTTP dictation requests and is detailed below.

There are two calls used to get the markers:

A call to collect the audio statistics which contains the audio markers.
A call to collect the text statistics which contains text markers.

These two calls are made after checking the Dictation status and before de-registering the Dictation observer.

It is important to note that the calls only return the markers and NOT the audio. The client is responsible for having or requesting the audio before deregistering the Dictation observer.

Request

The client makes a request for audio statistics.

Audio is processed for markers as 16kHz. If 8kHz audio is used, you must divide the results in half or convert the audio to 16kHz in order to get markers the line up properly.

GET /SCDictation/rest/{tenant_key}/dictations/{dictationIdentifier}/audio/stats HTTP/1.1

Accept: text/xml

Accept-Language: en-us,en;q=0.5

Accept-Encoding: gzip, deflate

Authorization: Basic <auth>

Single Sign-On Request

The only thing different about SSO is the “Authorization:” header.

Authorization: Bearer <user>:<apikey>

Response

The highlighted values pertain to word and segment audio markers. These values are pairs of indices that represent the start and end byte offsets of each word or segment in the dictation audio.

Segment markers may not be valid (but word markers are).

numWordAudioSegments—The number of words in the dictation audio.
AudioMarkers—The byte offsets for the audio segments in the dictation.
numWordSegments—The number of words in the dictation.
numSegments—The number of segments in the dictation.
WordAudioMarkers—The byte offsets of the words in the dictation. The offsets are pairs representing the beginning and end offsets for each word in the dictation audio.

"Words" are usually words only but can also be whole phrases like "New Year's Eve" or "iPhone XS Max".
Words can include control commands like "newline", "period", or "all caps".
Several sets of text markers can map to the same word. For example: "$123.45" can map from "one hundred twenty three dollars and forty five cents".

The server returns the audio properties.

HTTP/1.1 200 OK

Content-Type: text/xml

Transfer-Encoding: chunked

2b7

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<entry key="AudioURL">/SCFileserver/audio/<tenant_key>/IKpd-HjvRvSHh--oXtF5Mw</entry>

</properties>

Request

The client makes a request for text properties.

GET /SCDictation/rest/{tenant_key}/dictations/{dictationIdentifier}/text/stats HTTP/1.1

Accept: text/xml

Accept-Language: en-us,en;q=0.5

Accept-Encoding: gzip, deflate

Authorization: Basic <auth>

Single Sign-On Request

The only thing different about SSO is the “Authorization:” header.

Authorization: Bearer <user>:<apikey>

Response

The highlighted values pertain to word and segment text markers. These values are pairs of indices that represent the start and end of each word or segment in the dictation text.

For the Enhanced dictation server, segment markers may not be valid (but word markers are).

numWordSegments—The number of words in the dictation text.
DictationWords—The number of words in the dictation text.
TextMarkers—The pairs of indices for the segments in the dictation.
numWordTextSegments—The number of words in the dictation text.
WordTextMarkers—The text indices of words in the dictation. The indices are pairs representing the beginning and end indices for each word.
numSegments—The number of segments in the dictation.

"Words" are usually words only but can also be whole phrases like "New Year's Eve" or "iPhone 4S".
Words can include control commands like "newline", "period", or "all caps".
Several sets of text markers can map to the same word. For example: "$123.45" can map from "one hundred twenty three dollars and forty five cents".

The server returns the text properties.

1dc

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<entry key="Transcription">This is a test of dictation.\n</entry>

<entry key="TextURL">/SCDictation/rest/<tenant_key>/dictations/bMyRGK-XTc-WuJR6C06hNg/text</entry>

<entry key="SubstitutedText">This is a test of dictation.\n</entry>

</properties>

From /text/stats you get string indexes into the final transcription text. The indexes come in pairs, so there will always be an even number of entries in the list. The first value of a pair indicates the start position of a marker in the text, and the second value in the pair indicates the end of a marker value in the text.
WordTextMarkers=[0, 2, 3, 3, 5, 11, 13, 14]

From /audio/stats you get 16-bit 16kHz PCM byte counts. These values also come in pairs, and indicate at what time in the audio the corresponding text was generated from.
WordAudioMarkers=[3518, 23358, 23360, 39038, 60798, 73278, 73278, 76798]

In order to use timestamps, not byte counts, an example method to convert the byte counts into a timestamp in millis:

    public static long audioByteCountToDurationInMillis(long byteCount)

    {

        double SAMPLES_PER_SECOND = 16000D;

        double sampleCount = byteCount / 2;

        double timeInSeconds = sampleCount / SAMPLES_PER_SECOND;

        double timeInMillis = timeInSeconds * 1000;

        return (long) timeInMillis;

    }

WordTextMarkers and WordAudioMarkers is like forming data structure(s). Using the above data (bolded) as an example, and the conversion method, as well as the corresponding example transcription text: CNS: Patient is

	Text Start Index Offset	Text End Index Offset	Audio Start Byte Count = Timestamp (millis)	Audio End Byte Count / Timestamp (millis)	Text Substring
Marker 1	0	2	3518 = 109 ms	23358 = 729 ms	CNS
Marker 2	3	3	23360 = 730 ms	39038 = 1219 ms	:
Marker 3	5	11	60798 = 1899 ms	73278 = 2289 ms	Patient
Marker 4	13	14	73278 = 2289 ms	76798 = 2399 ms	is