Use word markers to match audio segments with words in order to make corrected transcripts or extract audio pieces with the associated text.
There are some constraints to using word markers.
There are two ways to get word markers: the New Method and the Old Method.
The new method supplies a list of word marker descriptions that is the same whether supplied in a WebSocket message or fetched via an HTTP request. If Stable Text is used, the markers are supplied at the same time as the Stable Text, otherwise the markers are made available at the end of the dictation.
In the HTTP case, the dictation details can be fetched and there will be a wordMarkers property.
HTTP/1.1 200 OK
Content-Type: text/xml
Transfer-Encoding: chunked
4035
<?xml version="1.0" encoding="UTF-8"?>
<dictation>
<done>true</done>
<id>dict-POItPoZrQYapUS0aSByH0Q</id>
<profile>profilename</profile>
<queuedTime>1438810593324</queuedTime>
<statsAsProperties>
<entry> <key>wordMarkers</key> <value>
[{"audioStart":0.39,"audioLength":0.32999998,"textStart":0,"textLength":2,"text":"My"},
{"audioStart":0.71999997,"audioLength":0.51,"textStart":3,"textLength":4,"text":"text"},
{"audioStart":1.26,"audioLength":0.29,"textStart":8,"textLength":2,"text":"so"},
{"audioStart":1.55,"audioLength":0.41,"textStart":11,"textLength":3,"text":"far"},
{"audioStart":2.0,"audioLength":0.48,"textStart":14,"textLength":1,"text":"."}]
</value> </entry>
<!-- other entries omitted for brevity -->
</statsAsProperties>
<statsString><!-- omitted for brevity --></statsString>
<url>https://server/SCDictation/rest/{tenant_key}/dictations/dict-POItPoZrQYapUS0aSByH0Q</url>
<audioName>dictation-audio-dict-POItPoZrQYapUS0aSByH0Q</audioName>
<tenantDatabase>MyTenant_1</tenantDatabase>
<textName>dictation-text-dict-POItPoZrQYapUS0aSByH0Q</textName>
<canceled>false</canceled>
<canceledTime>0</canceledTime>
<phonemesName>dictation-phonemes-dict-POItPoZrQYapUS0aSByH0Q</phonemesName>
<streaming>true</streaming>
<substitutedTextName>dictation-subst-dict-POItPoZrQYapUS0aSByH0Q</substitutedTextName>
</dictation>
0
In the WebSocket case, the message will contain a markers field.
{
"data" : {
"kind" : "HYPOTHESISTEXT",
"text" : "My text so far.",
"substitutedText" : "My text so far.",
"markers" : [{"audioStart":0.39,"audioLength":0.32999998,"textStart":0,"textLength":2,"text":"My"},
{"audioStart":0.71999997,"audioLength":0.51,"textStart":3,"textLength":4,"text":"text"},
{"audioStart":1.26,"audioLength":0.29,"textStart":8,"textLength":2,"text":"so"},
{"audioStart":1.55,"audioLength":0.41,"textStart":11,"textLength":3,"text":"far"},
{"audioStart":2.0,"audioLength":0.48,"textStart":14,"textLength":1,"text":"."}]
}
}
Written Form: .*
Spoken Form: $0
(that is "match the entire transcript and replace it with the entire transcript") then you will get the entire transcript, but only get one word marker that spans the entire utterance and audio file. If you have a substitution which is making your word markers too coarse, remove or rewrite the substitution.
The old method is only available for HTTP dictation requests and is detailed below.
There are two calls used to get the markers:
These two calls are made after checking the Dictation status and before de-registering the Dictation observer.
It is important to note that the calls only return the markers and NOT the audio. The client is responsible for having or requesting the audio before deregistering the Dictation observer.
The client makes a request for audio statistics.
GET /SCDictation/rest/{tenant_key}/dictations/{dictationIdentifier}/audio/stats HTTP/1.1
Accept: text/xml
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Authorization: Basic <auth>
The only thing different about SSO is the “Authorization:” header.
Authorization: Bearer <user>:<apikey>
The highlighted values pertain to word and segment audio markers. These values are pairs of indices that represent the start and end byte offsets of each word or segment in the dictation audio.
Segment markers may not be valid (but word markers are).
The server returns the audio properties.
HTTP/1.1 200 OK
Content-Type: text/xml
Transfer-Encoding: chunked
2b7
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<entry key="numWordAudioSegments">20</entry>
<entry key="AudioDoneTimeInMillis">1491413790594</entry>
<entry key="AudioMarkers">[0, 288000]</entry>
<entry key="TenantDatabase"><tenant_key></entry>
<entry key="numWordSegments">20</entry>
<entry key="numSegments">1</entry>
<entry key="WordAudioMarkers">[34880, 46080, 46080, 64000, 155840, 169600, 169600, 183680, 275520, 288000, 288000, 308160]</entry>
<entry key="AudioURL">/SCFileserver/audio/<tenant_key>/IKpd-HjvRvSHh--oXtF5Mw</entry>
</properties>
0
The client makes a request for text properties.
GET /SCDictation/rest/{tenant_key}/dictations/{dictationIdentifier}/text/stats HTTP/1.1
Accept: text/xml
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Authorization: Basic <auth>
The only thing different about SSO is the “Authorization:” header.
Authorization: Bearer <user>:<apikey>
The highlighted values pertain to word and segment text markers. These values are pairs of indices that represent the start and end of each word or segment in the dictation text.
For the Enhanced dictation server, segment markers may not be valid (but word markers are).
numWordSegments
—The number of words in the dictation text.DictationWords
—The number of words in the dictation text.TextMarkers
—The pairs of indices for the segments in the dictation.numWordTextSegments
—The number of words in the dictation text.WordTextMarkers
—The text indices of words in the dictation. The indices are pairs representing the beginning and end indices for each word.numSegments
—The number of segments in the dictation.
The server returns the text properties.
1dc
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<entry key="Version">12</entry>
<entry key="TextDoneTimeInMillis">1491413799127</entry>
<entry key="DictationLength">126</entry>
<entry key="numWordSegments">6</entry>
<entry key="DictationWords">8</entry>
<entry key="Confidence">-1.0</entry>
<entry key="Transcription">This is a test of dictation.\n</entry>
<entry key="TenantDatabase"><tenant_key></entry>
<entry key="TextMarkers">[0, 156]</entry>
<entry key="TextURL">/SCDictation/rest/<tenant_key>/dictations/bMyRGK-XTc-WuJR6C06hNg/text</entry>
<entry key="numWordTextSegments">8</entry>
<entry key="SubstitutedText">This is a test of dictation.\n</entry>
<entry key="WordTextMarkers">[0, 3, 5, 6, 8, 8, 10, 13, 15, 16, 18, 26, 27, 27, 28, 28]</entry>
<entry key="numSegments">1</entry>
</properties>
0
From /text/stats you get string indexes into the final transcription text. The indexes come in pairs, so there will always be an even number of entries in the list. The first value of a pair indicates the start position of a marker in the text, and the second value in the pair indicates the end of a marker value in the text.
WordTextMarkers=[0, 2, 3, 3, 5, 11, 13, 14]
From /audio/stats you get 16-bit 16kHz PCM byte counts. These values also come in pairs, and indicate at what time in the audio the corresponding text was generated from.
WordAudioMarkers=[3518, 23358, 23360, 39038, 60798, 73278, 73278, 76798]
In order to use timestamps, not byte counts, an example method to convert the byte counts into a timestamp in millis:
public static long audioByteCountToDurationInMillis(long byteCount)
{
double SAMPLES_PER_SECOND = 16000D;
double sampleCount = byteCount / 2;
double timeInSeconds = sampleCount / SAMPLES_PER_SECOND;
double timeInMillis = timeInSeconds * 1000;
return (long) timeInMillis;
}
WordTextMarkers and WordAudioMarkers is like forming data structure(s). Using the above data (bolded) as an example, and the conversion method, as well as the corresponding example transcription text: CNS: Patient is
Text Start Index Offset | Text End Index Offset | Audio Start Byte Count = Timestamp (millis) | Audio End Byte Count / Timestamp (millis) | Text Substring | |
---|---|---|---|---|---|
Marker 1 | 0 | 2 | 3518 = 109 ms | 23358 = 729 ms | CNS |
Marker 2 | 3 | 3 | 23360 = 730 ms | 39038 = 1219 ms | : |
Marker 3 | 5 | 11 | 60798 = 1899 ms | 73278 = 2289 ms | Patient |
Marker 4 | 13 | 14 | 73278 = 2289 ms | 76798 = 2399 ms | is |