You can also use multiple transcription engines and then use mismatches among the text streams to narrow down the % of content that needs to be reviewed. This is quite similar to multi-voting OCR for document images.
The principle is that the engines have different failure modes (hopefully) and therefore the 2-3% error rate of each engine is in different areas of the audio. The key underlying assumption is that the events are mutually exclusive.
With 3 engines, you can use something like 2-of-3 stream matches to override the stream that mismatches.
The principle is that the engines have different failure modes (hopefully) and therefore the 2-3% error rate of each engine is in different areas of the audio. The key underlying assumption is that the events are mutually exclusive.
With 3 engines, you can use something like 2-of-3 stream matches to override the stream that mismatches.