ASR Training :

Typical Accuracy Training Timeline*

*The following timeline is a representation of average accuracy performance and is subject to variance per customer.

What’s Needed?

There are 2 main components to our ASR service: a language model and an acoustic model. These models work in tandem and are individually fine-tuned to ensure that ASR is consistent and high quality.

Language Model: Historical Documents & Custom Words

The language model is a framework for ASR that learns from a collection of customer specific terminology. By providing past journals, meeting minutes, closed caption files, staff member names, and other sources of verbatim text, the model has a much stronger starting foundation. The more text resources that can be provided, the better.

Acoustic Model: Audio

The acoustic model is a framework for ASR that learns from a collection of customer specific audio characteristics. Unlike providing text, when it comes to setting up a new model, there are specific amounts of audio we recommend. We recommend that an initial 50 hours of past audio meetings are provided, while another 20-40 hours are provided in the form of live meetings over a 4-week training period.

It is important to note that major dips in ASR accuracy often stems directly from the audio quality of meetings. While there are numerous factors that could cause the discrepancy in accuracy, common occurring factors that cause issues for ASR are:

Fast/muffled/slurred speaking
Fast switching of speakers, overlapping speakers
Background noise during speaking (people talking, static, echoes, etc.)
Especially quiet speaker volume

Fortunately, these issues are preventable and can be minimized by putting in place best practices during meetings. It is very important to be mindful and communicate to meeting members to clearly articulate spoken communication. We strongly recommend for customers to improve meeting audio quality wherever possible.

ASR Story

While the infographic and above information provides a great overview of what Sliq’s ASR entails, the following information looks to provide a more in-depth explanation in what ASR involves.

Harmony Automated Speech Recognition (Harmony ASR) is the core of the following Sliq services:

Disability accessible closed captioning compliance
Transcription for use in Journals/Hansard.

ASR offers numerous benefits in the form of real time closed captions, a hands off approach, and the capacity to be trained/improved on an ongoing basis to tailor itself to every customer’s personalized meetings. In this document we will provide an understanding of what ASR requires and the steps we follow to consistently deliver highly accurate CC for our customers.

The Sliq Touch

ASR technology has improved in leaps and bounds, but still falls short on many of the human intricacies of speech and linguistics. Names, accents, acronyms, informal jargon, etc. are all aspects unknown to ASR technology and can cause errors. Sliq’s ongoing ASR training looks to bridge these gaps by comparing ASR to professional human transcriptions, error word classification, analysis, and custom word training. It is important to keep in mind that a new ASR project begins fresh, with no custom training beforehand. This is done so that each of our customers will have a dedicated, personalized ASR model, that is tailored specifically for them. The training procedure at Sliq is adaptive to each customer, and we continue to rigorously review each customer’s ASR model until we achieve a standard of 90+% average accuracy.

The core Sliq team members that help deliver our customers’ ASR service consists of the following:

Operations and deployment: setup and configuration

ASR accuracy transcriber: performs accuracy tests to measure status of ASR
ASR Model Trainer: creates ASR models and performs training
ASR Performance Analyst: collates ASR data for strategic approach and improving accuracy

As ASR and transcription technology becomes more accessible via tools and add-ons, Sliq is able to stay ahead of the curve by addressing frequent and reoccurring constraints of third party services. Some of these common limitations that Sliq has designed to overcome are:

File size limitations. Sliq is able to transcribe several hour long meetings or short snippets, where other services specify strict file size parameters.

Overlapping speakers and background noise can disrupt transcription, Sliq has volume thresholds to mitigate picking up unncessary noise.
Names, locations and specific jargon are often misinterpreted with off the shelf transcription services. Sliq ASR learns these custom words over time to capture nuanced dialogue.

Accuracy Training Period

While every customer will certainly be different and require dedicated personalized fine tuning, the following metrics have been gathered from analysis of ongoing and past customer projects:

ASR Stage	Recommended Hours of Customer Meeting Audio	Expected Accuracy of ASR
Initial Setup	50-70 total hours	80+ % accuracy
Training weeks 1-2	20-30 new meeting hours /70-90 total hours	80-85+% accuracy
Training weeks 3-4	20-30 new meeting hours/ 90-100+ total hours	85-90% accuracy
Training weeks 4+	Fine tuning as needed	92-94+% accuracy

In the event where a customer is able to provide more than the recommended amount of hours, while it is possible that the 90+% accuracy is achieved more quickly, note that the process may still take 4 weeks due to the scheduled dedicated time for analysis and accuracy tests.
In the event where a customer is not able to provide the recommended amount of hours, please note that more time will be dedicated to accuracy tests and re-tests using existing provided materials instead of new meetings. Due to the time needed to conduct more frequent accuracy tests in place of audio/text to feed, the estimated time to reach 90+% accuracy is 4-6 weeks instead.

A training cycle involves having our team of professional transcribers create a verbatim text record of recent live meetings. Using this transcription, a comparison and analysis between the ASR results and a human reference point is done. We refer to this process as an accuracy test. Using the accuracy test results we are able to quantify the severity and number of errors while collating several accuracy tests to create an ASR error word bank. We then update and custom train your ASR model through the collected errors. This entire process is a single training cycle, and the training period typically involves a total of 4 training cycles.

The process of training and updating the models can affect the ASR of live meetings that occur during the training sessions. For this reason, training is scheduled for end of week or periods of time where no meetings occur. Each time the model is trained, the model processes all info that has been fed into the system thus far. As such, the training procedure can take upwards of 24 hours depending on the amount of audio that has been provided. We work closely with our customers and monitor customer schedules to ensure training is done at appropriate times without compromising the ASR timelines.

ASR Training Print

Related Articles