Typical Accuracy Training Timeline* 

*The following timeline is a representation of average accuracy performance and is subject to variance per customer. 

What’s Needed? 

There are 2 main components to our ASR service: a language model and an acoustic modelThese models work in tandem and are individually fine-tuned to ensure that ASR is consistent and high quality.  


Language Model: Historical Documents & Custom Words 

The language model is a framework for ASR that learns from a collection of customer specific terminology. By providing past journals, meeting minutes, closed caption files, staff member names, and other sources of verbatim text, the model has a much stronger starting foundation. The more text resources that can be provided, the better.  


Acoustic Model: Audio 

The acoustic model is a framework for ASR that learns from a collection of customer specific audio characteristics. Unlike providing text, when it comes to setting up a new model, there are specific amounts of audio we recommend. We recommend that an initial 50 hours of past audio meetings are provided, while another 20-40 hours are provided in the form of live meetings over a 4-week training period.  

It is important to note that major dips in ASR accuracy often stems directly from the audio quality of meetings. While there are numerous factors that could cause the discrepancy in accuracy, common occurring factors that cause issues for ASR are: 

  • Fast/muffled/slurred speaking 

  • Fast switching of speakers, overlapping speakers 

  • Background noise during speaking (people talking, static, echoes, etc.) 

  • Especially quiet speaker volume 

Fortunatelythese issues are preventable and can be minimized by putting in place best practices during meetings. It is very important to be mindful and communicate to meeting members to clearly articulate spoken communication. We strongly recommend for customers to improve meeting audio quality wherever possible. 


ASR Story 

While the infographic and above information provides a great overview of what Sliq’s ASR entails, the following information looks to provide a more in-depth explanation in what ASR involves. 

Harmony Automated Speech Recognition (Harmony ASR) is the core of the following Sliq services: 

  • Disability accessible closed captioning compliance 

  • Transcription for use in Journals/Hansard. 

ASR offers numerous benefits in the form of real time closed captions, a hands off approach, and the capacity to be trained/improved on an ongoing basis to tailor itself to every customer’s personalized meetings. In this document we will provide an understanding of what ASR requires and the steps we follow to consistently deliver highly accurate CC for our customers. 

The Sliq Touch 

ASR technology has improved in leaps and bounds, but still falls short on many of the human intricacies of speech and linguistics. Names, accents, acronyms, informal jargon, etc. are all aspects unknown to ASR technology and can cause errors. Sliq’s ongoing ASR training looks to bridge these gaps by comparing ASR to professional human transcriptions, error word classification, analysis, and custom word training. It is important to keep in mind that a new ASR project begins fresh, with no custom training beforehand. This is done so that each of our customers will have a dedicated, personalized ASR model, that is tailored specifically for them. The training procedure at Sliq is adaptive to each customer, and we continue to rigorously review each customer’s ASR model until we achieve a standard of 90+% average accuracy. 

The core Sliq team members that help deliver our customers’ ASR service consists of the following: 

  • Operations and deployment: setup and configuration 

  • ASR accuracy transcriber: performs accuracy tests to measure status of ASR 

  • ASR Model Trainer: creates ASR models and performs training  

  • ASR Performance Analyst: collates ASR data for strategic approach and improving accuracy 

As ASR and transcription technology becomes more accessible via tools and add-ons, Sliq is able to stay ahead of the curve by addressing frequent and reoccurring constraints of third party services. Some of these common limitations thaSliq has designed to overcome are: 

  • File size limitations. Sliq is able to transcribe several hour long meetings or short snippets, where other services specify strict file size parameters. 

  • Overlapping speakers and background noise can disrupt transcription, Sliq has volume thresholds to mitigate picking up unncessary noise. 

  • Names, locations and specific jargon are often misinterpreted with off the shelf transcription services. Sliq ASR learns these custom words over time to capture nuanced dialogue. 

 

Accuracy Training Period 

While every customer will certainly be different and require dedicated personalized fine tuning, the following metrics have been gathered from analysis of ongoing and past customer projects: 

ASR Stage 

Recommended Hours of Customer Meeting Audio 

Expected Accuracy of ASR 

Initial Setup 

50-70 total hours 

80+ % accuracy 

Training weeks 1-2 

20-30 new meeting hours /70-90 total hours 

80-85+% accuracy 

Training weeks 3-4 

 

20-30 new meeting hours/ 90-100+ total hours 

85-90% accuracy 

Training weeks 4+ 

Fine tuning as needed 

92-94+% accuracy 

 

  • In the event where a customer is able to provide more than the recommended amount of hours, while it is possible that the 90+% accuracy is achieved more quickly, note that the process may still take 4 weeks due to the scheduled dedicated time for analysis and accuracy tests.  

  • In the event where a customer is not able to provide the recommended amount of hours, please note that more time will be dedicated to accuracy tests and re-tests using existing provided materials instead of new meetings. Due to the time needed to conduct more frequent accuracy tests in place of audio/text to feed, the estimated time to reach 90+% accuracy is 4-6 weeks instead. 


A training cycle involves having our team of professional transcribers create a verbatim text record of recent live meetings. Using this transcription, a comparison and analysis between the ASR results and a human reference point is done. We refer to this process as an accuracy test. Using the accuracy test results we are able to quantify the severity and number of errors while collating several accuracy tests to create an ASR error word bank. We then update and custom train your ASR model through the collected errors. This entire process is a single training cycle, and the training period typically involves a total of 4 training cycles. 


The process of training and updating the models can affect the ASR of live meetings that occur during the training sessions. For this reason, training is scheduled for end of week or periods of time where no meetings occur. Each time the model is trained, the model processes all info that has been fed into the system thus far. As such, the training procedure can take upwards of 24 hours depending on the amount of audio that has been provided. We work closely with our customers and monitor customer schedules to ensure training is done at appropriate times without compromising the ASR timelines.