Even in a conversation between two people, we often fail to catch every word and have to ask each other to say some things again. For computer programs, accurate speech recognition in similar situations is even more of a challenge. That is why development of an efficient speech recognition solution requires addressing the high complexity of human speech.
Nanosemantics developers managed to create a tool offering high quality recognition. NLab Speech is a set of neural network algorithms for audio signal processing and text analysis trained and calibrated with the use of a large body of manually marked speech data. As of the moment, NLab Speech boasts an accuracy rate (Backward Word, Error Rate) of over 82% for high-noise telephony data. And Nanosemantics’ cloud offers a data processing rate of 6 real-time factor, which is 40-80% higher than that of competitor cloud services. This solution took the team over two years to develop.
Unlike a human being, a neural network in NLab Speech analyzes an audio signal as an image: each audio is compared with its spectrogram, after which the neural network converts spectrograms into text suggestions of what is said in the audio. The best suggestions are identified with the help of a language model based on word co-occurrence.
The data is used to improve the performance of both models; the amount of data and markup quality directly affect the efficiency of each model. It is recommended to feed models with data from a specific commercial segment, like a call-center, for example, so that there is one-hundred-percent certainty of their completeness and authenticity.
Aside from the challenges associated with development of the speech recognition model, there was also the routine task of thorough data preparation. It total, it took the company’s specialists two years to prepare data used to train NLab Speech. To train acoustic models, they collected over 12 thousand hours of audio from various sources, including call centers, voice messages, audio books, and webinars.
Sets of data were also prepared to train models that perform best with user mic recordings made on smartphones or laptops. When working with audio from various sources recorded in various environments, reverberation and equalization had to be taken into account.
To prepare a large array of training data, Nanosemantics developed a data markup platform called NLab Marker. With the help of NLab Marker data is transformed into a format suitable for neural network training.
“For organizations that rely on machine learning in client support, the improved quality of Nanosemantics’ ASR-based voice robots is a real life-saver. A voice assistant with advanced speech capabilities and word recognition functions replaces dozens to hundreds of call-center operators, which allows to save on personnel and speeds up client support. ASR integration will considerably facilitate and optimize work in other business areas as well. For example, healthcare professionals use voice commands to fill in documents and save time on making medical history notes. For people with disabilities, voice technologies may be a key to a better life,” says Pavel Krivozubov, Head of Robotics and Artificial Intelligence Team at Skolkovo Foundation.
As of today, NLab Speech recognition solution from Nanosemantics is a self-sufficient technology that imitates human speech capabilities and does not rely on any third-party services in the recognition process. Quick and scalable speech recognition works both on processors and videocards. NLab Speech supports both file-based and streaming speech recognition. File-based recognition only displays the final result, whereas streaming recognition displays interim results after each pronounced word, which are corrected based on what is said next in the audio, much like in Apple Siri. Additionally, our ASR (automatic speech recognition) solution works with all major communication protocols: websocket, grpc and mrcp, which makes NLab Speech flexible in terms of its integration with the client. It can also break stereo recordings down onto dialog lines so that you can easily use your ASR results in speech analytics systems. NLab Speech automatically checks spelling, corrects errors and adds punctuation.
“We are now one of the leaders in highly accurate voice solutions in the Russian language and we aim to outdo our competition in terms of quality. And we have what it takes: we improve language and acoustic models, as well as the punctuator neural network. We collect even more quality data to train neural networks. Additionally, to increase accuracy of speech recognition we plan to upgrade NLab Speech with audio categorization by sex, age, speech rate, pitch, volume and speaker’s emotions. Furthermore, we plan to add a categorization of places by background noise. At the same time, we are developing ASR for English, Chinese and Korean,” Nanosemantics’ CEO Stanislav Ashmanov notes.