How Microsoft Teams Will Use AI To Filter Typing, Barking And Other Video Call Noise


Last month, Microsoft announced that Teams, its competitor to Slack, Facebook Workplace and Google Hangouts Chat, had exceeded 44 million daily active users. The milestone overshadowed its unveiling of some new features to come “later this year”. Most were simple: a hand-up function to indicate that you have something to say, offline, low-bandwidth support for reading chat messages and writing replies even if you have a weak or nonexistent Internet connection , and an option to make the cats appear in another window. But one feature, real-time noise cancellation, stood out – Microsoft demonstrated how AI minimizes annoying background noise during a call.

We’ve all been there. How many times have you asked someone to mute or move to a noisy area? Real-time noise cancellation will filter someone typing on their keyboard during a meeting, the rustle of a bag of crisps (as you can see in the video above) and a vacuum cleaner running in the background . AI removes background noise in real time so you can only hear speech during the call. But how does it work exactly? We talked to Robert Aichner, program manager of the Microsoft Teams group, to find out.

The use of collaboration and video conferencing tools is exploding as the coronavirus crisis forces millions of people to learn and work from home. Microsoft is making Teams the solution for businesses and consumers as part of its Microsoft 365 subscription suite. The company leverages its machine learning expertise to ensure that AI functionality is one of its great differentiators. When it finally arrives, removing background noise in real time will be a boon for businesses and households filled with annoying noises. In addition, the way Microsoft built the functionality is also instructive for other companies that use machine learning.

Fixed or non-fixed noise

Of course, noise suppression has been around for years in Microsoft Teams, Skype, and Skype for Business apps. Other communication tools and video conferencing applications also have some form of noise cancellation. But this noise cancellation covers stationary noise, such as a computer fan or an air conditioner running in the background. The traditional method of removing noise is to look for speech pauses, estimate the baseline of the noise, assume that the continuous background noise does not change over time, and filter it out.

In the future, Microsoft Teams will suppress non-stationary noises like a barking dog or someone closing a door. “It’s not stationary,” said Aichner. “You cannot estimate this in speech breaks. What machine learning now allows you to do is create this great training set, with lots of representative noises. “

VB TRansform 2020: the AI ​​event for business leaders. San Francisco July 15 - 16

In fact, Microsoft opened its training program earlier this year on GitHub “to advance the research community in this area.” Although the first version is publicly available, Microsoft is actively working on extending the datasets. A company spokesperson confirmed that certain noise categories in the data sets will not be filtered out during calls, including musical instruments, the real-time noise cancellation. laughing and singing.

Microsoft cannot simply isolate sound from human voices, because other noises also occur at the same frequencies. On a speech signal spectrogram, unwanted noise appears in the spaces between the speech and overlap with speech. So it’s almost impossible to filter out noise – if your speech and noise overlap, you can’t tell the difference between the two. Instead, you need to train a neural network in advance of what noise and speech look like.

Speech recognition vs noise cancellation

To make his points, Aichner compared machine learning models for noise cancellation to machine learning models for speech recognition. For voice recognition, you need to record a large corpus of users speaking into the microphone, and then ask humans to label that voice data by noting what has been said. Instead of mapping the microphone input to written words, in noise cancellation, you are trying to switch from noisy to clear speech.

“We train a model to understand the difference between noise and speech, and then the model simply tries to keep the speech,” said Aichner. “We have training data sets. We took thousands of different speakers and over 100 types of noise. And then what we do is mix noiseless clear speech with noise. So we simulate a microphone signal. And then you also give the model pure speech as the basic truth. So you ask the model, “From this noisy data, please extract this clean signal, and this is what it should look like.” This is how you train neural networks [in] supervised learning, where you basically have a basic truth. “

For speech recognition, the basic truth is what was said into the microphone. For real-time noise cancellation, the basic truth is noise-free speech. By feeding a sufficiently large data set – in this case, hundreds of hours of data – Microsoft can effectively train its model. “He can generalize and reduce noise with my voice even if my voice was not part of the training data,” said Aichner. “In real time, when I speak, there is a rumor that the model would be able to extract clear speech [from] and just send it to the remote person. “


Comparing functionality with speech recognition makes noise canceling sound much more achievable, even if it occurs in real time. So why hasn’t this been done before? Can Microsoft’s competitors quickly recreate it? Aichner listed the challenges of creating real-time noise suppression, including finding representative data sets, building and reducing the model, and leveraging machine learning expertise.

Representative data sets

We have already tackled the first challenge: representative data sets. The team spent a lot of time figuring out how to produce audio files that illustrate what happens during a typical call.

They used audio books to represent male and female voices because “the characteristics of speech differ between male and female voices.” They used YouTube data sets with labeled data that specifies that a recording includes, for example, typing and music. Aichner’s team then combined the speech and noise data using a synthesizer script with different signal-to-noise ratios. By amplifying the noise, they could mimic different realistic situations that can arise during a call.

But audio books are radically different from conference calls. Wouldn’t that affect the model, and therefore the noise cancellation?

“It’s a good point,” conceded Aichner. “Our team also did recordings to make sure that we are not only training on synthetic data that we generate ourselves, but that it also works on real data. But it’s really more difficult to get those real recordings. “

Confidentiality restrictions

The Aichner team is not authorized to view customer data. In addition, Microsoft has strict internal privacy guidelines. “I can’t just say,” Now, I’m recording each meeting. “”

The team was therefore unable to use Microsoft Teams calls. Even if they could – for example, if some Microsoft employees chose to record their meetings – someone should still note when bothersome noises occur.

“And that is why we are currently having a smaller scale effort to ensure that we collect some of these actual recordings with a variety of devices and speakers, etc. Said Aichner. “What we do next is we do this part of the test set. So we have a set of tests that we think is even more representative of real meetings. And then we see if we use a certain training set, how well does it work on the test set? So ideally yes, I would like to have a workout set, which is all the recordings of the teams and has all the types of sounds that people listen to. It’s just that I can’t easily get the same amount of data from the same volume as I can by entering another set of open source data. “

I pushed the point once again: how would an opt-in program for registering Microsoft employees using teams have an impact on functionality?

“You could say it’s better,” said Aichner. “If you have more representative data, it could be even better. So I think it’s a good idea to potentially see in the future if we can still improve. But I think what we’ve seen so far, even with just taking public data, it’s working very well. “

Cloud and edge

The next challenge is to figure out how to build the neural network, what the architecture of the model should be and iterate. The machine learning model has undergone many adjustments. It took a lot of calculation. Aichner’s team of course relied on Azure, using multiple GPUs. However, even with all of these calculations, it can take days to form a large model with a large data set.

“Much of machine learning occurs in the cloud,” said Aichner. “So for voice recognition, for example, you speak into the microphone, which is sent to the cloud. The cloud has a huge calculation, and then you run these great models to recognize your speech. For us, since this is real-time communication, I have to process each frame. Let’s say it’s 10 or 20 milliseconds. I must now process this within this time, so that I can send it to you immediately. I can’t send it to the cloud, wait for noise removal and send it back. “

For voice recognition, taking advantage of the cloud may make sense. For real-time noise cancellation, this is not a starter. Once you have the machine learning model, you need to narrow it down to fit the client. You should be able to run it on a regular phone or computer. A machine learning model only for people with high-end machines is unnecessary.

Push treatment to the edge

There is another reason why the machine learning model should live on the edge rather than the cloud. Microsoft wants to limit the use of the server. Sometimes there is not even a server in the equation to start with. For private calls in Microsoft Teams, the call configuration goes through a server, but the actual audio and video signal packets are sent directly between the two participants. For group calls or scheduled meetings, there is a server in the image, but Microsoft minimizes the load on that server. Doing a lot of server processing for each call increases costs and each additional network hop adds latency. From a cost and latency perspective, processing is more efficient at the edge.

“You want to make sure that you transfer as much computation to the endpoint of the user, because there is no cost involved. You already have your laptop, PC or cell phone, so we will perform additional processing. As long as you don’t overload the CPU, it should be fine, ”said Aichner.

I pointed out that there is a cost, especially on devices that are not plugged in: the battery life. “Yes, the life of the battery, we obviously pay attention to it too,” he said. “We don’t want you to have a much shorter battery life now just because we’ve added noise cancellation. This is certainly another requirement that we have when we ship. We have to make sure that we don’t regress there. “

Download size and sustainability

It’s not just the regression that the team has to consider, but also the progression into the future. Because we are talking about a machine learning model, the work never stops.

“We are trying to build something flexible in the future because we are not going to stop investing in noise cancellation after the release of the first feature,” said Aichner. “We want to make it better and better. Maybe for some noise tests, we are not doing as well as we should. We really want to be able to improve this. The Teams client will be able to download new models and improve quality over time whenever we think we have something better. “

The model itself will clock in at a few megabytes, but that will not affect the size of the client itself. He said, “This is also another requirement that we have. When users download the app to the phone or to the desktop or laptop, you want to reduce the download size. You want to help people move forward as quickly as possible. “

Adding megabytes to this “just for a model” download is not going to fly, said Aichner. After installing Microsoft Teams, later in the background, it will download this template. “This is what also allows us to be flexible in the future, which we could do even more, to have different models. “

Machine learning expertise

All of the above requires a final element: talent.

“You also need to have machine learning expertise to know what you want to do with this data,” said Aichner. “That’s why we created this machine learning team in this smart communications group. You need experts to know what to do with this data. What are the right models? Deep learning has a very broad meaning. There are many different types of models that you can create. We have several centers around the world in Microsoft Research, and we also have a lot of audio experts. We work closely with them because they have great expertise in this deep learning space. “

The data is open source and can be improved. It takes a lot of calculations, but any business can just leverage a public cloud, including the leaders Amazon Web Services, Microsoft Azure and Google Cloud. So if another company with a video chat tool had the right learners, could it be successful?

“The answer is probably yes, similar to how many companies get voice recognition,” said Aichner. “They have a speech identifier where there is also a lot of data involved. There is also a lot of expertise needed to build a model. Large companies do so. “

Aichner believes that Microsoft still has a big advantage because of its scale. “I think value is data,” he said. “What we want to do in the future is like what you said, having a program where Microsoft employees can give us more than enough real team calls so that we have a better analysis of what our customers are really doing, what problems they are facing, and further personalizing it. ”


Please enter your comment!
Please enter your name here