If you've ever looked through the subtitle options on a DVD or Blu-ray disc, you've likely noticed that there's often a set of subtitles for deaf and hard-of-hearing users. Much of the sound in videos we watch isn't pure human language, and those subtitles account for that, offering text descriptions of significant audio cues. Youtube has offered automatically-generated speech captioning for years. Now, Google is turning its machine-learning powers toward sound effects to bring audio-effect subtitling to its streaming video service.
Like it happened with automatic speech captioning, sound effect subtitling is starting out pretty basic, with [LAUGHTER], [APPLAUSE], and [MUSIC] denominators. Google explains in the blog post that while there are many more types of sound its machine-learning network is capable of recognizing, those sounds require the least contextual information. For contrast, Google engineer Sourish Chaudhuri explained that [RING] could be the ring of a bell, alarm, or phone.
One of the main challenges that the researchers encountered was having the system make an educated guess when it came across two sound effects simultaneously. In order to work around that problem, the team added a duration rule—if a sound effect isn't being detected for at least a certain period of time, then it doesn't get mentioned in the subtitles.
Google's blog post goes pretty deep into the weeds on the topic. If you're interested in the applications of deep learning, it's worth a look.