Cisco's Speaker Segmentation and Recognition System
This paper presents Cisco's speaker segmentation and recognition (SSR) system, which is a part of a commercial product. Cisco SSR uses speaker segmentation and speaker recognition algorithms with a crowd sourcing approach to create speaker metadata. The speaker metadata makes the enterprise videos more accessible and more navigable by itself, and by its combination with other forms of metadata such as keywords. This paper illustrates various functional blocks of SSR and a typical user interface. The paper describes the specific implementations of speaker segmentation and recognition algorithms. The paper also describes the evaluation data and protocols plus results for both speaker segmentation and speaker recognition tasks. Speaker segmentation results show that Cisco SSR performs comparable to the state-of-the-art on RT-03F data. Speaker recognition results show that a small set of user provided labels can be effectively transferred to a continuously expanding set of videos.