Speech@FIT Lecture Browser

Over the last few years, audiovisual recording of lectures and seminars has seen an enormous growth across many academic institutions throughout the whole world. The recorded material is often made available on-line on an Internet website, usually accompanied by meta-data, such as keywords, categories, title, author or links. These help users find the lecture they need. However, since lecture recordings are usually long in duration (typically over an hour or two), and often contain a lot of interesting information, a possibility to search inside video is a crucial and desired feature. Otherwise, users can spend a lot of time watching or listening to something that is not much important until they get to really useful parts. In any case, a browser capable of searching in audio/video may dramatically increase the accessibility of recorded lectures, seminars or meetings.

Up to now, a considerable number of lecture browsers have been developed, however, only a few support searching in audio/video data. Our browser enables users to search for what was spoken during all lectures or the search can be restricted to a specific one. Each lecture can be accompanied by a smart list of automatically synchronized slides and other useful related information.

Our goals were:

  • to create a lecture browser with an intuitive and easy-to-use web-based interface.
  • to make the browser work in a standard web-browser, without a need to install any special executables and libraries.

Search in Speech

Navigating in speech brings the most significant speed-up in accessing lecture videos. In our system, speech is first off-line transcribed to word lattices by a large vocabulary continuous speech recognition (LVCSR) system, and indexed. The browser then connects to a speech search engine.

The recognition system is based on state-of-the-art acoustic modeling [1] and includes context dependent hidden Markov models (CD-HMM). Feature extraction is done with vocal tract length normalization (VTLN) and heteroscedastic linear discriminant analysis (HLDA) transform, and constrained maximum likelihood linear regression (CMLLR) is used as unsupervised speaker adaptation. The models are trained on ~56 hours of high quality Czech data from SpeeCon and TEMIC databases. The language models are trained on spontaneous speech databases (~5 Mwords), written technical lecture texts (~2 Mwords) and lecture transcripts (~200 kwords) [2]. The LM used was a standard 3-gram.

The browser is interfaced to the BUT speech indexing and search engine [3]. Based on lattices at the output of LVCSR, a forward index is created, which is then converted to an inverted index similarly to standard text-search applications. The search allows for estimation of confidence based on comparison of the actual and the best path in the searched lattice. Although BUT is active in word/sub-word approach to search in speech eliminating the out-of-vocabulary word (OOV) problem [4], the current version of the system supports search in word lattices only.

Synchronization of Slides

The lectures, particularly in the academic environment, are mostly based on the presentation of slides. Identifying when a particular slide was presented in a video is:

  • an important cue for navigating in lecture recordings.
  • necessary for providing the user with a high-quality version of the projected slide.

The system first pre-processes the video and searches for the most probable area of a few projected slides. Having the area, slides are extracted and unwrapped into a normalized size. Slides extracted in this way are then compared to each other and slide changes are detected. The comparison is based on the analysis of approximated spatial distribution of slide differences. When the difference is high and the vertical and horizontal distributions of changes are more-or-less uniform, a slide change is detected.

A sequence of extracted images of a particular slide is combined using the per-pixel median in the temporal domain. The median approach partially removes the objects moving in front of the slide projection area such as the lecturer, students, etc. Stored extracted slides serve as an index to the video and as the source of enhanced slide images provided to the user.

Scheme of slide extraction and change detection
Figure 1: A scheme of slide extraction and change detection.

When the original slides are available, it is better to visualize these originals rather than the ones extracted from the video. The matching between extracted and original slides is based on dense gradient features extracted on an image pyramid (Figure 2). The used approach follows the idea of SIFT descriptors [5]. The image is sub-sampled into several levels and the regular grid is then used to divide each level into blocks. Gradients are extracted from each block and weighted by the Gaussian function to improve the invariance to translation. A similar approach was successfully used in [6]. At the very beginning, features are extracted from the set of original slides. Each extracted slide is also described by its set of features. Each feature then votes for k-best slides from the set of original slides and the most frequently voted original is chosen.

Example of a slide and first two levels of dense gradient features.
Figure 2: An example of a slide and first two levels of dense gradient features.


Our web-based lecture browser is powered by Lighttpd that, in connection with a module for HTTP Flash video streaming, enables users to play the video starting from any key frame. Web pages are dynamically generated using PHP, JavaScript and AJAX. Most of the data is stored in XML format for easy handling. This also applies to language dependent strings, so our browser can be easily localized to any language. For the sake of easy navigation, lectures are grouped into categories and subcategories based on the course, term and year. Besides that, a breadcrumb trail is provided at the top of each page to help users keep track of their location.

Since our lecture browser is completely web-based, in most cases users do not need to install or download anything. To play lecture video recordings, it uses the Adobe Flash Player that is already installed on over 98% (statistics at of Internet-connected computers worldwide. Each page has the same header and footer, only the middle part differs. Based on that, there are three main page types - search results page, category page and lecture page. The search results page is accessed via the input field allowing the user to search in audio data and shows all results matching the search query across all lectures. Each category page displays a list of assigned lectures along with their basic information. Clicking on any lecture name or image brings the user to the lecture page used for the lecture playback. This page consists of various components organized in boxes that each can be collapsed/expanded by clicking anywhere on its title bar.

Browser Components

Each component is designed to perform different functions and provide users with a piece of information related to a lecture.

Video Playback

This component enables users to watch lecture video recordings. To have our browser widely accessible on many platforms, we decided to use a Flash-based solution for video playback and chose the popular and flexible JW FLV Media Player. First of all, every video recording has to be converted into the Flash video file format (FLV). Then, it is necessary to add navigation cue points that allow users to seek to a particular key frame in the FLV file. The JW FLV Media Player is used with a modified look and a default set of buttons providing control over the video playback. It lets users switch to the full-screen mode and supports subtitles through the use of a special plug-in that can handle W3C Time-text XML files.


This component displays slides as they are presented. The data is stored in an XML file that can be generated by our slides synchronization tool. Slides are usually shown in the video recording along with the lecturer. However, in many cases, those containing small font text are difficult or even impossible to read due to low resolution and blurring. This is particularly affected by video compression. To reduce this problem, this component can display sharp images of slides matching those shown in the video. It is divided into three sections. The upper one shows the current slide image. A click on it pops up a new window showing the image in its maximum resolution. The middle one is used for a table that contains a clickable list of all slides (thumbnails). This enables the user to easily browse through the lecture and quickly access the desired slide position in the video. Multiple occurrences of slides in the lecture are marked. This reflects situations when the lecturer returns to a previously presented slide. Finally, the lower section may provide links to presentation materials, mostly PDF or PowerPoint files.

Search in Audio

The actual search is performed by our speech search engine that can run as a non-blocking server either on the same or a different computer. This component provides the user with an input field and a search button. A search query (a keyword or a sequence of keywords) is sent to the server to be processed. As long as there are some results, the server returns them formatted in XML. Otherwise, it responds with an error message. The communication between the client and the server is done through AJAX. An important advantage of AJAX is that it is asynchronous, so data is transferred and processed in the background without interfering with the existing page. This allows users to continue watching the lecture. Once the results are obtained from the server, they are displayed along with their confidence scores. Results with the highest confidence are shown first. Each row represents one result matching the search query that can also be accompanied by the surrounding transcript segment. The user can begin playing the video from any result. Search queries might be logged for later analysis. This can help extend the recognition vocabulary if the word is undefined.

Speech Transcript

In our browser, this component displays a written form of what was spoken. Transcription is the 1-best output from our speech recognizer. The output is divided into short textual segments. Their form is very similar to subtitles. Each segment contains its starting point in the lecture time-line and a corresponding transcript part. The whole transcript is organized in a table, in which each row belongs to one segment. In the lecture playback mode, the segments are highlighted as they are played, so even hearing-impaired users can follow the lecture to some extent. The current segment can also appear in the video as a caption. By clicking on a table row, users can begin playing the video from the starting point of the assigned segment.

Other Components

To date, several other components have been implemented. For example, the Web links component might guide users to resources related to the lecture. In case of multiple speakers, a component graphically showing the lecture time-line along with the speaker segmentation can be useful to locate all segments of the desired speaker. Furthermore, we have implemented a component containing links to related lectures. A related lecture is assigned based on keywords and category. New components can be easily added to our lecture browser using a simple interface.


We have developed a browser that significantly helps users navigate in lecture recordings. This is achieved not only by coupling to the speech search engine, but also by providing a smart list of all slides presented. Furthermore, an intuitive web interface and hierarchical categorization of lectures ensure that the desired lecture is quickly found. Using a Flash player to play video recordings, the browser runs on many computers with different operating systems.


[1] J. Kopecký, O. Glembek, and M. Karafiát. Advances in Acoustic Modeling for the Recognition of Czech. In TSD ’08: Proceedings of the 11th international conference on Text, Speech and Dialogue, pages 357–363, 2008.

[2] T. Mikolov, J. Kopecký, L. Burget, O. Glembek, and J. Černocký. Neural Network Based Language Models for Highly Inflective Languages. In Proc. ICASSP 2009, Taipei, 2009.

[3] J. Černocký et al. Search in Speech for Public Security and Defense. In Proc. IEEE Workshop on Signal Processing Applications for Public Security and Forensics, 2007 (SAFE ’07), Washington D.C., 2007.

[4] I. Szöke, L. Burget, J. Černocký, and M. Fapšo. Sub-word modeling of out of vocabulary words in spoken term detection. In Proc. 2008 IEEE Workshop on Spoken Language Technology, pages 273–276, Goa, IN, 2008.

[5] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60:91–110, 2004.

[6] V. Beran, A. Herout, M. Hradiš, I. Řezníček, and P. Zemčík. Video Summarization at Brno University of Technology. In ACM Multimedia, 2008 ACM International Conference on Multimedia Compilation E-Proceedings, TVS’08, pages 31–34, New York, 2008.

Development team team

  • Josef Žižka
  • Igor Szöke
  • Michal Fapšo
  • Vítězslav Beran
  • Jakub Kubalík
  • Kamil Chalupníček
  • Eva "Isabella" Hajtrová
  • Jana Skokanová
  • Tomáš Kašpárek

Guys who contributed to the speech processing system:

  • Lukáš Burget
  • Petr Schwarz
  • Martin Karafiát
  • František Grézl
  • Tomáš Cipr
  • Tomáš Mikolov
  • Pavel Matějka
  • Ondřej Glembek
  • Honza Černocký