Speech Transcript - Spoken Content Retrieval - Lattices and Beyond

we introduce a professor lin-shan lee

oh he's been with the national taiwan university since nineteen eighty two

is early work focused on the

a broader area of spoken language systems particularly focused on the chinese language

and a number of breakthroughs early on in that language

is more recent work

it's been focused on the sort the fundamentals of speech recognition

at a network environment issues

like information retrieval semantic analysis

oh spoken content

is ieee vol

and it is gonna follow

he served on numerous boards

and received a number of a awards including a recently

the meritorious service work for ieee

signal process the signal processing society

so please what we welcome professor lin-shan

hello

so you can hear me right

good doesn't it

and thank you larry

is my great pleasure today to be here presented to you

spoken content retrieval

lattice and the young my name's and change from national high

in this talk of first introduce

the base

is a problem and some fundamentals

and then i'll spend more time

finding some recent research examples

all showed them before the conclusion

so first introduce

problem

we are all very familiar with

text content which because it says in there is that

for any user errors

or user instructions

use your every repair them as well as

the desired information can be obtained

very

in real time you refer to

documents

all users alike

and i even produce various

successful in dutch

now today we all know that all rows of have

can be accomplished by force

on the content side

the spoken

content we do have spoken content or mouth and yet

sports

all part

on the query side the voice query can and should be a hand

hand held devices

so it's time for us to consider

spoken kind

now this is what we have today

everyone

blender

when we ensure

a text

we get X

now boasts the choirs and content

can be

informal ports

future

first

we may use text queries

to retrieve spoken comedy

or not

okay

including all

for this case

very often

also this before then

a spoken content

i spoken document retrieval

very often

morse subset

oh that problem

is referred to as

spoken term detection

in other words to detect

query change it

i spoken con

of course we can also

retrieve

a text content using

all was clear

in that case

usually referred to by also

thus voice search

and

oh however in this part because the retreat you document to be retrieved is in tech school and therefore would

be out of the scope of this talk

so i'm not going to spend more time talking about voice

oh of course we can do the other side that is to retrieve

a spoken

content using spoken queries

and sometimes the use of it for two days

query by example

and so in this part of focus on retrieval of spoken content primarily using text

cards

but sometimes we can also consider the case of

spoke

now as we all understand if the spoken content is one chorus can be accurately recognise

then this problem would be reduced to you well known text content

retrieval

it will be no problem

of course

that never happened

because we know the us recognition

in most cases

so that's a major part

not today we understand that

the

they are many hand held devices

with multimedia functionalities available today commercial

and also they are unlimited quantities of multimedia can which is ban scrolling over the internet

so we should be able to retrieve not only the text content but that's where the multimedia and spoke

in other words you wireless and multimedia technologies today are creating an environment for spoken content retrieval

as to let me repeat again that the

network access is primarily

text based today

but almost all rows of text can be accomplished by voice

so the nice mentioned very briefly some fundamental

first

but this we wanna stand D recognition always give errors

for various reasons for example

oh spontaneous speech for example oov words or mismatch models and so on

and then makes the problem difficult

so a good approach may be to consider lattices with multiple alternative

rather than the one best output on

in this case

of course we can have a higher probability of including quite words

but also in that case we have to include more noisy words that cost problem

but on the other hand even if we have to be like this we still have the problem that some

correct words may not be

including because they are all keywords and still

on the other hand when we use lattices that implies we need shooting memory and computation requirements

there's another major problem

of course there exist other approaches to solve that similar problems

for example people use confusion matrix to model reading errors

and try to explain the query and document using confusion matrices

people also we use

pronunciation modeling to try to expand enquiry in that way

people also use say fuzzy matching in other words the matching between the quality and the content does not has

to be

exact

these are all very good approaches however i won't have time to

say more about things

since our focus on lattices here it just talk

now the first question is how can we index that

well unless a lattice that like this

and usually the most

popular approach to index a lattice is to transform the lattice into a sausage like structure like this

in other words it's serious of segments

and every segment includes its number of word hypothesis we use

posterior probability

in this way the position information for the words can be readily available in other words they were one is

on the first segment and word at some the suffers the second segment so we're one after work can you

want a followed by word eight and that is a bigram and so on

in this way this is a more compatible to existing text indexing techniques

also in this way the required a memory and computation can be read use the slightly

in addition we may notice that in this way we can add more possible path

for example were three cannot be found by where i ate in the original lattice

but here this becomes possible

also the noisy words can be discriminated by posterior probabilities

because we do have

scores

in either case this we notice that we can match the and right

oh this lattice

oh with the very for example we have the bigram were three followed by word of a then response of

these bigrams exist in the very that helps

so we can

come all the possible

and when spanky will accumulate the scores and still

now there are many approaches proposed for such kind of indexing of the lattices

i just list a few examples here

and i think that today the most popular ones maybe the top to the yeah posterior please position specific posterior

lattice

or P S P L

confusion networks or C and

also another very popular one would be the weighted finite state transducer

wfst

now let me take

one minute to explain the first two

the position specific posterior lattice this psp a and a confusion networks

and

suppose is a lattice

and these are the board your possible pasts here

they were sick

and

the P S appeal or positions of the signal

a posterior lattice try to locate every word

in a segment based on the position of that word in a path

for example where ten here appears only as the force were in the past

so it appeared in the force tacoma

on the other hand

the and the confusion networks

try to cluster words together in a cluster based on for example time spans into word pronunciation

so for example

the word to word five or ten may have very similar time span and pronunciation they may be clustered together

and they may appear in the

second for here

so in this case you may note that the different approach gives different in this

now a major problem here is oov words

as you understand the

only work cannot be recognized therefore never appear in the lattice

that's important because very often

the yeah according words

the query includes both we were

however if we look carefully

but there are many approaches to handle this problem i think the most fundamental approach is to use some for

you

a i mean let's take this example

suppose at all keyword W is composed of

these four somewhere units every small W i use a summer units for example a phoneme or syllable or something

these are also somewhere units and these eli

and these are arcs

and the word here because

this W is not a

in the vocabulary so it's not recognise here

however if we look at carefully we notice that

the work is actually here it is hinted at sub-word level

so it can actually be matched

and somewhere level without being recognised in the

and that's a major approach that a different ways can be developed to try to handle this using somewhere units

oh one example is to construct

the same

P S P L or C and based on separate units

for example

now there are many give principal units have been used in this approach

and usually we can categorise them into two class

the first one is linguistically motivated units

for example phonemes

set of a character or more times and so on

the other one is data-driven units in other words the a drive

using some data-driven L

and different ellison's may

produce different names

for example someone for the particle someone code word fragments of panama transform offsets

of course there are some other different approaches

if we do have the very invoice for available

in that case we can also manage to hurry

in speech and a

all containing speech directly

without doing recognition

in that case we can avoid the recognition error problem and we can even do it with the in the

unsupervised way

in that case we even don't need a lattice

and this can be performed in say

frame based matching for example like dtw

or segment based approach

just imagine the sex the segments

or model based action so

our board at this kind of approaches

do not use recognition and therefore do not

have lattices

so i won't spend more time using a about this approach is all just focus on those with laughter

okay so below all always this fundamental at all it's just described to you some recent research examples

i have to apologise i can only cover a small number of examples

so many examples i just cannot cover

all

below i'll assume the retrieval was look something like this

this is spoken archive

after recognition based on some clothes models we have lattices here

now you retrieval was applied on top of this lattices

here the search engine i mean index in the lattices and search over the in

and by retrieval model i mean anything in addition for example confusion matrix is mention that was in the waiting

room and whatever

all based on this

graph to discuss the following

the first thing we can think about can do is to do integration and wait

and for example we can integrate different rules from recognition

from different recognition systems

from those

based on different subword units

oh in queens some of the

information and so on

in addition a good idea maybe to try to try and those of model parameters

if we have some training data available

what kind of training data are needed here well this kind of training data we need here are a set

of queries

indeed the so we shaded relevant irrelevant second

for example use when user entered query Q one we get a list of here and then the first two

are forced or irrelevant and the next two are two more relevant and still

we need a set of these kind of data

or such data does not necessary to be anointed by person abides by some people because we can collect them

from real clicks with data

for example if the user enter a query Q one and Q get this list

and then p2p skip the first two items and directories

click the next two

we may assume then

the first two are irrelevant or false

and the next two are round

in this way we can have chris with data

when we have this data then we can do something more for example we can use this training data to

train the

parameters here

for example we trained different weighting parameters to wait

different recognition output different

of subword units was the different information including open confidence or phone confusion matrix and so on

oh here let me show you to very briefly two examples here

and the first one

is that

and in this example we actually use two different ads

a indexing approach we just mention confusion network and position specific posterior lattices in each case we use not only

the we're page

in this thing but also those based on subword units in which case we can really one but right one

gram bigram three and trigram

and so we have a total of eighteen different scores

and we try to add them together by some weighting

to optimize some parameter described in the

oh retrieval perform

which is called M I P

here i am a P

oh the mlp i mentioned in this talk is mean average precision

which is the area integrate under this

recall precision her

and which is a yeah performs measure frequently used for information retrieval of course the are many other parameters that

i just have time to use one of them here

now we can try to optimize

this parameter using some

all extended version of

set of support vector machine

oh here's a result

here i am a few results for the yeah at different scores used in give usually and is the result

when we integrate them together

you see we get about a net gain of about seven to eight percent of the mlp which is not

here is another example that

we think it is possible to have context within the context dependent term weighting

in other words the same term may have different weights depending on the content

for example if the query term is the query this information series this information is very important

but if the previous speech information retrieval

then this work information is not so important because important terms are speech and retrieval

in this week different term may have different weights in different context

and these weights can be trained

and he got the results

using context-dependent wait we actually get some gain on the mlp

okay these are some directly waiting

now what can we do next

where the first thing we think about is how about acoustic model

can we do

just as we have so many expert in the clues modeling we can do discrete training on the quiz models

so can we use this training data to re-estimate "'cause" model

well

in the past or the retrieval are considered based on top of recognition output

they are two cascaded independent stages

and so retrieval performance really rely on the recognition act

so why don't we consider this two-stage together as a whole

then the acoustic models can be re-estimated by optimising the retrieval problem in performance here

in this way to coups models maybe better match to each respective dataset

so in this way we learn from the mpe and try to define they object function in this paper

and here is the results

here the this the results for a different set of course it unusual "'cause" models these supports speaker independent models

and these four adapted by global mllr and this by adapted further by

as mmi and

here is M E P but that is the adaptation for

but X men a posterior probability

and these numbers are mlp not yeah recognition accuracy

as you notice that

we do have some improvements

but relatively limited

probably because we were not able to define a good enough

objective function

another possible reason may be

because different christ given query is really have quite different characteristics

so when we put together many queries

and these different query really interfere with each other in the training data

so we are thinking of one not use query specific acoustic model

in other words you we can we estimate it "'cause" model for every query

then that means this has to be done

on real time

on the line

is it possible

we think that yes

because we can based on the first several utterances

they when the user clicks through

and browsing the retrieval results

then all their utterances not get browse can be reranked

by the "'cause" model

that means

the models actually can be updated and the lattice can be rescored very quickly

why because we have only a very limited number of training data so the retrieval can be very

so this is the

scenario that when the re

when the system gives the which you results here and the user clicks

browse and create the first several

but when assessing indicating

they are relevant or irrelevant

then these results are actually fit to the acoustic model to re-estimate them up models

where we get new models

and these are used to rescore the lattices

and that

is used to rerank the rest of the art

so what is the results

where we can see

just with one iteration of model react re-estimation which makes you real time

adaptation possible

and we do have some improvements

here

now what else can do

well how about acoustic features

well yes we can do something a focus feature

for example if we know an utterance is known to be relevant or irrelevant

then all the utterances similar to this one

can be is more probable to be relevant and iraq

so in this case

this we have the same scenario that the when the user see the output

and he clicked the first several utterances

and we can use the first separate or utterance as reference

and

does not give rows are compared with those correct

based on the acoustic similarity and then rewrite

in this way

let's see whether it is better or not

and then we need to first define the a similarity in the acoustic features

we first

define forty utterance

the a hypothesized region is these segment of

feature vector sequences corresponding to this lattice

these utterances and this lattice corresponding to the query Q

in the lattice with the highest score for example for this utterance of us see our feature vector sequence and

this is the corresponding arc for the choir and this is a high possibly

not similar there's another utterance

with the sequence here and high-pass reading here

then the similarity can be derived based on the dtw distance between these two regions

and in this way we can perform this scenario we just mentioned

and here are the results again for the three sets of acoustic model

and we may notice that in this way using a close similarities

we guess slightly better improvements

as compared to directly model we estimate acoustic model

okay so what else can we do

where we may consider a different approach

that in the above we always assume we need to rely on the users

that gives us some

feedback information

do we really need to rely only users

no because

we can always drive relevant information automatically

we can assume the top N utterances on the first-pass retrieval results are relevant

oh actually sold around

and this is referred to as a solo reverence

and here you see scenario

when the user and required

and the system gives the first pass retrieval result

and this result would not be shown to the user

but instead we just assume the from the top and utterances

are relevant

and all the rest are compared with these top N

and see whether they are similar or not and based on similarity

and based on the similarity

we rented results

and only this rear end results are shown

and

now we need to

all

okay here is the results

you can see that with this pseudo relevance feedback

for different acoustic models

and we really have

slightly better improvements here

now what else can we do

where we can further improve

the above pseudo-relevance feedback approaches

for example we can use graph based approach

remember

above that when we

E in these to the right of feedback approach we assume the top N utterances are taken as the reference

of we assume they are relevant

but in this way of course they are not reliable

so why don't we simply as

considered for the first pass retrieval results probably using the graph

in other words

we can construct a graph for all utterances in D first pass

retrieval results

and all the utterances are taken as a represent a signal

and then

the edge weights are actually the acoustic similarities between a

now we may assume that you utterance is strongly connected to

utterances with high scores

or very similar to utterances with high schools should have high school

for example if here X two X three have high school then X one should

similarly if X two S three all have

have lost all the N X one channel school

in that case discourse can propagate on the right

and then spruce among strongly content notes

in this we all the scores can be

corrected

so we can then reranked forty utterances in the first-pass which you retrieval results using this

we use these correct a score

and here is the results

and again for three sets of acoustic models

and you may notice that now graph based approach

get provide higher

and may result in or

this is a reasonable because this basic where a graph based approach really considered global globally or the first pass

retrieval results rather than reference on top and utterances

okay what else can we do

well we should and of course

machine learning has been used and shown useful in some work

so then sure one example of use of support vector machine

in the scenario of

we just mentioned this to the right feedback

and here is this scenario again when the user entered query Q

here

and this is the first pass retrieval results

this is not shown to the user but instead we simply take the first pass retrieval results we consider that

the top ten

utterance are assumed to be relevant and taken as possible examples

the bottom and i'll soon to irrelevant and taking as negative examples

and then we simply expressed some feature vectors from them

and then

we try to

train a

support vector machine

now for the rest of

utterances

we simply

okay we

X ray D feature parameters and then

we drank by just a

support vector machine and then only to rewrite the results are shown to the user

so in this case

please note that we need to train an svm for every query online

is it possible yes because we only have limit number of

a training data so they can be verified

the first thing we need to do is we need to define how to extract a feature parameters to beach

used in training svm

where again we can use the we just mention a hypothesis region

and suppose these same utterance and this is the corresponding lattice and here is a query and so these occur

i also read

we can divide this region into action states

this action state

and those feature vectors in once they can be averaged into one vector and then these vectors what different states

can be concatenated into H supervector and there's a feature vector for this

okay

in that way what's results

again we can

see the results for different

oh for different markers models

you may notice that now that's

svm is much better than reference

which is much better

the previous result

all

okay and of course i have to mention that all results report here are very preliminary the are just obtained

in preliminary results experiments

now what else can we do

well

all the above discussions are primarily considering acoustic models include features

ha linguist information

yes

for example

of the most straightforward information from linguists we can use is a context dependency

a context consistency

in other words the same term usually have very similar con

while quite different context usually implies that rams are quite different

what can we do we can do exactly same as we did

using svm

except now the feature vectors represent context information

so we use exact the same

scenario that the for the first pass retrieval results you we use top and bottom N to train his yeah

except now we use different features

vectors here

suppose these and are trying

and the corresponding lattice

and here are the

query here and we can construct a left context

vector

who is the man dimensionality is the lexicon size

yeah only those words appear to D left context

have their posterior probabilities as the score

or the other words has zero they are

similarly we can have a right context vector and the whole segment complex

and then we can come cut them together into a feature vector

and this has dimensionality of three times detection sides

three times of the like

now we can use this to do the experiments and here are the results

again for three sets of codes models

and you may notice the context information really helps

so what else can we do

where certain concept match

in other words we wish to

match the concept rather than lead to

in other words

when we are we should just system can return utterances or documents

semantically related to do carry but not necessarily include the card

for example if the query is white house of your nice they

and if the utterance includes present a bottom-up but not whitehouse not us

we should it can be returned as well

where they are many approaches have been proposed for this direction

for example we can close to D documents into sets

so we know

which sets of documents are talking about the same concept

we can use web data to expand the query or expanded documents

we can also you using a

a latent topic model

it or should we just one example of latent topic approach

where this is very straightforward and we just use a very popular where are used probabilistic latent semantic analysis or

pos

and in which we simply assume a set of latent topics

pitching is set up to

and a set of documents

and the relationship can be modelled by properties models

trained with em algorithm

of course there are many so many other approaches

and i are complementary

and here is an example work we did

for the us for a car

and we transformed into lattices

then for any given where we simply use the

plsa model we just mention based on the latent topics to estimate the distance between the primary and the lattices

and that gives the results

here are some X a preliminary results these results are in terms of recall or precision curve

and this three lower curves

are the baseline of later matching simply matching words

and the lowest one is on one best results and to chew up a one

two upper ones are based on a lattice

now yeah

the three curves here are

concept matching using the plsa i just mentioned

as you can see

a concept matching certainly this much

so what else can we do

where are you seconds content interactions that are not important

all

we know that user content

the interaction is important even for text content

in other words when we retrieve text content very open we also need a few iterations to get the desired

information

now for spoken content is much more difficult because i spoken content on not easily summarised on screen

they are just signals

so it's difficult for the user to browse

to scan and to select that

so when this isn't gives a whole bunch of

retrieval results we cannot listen to everyone old and then decide which one

we like

so that's a problem

what we propose is first we can

try to a select

automatically he turns and construct titles summaries

to help browse

and then we try to do some semantic structuring

to have a user interface

and then we can try to have some dialogue

to help the interaction between the user and the system

so been a very briefly go through some of

for example cute and extraction

which is very helpful in labelling the retrieval results and for user to browse

the key trends include two types at least keywords and key phrases

keyphrase include several keywords several words together

so for key phrase we need to detect the boundary

and there are many approaches to do this use one example suppose you them up model is a key for

one where it is defined that had it is always followed by the same word markup

mark over is always followed by the same word a get a model is always followed by the same word

model

however the model

is followed by many different words

and that means these at the boundary of the frames

in this way we know

there are a number it can be detected by context

statistics

now with the chi turn candidate

i there is a word or phrase

then we constrain many features

to identify with the object you transform that

for example prosodic features

because very often the key terms are produced with longer duration wider pitch range and high energy

we can also use

semantic features for example from hearsay because key trends are usually focused on smaller number of top

for example this is a distribution of topic probabilities obtained from plsa given a cat

now this one looked like she turned because it's focus on only smaller number of

topics D horizontal axis of topics

and this one doesn't like acute right because you need for me used in many different futures and in many

different

of course lex and feature a very important that includes term frequency and inverse document frequency of part of speech

tag

and so on

here is the result of weak of a attraction she turns

using different sets of speech

here yeah

the

prosodic lexical and semantic features here and you notice that each

it just single set of features are useful

however when we integrate them together we get the highest result

now for summarization where a lot of people in this room and doing summarization so i'll just

of course with a very quickly

the suppose these see that this is a document includes many but

and we try to recognise them into

words every circle is a word i the recognized correctly or incorrect incorrectly

what we do is try to select a small number of utterances

which are most representative

and avoid we done

and they are used to form a summary and this the

so called extractive summarization we can even replace these utterance with the original voice

so there is no correcting errors in the result

and i just show one example here

because we are selecting the most representative same utterances

so it is reasonable to consider that the utterances topic or is similar to the represented representative utterances should also

be considered as represented

so we can do similar

on the graph based analysis in other words every utterance represent as they

note on the graph

and then we let the scores for representatives yes

propagate undergrad

in this we compare get better scores and select better utterances

these are some results and skipped

title generation

titles are very open useful for if we

construct titles for retrieve the document second

is useful to for the browsing and selection of utterances

but i don't have to be very short but readable

and tell you what it is

here's one approach

we perform viterbi that was over the summer

based on the scores obtained by stream model

to select the to try to order to test

and to decide

lance of the tide

in this way we can have some good titles

semantic structuring there can be different ways to do semantic structuring and we don't know what's

good approach here just use one example

and we can cross to retrieve the results into some kind of

a tree structure based on the

a semantic information for example they can tell

in this way

every cluster can be labelled by except she turns

so that such intense indicating what they are talking about

and

every cluster can be for the next

tainted into the next layer and so on

here is another example

teaching

in other words every we retrieve the spoken document or segment can be labelled by step two turns

and then the relationship between the two terms

can be construct

represent as a graph

so we know what kind of

information about

okay now finally the kind of

if we have all this including semantic structuring forty turns summaries here

on the system

and the user is here getting providing some choir so what can we do to offer them to

have a better interaction

a dialogue may be possible

and many people here in this room are very experienced in doing something spoken time so we wish to learn

something from

for example we may model this process as a markov decision process or M

in this way what we can do is to

for example we need to define some goals

the goal is maybe higher text

success rate

except here the success indicates

the user information need is satisfied

we can also define a go to be

small number of dialogue turns back here

is small number of query terms entered

in this way we can define the reward function or something similar and then maximise the reward function with similar

uses

and here is one example application scenario for retrieving broadcast news

and here in every step when user and require every decision tree trends not only the retrieved results but also

a list of key trends what user to select

if the user is not satisfied with the results here then cheating

looks through that

Q chandler's from the top and select the first relevant to disney

and this

she turned miss can be ranked

i M P

and you're some results i'm escape

so above i have mentioned something about she turned up a summary title and is the menu structure and dialogue

so that's the something about user content interaction of course the a lot more work needed before we can do

something really

okay now let me

have a few minutes socially them

and this is a

course lecture

oh okay then you go through corporate

the slides first

this is

on a coarse black

and as we know there are many course lectures available over the internet

however it takes a very long for user to learn to listen to a complete course for example forty five

minutes

and therefore is not easy for engineers or in those readers to learn new knowledge the other course lectures

and we should

so we also understand they are lecture browsers available over the internet

however we have to bear in mind that

the knowledge of course lectures are usually structured one concept follows

so the retrieval vector segment

possible it being very not easy to understand what are then without enough background knowledge

and also given the retrieve the segment

there is no information for the gonna regarding what should be the now

so the proposed approach is to try to structure of course lectures parts line spanky turns

we derive the course lectures by slide

and drive the core or content what slides

i T turns and then construct she can grow

represent semantic relationship among the slides

and also all slides are given its lens

timing information in the course

summary key trends

and relay to transcend relay slides based on

the kitchen

or retrieve the spoken segments include all the information for the slides if you want to

and this is a system for a course on digital speech processing over by myself in taiwan university so therefore

given in mandarin chinese

however or determine on edges are produced directly in english so this is a call makes the data

okay so now let me go to

this

and this is the course

and the system it could be given name of and you virtual instructor

and i'm

it was recording your two thousand six the total and forty five hours

now suppose i heard something in a lecture about backward catwalks i don't know what that is

so i tried to retrieve it

however because i don't know what

so i mean i guess it is like work out with some

so it just enter like workout wasn't and then do the search

here i'm searching through the

internet and on the server on the server entire university so i rely on the internet here

and we see that

here i'm retreating the voice rather than the words

so the query words is bright work out some which are totally wrong

but here what we treat a total of fifty six results in the course

and here for example in this result the first one is utterance of a second long

it is in the slides

is a slight number twelve of chapter four that i told that finds this basic problem three for hmm

and here is the

and the slide is labeled by these key terms but what that looks a now i know this is about

but that was rather than that helps and also baumwelch or for what else and so on

and note that because this

utterances are represented in terms of lattices of subword units

so the supper unit sequence of this paper that works and is very similar to this one

and that's why i can retrieve

and there are many up and so on that goes with that

now if i think i like to listen to this

so i can click here to go to that slide

this line number two of chapter four

and side here that i don't is this that the this is done by myself so it is a human

generate a type i don't need all the magenta title

because every site has a title and the title is basic problems three for a channel

and hearsay is this lies has a total length of twenty too many and fifty seven seconds so you buy

like to listen to this i need to have twenty two minute

and in addition this is the spend all these sites

in chapter four out of the twenty two slides total

and so

and very important here is the key terms

only those terms on the top in yellow are the key terms used in this line

and those below here are real at times provided by the kitchen right

in other words it you demographics are completely not easy to show here so instead i just list the highly

related key terms here below every cute right

so for example when i go through here i saw here is at train quarterback right now "'cause" i'm not

support for L

and background wasn't actually relate to for education and so on

now if i don't understand this one so i like to know a little more if i understand this can

i listened to dislike

so i click here that give me that

this cute ran off

backward algorithm actually first appeared in on the slides

which appears earlier

and now you and the slides which is later so probably there's no experimentation up with this

and you really don't know about backward algorithm you should go there so i can cut this one and then

i go to

that's lights

where it's like okay yeah that's the other slides

and that's is the first time that but what else was measure

and that's lights

and that's that helps them here that you change in the slides for example in that slide we also really

have but what else

and for help us

the four at some he's actually relate to that alex

that ellen ready to this

for that and so on

now let me show a second example suppose i like to enter another query which is frequency

no i do the search

not in this course there are a total of sixty four results for the frequencies

here the first

utterance

of six second long appears in this

slight of build a batch processing

labeled by this

he turns

and the second

pre-emphasis

and so on if i'm interest in this one i can press here to go to this

slots

this is the slides on pre-emphasis

and i notice there is summary of beeping second

so i like to listen to this summer

so okay retrieving the summary from the high in so you got been able when you wear usual

you can change

oh no but i'll call again a number of them are element that you just sensitive to the tangent angle

that city and go under the try not to let your should formative initial sort of take a pre-emphasis factor

cocaine not out there

a given that you could solve

okay this is the fifteen seconds summary it's in mandarin chinese i'm sorry but i tried the english a subtitle

is actually done you know

manually in order to show you what was that in that summary

and okay so this end of the demo

so let me come back to the

powerpoint

so in conclusion

i usually divide the

spoken language processing over internet into three parts

user interface

content analysis including such as keychain extraction or summarization or

so on and the user content interaction

and we notice that user interface has been very successful however not very easy usually because you which users usually

expected technology to be placed human the

for content analysis and use a complex interaction which are not easy i seven however because the technology can handle

massive quantities of content

what a human being can not

so the technology does have some i think

now the spoken content retrieval is the one which integrate user interface with content

analysis and use a condom interaction therefore maybe offer some interesting applications in the future

so eventually i like to say that i think this is only

this area is only in its infancy stage

and there's to you plenty of space to

developed and plenty of opportunities to be investigated in the future

and i notice that many groups

i have been to some doing some work in this area and actual many people in this room have done

some work in this area

so i we should we can have more discussions and more work in the future

and hopefully we can have something better

much better than what we have today in the future justice

the in speech recognition we are now having much better

work then several years ago so we wish we can do something more

okay this concludes my presentation thank you very much for your attention

one child thank you very much for a very interesting to watch

one question i have for you is that a lot of people in the audience or working on a related

somewhat related problem which is voice search and one of the issues that come up and voice search sometimes you

say something it gets released speech recognition one and then you repeat the query to get it right all the

other sets of choice is it may come up with maybe sort of similar an overlapping

oh it would seem to me that has some relation to relevance feedback in the sense that the user was

sort of giving an additional set of information about the a priori previous query that was dictated i'm just wondering

you when you don't the people you work with the therefore looked into this

sort of problem whether you're you have any opinions on whether you could get improvements by looking at the somehow

taking the union of multiple queries

in a voice search sort of tasks to join improve results using similar results are similar methods to what we

talked about the in your talk

thank you very much i think certain is very good idea and

arg in actually are as i mentioned in the beginning in this part we

we are not talking about a voice search but in your experience for example when they're bp do queries may

provide

some good information or correlation about the intent of the user so that they are helpful for example in my

what i mentioned D dialogues we actually allow the user to enter the second query and so on and that

key

actually the interaction or the correlation between the first and second choirs and so i think that's the yeah the

only thing we have done

for up to this moment but i think probably what you what you say it implies much more we can

do and we are a i think we

as i mentioned we just have to

too much work

to be done and so we can think about how to implement what you think about in the future

thank you very much but that's only for very interesting talk i have a detailed technical question on your svm

slight

also

if you can go

oh yeah you for

for svm that you take a positive examples in there okay slight yes so my question is when you train

your svm you it seems like you're only taking a high count is examples for a stationary example

so in the margin you're pulling my you're pulling the examples from where is far from the margin

and in the testing phase if you have some difficult examples that were close to the hyperplane then you my

probably

where part time

three

yeah certainly you're right

but well that's all we can do at this moment because you know nothing about these results right you just

have the first pass results here and we can or we can do is that you assume the top and

bottom

and then construct the svm of course in the in the middle close to the andre it's a problem

however svm already provides some solution for them or the large margin concept

so they tried to improvise some

somehow almost margin and also there are some

allowance right so

what we just try to follow the idea from svm and try to that

to see if we can do something that

okay thank you

i have a question sure

making a great talk there's a lot of the parallels with the

methods to go this is really a is telling what michael say

with a text based web search with this problem

you know even beyond voice search

some methods i think the other have been developed by this community that would probably benefit the web search committee

and vice versa i wanted to

ask you to comment on that it and the your awareness of the literature there and opportunities for this

cross fertilisation there's in web search the query writing

is well established and

also and the click feedback web search without a lot of benefits from distinguish between clicks from users

and good click

because clicks a ten to be noisy and so there's been a lot of work in the web search committee

about modelling clicks and you know determine good clicks and use those you know with those more heavily for feedback

also just basic things like editorial relevance feedback about you know bunch of the you know large groups of users

to you know

determine the relevance and use that as we that's what so i just one of this that have asked you

what your thoughts are on the

opportunities for cross fertilisation between these two areas

yeah sure there are a lot of possibility to learn from the experience in voice search to do on this

part of where

we just don't have enough time to explore the possibilities as you mentioned the this kate may be divided into

several categories and they can be learned or something like that

and i think there should be a what could be done in the future

but on the other hand we also learn a lot in from other areas such as text retrieval

and then for example yeah

rather be back or pseudorandom feedback or renting or learning to rank or something much of some ideas laurel from

there

community so certainly

cost area

interact is very helpful

because these areas actually

interdisciplinary

on the other hand we really try to

do something we are more from india in speech area for example acoustic model

for example the acoustic features

and so on and we try to for example spoken dialogue

and we try to follow all those good ideas and good experiences in speech and see that can be used

in comparison

and i think we just have as i mentioned we data plenty of space to be explored in future

thank you much

Spoken Content Retrieval - Lattices and Beyond

Invited Speakers

Lin-Shan Lee (National Taiwan University)