0:00:13we introduce a professor lin-shan lee
0:00:16oh he's been with the national taiwan university since nineteen eighty two
0:00:22is early work focused on the
0:00:24a broader area of spoken language systems particularly focused on the chinese language
0:00:29and a number of breakthroughs early on in that language
0:00:34is more recent work
0:00:35it's been focused on the sort the fundamentals of speech recognition
0:00:40at a network environment issues
0:00:43like information retrieval semantic analysis
0:00:46oh spoken content
0:00:49is ieee vol
0:00:51and it is gonna follow
0:00:53he served on numerous boards
0:00:55and received a number of a awards including a recently
0:01:00the meritorious service work for ieee
0:01:02signal process the signal processing society
0:01:05so please what we welcome professor lin-shan
0:01:24so you can hear me right
0:01:27good doesn't it
0:01:28and thank you larry
0:01:30is my great pleasure today to be here presented to you
0:01:35spoken content retrieval
0:01:37lattice and the young my name's and change from national high
0:01:43in this talk of first introduce
0:01:46the base
0:01:47is a problem and some fundamentals
0:01:50and then i'll spend more time
0:01:52finding some recent research examples
0:01:55all showed them before the conclusion
0:01:59so first introduce
0:02:02we are all very familiar with
0:02:04text content which because it says in there is that
0:02:09for any user errors
0:02:11or user instructions
0:02:13use your every repair them as well as
0:02:16the desired information can be obtained
0:02:20in real time you refer to
0:02:24all users alike
0:02:25and i even produce various
0:02:27successful in dutch
0:02:30now today we all know that all rows of have
0:02:33can be accomplished by force
0:02:36on the content side
0:02:38the spoken
0:02:39content we do have spoken content or mouth and yet
0:02:44all part
0:02:46on the query side the voice query can and should be a hand
0:02:51hand held devices
0:02:52so it's time for us to consider
0:02:55spoken kind
0:02:58now this is what we have today
0:03:02when we ensure
0:03:04a text
0:03:05we get X
0:03:07now boasts the choirs and content
0:03:10can be
0:03:11informal ports
0:03:15we may use text queries
0:03:18to retrieve spoken comedy
0:03:20or not
0:03:22including all
0:03:24for this case
0:03:26very often
0:03:27also this before then
0:03:29a spoken content
0:03:31i spoken document retrieval
0:03:33very often
0:03:39morse subset
0:03:40oh that problem
0:03:41is referred to as
0:03:42spoken term detection
0:03:44in other words to detect
0:03:47query change it
0:03:48i spoken con
0:03:50of course we can also
0:03:53a text content using
0:03:55all was clear
0:03:57in that case
0:03:59usually referred to by also
0:04:02thus voice search
0:04:05oh however in this part because the retreat you document to be retrieved is in tech school and therefore would
0:04:11be out of the scope of this talk
0:04:13so i'm not going to spend more time talking about voice
0:04:19oh of course we can do the other side that is to retrieve
0:04:23a spoken
0:04:25content using spoken queries
0:04:27and sometimes the use of it for two days
0:04:30query by example
0:04:32and so in this part of focus on retrieval of spoken content primarily using text
0:04:40but sometimes we can also consider the case of
0:04:46now as we all understand if the spoken content is one chorus can be accurately recognise
0:04:52then this problem would be reduced to you well known text content
0:04:56it will be no problem
0:04:58of course
0:04:59that never happened
0:05:00because we know the us recognition
0:05:03in most cases
0:05:04so that's a major part
0:05:08not today we understand that
0:05:11they are many hand held devices
0:05:13with multimedia functionalities available today commercial
0:05:17and also they are unlimited quantities of multimedia can which is ban scrolling over the internet
0:05:24so we should be able to retrieve not only the text content but that's where the multimedia and spoke
0:05:32in other words you wireless and multimedia technologies today are creating an environment for spoken content retrieval
0:05:42as to let me repeat again that the
0:05:45network access is primarily
0:05:48text based today
0:05:49but almost all rows of text can be accomplished by voice
0:05:54so the nice mentioned very briefly some fundamental
0:06:00but this we wanna stand D recognition always give errors
0:06:03for various reasons for example
0:06:06oh spontaneous speech for example oov words or mismatch models and so on
0:06:12and then makes the problem difficult
0:06:15so a good approach may be to consider lattices with multiple alternative
0:06:21rather than the one best output on
0:06:24in this case
0:06:26of course we can have a higher probability of including quite words
0:06:31but also in that case we have to include more noisy words that cost problem
0:06:36but on the other hand even if we have to be like this we still have the problem that some
0:06:41correct words may not be
0:06:43including because they are all keywords and still
0:06:47on the other hand when we use lattices that implies we need shooting memory and computation requirements
0:06:53there's another major problem
0:06:57of course there exist other approaches to solve that similar problems
0:07:02for example people use confusion matrix to model reading errors
0:07:07and try to explain the query and document using confusion matrices
0:07:13people also we use
0:07:14pronunciation modeling to try to expand enquiry in that way
0:07:19people also use say fuzzy matching in other words the matching between the quality and the content does not has
0:07:26to be
0:07:29these are all very good approaches however i won't have time to
0:07:33say more about things
0:07:35since our focus on lattices here it just talk
0:07:39now the first question is how can we index that
0:07:44well unless a lattice that like this
0:07:47and usually the most
0:07:49popular approach to index a lattice is to transform the lattice into a sausage like structure like this
0:07:57in other words it's serious of segments
0:08:00and every segment includes its number of word hypothesis we use
0:08:06posterior probability
0:08:08in this way the position information for the words can be readily available in other words they were one is
0:08:15on the first segment and word at some the suffers the second segment so we're one after work can you
0:08:21want a followed by word eight and that is a bigram and so on
0:08:25in this way this is a more compatible to existing text indexing techniques
0:08:31also in this way the required a memory and computation can be read use the slightly
0:08:38in addition we may notice that in this way we can add more possible path
0:08:42for example were three cannot be found by where i ate in the original lattice
0:08:47but here this becomes possible
0:08:50also the noisy words can be discriminated by posterior probabilities
0:08:55because we do have
0:08:58in either case this we notice that we can match the and right
0:09:03oh this lattice
0:09:05oh with the very for example we have the bigram were three followed by word of a then response of
0:09:11these bigrams exist in the very that helps
0:09:14so we can
0:09:15come all the possible
0:09:17and when spanky will accumulate the scores and still
0:09:22now there are many approaches proposed for such kind of indexing of the lattices
0:09:27i just list a few examples here
0:09:30and i think that today the most popular ones maybe the top to the yeah posterior please position specific posterior
0:09:39or P S P L
0:09:40confusion networks or C and
0:09:43also another very popular one would be the weighted finite state transducer
0:09:49now let me take
0:09:51one minute to explain the first two
0:09:54the position specific posterior lattice this psp a and a confusion networks
0:10:01suppose is a lattice
0:10:03and these are the board your possible pasts here
0:10:07they were sick
0:10:11the P S appeal or positions of the signal
0:10:15a posterior lattice try to locate every word
0:10:18in a segment based on the position of that word in a path
0:10:23for example where ten here appears only as the force were in the past
0:10:29so it appeared in the force tacoma
0:10:32on the other hand
0:10:33the and the confusion networks
0:10:37try to cluster words together in a cluster based on for example time spans into word pronunciation
0:10:45so for example
0:10:46the word to word five or ten may have very similar time span and pronunciation they may be clustered together
0:10:54and they may appear in the
0:10:57second for here
0:10:59so in this case you may note that the different approach gives different in this
0:11:05now a major problem here is oov words
0:11:08as you understand the
0:11:10only work cannot be recognized therefore never appear in the lattice
0:11:15that's important because very often
0:11:18the yeah according words
0:11:19the query includes both we were
0:11:23however if we look carefully
0:11:25but there are many approaches to handle this problem i think the most fundamental approach is to use some for
0:11:33a i mean let's take this example
0:11:36suppose at all keyword W is composed of
0:11:40these four somewhere units every small W i use a summer units for example a phoneme or syllable or something
0:11:48these are also somewhere units and these eli
0:11:51and these are arcs
0:11:53and the word here because
0:11:55this W is not a
0:11:58in the vocabulary so it's not recognise here
0:12:01however if we look at carefully we notice that
0:12:05the work is actually here it is hinted at sub-word level
0:12:10so it can actually be matched
0:12:12and somewhere level without being recognised in the
0:12:16and that's a major approach that a different ways can be developed to try to handle this using somewhere units
0:12:24oh one example is to construct
0:12:27the same
0:12:29P S P L or C and based on separate units
0:12:33for example
0:12:35now there are many give principal units have been used in this approach
0:12:40and usually we can categorise them into two class
0:12:44the first one is linguistically motivated units
0:12:47for example phonemes
0:12:49set of a character or more times and so on
0:12:52the other one is data-driven units in other words the a drive
0:12:57using some data-driven L
0:12:59and different ellison's may
0:13:01produce different names
0:13:03for example someone for the particle someone code word fragments of panama transform offsets
0:13:11of course there are some other different approaches
0:13:14if we do have the very invoice for available
0:13:17in that case we can also manage to hurry
0:13:21in speech and a
0:13:23all containing speech directly
0:13:25without doing recognition
0:13:27in that case we can avoid the recognition error problem and we can even do it with the in the
0:13:34unsupervised way
0:13:36in that case we even don't need a lattice
0:13:38and this can be performed in say
0:13:41frame based matching for example like dtw
0:13:45or segment based approach
0:13:48just imagine the sex the segments
0:13:50or model based action so
0:13:53our board at this kind of approaches
0:13:55do not use recognition and therefore do not
0:13:58have lattices
0:13:59so i won't spend more time using a about this approach is all just focus on those with laughter
0:14:07okay so below all always this fundamental at all it's just described to you some recent research examples
0:14:13i have to apologise i can only cover a small number of examples
0:14:18so many examples i just cannot cover
0:14:22below i'll assume the retrieval was look something like this
0:14:27this is spoken archive
0:14:29after recognition based on some clothes models we have lattices here
0:14:33now you retrieval was applied on top of this lattices
0:14:38here the search engine i mean index in the lattices and search over the in
0:14:44and by retrieval model i mean anything in addition for example confusion matrix is mention that was in the waiting
0:14:53room and whatever
0:14:55all based on this
0:14:58graph to discuss the following
0:15:01the first thing we can think about can do is to do integration and wait
0:15:06and for example we can integrate different rules from recognition
0:15:10from different recognition systems
0:15:12from those
0:15:14based on different subword units
0:15:16oh in queens some of the
0:15:18information and so on
0:15:20in addition a good idea maybe to try to try and those of model parameters
0:15:25if we have some training data available
0:15:29what kind of training data are needed here well this kind of training data we need here are a set
0:15:34of queries
0:15:35indeed the so we shaded relevant irrelevant second
0:15:39for example use when user entered query Q one we get a list of here and then the first two
0:15:45are forced or irrelevant and the next two are two more relevant and still
0:15:50we need a set of these kind of data
0:15:53or such data does not necessary to be anointed by person abides by some people because we can collect them
0:16:00from real clicks with data
0:16:03for example if the user enter a query Q one and Q get this list
0:16:08and then p2p skip the first two items and directories
0:16:12click the next two
0:16:14we may assume then
0:16:16the first two are irrelevant or false
0:16:19and the next two are round
0:16:21in this way we can have chris with data
0:16:24when we have this data then we can do something more for example we can use this training data to
0:16:30train the
0:16:31parameters here
0:16:34for example we trained different weighting parameters to wait
0:16:38different recognition output different
0:16:41of subword units was the different information including open confidence or phone confusion matrix and so on
0:16:50oh here let me show you to very briefly two examples here
0:16:55and the first one
0:16:56is that
0:16:57and in this example we actually use two different ads
0:17:02a indexing approach we just mention confusion network and position specific posterior lattices in each case we use not only
0:17:12the we're page
0:17:14in this thing but also those based on subword units in which case we can really one but right one
0:17:20gram bigram three and trigram
0:17:23and so we have a total of eighteen different scores
0:17:26and we try to add them together by some weighting
0:17:29to optimize some parameter described in the
0:17:33oh retrieval perform
0:17:35which is called M I P
0:17:38here i am a P
0:17:40oh the mlp i mentioned in this talk is mean average precision
0:17:44which is the area integrate under this
0:17:48recall precision her
0:17:50and which is a yeah performs measure frequently used for information retrieval of course the are many other parameters that
0:17:57i just have time to use one of them here
0:18:00now we can try to optimize
0:18:02this parameter using some
0:18:05all extended version of
0:18:07set of support vector machine
0:18:11oh here's a result
0:18:13here i am a few results for the yeah at different scores used in give usually and is the result
0:18:19when we integrate them together
0:18:21you see we get about a net gain of about seven to eight percent of the mlp which is not
0:18:29here is another example that
0:18:31we think it is possible to have context within the context dependent term weighting
0:18:37in other words the same term may have different weights depending on the content
0:18:42for example if the query term is the query this information series this information is very important
0:18:49but if the previous speech information retrieval
0:18:52then this work information is not so important because important terms are speech and retrieval
0:19:00in this week different term may have different weights in different context
0:19:04and these weights can be trained
0:19:07and he got the results
0:19:09using context-dependent wait we actually get some gain on the mlp
0:19:15okay these are some directly waiting
0:19:18now what can we do next
0:19:20where the first thing we think about is how about acoustic model
0:19:24can we do
0:19:26just as we have so many expert in the clues modeling we can do discrete training on the quiz models
0:19:32so can we use this training data to re-estimate "'cause" model
0:19:39in the past or the retrieval are considered based on top of recognition output
0:19:45they are two cascaded independent stages
0:19:49and so retrieval performance really rely on the recognition act
0:19:54so why don't we consider this two-stage together as a whole
0:19:59then the acoustic models can be re-estimated by optimising the retrieval problem in performance here
0:20:05in this way to coups models maybe better match to each respective dataset
0:20:11so in this way we learn from the mpe and try to define they object function in this paper
0:20:19and here is the results
0:20:22here the this the results for a different set of course it unusual "'cause" models these supports speaker independent models
0:20:30and these four adapted by global mllr and this by adapted further by
0:20:37as mmi and
0:20:38here is M E P but that is the adaptation for
0:20:43but X men a posterior probability
0:20:47and these numbers are mlp not yeah recognition accuracy
0:20:52as you notice that
0:20:53we do have some improvements
0:20:55but relatively limited
0:20:57probably because we were not able to define a good enough
0:21:02objective function
0:21:04another possible reason may be
0:21:07because different christ given query is really have quite different characteristics
0:21:12so when we put together many queries
0:21:16and these different query really interfere with each other in the training data
0:21:20so we are thinking of one not use query specific acoustic model
0:21:26in other words you we can we estimate it "'cause" model for every query
0:21:31then that means this has to be done
0:21:33on real time
0:21:34on the line
0:21:36is it possible
0:21:37we think that yes
0:21:39because we can based on the first several utterances
0:21:42they when the user clicks through
0:21:44and browsing the retrieval results
0:21:47then all their utterances not get browse can be reranked
0:21:51by the "'cause" model
0:21:53that means
0:21:54the models actually can be updated and the lattice can be rescored very quickly
0:21:59why because we have only a very limited number of training data so the retrieval can be very
0:22:06so this is the
0:22:08scenario that when the re
0:22:10when the system gives the which you results here and the user clicks
0:22:15browse and create the first several
0:22:17but when assessing indicating
0:22:19they are relevant or irrelevant
0:22:21then these results are actually fit to the acoustic model to re-estimate them up models
0:22:27where we get new models
0:22:28and these are used to rescore the lattices
0:22:30and that
0:22:31is used to rerank the rest of the art
0:22:36so what is the results
0:22:38where we can see
0:22:39just with one iteration of model react re-estimation which makes you real time
0:22:45adaptation possible
0:22:47and we do have some improvements
0:22:52now what else can do
0:22:55well how about acoustic features
0:22:58well yes we can do something a focus feature
0:23:03for example if we know an utterance is known to be relevant or irrelevant
0:23:08then all the utterances similar to this one
0:23:12can be is more probable to be relevant and iraq
0:23:17so in this case
0:23:19this we have the same scenario that the when the user see the output
0:23:25and he clicked the first several utterances
0:23:29and we can use the first separate or utterance as reference
0:23:35does not give rows are compared with those correct
0:23:38based on the acoustic similarity and then rewrite
0:23:44in this way
0:23:45let's see whether it is better or not
0:23:49and then we need to first define the a similarity in the acoustic features
0:23:54we first
0:23:55define forty utterance
0:23:58the a hypothesized region is these segment of
0:24:02feature vector sequences corresponding to this lattice
0:24:06these utterances and this lattice corresponding to the query Q
0:24:11in the lattice with the highest score for example for this utterance of us see our feature vector sequence and
0:24:18this is the corresponding arc for the choir and this is a high possibly
0:24:23not similar there's another utterance
0:24:26with the sequence here and high-pass reading here
0:24:29then the similarity can be derived based on the dtw distance between these two regions
0:24:36and in this way we can perform this scenario we just mentioned
0:24:41and here are the results again for the three sets of acoustic model
0:24:46and we may notice that in this way using a close similarities
0:24:50we guess slightly better improvements
0:24:52as compared to directly model we estimate acoustic model
0:24:58okay so what else can we do
0:25:00where we may consider a different approach
0:25:03that in the above we always assume we need to rely on the users
0:25:08that gives us some
0:25:10feedback information
0:25:12do we really need to rely only users
0:25:14no because
0:25:15we can always drive relevant information automatically
0:25:19we can assume the top N utterances on the first-pass retrieval results are relevant
0:25:25oh actually sold around
0:25:28and this is referred to as a solo reverence
0:25:31and here you see scenario
0:25:34when the user and required
0:25:36and the system gives the first pass retrieval result
0:25:40and this result would not be shown to the user
0:25:42but instead we just assume the from the top and utterances
0:25:46are relevant
0:25:48and all the rest are compared with these top N
0:25:51and see whether they are similar or not and based on similarity
0:25:57and based on the similarity
0:25:59we rented results
0:26:00and only this rear end results are shown
0:26:07now we need to
0:26:11okay here is the results
0:26:13you can see that with this pseudo relevance feedback
0:26:17for different acoustic models
0:26:19and we really have
0:26:22slightly better improvements here
0:26:25now what else can we do
0:26:27where we can further improve
0:26:29the above pseudo-relevance feedback approaches
0:26:32for example we can use graph based approach
0:26:38above that when we
0:26:40E in these to the right of feedback approach we assume the top N utterances are taken as the reference
0:26:47of we assume they are relevant
0:26:50but in this way of course they are not reliable
0:26:53so why don't we simply as
0:26:55considered for the first pass retrieval results probably using the graph
0:27:01in other words
0:27:02we can construct a graph for all utterances in D first pass
0:27:07retrieval results
0:27:09and all the utterances are taken as a represent a signal
0:27:13and then
0:27:14the edge weights are actually the acoustic similarities between a
0:27:20now we may assume that you utterance is strongly connected to
0:27:25utterances with high scores
0:27:27or very similar to utterances with high schools should have high school
0:27:32for example if here X two X three have high school then X one should
0:27:38similarly if X two S three all have
0:27:41have lost all the N X one channel school
0:27:44in that case discourse can propagate on the right
0:27:49and then spruce among strongly content notes
0:27:52in this we all the scores can be
0:27:56so we can then reranked forty utterances in the first-pass which you retrieval results using this
0:28:03we use these correct a score
0:28:07and here is the results
0:28:09and again for three sets of acoustic models
0:28:12and you may notice that now graph based approach
0:28:16get provide higher
0:28:18and may result in or
0:28:21this is a reasonable because this basic where a graph based approach really considered global globally or the first pass
0:28:28retrieval results rather than reference on top and utterances
0:28:35okay what else can we do
0:28:36well we should and of course
0:28:39machine learning has been used and shown useful in some work
0:28:43so then sure one example of use of support vector machine
0:28:47in the scenario of
0:28:48we just mentioned this to the right feedback
0:28:52and here is this scenario again when the user entered query Q
0:28:59and this is the first pass retrieval results
0:29:03this is not shown to the user but instead we simply take the first pass retrieval results we consider that
0:29:11the top ten
0:29:13utterance are assumed to be relevant and taken as possible examples
0:29:18the bottom and i'll soon to irrelevant and taking as negative examples
0:29:22and then we simply expressed some feature vectors from them
0:29:26and then
0:29:27we try to
0:29:28train a
0:29:30support vector machine
0:29:32now for the rest of
0:29:35we simply
0:29:37okay we
0:29:38X ray D feature parameters and then
0:29:41we drank by just a
0:29:42support vector machine and then only to rewrite the results are shown to the user
0:29:49so in this case
0:29:51please note that we need to train an svm for every query online
0:29:56is it possible yes because we only have limit number of
0:30:00a training data so they can be verified
0:30:07the first thing we need to do is we need to define how to extract a feature parameters to beach
0:30:13used in training svm
0:30:15where again we can use the we just mention a hypothesis region
0:30:20and suppose these same utterance and this is the corresponding lattice and here is a query and so these occur
0:30:26i also read
0:30:28we can divide this region into action states
0:30:31this action state
0:30:33and those feature vectors in once they can be averaged into one vector and then these vectors what different states
0:30:39can be concatenated into H supervector and there's a feature vector for this
0:30:46in that way what's results
0:30:48again we can
0:30:50see the results for different
0:30:52oh for different markers models
0:30:56you may notice that now that's
0:30:58svm is much better than reference
0:31:01which is much better
0:31:03the previous result
0:31:08okay and of course i have to mention that all results report here are very preliminary the are just obtained
0:31:14in preliminary results experiments
0:31:18now what else can we do
0:31:21all the above discussions are primarily considering acoustic models include features
0:31:27ha linguist information
0:31:30for example
0:31:31of the most straightforward information from linguists we can use is a context dependency
0:31:36a context consistency
0:31:38in other words the same term usually have very similar con
0:31:43while quite different context usually implies that rams are quite different
0:31:49what can we do we can do exactly same as we did
0:31:52using svm
0:31:54except now the feature vectors represent context information
0:31:59so we use exact the same
0:32:01scenario that the for the first pass retrieval results you we use top and bottom N to train his yeah
0:32:07except now we use different features
0:32:10vectors here
0:32:12suppose these and are trying
0:32:14and the corresponding lattice
0:32:17and here are the
0:32:18query here and we can construct a left context
0:32:23who is the man dimensionality is the lexicon size
0:32:27yeah only those words appear to D left context
0:32:31have their posterior probabilities as the score
0:32:34or the other words has zero they are
0:32:37similarly we can have a right context vector and the whole segment complex
0:32:42and then we can come cut them together into a feature vector
0:32:46and this has dimensionality of three times detection sides
0:32:51three times of the like
0:32:53now we can use this to do the experiments and here are the results
0:32:58again for three sets of codes models
0:33:00and you may notice the context information really helps
0:33:06so what else can we do
0:33:08where certain concept match
0:33:11in other words we wish to
0:33:13match the concept rather than lead to
0:33:17in other words
0:33:18when we are we should just system can return utterances or documents
0:33:23semantically related to do carry but not necessarily include the card
0:33:28for example if the query is white house of your nice they
0:33:32and if the utterance includes present a bottom-up but not whitehouse not us
0:33:38we should it can be returned as well
0:33:41where they are many approaches have been proposed for this direction
0:33:45for example we can close to D documents into sets
0:33:49so we know
0:33:51which sets of documents are talking about the same concept
0:33:54we can use web data to expand the query or expanded documents
0:33:59we can also you using a
0:34:02a latent topic model
0:34:04it or should we just one example of latent topic approach
0:34:08where this is very straightforward and we just use a very popular where are used probabilistic latent semantic analysis or
0:34:18and in which we simply assume a set of latent topics
0:34:23pitching is set up to
0:34:25and a set of documents
0:34:27and the relationship can be modelled by properties models
0:34:31trained with em algorithm
0:34:33of course there are many so many other approaches
0:34:37and i are complementary
0:34:39and here is an example work we did
0:34:41for the us for a car
0:34:45and we transformed into lattices
0:34:47then for any given where we simply use the
0:34:51plsa model we just mention based on the latent topics to estimate the distance between the primary and the lattices
0:34:58and that gives the results
0:35:00here are some X a preliminary results these results are in terms of recall or precision curve
0:35:08and this three lower curves
0:35:10are the baseline of later matching simply matching words
0:35:14and the lowest one is on one best results and to chew up a one
0:35:21two upper ones are based on a lattice
0:35:26now yeah
0:35:28the three curves here are
0:35:30concept matching using the plsa i just mentioned
0:35:34as you can see
0:35:35a concept matching certainly this much
0:35:39so what else can we do
0:35:41where are you seconds content interactions that are not important
0:35:47we know that user content
0:35:49the interaction is important even for text content
0:35:54in other words when we retrieve text content very open we also need a few iterations to get the desired
0:36:02now for spoken content is much more difficult because i spoken content on not easily summarised on screen
0:36:09they are just signals
0:36:11so it's difficult for the user to browse
0:36:14to scan and to select that
0:36:17so when this isn't gives a whole bunch of
0:36:20retrieval results we cannot listen to everyone old and then decide which one
0:36:25we like
0:36:26so that's a problem
0:36:29what we propose is first we can
0:36:31try to a select
0:36:33automatically he turns and construct titles summaries
0:36:37to help browse
0:36:39and then we try to do some semantic structuring
0:36:43to have a user interface
0:36:45and then we can try to have some dialogue
0:36:48to help the interaction between the user and the system
0:36:53so been a very briefly go through some of
0:36:56for example cute and extraction
0:36:58which is very helpful in labelling the retrieval results and for user to browse
0:37:05the key trends include two types at least keywords and key phrases
0:37:10keyphrase include several keywords several words together
0:37:14so for key phrase we need to detect the boundary
0:37:18and there are many approaches to do this use one example suppose you them up model is a key for
0:37:25one where it is defined that had it is always followed by the same word markup
0:37:32mark over is always followed by the same word a get a model is always followed by the same word
0:37:39however the model
0:37:41is followed by many different words
0:37:44and that means these at the boundary of the frames
0:37:47in this way we know
0:37:49there are a number it can be detected by context
0:37:54now with the chi turn candidate
0:37:58i there is a word or phrase
0:38:01then we constrain many features
0:38:03to identify with the object you transform that
0:38:06for example prosodic features
0:38:09because very often the key terms are produced with longer duration wider pitch range and high energy
0:38:16we can also use
0:38:17semantic features for example from hearsay because key trends are usually focused on smaller number of top
0:38:25for example this is a distribution of topic probabilities obtained from plsa given a cat
0:38:32now this one looked like she turned because it's focus on only smaller number of
0:38:37topics D horizontal axis of topics
0:38:40and this one doesn't like acute right because you need for me used in many different futures and in many
0:38:47of course lex and feature a very important that includes term frequency and inverse document frequency of part of speech
0:38:54and so on
0:38:56here is the result of weak of a attraction she turns
0:39:00using different sets of speech
0:39:03here yeah
0:39:07prosodic lexical and semantic features here and you notice that each
0:39:12it just single set of features are useful
0:39:15however when we integrate them together we get the highest result
0:39:21now for summarization where a lot of people in this room and doing summarization so i'll just
0:39:27of course with a very quickly
0:39:29the suppose these see that this is a document includes many but
0:39:34and we try to recognise them into
0:39:37words every circle is a word i the recognized correctly or incorrect incorrectly
0:39:43what we do is try to select a small number of utterances
0:39:47which are most representative
0:39:49and avoid we done
0:39:51and they are used to form a summary and this the
0:39:54so called extractive summarization we can even replace these utterance with the original voice
0:40:01so there is no correcting errors in the result
0:40:05and i just show one example here
0:40:08because we are selecting the most representative same utterances
0:40:12so it is reasonable to consider that the utterances topic or is similar to the represented representative utterances should also
0:40:21be considered as represented
0:40:23so we can do similar
0:40:25on the graph based analysis in other words every utterance represent as they
0:40:31note on the graph
0:40:33and then we let the scores for representatives yes
0:40:36propagate undergrad
0:40:38in this we compare get better scores and select better utterances
0:40:43these are some results and skipped
0:40:46title generation
0:40:47titles are very open useful for if we
0:40:51construct titles for retrieve the document second
0:40:55is useful to for the browsing and selection of utterances
0:41:00but i don't have to be very short but readable
0:41:03and tell you what it is
0:41:05here's one approach
0:41:07we perform viterbi that was over the summer
0:41:11based on the scores obtained by stream model
0:41:16to select the to try to order to test
0:41:19and to decide
0:41:21lance of the tide
0:41:23in this way we can have some good titles
0:41:27semantic structuring there can be different ways to do semantic structuring and we don't know what's
0:41:33good approach here just use one example
0:41:36and we can cross to retrieve the results into some kind of
0:41:41a tree structure based on the
0:41:43a semantic information for example they can tell
0:41:48in this way
0:41:49every cluster can be labelled by except she turns
0:41:53so that such intense indicating what they are talking about
0:41:58every cluster can be for the next
0:42:01tainted into the next layer and so on
0:42:05here is another example
0:42:10in other words every we retrieve the spoken document or segment can be labelled by step two turns
0:42:17and then the relationship between the two terms
0:42:20can be construct
0:42:22represent as a graph
0:42:24so we know what kind of
0:42:27information about
0:42:32okay now finally the kind of
0:42:35if we have all this including semantic structuring forty turns summaries here
0:42:41on the system
0:42:42and the user is here getting providing some choir so what can we do to offer them to
0:42:49have a better interaction
0:42:51a dialogue may be possible
0:42:53and many people here in this room are very experienced in doing something spoken time so we wish to learn
0:43:00something from
0:43:02for example we may model this process as a markov decision process or M
0:43:09in this way what we can do is to
0:43:13for example we need to define some goals
0:43:16the goal is maybe higher text
0:43:19success rate
0:43:20except here the success indicates
0:43:23the user information need is satisfied
0:43:26we can also define a go to be
0:43:28small number of dialogue turns back here
0:43:32is small number of query terms entered
0:43:35in this way we can define the reward function or something similar and then maximise the reward function with similar
0:43:45and here is one example application scenario for retrieving broadcast news
0:43:50and here in every step when user and require every decision tree trends not only the retrieved results but also
0:43:58a list of key trends what user to select
0:44:02if the user is not satisfied with the results here then cheating
0:44:06looks through that
0:44:08Q chandler's from the top and select the first relevant to disney
0:44:13and this
0:44:14she turned miss can be ranked
0:44:16i M P
0:44:18and you're some results i'm escape
0:44:21so above i have mentioned something about she turned up a summary title and is the menu structure and dialogue
0:44:28so that's the something about user content interaction of course the a lot more work needed before we can do
0:44:36something really
0:44:38okay now let me
0:44:40have a few minutes socially them
0:44:52and this is a
0:44:54course lecture
0:44:56oh okay then you go through corporate
0:44:59the slides first
0:45:04this is
0:45:05on a coarse black
0:45:07and as we know there are many course lectures available over the internet
0:45:12however it takes a very long for user to learn to listen to a complete course for example forty five
0:45:19and therefore is not easy for engineers or in those readers to learn new knowledge the other course lectures
0:45:26and we should
0:45:29so we also understand they are lecture browsers available over the internet
0:45:33however we have to bear in mind that
0:45:37the knowledge of course lectures are usually structured one concept follows
0:45:42so the retrieval vector segment
0:45:45possible it being very not easy to understand what are then without enough background knowledge
0:45:52and also given the retrieve the segment
0:45:55there is no information for the gonna regarding what should be the now
0:46:00so the proposed approach is to try to structure of course lectures parts line spanky turns
0:46:06we derive the course lectures by slide
0:46:09and drive the core or content what slides
0:46:13i T turns and then construct she can grow
0:46:16represent semantic relationship among the slides
0:46:20and also all slides are given its lens
0:46:24timing information in the course
0:46:26summary key trends
0:46:28and relay to transcend relay slides based on
0:46:32the kitchen
0:46:34or retrieve the spoken segments include all the information for the slides if you want to
0:46:41and this is a system for a course on digital speech processing over by myself in taiwan university so therefore
0:46:51given in mandarin chinese
0:46:53however or determine on edges are produced directly in english so this is a call makes the data
0:47:00okay so now let me go to
0:47:08and this is the course
0:47:10and the system it could be given name of and you virtual instructor
0:47:14and i'm
0:47:15it was recording your two thousand six the total and forty five hours
0:47:20now suppose i heard something in a lecture about backward catwalks i don't know what that is
0:47:27so i tried to retrieve it
0:47:30however because i don't know what
0:47:31so i mean i guess it is like work out with some
0:47:35so it just enter like workout wasn't and then do the search
0:47:39here i'm searching through the
0:47:42internet and on the server on the server entire university so i rely on the internet here
0:47:50and we see that
0:47:51here i'm retreating the voice rather than the words
0:47:55so the query words is bright work out some which are totally wrong
0:48:00but here what we treat a total of fifty six results in the course
0:48:04and here for example in this result the first one is utterance of a second long
0:48:11it is in the slides
0:48:12is a slight number twelve of chapter four that i told that finds this basic problem three for hmm
0:48:20and here is the
0:48:22and the slide is labeled by these key terms but what that looks a now i know this is about
0:48:28but that was rather than that helps and also baumwelch or for what else and so on
0:48:35and note that because this
0:48:38utterances are represented in terms of lattices of subword units
0:48:42so the supper unit sequence of this paper that works and is very similar to this one
0:48:47and that's why i can retrieve
0:48:50and there are many up and so on that goes with that
0:48:55now if i think i like to listen to this
0:48:59so i can click here to go to that slide
0:49:02this line number two of chapter four
0:49:05and side here that i don't is this that the this is done by myself so it is a human
0:49:11generate a type i don't need all the magenta title
0:49:15because every site has a title and the title is basic problems three for a channel
0:49:20and hearsay is this lies has a total length of twenty too many and fifty seven seconds so you buy
0:49:26like to listen to this i need to have twenty two minute
0:49:30and in addition this is the spend all these sites
0:49:34in chapter four out of the twenty two slides total
0:49:37and so
0:49:39and very important here is the key terms
0:49:42only those terms on the top in yellow are the key terms used in this line
0:49:48and those below here are real at times provided by the kitchen right
0:49:54in other words it you demographics are completely not easy to show here so instead i just list the highly
0:50:00related key terms here below every cute right
0:50:03so for example when i go through here i saw here is at train quarterback right now "'cause" i'm not
0:50:09support for L
0:50:12and background wasn't actually relate to for education and so on
0:50:16now if i don't understand this one so i like to know a little more if i understand this can
0:50:22i listened to dislike
0:50:24so i click here that give me that
0:50:26this cute ran off
0:50:28backward algorithm actually first appeared in on the slides
0:50:32which appears earlier
0:50:33and now you and the slides which is later so probably there's no experimentation up with this
0:50:40and you really don't know about backward algorithm you should go there so i can cut this one and then
0:50:45i go to
0:50:47that's lights
0:50:49where it's like okay yeah that's the other slides
0:50:52and that's is the first time that but what else was measure
0:50:56and that's lights
0:50:58and that's that helps them here that you change in the slides for example in that slide we also really
0:51:05have but what else
0:51:07and for help us
0:51:09the four at some he's actually relate to that alex
0:51:12that ellen ready to this
0:51:14for that and so on
0:51:17now let me show a second example suppose i like to enter another query which is frequency
0:51:24no i do the search
0:51:26not in this course there are a total of sixty four results for the frequencies
0:51:31here the first
0:51:34of six second long appears in this
0:51:36slight of build a batch processing
0:51:39labeled by this
0:51:41he turns
0:51:42and the second
0:51:46and so on if i'm interest in this one i can press here to go to this
0:51:52this is the slides on pre-emphasis
0:51:54and i notice there is summary of beeping second
0:51:58so i like to listen to this summer
0:52:00so okay retrieving the summary from the high in so you got been able when you wear usual
0:52:07you can change
0:52:10oh no but i'll call again a number of them are element that you just sensitive to the tangent angle
0:52:15that city and go under the try not to let your should formative initial sort of take a pre-emphasis factor
0:52:22cocaine not out there
0:52:24a given that you could solve
0:52:30okay this is the fifteen seconds summary it's in mandarin chinese i'm sorry but i tried the english a subtitle
0:52:38is actually done you know
0:52:40manually in order to show you what was that in that summary
0:52:46and okay so this end of the demo
0:52:48so let me come back to the
0:52:54so in conclusion
0:52:56i usually divide the
0:52:58spoken language processing over internet into three parts
0:53:02user interface
0:53:04content analysis including such as keychain extraction or summarization or
0:53:10so on and the user content interaction
0:53:14and we notice that user interface has been very successful however not very easy usually because you which users usually
0:53:22expected technology to be placed human the
0:53:26for content analysis and use a complex interaction which are not easy i seven however because the technology can handle
0:53:34massive quantities of content
0:53:36what a human being can not
0:53:37so the technology does have some i think
0:53:41now the spoken content retrieval is the one which integrate user interface with content
0:53:48analysis and use a condom interaction therefore maybe offer some interesting applications in the future
0:53:56so eventually i like to say that i think this is only
0:54:00this area is only in its infancy stage
0:54:03and there's to you plenty of space to
0:54:07developed and plenty of opportunities to be investigated in the future
0:54:12and i notice that many groups
0:54:14i have been to some doing some work in this area and actual many people in this room have done
0:54:20some work in this area
0:54:22so i we should we can have more discussions and more work in the future
0:54:27and hopefully we can have something better
0:54:29much better than what we have today in the future justice
0:54:33the in speech recognition we are now having much better
0:54:37work then several years ago so we wish we can do something more
0:54:42okay this concludes my presentation thank you very much for your attention
0:54:58one child thank you very much for a very interesting to watch
0:55:02one question i have for you is that a lot of people in the audience or working on a related
0:55:09somewhat related problem which is voice search and one of the issues that come up and voice search sometimes you
0:55:17say something it gets released speech recognition one and then you repeat the query to get it right all the
0:55:24other sets of choice is it may come up with maybe sort of similar an overlapping
0:55:31oh it would seem to me that has some relation to relevance feedback in the sense that the user was
0:55:37sort of giving an additional set of information about the a priori previous query that was dictated i'm just wondering
0:55:46you when you don't the people you work with the therefore looked into this
0:55:50sort of problem whether you're you have any opinions on whether you could get improvements by looking at the somehow
0:55:58taking the union of multiple queries
0:56:01in a voice search sort of tasks to join improve results using similar results are similar methods to what we
0:56:07talked about the in your talk
0:56:10thank you very much i think certain is very good idea and
0:56:14arg in actually are as i mentioned in the beginning in this part we
0:56:19we are not talking about a voice search but in your experience for example when they're bp do queries may
0:56:26some good information or correlation about the intent of the user so that they are helpful for example in my
0:56:33what i mentioned D dialogues we actually allow the user to enter the second query and so on and that
0:56:42actually the interaction or the correlation between the first and second choirs and so i think that's the yeah the
0:56:50only thing we have done
0:56:52for up to this moment but i think probably what you what you say it implies much more we can
0:56:57do and we are a i think we
0:57:00as i mentioned we just have to
0:57:03too much work
0:57:04to be done and so we can think about how to implement what you think about in the future
0:57:13thank you very much but that's only for very interesting talk i have a detailed technical question on your svm
0:57:24if you can go
0:57:31oh yeah you for
0:57:32for svm that you take a positive examples in there okay slight yes so my question is when you train
0:57:40your svm you it seems like you're only taking a high count is examples for a stationary example
0:57:48so in the margin you're pulling my you're pulling the examples from where is far from the margin
0:57:54and in the testing phase if you have some difficult examples that were close to the hyperplane then you my
0:58:04where part time
0:58:06yeah certainly you're right
0:58:08but well that's all we can do at this moment because you know nothing about these results right you just
0:58:14have the first pass results here and we can or we can do is that you assume the top and
0:58:21and then construct the svm of course in the in the middle close to the andre it's a problem
0:58:28however svm already provides some solution for them or the large margin concept
0:58:34so they tried to improvise some
0:58:38somehow almost margin and also there are some
0:58:41allowance right so
0:58:44what we just try to follow the idea from svm and try to that
0:58:50to see if we can do something that
0:58:52okay thank you
0:58:58i have a question sure
0:59:00making a great talk there's a lot of the parallels with the
0:59:05methods to go this is really a is telling what michael say
0:59:09with a text based web search with this problem
0:59:15you know even beyond voice search
0:59:17some methods i think the other have been developed by this community that would probably benefit the web search committee
0:59:24and vice versa i wanted to
0:59:27ask you to comment on that it and the your awareness of the literature there and opportunities for this
0:59:34cross fertilisation there's in web search the query writing
0:59:38is well established and
0:59:41also and the click feedback web search without a lot of benefits from distinguish between clicks from users
0:59:49and good click
0:59:51because clicks a ten to be noisy and so there's been a lot of work in the web search committee
0:59:56about modelling clicks and you know determine good clicks and use those you know with those more heavily for feedback
1:00:03also just basic things like editorial relevance feedback about you know bunch of the you know large groups of users
1:00:11to you know
1:00:13determine the relevance and use that as we that's what so i just one of this that have asked you
1:00:19what your thoughts are on the
1:00:20opportunities for cross fertilisation between these two areas
1:00:24yeah sure there are a lot of possibility to learn from the experience in voice search to do on this
1:00:31part of where
1:00:32we just don't have enough time to explore the possibilities as you mentioned the this kate may be divided into
1:00:41several categories and they can be learned or something like that
1:00:46and i think there should be a what could be done in the future
1:00:49but on the other hand we also learn a lot in from other areas such as text retrieval
1:00:56and then for example yeah
1:00:59rather be back or pseudorandom feedback or renting or learning to rank or something much of some ideas laurel from
1:01:10community so certainly
1:01:13cost area
1:01:14interact is very helpful
1:01:17because these areas actually
1:01:21on the other hand we really try to
1:01:24do something we are more from india in speech area for example acoustic model
1:01:31for example the acoustic features
1:01:35and so on and we try to for example spoken dialogue
1:01:39and we try to follow all those good ideas and good experiences in speech and see that can be used
1:01:46in comparison
1:01:48and i think we just have as i mentioned we data plenty of space to be explored in future
1:01:56thank you much