thanks karen it's a very nice to be here to talk to about some of
the low resource
work we've been doing in our group
this research that i'll talk about today involves the
work of several students in our group that i a list here
there's been a lot of talk just
today about the low resource issue good descriptions i won't belabour the point
i think my perspective on the problem is that
current
speech recognition technology will benefit by incorporating more unsupervised learning
ideas
and this schematic shows the
sort of a range of increasingly difficult tasks that we could imagine
starting the upper left
from the conventional asr approach that has
annotated resources pronunciation dictionary and units
two scenarios that have less resources associated with them
a parallel annotated speech and text independent speech and text all the way down to
just having speech
what can you do with that as would be the case for example of an
oral language that we were talking about earlier this morning
i think
it's a challenging problem but if we start to look at some of these ideas
i think there will be a few benefits first of all i think it's really
interesting problem i will learn a lot just by trying to do it
be i think it will ultimately enable more speech recognition for
larger numbers of languages in the world and
it has the potential ideally to complement existing techniques
and so even benefit languages that are
but quite successful with conventional techniques
so in the time i have today i was gonna talk about two research areas
that we plan
exploring in our group the first one is the
speech pattern discovery a method and various we applied to various problems
and this is an example of the zero resource scenario that could work potentially on
any language in the world just with the body of speech
so that has a certain appeal to it
but there's no specific models that we learn from that so and your line of
research that we've been starting
to do is exploring
methods to learn speech units models speech units and pronunciations
from either speech or when there are some limited resources available and we're using a
joint modeling framework to do this that i think even know what still very early
days for this work i believe it's quite promising so these of the two things
all a
touch on
and it'll be fairly high overview because i don't a lot of time but hopefully
you'll get the idea some other things we're trying to do
so the speech pattern work
it was actually motivated by in part by humans you know incensed people like jenny
saffron are shown that
just exposing infants to use short amount of nonsense syllables a very quickly learn where
a word boundaries are of what they've seen before what they haven't so we sample
Y can we try to apply some of these ideas if we have a large
body of speech
but we don't have
say all the conventional paraphernalia that goes with the conventional asr
but maybe we had a lot of speech so we could throw all that out
and look for repeating occurrences
instances of things like words and there might be some interesting things that we could
do if we achieve that capability
so to describe it
in terms of some speech the idea was to if we had say different chunks
of audio can we find the reoccurring words of these are two utterances with the
word schizophrenia in them
can we develop a method to establish that those two where the same thing
and the approach that we took
fairly common sense i think is to
if you have a lower body of adding a we would compare every piece of
audio with every other piece of audio
and so for those two utterances now we would computed distance matrix which would represent
point by point distances
and the idea is that one two things to spectral frames of the same
be of course a low this stands and when there are very the similarity high
distance
when you look
and the representations that we use of varied over time we started off just using
white and mfccs which would work fine if it's all the same speaker
we then went to unsupervised posteriorgram representations based on both
to write from gaussian mixture models and also some
dnn trained in an unsupervised way we've done some stuff with herbs slu is as
well representing them as a posteriorgram it really doesn't matter what you use
the interesting thing is when we all look at that picture
i think most of this can see right away that all you know what there
is a diagonal that sort of
bunch of low distances that's where the repeating pattern is so that's
what we wanna do we try and find that automatically
and we developed a
a
variational to dynamic time warping we call segmental dynamic time warping that basically just consisted
of striping all the way through the audio corpus so that you would eventually
compare everybody's with every other piece and that was that
the past the warping path you were on would eventually snap into
that as you passed over that possible alignment
we call this little region the alignment pass a fragment and the web in that
region is sort of the point my point distance and that's what we're trying to
so there's different ways to do this but this particular illustration shows the point by
point alignment of the two stripes of that the two pieces of the utterances here
and this is the distortion as a function of the frame-by-frame distances and here courses
where the overlapping
word schizophrenia is see want to look
there were that warping path
establish some mechanism to try and find a low distortion region
we were looking for things that were least half a second longer turns out the
longer you constrain yourself the better this idea works
so like boston red socks would work really well as sort of an expression
and we extended a little bit and when we do this
you know we're is we produce are aligned fragment
now
people a modified this basic idea "'cause" this is computationally fairly intensive we actually done
some stuff to do
approximations that are guaranteed to be admissible but other people like aaron
chance and edge i J H U is done some really nice work using some
of visual processing concepts to significantly reduce the amount of computation that and paul
although it turns out i think that using sgd W idea that very and
is not a bad way to refine the initial matches
so when you do this what happens if you end up with pairs of utterances
and the things in red are example matches low distortion matches that are found in
your corpus and you can see that it's
depending on the parameters like the with that you pick we were sort of aiming
for word level ideas which was why we picked have second constraint
i sometimes it's a word sometimes its multiple words
sometimes of the fragment up were
sometimes it's something similar but not the same thing
this is the type of thing that you get out
the interesting question then is once you have all these pairwise matches for your corpus
you'd like to try to establish
what things that are the same underlying
and so you have to go to some sort of clustering
notion in this is what we call speech pattern discovery and we try to represent
all of these pairwise matches and the graphical structure
that then we could do clustering on
and so when you do that where the
well describe how we defined the very season the graph in the second but if
you do each region
corresponds to a vertex
in a in the matched in the arc corresponds to
where the edges correspond to connection between the regions and then you can trying to
clustering
naturally of course in the real world
these clusters are inevitably connected in some capacity is the matches are
a perfect but then you can apply your favourite clustering algorithm to try and find
densely connected regions in the graph and that's in fact we did
i just a very briefly show you one way that we did it there are
many to define the of or the season the graph
this illustration is sort of showing all example pairwise matches so each little rectangular
are corresponds to match and the colour
means it's the same match so the blue rectangle for example is a region where
we think of the word matches to something else it said over here
okay so different colours mean different batches
well if you actually look at what's going on at any point in time
it's messier than that because you potentially have a whole lot of matches because each
match is done independently the start time and end times are all gonna be probably
different
but what we did as we summarize
that collection of matches by just summing up the similarities as a function of time
and so you get some and then we consider that and then what you get
is something is time-varying that has local max's
and we defined the local max's is places that interest
where a lot of similarity matches were occurring
and so we use those places to define nodes or a rare disease in our
graph and then
what you define the notes
the
the matched pair is that you have that overlap the nodes define
the edges in your graph
so for example the blue pair went from no one
to know eight so you'd make this connection in your
and you can do that for all of the
matches that you have that are low distortion and so this is how you can
construct your graph
and then as i said you can apply clustering algorithm to
make chunks out of that to define clusters
so let me show you i and example on a lecture that was recorded mit
and this is an exam so you we had four matches here different places in
the in the recording
and there was is nice little cluster here the show you and i'll hopefully play
you some examples should not on a research optimized
there was for things to them played the same time
but
basically what this was this guy was talking about
variations of search engine optimiser search engine optimising and
it actually stand the word optimiser
to get the common acoustics in the cluster
so this is an example of the type of thing that you get
interestingly all of these words tended to
curve near each other in the lecture so you can actually
are we done this work i'm not gonna talk about at any actually do topic
segmentation based on the time-varying nature of these clusters over the course of long audio
recording
i can show you some other examples we try don't different languages this is a
lebanese
interview that we recorded it actually used to people talking
one of them is using a lebanese layer levenshtein arabic and one of them is
talking and msa
the algorithm doesn't care it doesn't know and it it's oblivious is just looking for
things that look like they're the same
and so here's the cluster that
it got
this is this is this is
there's another one for mandarin a lecture that we apply right on those users use
those use versus you get the idea defining these acoustic chunks and they're sort of
the same
thing
now when you do it over a single large body of audio like
a not sure you'll get a bunch of these different clusters
and that's interesting when you look underlying lee at what the identity of the cluster
is
you can see the lot of the terms are important content words in fact what
we did a
this study
we were finding about eighty five percent of the top twenty tf-idf terms
on lectures so it's an indication that
the clusters are finding potentially useful combines
of information and i guess what on the motivations for this that there's a word
that's important
in a conversation or a lecture it'll probably be said multiple times and that gives
us a chance to find it it's not always the case
but we need that for this type of technique to work
now one of the things that we've done recently is
in addition to this was in one particular a document you can look at the
relationship between these unsupervised patterns across different documents
and documents like
her was talking about a topic id
we can do one supervised topic clustering
based on the relationship of these unsupervised words across different documents
so
just to visualise that a little bit here you each of these grey
rectangles is a different document and the darker grey rectangles are
speech patterns that we found
in the unsupervised way and then the connections are just
things where there they connected to each other with a low distortion match
and say for example
you know you have this type of distribution of your clusters then you might you
might want to say well these two
two clusters are on the right because of the connection between those unseen supervised terms
that we found
and the three on the laughter in the same class again this is not doing
this unsupervised
so to do this we tried a couple different methods but one that was the
most successful
and how to latent model for
topics and words
and that's just the plate notation on the right
but the observed variables of course where the documents and then what we call at
a link structure
which we define
a link
as the connections for each interval in a document that we found the link structure
is just a set of connections to all the other patterns that were made in
all the other different documents
when
and the latent variable words has a certain distribution
of links
and the topics have a certain distribution
of words and you can learn this model with em style
algorithm and the thing that's interesting we did some experiments on the fisher corpus
sixty conversations
spending six different topics
we see this with about thirteen hundred initial
clusters and we did tell what that define six clusters so there was kind of
cheating
but
this is the resulting clusters that we found
and the interesting thing is when you look at the underlying speech patterns that are
associated with these topics
they make a little bit of sense actually which is nice so
what you find just the there are relevant words and there are irrelevant words things
that you might like to be
in the stop word if you were doing with text so be nice to get
rid of them
here leads me show you some the other one so that's of the green was
on minimum wage
the other one is on content computer and education
the purple one is kind of interesting
when you look at the distribution of the true underlying topic labels
some of them are pretty good
you know that the holidays computers in education the benefactor split into two
it's kinda
intriguing to me that corporate conduct an illness were
mapped into the same cluster maybe that's telling or something
but you know this really days but you know there's a lot of things that
you can potentially do
with these kinds of unsupervised methods are not conventional but her showing some nice examples
to and this is another one
i wanna move on and talk about
some of the newer work that we're doing
we just trying to learn a models
lower speech units learned pronunciations
really trying to get rid of the dictionary
or at least learn some methods
you know it's interesting
we pride ourselves on our ignorance models that we've developed with hmms modeling think we
don't know about speech
and that the dictionary is still are crutch it's
it's typically made by humans and hours and hours are spent
tweaking these things getting rid of all the
inconsistencies anybody used on it is it's hardware and it takes long time mary can
tell you the amount of effort goes in the making use dictionaries for the babble
program but it's not a trivial ever
why is it we can learn
the units and learn the pronunciations automatically we do everything else
you know i think it's time we look into this
so this is the type of thing we're trying to do what we do from
speech
or maybe if you have some tax
you know can that help you other pronunciation so we're doing
we're trying to do both of these things now in our work
are there's prior work in this area dating
all the way back to the eighties people lunch in the
we're trying to do some
acoustic based approaches
a more recent work as a verb is a good example with a self organizing
units
and there's been other work that's come out of johns hopkins that is very interesting
as well
the approach that we've been taking is
motivated in fact by
more of a machine learning something is becoming more popular it in machine learning is
of asian
framework
for inference and in particular sharon goldwater
who's now the university then burrell had a really nice paper
on trying to learn word segmentation from a phonetic transcriptions so was symbolic input
and the trying to work for that and are more recent work was trying to
do it from the ways here
phonetic transcriptions we wanted to try and modify this model so we could learn from
speech itself so that's what i'll talk about now
and then we've recently extend it to try and word pronunciations as well
so we all know what the challenges are you're trying to learn what the speech
units are first of all
as the last question or said we don't know how many units there are
maybe they're sixty four maybe there's not
and we don't know what they are and we don't know where they are
so these are a lot of unknowns or trying to figure out in units
so
what we're trying to do is given speech
in this stuff only speech
discover the inventory of units and build a model after each of them
as i said we're formulating this in a different kind of mathematical framework for the
speech community where we have a set of latent variables
that include the boundaries the units the segments and then there's the conventional hmm-gmm
model we're using for each unit that although scribe shortly
and in this initial work we actually wanted to try more number of units
and to do this we were
representing is what's known as a chinese restaurant process or dirichlet process
had a prior on it so there's a finite chance of generating a new unit
every time you better rate
through
so let me walking through this
at a high level sort of channelling of the generative story or a power
generating an utterance
with this hmm gmm a mixture
that i said so basically underlying when we had a have a set of K
models and then
for a particular set or frames one of them is selected
and it generates
a certain number of speech frames and then you'll transition to another one generates more
frames
et cetera et cetera as you go through the entire utterance
so that this sort of just described are model here but what of the latent
variables will first of all
we don't know where the transition are
transitions are in the speech between one unit and another so the bees will be
one set of
latent variables
we don't know what the labels are
inventory of labels and i have down below also these season purple will also be
a set of unknown variables
we don't know of course the parameters of our hmm gmm model so those will
be parameters of variables
and lastly as i mentioned we don't know how many units there are
so that will be in a no as well and this is the last thing
is what we're modeling with the deer actually process
so
the learning procedure for this is done the inference and gibbs sampling
so it's an iterative process
where we initially select
values on some boundary variables i'll talk about that
shortly but for now
think of an it as an initial segmentation and we have an initial prior distribution
for the parameters that we have
and then we go through are corpus
one segment is a time where segment is defined as the chunk of frames between
boundaries and
based on the posterior distribution will sample down you so for each segment will sample
a identity of the units see for a particular segment
when we say something about that
here
so as i mention this is a chinese restaurant process because there's a finite chance
of selecting defining a new unit
and for those of you foreign familiar with that
the idea the analogy is what people going into a chinese restaurant in trying to
decide which table to sit down at "'cause" each tables begin can see multiple customers
so
in this notation each segment is a customer and they have to decide
well which table to sit at
and
the index of each table has an index
which corresponds to the identity of the models we each of these tables think of
it is belong to a different unit
and
what you wanna have is a
posterior probability of
the likelihood of taking a particular unit label for each segment
and that basically is proportional to the likelihood of the customers the that particular same
data in generated by that particular units
hmm-gmm and it's a weighted by a prior probability
which just corresponds to the number of customers that were at the table normalized total
amount of
segments that you have and you'll notice that there's also a
little bit a probability that stolen away from each one is out of
to assign a little bit a probability to the likelihood that you might generate a
new unit
as well
so once you have these posterior distribution setup sample and that's the value of for
that particular segment
at that particular iteration
once you have a unit label for that segment
you then go through and use apply hmm parameters
and this is sort of on our home turf so it's a conventional
hmm gmm we're using an eight mixture model
you know the three state left to right transition so it's all very familiar
and
we assume when you have to start a segment in state one and it in
state three but any other states we will sample we will draw samples to determine
which state you're in at once you have the state sequence
we'll draw samples to see which mixture component
you're drawing file and then
we can update the parameters based on that
but passing i wanna say as we have to also consider different segmentations the choices
of the B
no these boundary the variables naively every frame to be a boundary
and
they take binary values either frame is the boundary or it's not so would zero
or one
in terms of putting this into a probabilistic formulation we have again a prior
and a posterior probability the prior is just a bernoulli trial we picked we flip
a coin
with probability alpha to be it's a boundary and one minus alpha so be it's
not
two
generate the posterior so we can generate a new sample for every boundary we go
through one boundary at a time
and we fix the state of all the other boundaries we generate the posterior distribution
and then sample whether it's
boundary or not
and the posterior
sorry for the map is just so this is this is it i think
but the you know we have the prior here
and then
it's we consider all possible units that the segments on either side of this boundary
might be so this is sort of the likelihood
that you would generate
given this was the boundary
and then you consider the possibility that is not a boundary and again that's the
prior
and then you consider the likelihood of generating this entire segment
considering all possible models that you have so those are your to posterior distributions
and you sampled from then to generate a new value for each boundary to the
corpus
so you reiterate through this i think that was twenty thousand iterations that we were
doing
to generate all these parameters the last thing i'll say is just like her note
so this is near and dear to my heart anybody knows me for awhile now
i am a big believer a landmark based things but
it turns out
that it can help save a lot of computation
and
so we're using some acoustic landmarks we developed are also derive some spectral change
and it's nice these are language-independent and it reduces the computation
and the thing is
as i'll show you later
this is just the initialisation once you learn thirty minutes
you can then going train a conventional models doing frame based stuff
so this is sort of a heuristic to help you do the learning faster
but it seems to be effective
so i don't want to dwell on experiments too much this work we will analysing
the timit corpus and
we found a hundred twenty three so i'm gonna have to do it out with
her brother sixty four hundred point three
maybe should eight hundred twenty eight next
and i have a similar kind of plot showing the underlying phonetic label
verses the unit index and you can see we're sort of covering
are the majority of the sounds
what we're generating a little too much effort modeling silence i think we would benefit
from a good speech activity
detector the interesting thing is when you start looking at some of these ones that
have multiple units like a
we looked and
there's a there tends to be a distribution for particular context so we are seeing
some context-dependency here
okay at has a raised out phenomena word-like champ
versus cap
so when we're seeing a lot of these thousand one particular unit the ads followed
by nasal so what's
it there is some context-dependency stuff going on
so i'm only doing for time
i'm good
okay good
telling be here
so as i wanna move onto the next step which is i mean when we
really wanna go always learning words
we're not there yet but what we but we tried to do is enhance the
model so we could learn pronunciations from parallel speech text
data
and
ideally do better than the graphone model of five hopes
okay i'm gonna go with your first answer
now again there's been work done in this area in the past
chin
there was also some more people like
mari ostendorf and
in the ninety nine use that was exploration of joint acoustic lexicon discovery
and are naturally baseline for this work is the graph
the grapheme based recognisers a standard think people do when there isn't a pronunciation dictionary
project parallel text
and am giving a little bit away the punchline but the formulation we have
can reduce to grapheme based setup if you wanted to its one particular constrained application
but to go through the intuition
again
we're setting up an additional latent structure here
beyond what we had before
so we have all
we had the unit sequence we need to learn the boundaries
and we also need to learn the graphing the sound mappings now the experiments we
don't in english will think letter to sound if you want but the framework generalizes
to do two different languages
just to remind you where we started from
with just acoustic units alone
and some distribution on the likelihood of predicting a particular units
you can go through and generate a speech frames as a sequence of these units
now if you have a
i wanted associated with it like the word fly
we need to introduce a new variables and
what happen at final or word pronunciations directly we're actually learning these browsing the sound
because we think it'll generalize better across a corpus
you might eventually want to
do word specific pronunciations but
so we are representing that by another set of latent variables that are latter specific
here
where we would have specific distributions for each letter and by the way will eventually
do try grapheme
also context dependent ones but this is monophone or mono grapheme for now
so you have letter specific distribution so the S might prefer this particular
acoustic model on the L might prefer this particular one et cetera et cetera
hopefully get the idea
so those are latter specific mappings
and this is the initial believe and of course you need a couple these together
so that your general belief is related to the more context specific
things that you have a gap an unknown set of units these will also be
years like processes
underlying only as well
so i don't think so if you go through you know you have a ladder
you use the particular
distribution you select a particular unit and that unit would generate
i your frames and when you go to the different letter different distribution
so like to different unit sampling generate frames
et cetera et cetera
and that sort of
how would work so this model is now
i joint model for learning units
and
grapheme the sound mappings
and the underlying
acoustic models
one more wrinkle we have to deal with this that there isn't necessarily a one-to-one
matching course graphemes there is a one-to-one matching but there isn't necessarily
so we introduce another variable
that
gives you some flexibility as to the number of
units a particular latter might not too so we can think of english here so
we said zero one or two
and we have a little categorical distribution
which predicts the likelihood a ladder specific or context letter specific
distribution
so like the X for example might be likely to be represented by two models
you might learn
so let me go through this very quickly i know i'm running out of time
but once you have a sequence of letters you have to generate a sample for
the number of units that you have once you have a so that units
we also have
you have to pick a
a hmm acoustic model to a sample from a we have these are position-dependent as
well so for next has to we have position one and position to
distributions but you would sample
from those
and then once you've sample the particular
units you would generate speech frames from the appropriate
hmm-gmm model
and so the latent variables that we have to deal with now there's more of
them
there is the number of units per letter the unit label identity same as before
and then there's these latter specific
distributions
as well as the hmm gmm parameters
when you do the inference i'm gonna skip that i knew i wouldn't have time
to go into that
you end up with this mapping all the way down from the latter is all
the way down to the segments and now
if you look at the top part
you can generate pronunciations for words in your lexicon in terms of these units that
you lower
and of course you can train your hmm gmm models
and guess what i now you're in business to train a conventional speech recognizer with
whatever technique
that you like
this is similar to whatever a was talking about
the experiments that we and i know what's very ironic that here i am talking
about low resource languages and all the experiments i'm showing your in english
it's ongoing work trust me
but we've done some experiments on a weather corpus that we had for a while
we like working on
we use that an eight hour subset to try and learn the units and the
pronunciations and then we retrain
a conventional recognizer on the entire training set using these units and we compare
these are our baseline the expert
as we wanna be but certainly we would hope to do better than the grapheme
and just to cut to the chase were in be twenty
right now
so
graphemes or heart on english everybody knows that so it's nice that were able to
be that
but we'd like to get to the supervised and
i actually think there is
there's reasonable expectation that we can do that because
you know we've done some stuff where automatically learning
throwing out expert pronunciations and learning some new ones based on graphone models we can
get down to about eight point three percent
so i would hope that we could write this down below the ten percent
marker
it's still early days
it's also how to interpret a necessarily what you actually have lower
because
the pronunciations are in terms of these numbers which is not very intuitive for me
but
so we can do things like a local these words that have a shot in
it
sort of have the same unit so that's kind of encouraging
and the other thing that
jackie is done recently
is used moses to try and translate
use the two dictionaries the pronunciations from the expert dictionary the pronunciations from these learn
units
and use models it's a trying translate between the two so here's a six words
this is the yellow one on top is the expert pronunciation and the blue one
is the translated
unit once into phone like units so there are a little more
in interpretable by us and you know
i think it's on the right track i think there's something there we need to
do a lot of more investigation
but
i'm encourage so far so that sorta where we are
so i can wind up i think
you know
it would be beneficial i'm really encourages is to see you know more people
doing unsupervised things
it's challenging but i think
we learn a lot by doing this and
i truly believe that
in the end it will help us develop speech recognition capability from or the world's
languages
and i'm optimistic that
these methods can potentially complement existing approaches one of the things we're doing right now
in the babble framework as we're looking at these
acoustic matching mechanism as a means to rescore keyword hypotheses from conventional recognizers and
you know maybe that can be helpful
so i've shown you some problems we've been making in speech and discovery and topic
clustering also learning units
and i mentioned we're looking at other languages and we're augmenting the framework right now
to try and learn words itself from the audio
and i guess the by can be speculative that the very and
out this morning said the elephant in the room is
text data actually think the element in the room is lost
in fact it's all humanity that's ever been in a never will be
you know we are learned language
toddlers
no one gives us a dictionary
no one gets as text
you know we have other things but we figure it out
and
i think we be
better off in the long run as a community trying to think about how to
have some of those capabilities our system
anyways thank you very much
and i'm done
you're some references for anybody who's interested not be happy to talk to anybody
sorry i one over
so that was actually and just exact way as one to remind people really quickly
that after
this break instead of having a for one panel session we're gonna have a short
and panel session and preceded by a talk by manual to prove that mention this
morning is a cognitive science side is working and human language
we have time for a few questions
so the i think this each the kind of pretty nice approach based on the
didn't you
model
framework
and the isolates
they discussed seems kind of more like a discriminative
approach based neural network and i think past
a pretty important so don't have some kind of the ideal document all integrating these
kind of course the how to say and forced to some nice framework
a well i don't know if it's nice so first of all i totally agree
with you
i sorted you this generated stuff we're trying to do as a way to get
started
and then just as verbs that i think
if you can figure out the units and some pronunciations and get going
then i think
there's potential to bring in
discriminant thing to sharpen
the boundaries that you're learning
so that's sort of the approach that were taking try to get use this to
generate an initial speech recognizer like the landmarks we go away from that once we
have an initial model
it's not maybe well
the best idea i have at them
more questions
at the beginning of the talk is so that the you are going for the
places of similar for increased
by looking at the similarities so actually by detecting of places where signal right speech
is similar to some other place you say this is a place of interest well
this is a bit contradictory to the information theory mobile
basically when you have something that is unique with this they're just once it can
carry lots of information and that can be a place of interest
could you comment on this be
it's very true something that i mean these patterns are complicated
you know "'cause" they're sequences it sounds
they're probably
a little more reliably detect
and it could be that you could have a very important word that only occurs
once
this method would not be able to find it so
i just
it's more how reliable can do we think we can actually find it
and the longer it is the more reliable it is and it turns out
that's why mentions the comparisons with the tf-idf
we do seem to be able to find
important content words
in lectures using these methods
now
the hidden thing ricer mention little bit was a lot of the common words are
short
and that's why having a threshold for looking for some duration is important like have
second
you guys might of even use the second
right in the first paper
have okay
but so have second you know eliminate a lot of the very commonly occurring
things that you can
one other one other point out that to that is the non parametric bayesian stuff
is actually really good chinese restaurant process
generating a new category for something that just occurs once are very small number of
time
and so it sort of a really nice framework i think from language generally has
a nice
jordan questions
so what we have standing problems in this field use output of the real observations
so have you thought
designing
something that will allow us to
therefore
because of observations we have based on these kinds of non parametric
distribution issues
i missed the part when the most important
observations what is that the your
and i don't i don't believe that anybody in this room listens to cepstra
and that's just not the right thing the question is what's the right thing so
you see whether to acquire more analyze so that we could actually figure out what
the observations of this
you're talking about the input representation
these were all just based on mfccs
it's a very good question i
i actually think you know we would benefit from a better representation that sort of
naturally had the more phone like contrast in the languages around the world
i think mfccs are kind of rather blurry
description of the signal but i don't have any brilliant
i just have a question
we need where are you very natural language i like that it's a side quests
okay and it data
i don't know about the impression
with respect to discovery things the more data better
so the question
it's a state which you ideally want to have
right
or is it is the next so i have much while are that it seemed
like really well are
things that are sort of
state
thank you might have a lot a tensor thinking about how much data collection things
like it's really are really understanding in
it is really want
almost
don't know the answer that question
it seems like it for questions
i
i
i
so
there's a lot of data for english and another you know that all languages of
course and resources like the babble
will be a tremendous legacy that people can evaluate on
i don't know how much we need all
i mean there's this stuff i was talking about very early days so
but
okay well that's what others think jim