Přepis řeči - Trends in Speech and Spoken Language Processing

bush water on don't ask me to carry in one but okay no shot and so uh but and for uh three sessions um for after lunchtime uh i think we're as because shall i we to and of about five to six o'clock and family and took pictures of uh of the poster sessions and the sessions that had i two people at the end i and of speech and language processing so people in speech and language or or or or or or is dedicated to kind of stay to the yeah so i can much for a for coming to it so uh uh i just go inter so uh i was to close have to the first just a couple about that the suspension line which technical can most of you have a saying that one can had a like a icassp i my above ten to a can use for special which to i the low just to fifty three members the uh a a for the notion of a should i because we have a a large number of people per it's made at a cost so still uh sum of about some papers a a a a a a a "'cause" spanish trapped on the last a a couple of hours passed because we have a separate a constant of filter a that focus on which much processing and and spend increasing a rather significant way a uh the paper try since you can can and initial we have we about a the of the papers that that's except at i cast and suspension which processing field and so uh and discussed like to cover uh i seven hundred submissions and a three and four are some papers i in for it's when five we could just one thing a session of those and that were to thirty minutes by itself um but we just them not to do that okay uh that's true a uh i was spent a under a giant a a but uh a a a a from a a a a has a big impact of interest was not able to attend a of the input here so a lot of uh chin folks kinda again um we encourage questions is or we're going to have to kind of try to cover to to uh uh uh i have tried to make sure that are we get to have a microphone well so uh i think german one going for okay um um be very quickly so if anything i white he's um if anything i set wrong or a miss anything point to where and i you need three to Q is set just i'm i why is that a because there is a were three hundred paper a C to really hard uh to goes through a them them a summarise be case section if die on it up now of a see that yeah um B um this is you thank you people like couple that that talk to try to get a job of U um so a is three hundred twenty five paper is uh it is true of an them are on speech and it seventy five non on the language processing oh according to how the conference i sign of me to different section is the be case is arguable second papers can be a both um so they they are to part out wall um i what cover an now one handed out well not uh the um uh they or departing in language processing and uh and speech processing by not T T S and a a and this speech and that will a cover to speak i D uh including a speech a speaker verification and recognition and test speaker tiring is diarisation and and and up yeah what touched they speech nice thing has to and the little B and so this is a language modeling um on their right top of the the to show shows they number of papers on the big field F i of the language processing so uh a couple of things worth mentioning here is that a for X Y the at the the like the model um a a a model M based exponential model um the class based in then your a network model a language model non spam model it a dynamic language model adaptation um discriminate models and uh i think there many others just a couple and i as i C O goes through a paper is so what's common in the um the um uh the a cup of the a sum up paper on computing um uh optimize asian time to do you know how to train a language model a large scale data uh so it distributed uh ch training a fast you are not recommended model training training how to manage in long span and there is a common um comic data set people work and uh spoken document a processing here the task it try to do to documents some summarization classification a speaker role identification um give a kind approach is a typical of machine learning um at the A C rap and now the motion any classification and reasons and the uh and translation and the semantic a classification or set um two sessions that route paper is on this topic including a um the standing you i search though um different paper sick probably use different times to by or of folks out how to use a lot the how to uh um and ten to carries for search and then and that there is a um well i'm paper use in T B N of a i think it you see lots of paper some T B the used to be a language um uh and is standing this the for car car routing um a speech translation uh cab are saying how you can and tied that a speech recognition and south um and translation together whether a yes are a word accuracy it's not probably is not good metric of four for uh um for speech translation um the bilingual audio uh subtitle extraction there's many others i think a i probably didn't not list L um uh power linguistic in an a linguist give features uh did this are very interesting and be case you could think case of what to you can you know for a from speech language the motion detection um recognising and lexical but yes and now um the cognitive load a correct classification a um trying to um um i to guess when there is a one to talk to compute a to compute trying to guess you know how much you are thinking um the perceptual difference so four innings this if for language learning be or speech at things that says um you generating a a traffic trucks pressure it's are on this stick to topic um some of them operating new um spoken term and recognition it's trying to um now um you know given the huge uh each file was so audio file was or of video trying to be trying a list of of spoken utterances giving a voice query boast term which you you just speak um the approach is are of dynamic whopping sub word recognition rate to one graph based approaches um there's comedy a set out from this and the dialogue um there are um this is a i we don't have for this calm face and there's only five people well i to but i mean to dialogue um by a uh uh you know a at the train is before you know if you look a back a couple years ago probably there not many approaches are the disk approach now is most papers to focus on this disc of there two oh fights wise you you you track a distribution of all the all the possible state the second part is the being from an any you put to again it become palm to P i think a um uh there are several papers on the conference on problem um this so there's a events this is so specifically so we're not go through those um language identification um yeah six a a a a a a paper is an in and one session they skate trying use phonetic a prosodic feature is the combination of them and and did to do that to to identified the language if you look and approach um how to do at it's a a you know um use a classification uh i i can i seek up a paper sound logistic regression in and grams there's a set to or was the same data or this uh a trying you guess the language what use thing um a lexicon modeling is trying to use the much line tight to automatically generate the pronunciation from the from the given word um that you to be there is a couple question lining uh and here have uh approach is introduced in that um are to lingo on a multichannel processing it task you is that you know you have a mixed them and to each input how do you uh to the asr um to uh index and search and um what mac thing you can set up the approach the very died verse by a so i did not list them on here um speech analysis this a few i really um don't know much so i uh um just trying to cover what topics that were kind motion detection um i you know on sing this level you kind to um it sec come up if the change you can see today the relationship between motion in F zero range um you motion you know including detect the anger and so and so far and duration modeling for for log block of account for zero um um yes are peachy frequency estimation so on um the approach is i i i sees you know couple of things probably um not new buys the comic class papers um singularity generic the i have a the phase locked loops and there is coming a set on that um as i said this is good i don't know much so if for the second chair is our or would know this a better place a calming know what's what's what's missing here that i didn't cover uh speech enhancement um a a time know the task it trying to just as separate a speech of versus not speech and noise um a there is okay um you can be the slides but uh there's is i i think i can i at compared to produce the conference is is a be more apt it's a music noise um there as many approaches here um you know somehow of mark and tuning to in uh they well no poke just like when you filtering in who the calm through train C C um have a long list T here i thing i don't have time goes through an all and uh you can vacation um we have over all there's a forty eight paper on this topic in clean speaker diarization and um but i think that we more than previous to conferences probably one the reason um i don't know what is the relevant to the nist to be recognition evaluation um a crop but he's through the the paper is there is a a couple of things just just highlights i think E very very is hard to summarise um i back to space and uh a probabilistic lda and uh the evaluation papers the from nist R are the are used to use in in a fusion that if you would us several uh speaker recognition system to you fuse the results um okay second one here speaker that issuing is first if who spoke when you in audio stream i'm meeting um yeah a just a which is a second and this into uh top down bottom class three um how to uh exposed features be close to give features there's is by new keys approach um there uh information bottleneck the based approach the couple course quite i knew hence is is me on this field um you bass so S I can in this uh um there's lots of people is here i search will miss something um the so i put i mean you several kind can't three the first one uh processing and signal up i think set compressed the sensing i you can use a compressed sing on on the other parts of a by S are two um now net to magic a factorization um how to use of a to transform in that the spectrum and then there other approaches i have give a long list to here and a feature is so how to and you know lots of features six we shouldn't based ten antenna um there's say a you a of the cup of papers how to use T N to two genders the tandem features um logistically smart mapping and noise it's feature normalization uh there is a a and different model um um i is quite diapers a collection so i don't think of will we can um maybe after this of can put a slice somewhere where thing wants to take a look um given at a wall and you know worked a i'll try to cover can everyone hear me so i like to cover all of the uh papers that but generally included in large vocabulary speech recognition and acoustic modeling and adaptation technique um can any that were a lot of a you can see the asr lie and so we try to split it in a manner that matched well with the sessions so let's for start with adaptation the problem here is basically to say how well can you adapt your existing models do is a specific speaker or environment and the most recent trend we been seeing is how can you and force sparsity or structure on the transforms we line and how can you do a better optimization now in general the ideas that have been floating it on in this field include discriminative transforms how can to find something that will learned rapidly or rapidly adapt to minimal amounts of data and so and now you see these things are adapted to more real well tasks such as a waste H and you starting to see as some impact from one of these techniques and this new of data and we did see some for is now on a rapid adaptation for uh like is said to a what test and how you can include um convex optimization methods in situations where your objective function is not convex anymore i if you want to read more about uh these bit where as i've listed that element section here that was not a good idea job we we have small problem and so i i think that do modeling now yeah are modeling was split the as many many sessions uh basically but all talking about statistical modeling of speech signals yeah uh the more recent trends have been along the line of how can i use machine learning technique in large vocabulary speech recognition we all know they were on certain class of problems like envision and handwriting recognition uh uh which are really difficult but but like i have small do it is sets so we're not looking to see how we can apply an and these techniques to speech problems and a lot what that comes the task of speeding up these learning algorithms to deal with large quantities of data and that guy here we saw more applications to real well uh tasks and including die play evaluation yeah i this like yeah your some of the i D as we saw most if you are familiar with these things um i i'll some of the key components yeah why at a a a we saw some papers and capturing long H on that uh critically more use of this do you has to like clean either an hmm framework or in other forms of coding uh a how can you use the psd there to type from class classifiers intelligently maybe we've using them in deep belief nets or maybe they are using them directly the hmm framework a a can you intelligently like acoustic units whether that it's for english or any other form of language and do you'll have enough data and now to pick these acoustic units which didn't white the before um also we have seen some papers that use language id accent and dialect identification in incorporating them to improve speech recognition accuracy so you a bunch of a as people are working on um we want to see in some recent interesting wake on uh last functions and busting methodologies that improve the quality of the classifiers of the learners an acoustic model uh this this particular yeah that that the meant in the the section title modeling for a a uh moving on a at but to why sessions which covered acoustic modeling these line it's topics and statistical that that and these do fall under the category of general asr type problem um that was some more i yes yeah which include complex models which include long spend board language modeling an acoustic modeling technique uh we see some applications of C i i have to be so multiple stream the nation's um a i thought is an using this D D as as some sort of any then to that there and thinking out how to model these posterior a a a a few a where is that where a uh uh derivation from the johns hopkins workshop which is that every summer focused john how you can use some of these posteriors in some sort of a segment of framework a a more recently if you see what the training is uh we see a lot of now and and sparse representations example are based methods how you can capture higher-order statistics using deep belief networks um you have a point process models are can you to spectro-temporal patterns uh i so we are saying a wide range of novelty here in this field uh a continuing on and modeling which is also included in discriminative techniques for asr a of the is you was mostly on how can i use just limit of training for both acoustic model as well as for adaptation uh i we saw some papers on training full covariance models uh we also saw a if you break it down into specific to saw some feature selection voices better are like it is the in your model parameters that was interesting and people also to present a different kinds of training criteria do you use an objective function that models see what at a rate or do you use an objective mark function that model something else related uh to to the likelihood or the ad or in some computer in some other fashion um the last session that a cover on asr was uh a tight to large vocabulary speech recognition uh the focus it was mostly and bowling large systems uh large systems for the galley value evaluation in different languages and that are also if you like six systems that were built on real world tasks and so of the key idea here are how can you exploit large quantities of unlabeled data and to the class of unsupervised training a a do you use better methods for lattice based training uh we also saw that's a is the best farming techniques and algorithms for building acoustic and language models and typically we and then like in tasks like mandarin then a big which were part of the gale evaluation oh also system combination strategies played an important role uh we also some that that it's to do unit selection particularly in language just like a man and polish we sell some methods to improve the quality of transcripts when you're don't have a uh manually transcribed data how you can improve the performance of your acoustic models of the training by getting better transfer uh that was a like a on in of decoding schemes to better optimize memory consumption and to make things go faster and B so a large presence of deep belief networks all over the place which still anyway somewhat and we saw lots of papers on acoustic modeling out so um this you can break it down into some a couple of that as one which includes or tended to features for hmms in addition to traditional mfccs and plps and the other in the modeling paradigms itself we saw a lot of but are starting from from recognition to lvcsr a a a a a few things to point out we saw energy based feature a lot of articulatory trajectories a hot can you do it uh include nonstationary features term and page for set languages uh we saw some efficient parameter estimation that captures phonetic variability i am not capturing everything and every session but these are sort of uh to get you what motivated to look at gender trends and ideas and bring in ideas from other feels that but perhaps help acoustic modeling better uh we did see a lot of like linear models for covariance model and particularly this time we set some work on a uh or of a lap speech detection and non-audible audible but detection which is useful in uh situations adjust just monitoring in the public domain uh the set relevant sessions are acoustic modeling one and two um that first session some speech synthesis so this is just a very brief summary and speech synthesis uh uh we sell a focus on well that two categories and synthesis hmm based in concatenative uh unit selection based tts a a bunch of the like on hmm based synthesis focused mainly on the underlying parameterization majorization and do construction and that included a work on X duration modeling how you can incorporate this technology and embedded system a a so impact of machine translation i meaning the number of errors the translation system makes and the fluency of the output the impact that has on speech synthesis um that tying like that of parameter estimation for hmms this is was also there uh i think that that of section we saw work on a prosody prediction how you can do better prosody prediction how you can do better uh annotation of pitch axe uh uh we also saw a new constraints being introduced used for unit selection in concatenative tts systems and the but all the relevant sessions are listed yeah there but also a few posters on in the machine dining section a speech and audio applications that cover synthesis so that's basically a broad overview i have for asr and since i know or no no seen over three hundred papers and thirty minutes a note maybe feel like a fire hose just at you um i about to see we could try to generate a few questions uh from folks um i will so that uh we use speech language to C uh we do put a newsletter news letter of any of you of of the local up through the paper a speech or you in this i care as we do reach a goal of we uh you know what dress for um uh all all of papers and speech and language area our group uh and and back to or uh a news letter or you mail lists so you good or a regular copy your of the newsletter for mark to C and we will include uh uh links to kind of down be slides if you like to get a copy of the right so can i you for the any questions here the river and may have to make a on its work and a can have the speakers no to to to get a i don't get all of of a three or four years ago speech technical to can be cut of organised itself so a a more text for you know spoken language so a is spoken language not your text alright spoken like or text processing a to spoken language processing from to try to sort of try those papers from your is you but they were generally going to also circle oh it's what's what's a room um uh solution is to actually things like a spring a are a put up your part of like are there more are more so if we more a set of the paper is that what is going to is your car he's coming here so i think we have a for about a hundred and term papers and spoken language should be push you or it's been sitting in the last two or three use roughly new were for about eighty two a a a little over a hundred and weeks a roughly you know average word forty six forty two to you percent of the people um close some of the work or uh but is presented in spoken language start also go to use em so i but it brings in more folks for um from that community so to speak um i just several also so that uh and the speech technical committee meeting we had on wednesday um um uh uh be up to short frames from the trains actions the number of paper submitted a are in spoken language was increase significantly uh spent a huge increase of the number of submissions and uh uh page count is actually local realms of some of you have a people sitting in volume ninety nine you'll know what i mean that's kind of a or or or going uh volume and are we can kind to do more a you to a kind of a or to the people but the us work there was a lot of people kind of coming in oh i a series of request no one see me was on in bands of sorry question uh so on from we use row i for sure no i i you so are was so i could so or do some channels as i'm not sure if you want to use one so for a was from the speech so it uh i think when people are looking at a uh there's a lot more work now you know real data uh and so what can you broadcast news uh working a real was to go search in you to write and videos an audio bits sparse of on the web um there's a lot more a you know play between music and speech um there was we used one people looking at speaker I D language i D uh in multiple languages uh with people singing uh and and so for a but we can use the past although occurrence but we're also seen uh yeah the morphing or transformation ear um i so not in this conference spoken in a pretty loose are cars or someone one to work a music video or a pop artist uh a in english and more of to than to spanish um and it would be sound of flawless and the of grammar really where is power being in english uh and to know spanish but you couldn't sell was really good so i i think you saying a lot of movement now a some of the tools of there exist for speech recognition a a speaker or do you'd are resolution and so forth trying to draw some challenges and music because a lot of a more realistic beater uh a the folks for getting access to uh have music in has become a big bit channel uh actually pitch tracking of this the speech analysis side pitch tracking where there's music is a real tough thing to work out and uh are some we'll folks that are or have been working about or and a quick comments i think a a do see a lot of people as time to move music um i have an face a a couple use but one of the big challenges was the computing speaker problem you have one person walking on another person no words music working on someone else and being or would try to suppress that's to try to the recognition for a question number i just got glasses i think that's for a a a we are just uh reading some common to the previous commons all this that as the single processing as speech and also recognition of of this is is is i to C personally actually a a of view myself a as a a how core signal processor so i E C is processing so the regardless whether is a speech or so is is actually a the models of research asia and i do have a lot quite of few about colleagues such as a professor so that almost oh from a uh a talk university is well as like a working on in both to maze and do we treated music either the instrumental music well well can yeah yeah as a all or or or on a interest or applications mary rules uh the oh we do all speech this this is we use a T T as the knowledge particularly these there is a and then the has for whether or and so not only just bridging the gap between the traditional uh you know concatenation or unit selection based since this is but right now the hmm a since this is a a a or or some people close the hybrid synthesis is but actually my opinion is really just T but the the whole a statistical and sample based uh the rent or E a as the as a holistic approach to the holes is this is or render E processes so T uh we you is uh a T T to do single is but only the knowledge and the we just try to say you you the out of E given the re quote speech material to real can we sing as cell yes the uh we had done that and that only you has a noisy in i saw quite a few people are quite a few research researchers uh really working but did shall in that direction and but polyphonic sink the pitch tracking which is a a really a i i i been but rather where be ever so i'll but the is definitely posing a a a a a kick take advantage lunch and interest to that the signal the processor speech researcher or or music researchers and the analysis analysis is that of the recognition uh uh again um probably just as a matter come motion say that is that used to be just say recognition commission the next the or to understanding and the the car we just of a speech synthesis uh session and maybe the advertisement houseman so that they but but uh in as there a speech synthesisers or speech this is a researcher or because the whole understand understanding to close the speech chain we do need a a good small out express a speech this is is too so uh to put a quick summary i personal out to C there's is the boundary between the nose and speech and the of the common to as statistical modeling the the sample both uh uh a really uh every using a it's a really is really just merging was each other in kind of the seamless model thanks for but the uh when you think about better recognition maybe a couple years ago a general perception is that since you can buy commercial be used speech recognition products in the field that some people perceive that it's solved uh but in fact were there's a huge challenges i think you will do to or four much more realistic data in the field make or recognition much more challenging to do and frank's comments on the synthesis part as is right on target when you look could be general usual population and the use of dialogue system uh studies have shown that uh the perception of how group the dialogue system use a is to a large extent related to the quality of the synthesized voice that you're interacting were uh hidden behind errors where are the recognition or rate uh is used to hard kind of recover from a lot of ground already recently uh or or uh looking at a rubber processing approach or the questions or comments we have gone or thirty minutes per should we have a we're three had papers here so we should have some more course so to to make a put us or for the use or you are workshop um it's it's motion some this are we were to the you training and maybe not as a a a sunny it's it's a sony and or yeah uh to the weather is great there and or a great opportunity to come uh follow up on some of the topics that you you've see that risk and everyone gets a little a a sort of of flowers that we could oh the comments or questions well with you for a a block to make the last pitch of view are for interest in being involved in the speech language stuck to can please a a contract one of pretty members there's or we're fifty members a a if you do the web for a you news later um blue fine there's a number of of topics that are are to if you were advertising for jobs or uh trying to to record folks so there's an online uh jobs posting your we're i i don't know lesson well S may later i put a little chunk can be on uh what represents a a grand challenge of a speech and language field and i think uh there's been a lot of talk curve in terms of energy and health care uh i was grand challenges speech and language arose uh the one of the most input mass perks when you work could society and interacting with folks a speech-to-speech translation some of the big advancements in this area uh will allow people to communicate more efficiently and reduce barriers between people so speech of mine which very important should represent a one of a grand challenges well if there are no more comments are will close the session and think are we uh

Trends in Speech and Spoken Language Processing

Expert Sessions

Přednášející: John H. Hansen, Junlan Feng, Bhuvana Ramabhadran, Autoři: John H. Hansen, Junlan Feng, Bhuvana Ramabhadran