0:00:15so thank you for a nice introduction
0:00:18my name is but the shreds and the
0:00:21at the beginning i was recessionary the banana what is it the of technology
0:00:27i for one many different fields start thinking of porting recognizer
0:00:33then
0:00:36we really try to do it should speaker identification don't know asr
0:00:40are you
0:00:41roll the speaker is no particular isn't enough
0:00:45that many of you have still using
0:00:48but and down the here
0:00:51two thousand five
0:00:53something like a stranger happen to of for us a basically we do our approach
0:00:58by one company and the
0:01:00the companies
0:01:01set to will give you some money but you would like to have different license
0:01:07for recognizer it was publicly available
0:01:10also we said okay this is find a do this
0:01:13bill help us to
0:01:15the to finance the there is an essential but the third row to be
0:01:19quite the low expediency to go also
0:01:24nine months just to the
0:01:27negotiated the license university
0:01:30and we realise two things that there is interest from commercial market it can be
0:01:36recast additional money
0:01:38and
0:01:41that we need to do we the heat a better way
0:01:44so there we started a company called for X
0:01:49so i would like to talk about to the topics are the two main topics
0:01:54today
0:01:55how to the speech tickle you probably such a two
0:02:00mark and the and then a i would like to the
0:02:05shows are much related that we see a problem the user point of view
0:02:12so at the beginning i will talk a few words about the company
0:02:16then about the text widget use cases
0:02:19technologies that are behind the our programs
0:02:23and the so
0:02:25how big over the technology to the rose and then i mean really indicates someone
0:02:31grand challenges
0:02:34people usually don't know what is that you speech
0:02:39but if you look at this
0:02:44at the
0:02:47at a dislike so you can see that they result
0:02:52that there is before make sure the about speaker it can be
0:02:56E hundred and there are you gonna be eight sure it can be speaker i
0:03:00didn't at you can't that example emotion states
0:03:05meant the speaker speaks and so on there is the goal that the
0:03:10you can detect the language you can detect dialects a keyword crazy so
0:03:15you can do the whole speech transcription
0:03:18maybe the topic is interesting
0:03:22you can
0:03:23do something domain incapable then the
0:03:26but there are other modalities you can have some information about the
0:03:33and white men the that the speaker is
0:03:35to whom the speaker speaks you can have other solid so
0:03:39what is up of user you go
0:03:41very close animals
0:03:43or you have a lot of information about equipment that was used
0:03:49of the to get a relief what we on voice it can be the device
0:03:53the
0:03:54for example we also for ticket be transcription of huge it can be according to
0:04:01you may be the test it in a speech quality
0:04:04this
0:04:05is important for user and the users can benefit from this information
0:04:12about products E R
0:04:13it also based in two thousand six as us
0:04:16startup from brno university of technology
0:04:20it has C
0:04:21in czech republic in button or just five minute walk though from the university come
0:04:26rules
0:04:28if we speak about the user so we have currently users in more than twenty
0:04:32come companies
0:04:34so i've got with the agency score said that all the bank
0:04:37dell corporation that also brought up so service providers and others
0:04:43the company use the roughly table and the
0:04:47the little small only for an external funding so those so far
0:04:53you if we speak the about the process how to transfer technology it probably search
0:05:00the to the progress and the market
0:05:04there are several steps
0:05:07well i'm an important role E V speak about the research
0:05:11theirself useful for a dollar a universities or inside companies
0:05:16but the goal is to get the
0:05:20the best like technology but the more i don't interest is unit could easily the
0:05:28quality so like a or a set of for the stability of speech will for
0:05:34the court is not do
0:05:37the man it main importance
0:05:39and the what was important for a so it probably also for
0:05:44you
0:05:45and this stage we will to be all limit as possible so it's the beans
0:05:50quotative measurement
0:05:52to
0:05:52to the saba
0:05:54okay so
0:05:56open-source toolkit and so on
0:05:59but then you need to getting better technology to
0:06:05user somehow
0:06:06this also for this you would be to do
0:06:10next step but
0:06:11you need to build the
0:06:13code base that is or almost that is stable
0:06:17it is fast the has a modified a P has documents a day shown a
0:06:22proper licensing
0:06:24us assume so the D V D this is what is what exactly
0:06:30then that is others that but
0:06:32a better you know
0:06:35i need to build product for our customers so you can have nice technology you
0:06:39could have nice interface is but if you don't have power back to you won't
0:06:42be able to sell
0:06:44so here the full cost is
0:06:48functionally the and that this is donna either by phonics el or by other companies
0:06:53we never be
0:06:58not now i the bill mentioned
0:07:03pretty domain use cases
0:07:06or pretty main customers there are others but i selected this free
0:07:13the first the are all sentences in course of course there are so we are
0:07:18active you know why rasta the fires are is the quality control how to ensure
0:07:26quality parentheses in
0:07:29the call the course of terror and there are there is data mining from
0:07:34voiced i think
0:07:38the that the quality control what is it about
0:07:42in both antenna
0:07:44it you have to really
0:07:46some kimberly there or supervisor
0:07:49that's a pairwise well the
0:07:52then a but i there are so i just this better than do
0:07:57but i think of course
0:07:59evaluation of operate that also
0:08:01analysis of the results
0:08:03for the team and some reporting
0:08:06if the there are no speech technology
0:08:09so you usually only three by a set of recordings is
0:08:15inspect the then the use of to control wanting to local schools said there but
0:08:20if you would be point something which is you are able to control
0:08:24a hundred percent so the topic to get you are able to better use statistics
0:08:28the
0:08:29and the this everything is about the
0:08:32moreover
0:08:33the cost so of star but
0:08:37we are able to try to reduce the number of advisers to how well
0:08:41over operating costs
0:08:44so it is very you are in but
0:08:47to shorten the call so
0:08:50if it you are able to the
0:08:54you have
0:08:55find problem so you really errors
0:08:58some or but i just are not the us up despite the or a remote
0:09:03well train the
0:09:05i hear its is possibility to said that what is look training can that
0:09:11usually a the
0:09:14the formulation of people you know such posts of the rest is
0:09:18you know tens of
0:09:21paris and so we are able to reduce the this
0:09:28the this amount of data and the again six of some calls
0:09:36and about this approach is a huge the for quality control the main technology is
0:09:42the
0:09:43the at and doing some
0:09:47this does so i'm not easy so on
0:09:50topologies
0:09:52so you are unable to get important statistics
0:09:55like but better
0:09:58the dialogue starts the number of speaker turn
0:10:01us
0:10:03speech adaption times
0:10:05of unknown the call centres have
0:10:09all the equipment so if they have some channels to the conversation the speakers are
0:10:15not in separate the
0:10:17well i like channels that we need to do diarization
0:10:22then it is possible to deploy the key you want to raise the text order
0:10:28you have some obligatory phase this you don't want the people to
0:10:33speaker all the words
0:10:36you would like to have some
0:10:38of course grip compliance the people should
0:10:41for all calls three the
0:10:43and the it is possible at the to the voice speech transcription are mainly
0:10:48for us
0:10:49set a should
0:10:52in this task
0:10:55every about the data mining the this is other large a topic for also that
0:11:01is
0:11:03year again we have like two subtasks
0:11:07one is but i mentioned of the corset errors
0:11:10of course and therefore overloading
0:11:13you may gina that you have
0:11:15also there are of a few hundred people
0:11:18and the there is large up
0:11:21i rolled each of thousands of people start holding the
0:11:26to all said that the service so you need to whatever it really quickly
0:11:30we need to explore the export of what is wrong and the
0:11:34maybe that some information to do initial i we are
0:11:39a stage
0:11:40and the japanese so you could be the this is solved by some be
0:11:45screen
0:11:46in the call centers showing the topics that are just the discussed that the
0:11:51percentage
0:11:52then
0:11:53i other important the
0:11:56but you use the like the
0:11:59i did value speech technologies for business of eight basically now a moan about so
0:12:06but
0:12:07indeed i don't know if you have for example how
0:12:12may be done also
0:12:13it's looking for places that too but i'm not
0:12:17in new
0:12:20new fast foods
0:12:21you the approach is to but the because go to telephone operator and they ask
0:12:28please could you
0:12:29if a statistics of where people that the visit it the our fast foods are
0:12:35putting day and the
0:12:38the place where is the highest of that it was good consideration is good place
0:12:42to start the you fast food
0:12:45but the you know speech technologies the same
0:12:48if you have more information for example it's on the phone line a is a
0:12:52male or female or that they are more people or on the line or the
0:12:59pairs and was interest in the in some regrets the in boston
0:13:04it helps you to go
0:13:06the whole business certain to push to some more
0:13:10for this we usually use of speech transcription
0:13:14and then
0:13:15some of data mining to on top of it so
0:13:19of course it is possible to at the
0:13:22so i changing if you want to the session a narcotics
0:13:28then
0:13:30the other big groups are bangs
0:13:33a bank so of colours have all sentence so what i dimension on the past
0:13:38two slides
0:13:39is important also here
0:13:41but the other too large task
0:13:45was a box are the bands needs to ensure the security part so on the
0:13:54other side
0:13:55they need something that these
0:13:58breeze and the
0:13:59for the user does that at the
0:14:01the that doesn't to
0:14:03being
0:14:05much complications
0:14:06so here the voice by but lisa very interesting it can be ways by timidity
0:14:13using a cape race or it can be ways by comedy that is the dominant
0:14:19on a big get the using a text independent speaker identification system
0:14:25and then i other
0:14:28task is for our protection in major in there are people according to bank so
0:14:35for example a hour a day
0:14:37i shows we make i didn't theses and that today are
0:14:42requesting clones
0:14:45you it is how to detect that this that if you don't have technology
0:14:49but if you have technology like the speaker identification is a really simple
0:14:56now about the intelligence agencies
0:15:00the intelligence agencies the situation it is usually that the this intelligence agencies have
0:15:07they really huge amount of data
0:15:11the amount this i should that the
0:15:13they are not able to put to see the
0:15:17manually this can came from
0:15:19a big use of from telecommunication network and communication the internet and so on
0:15:25they are looking really for need to
0:15:27you know i say it's take a
0:15:29and for these it is possible into to use combination of technologies
0:15:35so we are using combination of technologies
0:15:39and you language identification agenda to speaker diarization keyword spotting speech transcription
0:15:46data mining tools
0:15:48and also a little fun a correlation with some other metadata for example from this
0:15:54text is used
0:15:56and so of course the sequences are very interesting you know operation and forensic speaker
0:16:01identification
0:16:04now i will go to be better to the technologies
0:16:09and the tell you what this important for a practical deployments
0:16:16here are some of the technologies i want to speak about all of them you
0:16:21can come and ask if you should it from a question
0:16:26about the voice activity detection
0:16:29i would say that this is the most important part therefore practical deployment
0:16:37you can see our process
0:16:41by this is the most important part
0:16:44you can have
0:16:46very nice results for example on these databases
0:16:50but
0:16:51you what do you will explore a target so
0:16:55the users are working if such channels that the you should quantity of the traffic
0:17:00it can be
0:17:02tens of a sense is not speech at all
0:17:05it is some technical signal for like dialling don so that six
0:17:10and
0:17:12everything can if you don't have that
0:17:14this
0:17:16built in
0:17:18it it's a really harder to work with such channels
0:17:22so we are using energy based the steam of would be eighties energy based
0:17:26the at the beginning the to remove very large portion of the silences
0:17:32then technique the signal removal like the tone detect to removal first
0:17:39back to the spread like that are in mobile station i
0:17:43signals and so on
0:17:44and then we have a vad based on F zero tracking
0:17:49because of the speech have the specific characteristic the that should be a we have
0:17:54zero
0:17:55and the
0:17:56and the respective behaviour of this
0:17:59F zero and then we have you wanted what based vad
0:18:06to get a very precise the segmentation
0:18:12but is this say sets it is very important technology and they are still many
0:18:16challenges
0:18:19so it is important
0:18:21the accuracy of media
0:18:24directly affects the accuracy of the technologies
0:18:27us some sort are actually trying just you can have music
0:18:31of
0:18:32that they're other speakers sounds of like people tend to
0:18:37well a four or something like this
0:18:41you have a an alignment silence
0:18:45use a different technical signals
0:18:47what is a challenge is the vad one variable well snrs
0:18:54we at on distort each section was
0:18:57well what is also through you that important i think it is unknown parameter to
0:19:02a
0:19:03non automatic way
0:19:04of green vad because we know that we can do it's one to the deep
0:19:08or specific channel
0:19:10by training just as before
0:19:12some good classifiers
0:19:14but how to get a rise this it is still difficult and of colours distant
0:19:18mikes
0:19:22and well the language identification
0:19:26currently if we are able to recognize about fifty languages
0:19:31and the
0:19:32what is even more important
0:19:34that's that the user can add a new language
0:19:40themselves
0:19:42this is important especially for the intelligence come community because today will never
0:19:48tell you able to instead of the languages are great interest the that
0:19:53what the correct you on sent to you won't be able to collect the data
0:19:56to have much easier X axis
0:19:59to such data
0:20:02we are using i-vector based the technology
0:20:06and it is commanded training okay and
0:20:09we have first of all men which means that
0:20:12the language print is a less than
0:20:15well
0:20:16"'kay" a record
0:20:18bear
0:20:21a better file
0:20:26do it in this is the technology behind the there S this several stages
0:20:34year we have feature extraction
0:20:37collection of statistics using the ubm
0:20:40usually the
0:20:42we use gmm and the that is
0:20:46are aesthetically the by some subspace so the subspace it's estimated on large quantity of
0:20:52data to model the
0:20:55for
0:20:57variability in the speech
0:21:01so in that we get the
0:21:04of estimate so for but when we are in the subspace
0:21:09so this part was prepared by for next year
0:21:13but then and there is other part to
0:21:16but that it is the classifier of languages we use a multi class logistic regression
0:21:22here
0:21:23and that this is done by using
0:21:31about speaker recognition
0:21:34the there are many task like speaker verification a speaker to
0:21:38set of speaker spotting link analyses
0:21:42for
0:21:42after normalization some house
0:21:44sometimes social network analysis
0:21:47we can work in text independent or text dependent more
0:21:51i-vector based the approach
0:21:55we use diarization
0:21:57i think it what is important here we have
0:22:01use a based the system training for calibration
0:22:04that again helps
0:22:05people a lot
0:22:07it is here
0:22:11a so that the use of the same as a in case of
0:22:15and which identification
0:22:18what about a year we remove other what it but it is on speaker variability
0:22:25i would be have some normalisation of ways pain so simply by
0:22:29mean subtraction that can be done it user side
0:22:33and
0:22:34then
0:22:36if we have well
0:22:39scoring
0:22:43we compare ways putting it's a
0:22:46this
0:22:48pretend the by one excel and it is do you get it but the
0:22:53we allow our user to
0:22:56the rain or i don't the this classifier
0:22:59this is very important because
0:23:03it's harder to get any recording for from clients
0:23:07but
0:23:08if you deliver a such system to clients the
0:23:11and they are able to adapt the system you the amount of data can be
0:23:15a really small it can be
0:23:17for example fifty speaker does just few recordings of each
0:23:21well i'm
0:23:22we saw that the
0:23:23a normal telephone channels that we are able to get about the forty percent improvement
0:23:30for
0:23:31the new deployments
0:23:34and if it is about some
0:23:36us special of or
0:23:39for example many directions
0:23:40we saw a hundred percent improvements just
0:23:44with this
0:23:45simple book
0:23:47and of course so what is closely you
0:23:51important is calibration
0:23:54of
0:23:56you know case that we are drinker like that and then calibration because that this
0:24:00is also not seen too much in
0:24:03and nice the because they're the recordings are
0:24:07about two and half minutes long the but if you have
0:24:10the huge
0:24:11but variability in do lying to you need to do anything three this L C
0:24:15the shore recording studio
0:24:19solve it
0:24:20by do some up for a times
0:24:25what are the challenge in a language identification and the speaker identification
0:24:33i think the that the main challenges are very short recording so it can be
0:24:39one less than a three seconds
0:24:41but the
0:24:43very important for us is
0:24:46keeping to the training corn a user side
0:24:49because why less than three seconds to each if you have department of speaker identification
0:24:56and the you would like to deploy eighteen bank the people don't want to speaker
0:25:02they would like to have
0:25:04the decision even before they start speaking
0:25:07so i would say that the ten second these
0:25:12the maximum of
0:25:14a line that it that is a set the
0:25:17and you really free second to for a verification
0:25:21you can do we the we but text dependent systems
0:25:25i is harder to do it with text independent systems but in case of text
0:25:29independent systems
0:25:32these two steps are report to study on background so bias it to do use
0:25:36user is talking
0:25:38the operator
0:25:40of
0:25:44i that is question how to ensure
0:25:47accuracy over large number of acoustic channels and languages
0:25:52the technologies are more and more general
0:25:55independent but there still is someone
0:26:00independence
0:26:02what is was link important there are a graphical tools so that how
0:26:06the user is to visualize the information to do the calibration
0:26:10because
0:26:12if you do want to do this for user the user will never we the
0:26:17in self
0:26:19what we see also very challenging is
0:26:21language identification and we could ideas deviation a no voice over ip networks because
0:26:27there you have pockets you have gets lost
0:26:31and the you if you have this costs
0:26:34you usually cortex a are doing something that they are either put their zero also
0:26:39okay are sensing is i speech
0:26:41but this is not so the speaker to the that it this is something that
0:26:45was
0:26:45generated by decoding
0:26:47so that's also it's very important topic
0:26:50and of course the distance might
0:26:54now i would
0:26:56say few words a ball so diarisation a because that this is very important technology
0:27:01you useful for example the call centres but model also
0:27:06for anger
0:27:10other users
0:27:12we are using approaches one approach is really possible the not so
0:27:17much weaker the
0:27:19this is approach are based on
0:27:22clustering of i-vector so we basically split the audio too small chunks and to do
0:27:27clustering go for i-vectors
0:27:29but the
0:27:30then the i don't take the
0:27:33fully bayesian approach to the initial you know might take the by
0:27:37fabio one a
0:27:39patrick kenny worked with this to
0:27:42on the reset assures
0:27:44quite with the text and it it'll be the
0:27:47D P this is approach to bear you don't do a heart decision
0:27:55during of the process of you have everything good
0:28:03probably sticks and the you are going to do the decision and at the beginning
0:28:06it at the end
0:28:10this approach is i would say
0:28:14but if you're at the but you want see
0:28:17well my next slide it's not
0:28:19fully to
0:28:20but
0:28:21memory cons i'm mean and the quite small
0:28:26so what are challenging
0:28:28so in diarisation
0:28:31in my point of view the diarization
0:28:34still technology that needs quite a lot of research
0:28:39really so that it is very sensitive to initialization
0:28:44it is very sensitive to
0:28:47non speech sounds
0:28:49do you usually it is about the wall so you got more gaussian before
0:28:55for example if there are you sure that the
0:29:00new sounds that you haven't seen in your training data needs to be sorted for
0:29:04example we ask
0:29:06the system
0:29:06to keep two speakers
0:29:08but the output was so the that the
0:29:11we got two speakers in a
0:29:14one or like
0:29:16like under one labeler and the second speak that are you know
0:29:21there were segments
0:29:23it was other us
0:29:26speaker sounds
0:29:27i think of a lot in this case the
0:29:31so what is important the it's a very
0:29:35would the duration of your vad if you have
0:29:39i just sounds the that the speech
0:29:43it can hardly due to the adaptation
0:29:47a so it's a it is but very sensitive to two so that speech and
0:29:51also and then which is
0:29:53what we see that the you human of with this is things systems you can
0:30:00easily very should diarisation error rate the
0:30:02close to what one percent the one is data
0:30:07but
0:30:08well what we also saw
0:30:10and the
0:30:13you the
0:30:15first it's us
0:30:16we could be that the rest of the one percent the
0:30:20is that the there won't be pro by means segmentation about it's fails so
0:30:26forty four for this recording good did this is the
0:30:30usually of speaker to sweep but very similar voice but this happens
0:30:38okay so i think there is a shana
0:30:41that was a lot done during past two years but the data
0:30:45the challenge is quite the
0:30:47and of course you can speak about
0:30:50text and distant mikes for
0:30:53the of
0:30:55processing cove of
0:30:57the or like for example of deviance
0:31:02it in both keyword spotting
0:31:05so we are
0:31:10the we are using approach is what one approach is something probably you know all
0:31:15few
0:31:16no see the
0:31:18probably from project
0:31:19is the lvcsr based the keyword spotting
0:31:22is this is very good
0:31:25but the
0:31:26small
0:31:28and it's expensive for development
0:31:30the other keyword spotting the that the T V are using this acoustic bass the
0:31:37year indifference the that the
0:31:39it here we usually use a larger acoustic model
0:31:42here it's a simple on your network based acoustic model
0:31:50the there is no language model or data simple language model but here it's much
0:31:56cheaper of for our development
0:31:58so in case of
0:32:00lvcsr we are stopping creep hundreds of hours of training data
0:32:04in case of plastic you want splitting a
0:32:07we are stuck used by the office of acoustic data or human less
0:32:14the speech transcription a what we are using a
0:32:18this is probably not important of all of you are working can this
0:32:22feel that
0:32:23we are using the system based on a
0:32:26bottleneck features that the can combination we've other features hlda vtln
0:32:33gmm based system or and your network based system is not explaining okay
0:32:38speaker adaptation
0:32:39and gram language model and generate the
0:32:42usually confusion networks
0:32:47what are the challenging
0:32:49from the deportment point of view here
0:32:54well of course the accuracies
0:32:57still important
0:32:59but i would say that the it's not so the most important challenge
0:33:04the challenge is us be the
0:33:07lower memory consumption
0:33:10how to train new systems for the automatically course we would like to do it
0:33:14for
0:33:16so how to donna
0:33:17hundreds of recognizers in a parallel
0:33:21before all compute efficient computation one
0:33:25resources
0:33:26and also
0:33:27how to
0:33:28to the lecture normalization is
0:33:31speaker
0:33:32adaptation of for any length of
0:33:35speech utterance
0:33:37course for example if we transcribe
0:33:40along all source lectures some whatever
0:33:43we try to put the
0:33:45this much a adaptation is possible but if you are working with very short the
0:33:50segments so like
0:33:51three seconds or less
0:33:54the adaptation
0:33:56below heart how do you and the usually you will see worse results
0:34:01but
0:34:02the system
0:34:03that was so one solution is to remove those this adaptation
0:34:08but
0:34:09the system to be less robust to train
0:34:18not now of a how to sell of speech transcription
0:34:25what we found that if you must speech transcription
0:34:29and the you want to sell this technology is quite heart
0:34:34you need to have but
0:34:36something that this on top of this technology at that the real presently information to
0:34:42users
0:34:43this is the
0:34:45because
0:34:47there is too much text
0:34:49and
0:34:52this that there are still some errors
0:34:56what is our experience that the
0:34:59the user
0:35:01bill never be happy about the accuracy of those
0:35:04the speech recognition system if there are errors in more so the uses to mention
0:35:10this are also
0:35:11you if the words are correct the data start combine of a preposition suffixes in
0:35:16this is correct the
0:35:17a star complain about the some punctuation marks or grammar
0:35:21this is but
0:35:24if we use so
0:35:26of for a summer
0:35:27and the representation how to look at the data
0:35:31bill help you
0:35:32to sort of technology and that we are doing the in such weighted maybe do
0:35:37configuration
0:35:38we've the existing test bay it takes a base data mining tools
0:35:42integration is donna
0:35:44you usually on that the level
0:35:47all of
0:35:49of confusion networks also we have also the other a captive audience
0:35:57this is one a to the to use of
0:36:00this was the double
0:36:01by our part company so like
0:36:04so you have set session in gina
0:36:06here you can have very complex squarey here are
0:36:11documents it's a better found
0:36:13the document
0:36:15but you need to
0:36:19bright some somehow the query the query can be very complex
0:36:23so is here is
0:36:25gladiator
0:36:27but the was so it's one possible ability you but if you want to
0:36:33we have described topic so you can use this but it there
0:36:37or you can go from update time you can result i still they are you
0:36:41can look at what is the correlation among works
0:36:44and the
0:36:45you can you can
0:36:46take this could happen automatically two classes
0:36:50well i mean you have these so you can
0:36:52here are need the correct someday time
0:36:55then train
0:36:56statistical based approaches
0:36:58or if you can deploy stuff for example to see you what is the
0:37:05how well
0:37:08the topics
0:37:08and morph in time
0:37:14of the input
0:37:16not now i have
0:37:19two slides so how we transfer the call
0:37:22what is the time please
0:37:27all cases so it each
0:37:30okay so it's we just quickly
0:37:33this is how we
0:37:37to transfer the call so it to use this at the i think in a
0:37:41two thousand seven wanna be decided to write our speech for the of score
0:37:46the well the reasonable so that you wanted to have something very
0:37:51stable very fast that
0:37:53and the before the proper interfaces
0:37:58the speech for it has morgan two thousand five hundred topic objects coding or the
0:38:04hour of speech processing go
0:38:06it is a more then
0:38:10minimal first lines of source code and it ceases steely
0:38:14still use it might enable
0:38:18how
0:38:21be approach to refer to the recession
0:38:24we it the research is usually done using standard tools like that
0:38:28S T K in a car be by transcripts
0:38:32i think it it's all through the nose these two kids that this is
0:38:38for of
0:38:41hmms reconnect the but this is for neural network training good colour be it's made
0:38:47in the by then pour we
0:38:49and so and so on
0:38:52but the that diana we can the to use our code base
0:38:58and we can implementing new system and a two hour speechcorder quickly in a
0:39:05just two days the
0:39:06well final nor seen a single line of C plus court is written
0:39:12everything is don
0:39:14flew configuration file this could do this configuration file
0:39:18can
0:39:19look like this
0:39:20you have some objectivity this object so
0:39:23well this description is the map
0:39:26two
0:39:29C plus interface the user to set functions
0:39:33and then i we have some framework out to connect to be subjects to better
0:39:39so some
0:39:40you of fun we have four or the artemis entity
0:39:43but if you need a algorithm we just goal and to buy one simple chip
0:39:49for simple objects
0:39:52a about interface is
0:39:54what
0:39:55the customers are used to
0:39:58a locally specific interfaces
0:40:02us
0:40:03so i don't want to change data bits is so we
0:40:07men then the double the
0:40:09large
0:40:10set of interface this C plus channel aussie sharper and marcy be protocol uses for
0:40:18ivr so that is nice open source project the
0:40:21press instead of face to build the our on how based so set B C's
0:40:26and so on
0:40:30the this is common a framework
0:40:32for
0:40:33but based
0:40:35over a solution
0:40:37we were speech set of our application server the base ever and some clients
0:40:45this is just example of
0:40:48our testing client
0:40:51okay so some not now i will just summarised
0:40:55three slides about the
0:40:57some ongoing challenge is that the I C now
0:41:02partner very important challenge these
0:41:05data
0:41:07a training data is the smog a small company it is a difficult for us
0:41:12to get the data it is expensive
0:41:17and the this out that the a common approach of allows us to at just
0:41:23two and we just by here
0:41:25so we are working the
0:41:28for cheaper mesa how to do these
0:41:32i think it'd great inspiration is google
0:41:36so but not a we did something similar in you know language identification
0:41:42in which i didn't if we are able to bit the data the that the
0:41:46we can use of for training go for organisational
0:41:52a speech recognition systems that can be deployed on balls like don't the you know
0:41:57quantization of speech for telephone speech and the
0:42:00for broadcast
0:42:03so of what one possibility that we export the was to use broadcast
0:42:10for this
0:42:11but not so the whole content but the
0:42:15automatically to take the
0:42:17phone calls in the broadcast this ensures
0:42:20a high variability of speaker does the dialects and the was speaking styles
0:42:28language can be a very fight the using the when automatic language identification will so
0:42:33we need to some
0:42:34a small amount of data to bootstrap the this approach but then it is possible
0:42:41this is speaker identification of the speakers of the variability conventional by current
0:42:47speaker id technology
0:42:50the you would like to
0:42:52transcribe the it is some of the speech we think crowd sourcing
0:42:57and that use have really unsupervised the training for adaptation
0:43:03of
0:43:05currently we discuss the D so we've
0:43:10several company sent to would like to form a conception for this
0:43:14you have some expending admins experience to when we did this
0:43:19you know project the for language identification ldc anthony's the
0:43:24of it turns out the E to be very successful and the melody so like
0:43:29mainstream language identification
0:43:31we have one line
0:43:32up or go for adaptation
0:43:36be backed by our
0:43:38after companies the spinoff from but and that we believe that we could put to
0:43:44reduce the cost for the opened of new recognizer
0:43:47to variable and models
0:43:49so if you are interested in the you would like to one and more just
0:43:53sent me email and to
0:43:55we can discuss this
0:43:58the then other trying to we see is the that the
0:44:03we have quite the roots the technology about the still the deep one man the
0:44:09is hardly bring some of somebody six of each customer list
0:44:13but if the specified
0:44:16the each if we have departments many cantonese
0:44:19we never know what to be the final
0:44:23accuracy of up
0:44:25you're technology and of unevenly to do adaptation again some
0:44:30project so that i mention on the previous slide so we have
0:44:35to word this
0:44:37usually if you speak of the technologies that
0:44:40we claim the
0:44:43the customers that the technology
0:44:46is
0:44:48language-independent the
0:44:50channel independent the but always there is some for two for station
0:44:55the only possible way i see to reduce this risk
0:44:58is to built on to evaluate that these technologies
0:45:03on a many languages and to know that the results in advance before the technology
0:45:08so
0:45:10so for this again the data collection project and can have
0:45:15and to you are thinking about you want
0:45:17to extends some approaches to something like to work through much of spoken languages
0:45:24because for language identification we have collection of about fifty languages and
0:45:29good all the rapidly
0:45:32and the
0:45:33finally remark
0:45:36what we see that is that the percentage is
0:45:41full cost mainly on accuracy or more most of the research articles
0:45:46are describing some improvement in accuracy but if we speak about commercial market the
0:45:53i think that any improvement and the you know speech the
0:45:57or will
0:45:59something that cannot
0:46:00and i do use
0:46:02the cost of hardware
0:46:04can help and can
0:46:07help you to
0:46:10have successful technology
0:46:13so we saw you in some approaches the hardware cost is a really large can
0:46:19be
0:46:20fifty percent of the cost of the project and so on
0:46:24so this is everything from me and thank you for attention if you have any
0:46:30questions please ask
0:46:39any questions
0:46:46so we how did you do that didn't have to go to cepstral but better
0:46:51or something like that
0:46:53we are considering this approach should to but
0:46:57you know it at the beginning it's harder to get the
0:47:01money from adventurers
0:47:02the so this started the in the trade that event to customers and the ask
0:47:08or negotiate at some contacts
0:47:12and the we just started to be for contract so basically the
0:47:17custom development and the
0:47:19B and some money on this custom development and then we compute developing technology
0:47:24and we start something good technology and then
0:47:28even to product and stuff
0:47:32i have a question your
0:47:35your solutions are on site or is it based on cloud services
0:47:43most of the solution so
0:47:46i don't say it actually bosses possible be because
0:47:50that we can use of the technology one site but we have the base the
0:47:55or interfaces for example that is the best interface that can be used for
0:48:00called department
0:48:03but so you have a lot of cloud deployments or not please models and not
0:48:07know most of our current improvements are
0:48:10of
0:48:11a local click the like of one side the departments
0:48:15but we have
0:48:19the spinoff at the battalion is it of technology this is to play well
0:48:22that is for example the recording go lecture here
0:48:25this is already got based this is gonna but
0:48:30i don't at the of lectures
0:48:35questions
0:48:44so you started off connected with
0:48:47with university do you still do you have now it's to say that you projects
0:48:51that at the next cnns are an issue with that in terms of their
0:48:56we only with the company a with the government
0:49:00we are doing this see in different races we didn't have students
0:49:05it's for next cer we some or twelve some people at the but
0:49:11some contacts we have joint project
0:49:14the sort out differently so
0:49:21alright
0:49:22that's one thank you thank you