0:00:12i
0:00:13i would say let's say get the session and to make
0:00:17yeah i'm very happy to invite i introduce our next to invited speaker much elephant or equal
0:00:23he is from the
0:00:25from what used to be called I P C but is now the
0:00:28i think the foundation bruno canceller uses independent research institute associated are located near the university of train yeah
0:00:36over there he had the speech and language of the human language technology effort is the co director of that
0:00:43you probably know him for many of his paper
0:00:47yeah i personally know him from a summer workshop in two thousand seven at johns hopkins very he and the
0:00:52number of people including philip goon
0:00:55you have a lot to a lot of very useful software for machine translation that are sort of the genesis
0:01:01of the moses toolkit
0:01:02and for those of you don't know moses is to a lot of machine translation but htk is to the
0:01:07speech recognition people
0:01:09it's very widely used so that's how i got to know him but of course he has many long other
0:01:13accomplishments please list i will not lead all of them just to point out that he's been maybe it's the
0:01:18associate editor for the acm transactions speech and language processing and E foundations and trends in information retrieval
0:01:26so and he's also i think
0:01:28and offers better than S E N Z S A M D this is like the ieee technical committees its
0:01:34thickness counterpart in the A C M
0:01:36and he's gonna talk to us today about something that he's also very well known for his been running these
0:01:42that
0:01:43workshop on spoken language translation they have been very useful in fostering a lot of collaboration and discussion on this
0:01:49important problem that can be i W S L T workshop something international workshops and spoken language translation
0:01:56so he's also well known for that similar to do
0:02:00yeah
0:02:01okay thanks for the kind introduction
0:02:08so an outline of might work i will introduce a diversity for those who do not know and the particular
0:02:15these stores will focus on the door translation task that we started this year
0:02:20i will introduce the research on and just behind this track
0:02:25and describe how we organised an evaluation
0:02:29on the whole translation
0:02:31the language resources we provided the evaluation conditions we set
0:02:37participants of course that took part in the in the workshop
0:02:41that was have recently san francisco
0:02:45i will briefly describe
0:02:48are we run the subjective evaluation for machine translation which is a quite a tricky important aspect
0:02:55i'd give an overview about results in finding of these exercise
0:03:00and give some outlook about what we planned for next year and give some conclusions
0:03:10so i diversity is international workshop on spoken language translation it consists of and evaluation campaign which is wrong before
0:03:19the worker
0:03:21and the scientific workshop
0:03:24i'll absurdity is has been running now for at S
0:03:29and the main organisers decide F B K R costs were institute of technology and the nation
0:03:37institute of communication technologies injure
0:03:40about evaluation campaign
0:03:43features are that it's around spoken language translation so this is
0:03:48something which is be clear to a diversity is thus is not covered elsewhere by other evaluations
0:03:56another aspect is that language resources are
0:04:00organise i've collected by the organisers and are provided for free to the participants
0:04:05it's an open evaluation in the sense that it's
0:04:11develop these benchmarks
0:04:12for everyone who wants to work on them
0:04:15and
0:04:17we carry out for all these evaluations about objective and subjective evaluations
0:04:24which is not for free of course for us but it's before the participants
0:04:29concerning the scientific workshop
0:04:33this is used as a venue to present research papers on
0:04:37speech spoken language translation machine translation in general
0:04:40and of course it's a venue for presenting the evaluation results and four participants of devaluation to
0:04:47present their system paper
0:04:49describing the systems
0:04:51we have also invited talks and the discussion on
0:04:58so if you look at the venues we start that's
0:05:01two thousand fourteen Q or two then we had it's working or two
0:05:04gain trying to usually
0:05:07it's board batteries and
0:05:09one week ago we were in san francisco
0:05:14so if you look at the participants over these all these years
0:05:18we can't for
0:05:21fifty two different research groups the two parts of course not all not all of them to part to all
0:05:28evaluations so we have
0:05:30let me say a core group of around fourteen participants the two part that listing for evaluation
0:05:36and we have around twenty sites that participated only in
0:05:42one of the of the event
0:05:46so you can figure out then
0:05:48most prominent research groups it working on machine translation but also several
0:05:53small groups
0:05:55and components as well
0:05:58so the aspect of small groups is important because we try also to proposes somehow affordable evaluation tracks so which
0:06:07do not require
0:06:08intensive computation power or
0:06:12large groups to be around
0:06:16so an overview about
0:06:20the
0:06:23so these are the figures about
0:06:25the parties
0:06:29so an overview about the
0:06:32the pollution of our
0:06:36tasks
0:06:37so
0:06:39we consider the was until the two thousand ten a lot of effort on so could be take travelling domain
0:06:48which are organized separate evaluation
0:06:50and just recently easier and part of last year we started to colour this tectonic domain
0:07:00so concerning
0:07:04is that talk domain we started in a two thousand four and the provided just an evaluation for text translation
0:07:13over this bitter corpus which is a collection of travelling expressions
0:07:17collected from books that's tourists for instance use to try to communicate
0:07:24abroad
0:07:25so we start to be chinese to english japanese to english
0:07:29and two thousand five we had that's a track from a using
0:07:34speech but indeed be provided basically
0:07:38a transcripts from speech recognition engines
0:07:42and this was really an exercise we write speech so people write these expressions these sentences
0:07:49and yeah we cover the gain chinese english
0:07:52also in english to chinese japanese english
0:07:57a rabbit english and the korean to english
0:08:03in two thousand and five we try to launch a new task but with a tumour to be taking two
0:08:10thousand seven we
0:08:12or arabic english japanese english then in two thousand eight with arabic english chinese english and
0:08:21chinese
0:08:22spanish and this time we propose to so that people translation task so
0:08:27is chinese english had to go through english sorry chinese spanish or to go through the two english
0:08:35so chinese english and english to spanish
0:08:39for their we went on with the rubber english chinese english and we added the new language so
0:08:46almost every year we add new languages and that we had the turkish so to it to give to english
0:08:52translation
0:08:53and the yeah after
0:08:57we repeated arabic to english but be added french
0:09:03so
0:09:05the side let me say this stream of be take
0:09:11tasks
0:09:12we as explained we added some more complex tasks
0:09:17always around travelling expressions
0:09:19and the we start
0:09:22modeling one dialogue
0:09:23in two thousand six over a heartbeat english chinese english japanese english and
0:09:30italian english
0:09:33then following yeah we had we repeated the experiment
0:09:39and
0:09:39then we moved to
0:09:42you man
0:09:44machine sort of human slash machine mediated dialogues
0:09:48which really reflect its
0:09:52a translation task
0:09:54while the former where basically translations of a monolingual dialogues between you months
0:10:01and the translations where
0:10:03produced after
0:10:06yeah are the language and actions we worked on
0:10:11english japanese chinese japanese
0:10:14and following yeah chinese and english
0:10:17and again chinese english in two thousand
0:10:20then
0:10:21in two thousand and we run for the first time an exercise without really evaluation and this was on their
0:10:29doors
0:10:30and we started with the providing output from speech recognition and translation direction was english to french
0:10:39and
0:10:40the following us with these yeah we provided a box machine translation tracks
0:10:46so from rabbinic
0:10:48to english chinese to english and french to english and the full end-to-end evaluation from speech
0:10:55so from providing audio files from english to french
0:11:02okay
0:11:04we start to be distorts but it indeed is not really new stuff
0:11:09for less port organisers
0:11:12so we had a past work on a speech recognition of lectures in within the european project fame
0:11:19from two thousand one to two thousand five and yeah some papers and it's a funny that i mean our
0:11:25first work on a language modeling for towards transcription was on the text corpus
0:11:31but here it's another acronym because it
0:11:37it's a database of lectures recorded at eurospeech ninety three and this database was released by ldc and entering two
0:11:46thousand two
0:11:47so tense stance for trance lingual english database
0:11:51and so in the spring project we worked on this database as well people from cost where worked on that
0:11:59lectures they collected in their own
0:12:03concerning spoken language translation of lectures or speech as i mentioned the european project tc-star which was a big effort
0:12:13from two thousand four to two thousand seven
0:12:15and that which i see many participants he also we had ibm construe technology lindsay a an I B M
0:12:27and upc taking part and there are several papers about
0:12:32translation of speeches
0:12:36you have here a couple of examples so
0:12:40in
0:12:42in two thousand now ten i stated we started a new track diversity on down towards
0:12:49and particularly we focused on these
0:12:53domain tech talks translation and so what is that first maybe you know so it's
0:13:01i is and more profit organisation in us that the organisers every to conferences
0:13:10and a and a host of many i would say short or just brilliant talks over a variety of topics
0:13:18and for all these all these stores are recorded and
0:13:23there is a web sites mandarin by the tent
0:13:27which collects all the videos of the talks the transcripts and also many translations
0:13:33and all this material is provided with the creative commons i
0:13:38so you can basically download its use it
0:13:44so if you look at the translations i mentioned
0:13:48there is a
0:13:51community behind these
0:13:54that's which
0:13:56heads
0:13:57with the translations so there are many volunteers who provide translations and here i show you a blocks
0:14:04that compares
0:14:06the
0:14:07for
0:14:09let me say the most popular languages which are translated
0:14:14the number of course translated
0:14:17up to november two thousand and ten and up to november two thousand eleven
0:14:22and you see that
0:14:25yeah
0:14:26many languages for which you have around thousand told translated and if you look at the right side you have
0:14:32to a global figures
0:14:34so that works recorded
0:14:37in english at the at the conferences and the transcriber eight hundred in two thousand ten and the thousand at
0:14:46in the two thousand eleven so there are about two hundred fifty three hundred talks
0:14:51processed every year
0:14:53and the languages which are covered by these to volunteers move from eighty two eighty three
0:14:59and
0:15:00the number of these volunteers of these translators move from four thousand to almost seven time
0:15:07and the number of translations globally provide so which are many more than that was you can find the end
0:15:12is dropped
0:15:14which covers around twenty languages move from to
0:15:17twelve thousand two
0:15:20twenty four thousand
0:15:22so it's a really a large number
0:15:25and as good as you can see you have your menu language many languages covered for which you usually do
0:15:32not have a many language resources available
0:15:35especially in terms of parallel corpora within
0:15:42so let's see what's the from the point of view of these translators what's how we can describe the task
0:15:50behind the preparing this dollars
0:15:53and
0:15:53preparing sorry these translations of talks
0:15:56so typically so audio's partitions
0:15:58because you might have music background classes
0:16:02so used
0:16:03detect the speech segments and
0:16:08you split the this the speech into sentences and these are transcribed
0:16:15so translation works on these on the segmented transcripts
0:16:20and the as ideal transcript sre it as a D on a translation task
0:16:26units they should focus on the on the on the simple caption
0:16:32actually you an example so
0:16:35ideally the translators should keep seem criminal synchronicity among the amount to
0:16:43among the about the captions like you see in this example so the same sentence is exactly translated the same
0:16:49way in french and italian of course you can see if you if you look a bit deeper
0:16:55that for some languages they allow for some reordering across captions team for instance german which you have these longer
0:17:03movement so you might have also movement across the captions but of course
0:17:08this is the sentence boundaries are process
0:17:15so
0:17:16it does not give a look at the at the at the torso i show you some videos or nothing
0:17:22scaring
0:17:23like before
0:17:31some audio
0:17:43yeah there's
0:17:45i'm a performer
0:17:49i
0:17:52i
0:17:55and
0:18:00i
0:18:05i
0:18:07i'm also
0:18:09diagnosed
0:18:10bipolar
0:18:17every frame that is a positive because the crazy i get on stage
0:18:21but no entertaining i become
0:18:25but that was sixteen in san francisco had my breakthrough manic episode in which i thought i was jesus christ
0:18:32you thought that was
0:18:35so that was an example
0:18:38and by the way from our real test set of this year
0:18:41so if you compare try to compare this kind of content with the previous task and the also
0:18:51the very popular use a translation task
0:18:54which has been covered by other evaluation
0:18:59so this table somehow summarise
0:19:03so look travelling networks and the news from the communication perspective so we move from a
0:19:10i don't to monologue communication
0:19:15the situation is
0:19:17informant for the travelling
0:19:19so the in the in the travelling task we have usual tourist asking for information to
0:19:24people on the street
0:19:26while in that talks i would say so semiformal sometimes it's i mean
0:19:32there is even some interaction with the body
0:19:35one uses different G format
0:19:38oh the email is
0:19:40informative for the travelling and for the news would say
0:19:45convey just information ask for information
0:19:48well yeah i would say that
0:19:51i
0:19:52the aim is more pencils leave so these people are
0:19:56to my view trying to convince you about something selling you one idea
0:20:01the style us
0:20:03different conversational
0:20:04for travelling when he i would say into training in the detector
0:20:10why
0:20:12for my use
0:20:14domain research with respect to the main problem
0:20:19is
0:20:19is limited it's focusing on information requests able to have the so it's troubling to me
0:20:25that's the general term that is
0:20:30well for tech talks and use it so really open you have really a variety of possible topics
0:20:36with respect to the lexical
0:20:39this might be surprising so travelling is for sure as small so the two
0:20:44lexical was always around five dollars
0:20:47it doesn't work that maximum
0:20:50for ten dollars i would say this medium because
0:20:54during it to work
0:20:57mean
0:21:00the goal is to convey something and they do with using a rather plain language so they use lots of
0:21:06colloquial or colloquial expressions there is no they're not looking for L accounts i mean
0:21:13expression unless you look for some great technical at all
0:21:16so it
0:21:18smaller differently than the vocabulary that you find in use
0:21:21and concerning the syntax of the complexity of the of the sentences in terms of structure
0:21:27you have a very simple structure in that
0:21:31reading task
0:21:32we had a maximum and average length of seven words eight words which is
0:21:37very short
0:21:39news you may have very long sentences while the tech talks sentences are typically show so
0:21:46okay
0:21:47fifteen months
0:21:49and also the structure is quite a
0:21:51quite
0:21:52linear let's say you have not many
0:21:55nested close
0:22:01a concerning the challenges that this task
0:22:04that you faced with this task
0:22:06from the language modeling point of view
0:22:10have of course limited in-domain training data
0:22:15you think that's
0:22:16the caucuses are a couple of million words which is not useful sites you expect for modeling
0:22:22language and then you have portability of topics and styles so each door
0:22:29is different from the others
0:22:30and has its own topic and maybe also it's
0:22:34maybe six time
0:22:37acoustic modeling ugly but speakers
0:22:40many speakers and you may have speakers with different accents you may have a
0:22:48for instance
0:22:49so nonnative speakers
0:22:51you have different fluency speaking rate style that also
0:22:58there is no one speaker but to
0:23:02and you have chosen to cope with noise
0:23:05so you have members that colour maybe the speech opposes last
0:23:10and also music like before you
0:23:12the guy was playing
0:23:14well
0:23:15we just like to translation modeling
0:23:19we can work with this collection
0:23:23with under-resourced languages
0:23:28a rabbit constraint is it would not say they're on the resource because i D C collected lots of data
0:23:32but there are several languages
0:23:35for which probably are very little power the data around
0:23:39and also distant languages joe languages for which you have a very different that structures like we did this year
0:23:45we changing
0:23:49you can deal with morphologically rich languages so they are well covered here
0:23:54concerning speech translation specifically
0:23:57the task
0:23:59that we the design it's
0:24:03requires going from spontaneous speech to a partition X
0:24:08so
0:24:11which means that you
0:24:13that you have to provide a polished X pizza with capitalisation and punctuation for
0:24:18it's which is a
0:24:19and not treated
0:24:21starting from speech
0:24:22then you i have task like detection and that annotation of non-speech events
0:24:30and
0:24:32finally i think the ultimate goal here would be to provides subtitling and translation real time
0:24:39well
0:24:40the to work is given
0:24:42of course
0:24:43we did not they can all these challenges now so for two times in that and we basically
0:24:51focused on the on the challenges
0:24:55like
0:25:02so the tracks we proposed for two thousand eleven O where for the first time one automatic speech recognition
0:25:11so we
0:25:13ask participants to provide transcription of doors
0:25:17from audio to text
0:25:18in english
0:25:20we had a spoken language translation track
0:25:23which requires automatic
0:25:25translation of dorks from audio
0:25:27or from the asr outputs we provided into tech
0:25:34and the from english to french
0:25:36keep in mind that's what the doors are recorded in english
0:25:40and then and it's machine translation tracks
0:25:44and this time
0:25:46starting from texas
0:25:48and
0:25:50from english to french
0:25:52from arabic to english and chinese to english so notice that for the last two translation directions
0:26:01we basically started from the human translations
0:26:06and try to translate back to the
0:26:10original
0:26:12so you might
0:26:16think that is
0:26:17it's not the best thing you can do that because
0:26:19it's
0:26:20has been started has been shown that some artifacts michael
0:26:26if you are brighton for instance in there
0:26:30in you know probably
0:26:33as
0:26:34as an active
0:26:35either because you write some text to because you translate some text from some of the language but from our
0:26:41point of view i mean this kind of artifacts are really not
0:26:45important it respect to the quality that you can achieve nowadays be machine translation so it's better to have some
0:26:51data if even if not the ideal data but it's better to use them as they are
0:27:01okay and finally
0:27:03again as in a material so provided some system combination track
0:27:09but for asr output of for mt output
0:27:13and the participants where given all the
0:27:18the system outputs from the collected
0:27:22doing the village
0:27:26so the sources is important aspects
0:27:30languages sources so for speech we did not provide data
0:27:36but a lot to use any publicly available a recordings
0:27:40they did before thirty first december
0:27:43two thousand ten
0:27:45and that's good because the evaluate the data were collected after that date
0:27:53as parallel data we provided a text parlour didn't sort orders
0:27:59for about two million words for an english french chinese english arabic english then we made available at the so-called
0:28:08multi the united nation corpora
0:28:10which
0:28:12is around two hundred million running words
0:28:17for english french chinese english and arabic english
0:28:20this is i would say a large out of two main corpus and then all the data made available by
0:28:27the works of machine translation
0:28:31any particular the
0:28:33upon a corpus of
0:28:34english french crawled from the web
0:28:37and which makes up to eight hundred million words
0:28:42so it's a very large particle
0:28:45as monolingual texts besides the modeling one part of the part of the data
0:28:51we provided or the transcripts of the english talk speech or more than the was the best we can
0:28:59and you probably they can and are we also allow two years ago but book collection problem
0:29:06but the english and the french
0:29:09then be provided datasets for asr sat T and system combination so this
0:29:17this but
0:29:18data were collected and checked by different
0:29:24a specification so concerning conditions
0:29:30we decided to go for a presegmented input this time for speech recognition it means that
0:29:37we provided a
0:29:40just segments with speech
0:29:43so
0:29:45segments of non-speech events were just oh
0:29:48not consider
0:29:49this time
0:29:53and the same segments were used for speech recognition speech translation also for machine translation so there were perfectly aligned
0:29:59in this
0:30:00the reason for this is also that's
0:30:02with a lot better means for the system combination with participants
0:30:07provide out before the sex
0:30:09same thing
0:30:12inputs was case then punctuated
0:30:17for machine translation only
0:30:21outputs
0:30:25was not required to be cases and computed for the speech recognition but it was for our machine translation systems
0:30:34so the output of smt
0:30:37man machine translation but for spoken language translation the machine translation had to be with punctuation and case information
0:30:44we have an automatic evaluations on all the tracks
0:30:48and we don't human evaluation of the machine translation
0:30:53spoken language translation
0:30:54as matrix here is the for the matrix we
0:30:58using
0:31:05about the schedule
0:31:08the time and show us to buy which will be a provider training data
0:31:13that data by the end of june and
0:31:18in and a four was provided data for system combination and so we basically ask participants to do with first
0:31:27on the dev sets and rector announced his runs
0:31:31the tree and then put on the website
0:31:34for the participants working on system combination and we had a very bad scheduling september in which we run one
0:31:41after the other asr evaluation
0:31:44asr system combination
0:31:47acidity and machine translation evaluation and finally
0:31:52machine translation system combination
0:31:54so we allows participants to submit one primary run
0:31:58in multiparty multiple secondary from
0:32:03this test sets references were not released
0:32:07so the evaluation was
0:32:09done through an immigration server and we are going to keep this test set as a progress test set for
0:32:15next year
0:32:16what is good is that the benchmark
0:32:19available on our website
0:32:21and that the evaluation server is also going to be a so
0:32:25everyone can give a try
0:32:27and participants and what is it
0:32:30scan what for there to improve the system
0:32:35participants heads eleven teams so we had fifteen at the beginning but for a withdrawal after
0:32:43a few months
0:32:44probably i mean for sure the task is
0:32:47was more difficult than the one of the of the previous year
0:32:50so we had
0:32:55see you so the centre for an extradition organisation and a conceit university difficulty
0:33:02in germany our research on a constellation of technology number of americans of grenoble the mce cinemas
0:33:12you
0:33:14and most of them or
0:33:15i'm at and i force research
0:33:19microsoft research us
0:33:23it shows you of communication
0:33:25because of technology one and a R W D H german
0:33:33submissions we received our yeah so we had five submissions
0:33:38for asr five for smt french english french machine translation was the most popular track seven participant
0:33:48and then we had for my Q for arabic english and chinese english
0:33:52and a couple of solutions for system
0:33:56really
0:33:57so if you look at the
0:34:00results for asr here is that is that is
0:34:04so
0:34:07if you look at the bottom line we had what was the baseline of last year which at the word
0:34:12error rate of around twenty two or three
0:34:15a sense
0:34:17this year we had to significant improvements
0:34:21terms of performance
0:34:23and
0:34:25you see that also system combination had quite a lot so that move from the best system fifteen not for
0:34:32percent water rates to thirty three
0:34:37if you want to give a look at the
0:34:40if you reminder
0:34:42excerpt of to what we have seen
0:34:44you see but
0:34:47the best transcription asr transcription provides
0:34:50which i
0:34:51so it's not really
0:34:53thus
0:34:54so we have a rather good performance but
0:34:59you remind that
0:35:01the guy is it dated between the microphone at the beginning and
0:35:06it's not
0:35:07was not
0:35:08speaker so if you look at the
0:35:11performance we have
0:35:15over the
0:35:18S chores provided
0:35:20they're quite what's a uniform so
0:35:23is you don't there
0:35:25towards for which you are over
0:35:28eighty percent with the best system fortunately with the system combination you are always mostly below twenty percent
0:35:37so our difficult or was the one seventy eight
0:35:43which is around
0:35:45fifteen percent for
0:35:49system combination so i give you a
0:35:53i show you usually transcripts for the
0:35:56you just one the one
0:35:58eight three
0:36:01i
0:36:02because of the audio
0:36:04the corresponding you
0:36:07a few years ago
0:36:09i felt like i was not in a row
0:36:12so i decided follow in the footsteps of the great american philosopher morgan's for a lot
0:36:17and try something for thirty day
0:36:20yeah yes actually pretty simple
0:36:22they could not something you always wanted to actually my
0:36:26and try
0:36:27for the next thirty days
0:36:29it turns out there it is just about a right and a time had you had
0:36:33or subtract
0:36:35like watching than it is
0:36:36from your life
0:36:38there's a few things that i learned wondering used thirty day challenge
0:36:41the first one is instead of the month find i forgot
0:36:46but i'm much more memorable
0:36:50so it's
0:36:51really if you
0:36:54so you have a very good so transcription
0:37:00now
0:37:02this is
0:37:03for what
0:37:04concerns
0:37:05speech recognition
0:37:07i told you now briefly about subject evaluation for mt as you might know you have we have automatic metrics
0:37:14for
0:37:16you like
0:37:17the bleu score is the most
0:37:19known one but there are others like nice to meet you are
0:37:25i don't
0:37:26a word error rate
0:37:28position independent error rate
0:37:30i know this matrix basically try to
0:37:35compare match the mt outputs
0:37:37against that one or more a reference
0:37:41translations
0:37:44it did not know is a matrix is there are there are far from being perfect
0:37:49if you want to measure or
0:37:52you want to rank or C compare system outputs you need a to rely on subjective evaluation which is of
0:37:59course
0:38:00more expensive and slow to carry out this is why he runs evaluations of
0:38:05is because
0:38:06once in a while you need to evaluate you systems
0:38:09and
0:38:10it'd be subject evaluations
0:38:13has been have carried out by
0:38:17and
0:38:18coding some experts and asking them either to charge in absolute terms the quality of
0:38:23machine translation or better
0:38:26which is
0:38:27a more focused on the final you want to rent and the outputs
0:38:31ten which is
0:38:32better
0:38:36but considering the right
0:38:39so what we did this year with respect to produce your is that
0:38:43nearby is a wiener based experts and the
0:38:47and run evaluation by crowd sourcing
0:38:50and
0:38:51it's not a new methodology because chris cut isn't large stuff that a couple of years ago with
0:38:59W T with the war for machine translation so we applied to us a new ideas
0:39:05about
0:39:07random subject evaluation of it also seen
0:39:10which are described in this
0:39:12design
0:39:14so i briefly tell you what's the what's about
0:39:20so i'll or
0:39:21core evaluation
0:39:24is a now
0:39:27one sentence pairs so we compare the output of
0:39:30just to system
0:39:32and the
0:39:33we provide to each of these
0:39:39not all real judges
0:39:41a reference translation and the output of to say
0:39:45and that we ask this the charges to rates which is the best one so they are allowed to say
0:39:52was that i
0:39:54define the translations are equally good or equally bad or to indicate which is the best transition like in this
0:40:01case
0:40:01you have
0:40:03three judges
0:40:04two of them choose
0:40:07system to i'll just the best one and one said that they are equally bad
0:40:16from these atomic
0:40:20evaluation we can say that this
0:40:24the wiener this case is just too
0:40:26okay
0:40:29of course this is just one sentence
0:40:31what we can do is to repeat these evaluation for all sentences over all test sets
0:40:38and repeat this every time so for sentence one sentence two systems we always between
0:40:43system one and six
0:40:45two
0:40:46and we collect all the charges
0:40:51judgements and the
0:40:53and collects
0:40:54final statistics about the
0:40:57how many wins by system one how many by system to and how many times
0:41:03for me this looking at the statistics we can decide that
0:41:07here that we know is
0:41:09just because i was
0:41:12so and this comparisons
0:41:14is run just for a couple of systems if you have a more system
0:41:18in the taking part in relation we organised yeah
0:41:24and from dropping tournaments
0:41:27and all systems
0:41:29so
0:41:30what you see in this table is that you have all the systems on the top
0:41:36and you have boxes in which you put wins and losses
0:41:40statistics
0:41:41and we do we do have a table which shows you all pairwise comparisons that you need to carry out
0:41:46of course
0:41:51depending from the direction
0:41:52and the
0:41:56for each of these boxes you run one of these
0:41:58evaluation over the full test set
0:42:02and you report and all the number of test set by wires wins
0:42:08and
0:42:09losses
0:42:11table
0:42:12so from these machinery
0:42:16we can extract
0:42:18some meaningful statistics for the comparison and use this quite standard
0:42:24scores
0:42:25so it is
0:42:27first code used it's larger than others
0:42:30and you report your the percentage of test sentences the system a given system was run
0:42:36that are against any other system
0:42:38so for each system we compute these
0:42:43actually as well as the other metric which is
0:42:46larger than equal which collects which in close box wins and the ties
0:42:54collected by
0:42:56and finally we have these had two heads
0:43:00results
0:43:01which counts the number of test set pairwise rankings one by the system
0:43:07so if you look at the
0:43:09figures of this year you
0:43:12you can appreciate was the importance of a running subject evaluations because we report what
0:43:17matrix automatic metrics and the subjective metrics
0:43:22so as you know they correlate well but
0:43:25you might have some surprises especially with systems with
0:43:29our scores
0:43:30rather closely with automatic metrics
0:43:33for instance you see a customer and the not gonna very close metrics but
0:43:39the
0:43:42rankings may change with subject evaluation
0:43:46so
0:43:47what we see as that from one side
0:43:52we had an improvement in terms of a bleu score with respect to produce you
0:43:57one an exercise of the same translation direction and we had the maximal bleu score of sixteen to fifty
0:44:03for this sat task so yeah these are results
0:44:07of machine translation
0:44:09starting from speech okay
0:44:11and ending with the partition text
0:44:14punctuation capitalisation
0:44:19so we basically doubled
0:44:21the bleu score which
0:44:22which for sure means that
0:44:27moreover for machine translation english french we had the similar behavior
0:44:37so
0:44:38the ranking a given but it was not
0:44:42confirmed yet so we have a slightly different ranking or of course the correlation is cool so
0:44:50you can write
0:44:57machine translation arabic english
0:45:00see here that in this case the ranking is confirmed you have
0:45:07more significant difference
0:45:09among the systems
0:45:11the bleu score
0:45:12so if you like this
0:45:14a large difference
0:45:15it's very likely that the
0:45:17subjective ranking is performed
0:45:20unfortunately you see that system combination do not really
0:45:25help machine
0:45:28for machine translation
0:45:30so
0:45:32system yeah ended up second
0:45:41okay
0:45:43the for chinese english
0:45:46we have again a result confirmed from the robert english so the ranking of lewis is
0:45:54for
0:45:55the
0:45:57with some slight difference on
0:45:59bottom part
0:46:02and this time
0:46:04the
0:46:07basically the system combination provided the
0:46:10best was on times to head to head comparison
0:46:13so
0:46:14you see on the bottom line is i two it figure four means that the
0:46:20justin commission or four
0:46:23matches
0:46:24he was
0:46:26applied to
0:46:27you one
0:46:28again some of the other forces
0:46:32now briefly about the
0:46:36results we can compare yeah
0:46:41i'll come from sat
0:46:44which is translation from english to french so yeah
0:46:49we have again a simple of given by D is a guy affected the body people a reason
0:46:55so you might be surprised at
0:46:58about something
0:47:00what
0:47:03san francisco
0:47:04because these a translation starting from speech recognition and you remind that
0:47:08in the asr actually before san francisco was not recognise red so i was also
0:47:15what about
0:47:17as you're suspicious so i looked into the asr output
0:47:22of the of that
0:47:24the best system and indeed he got san francisco
0:47:30so it means that the system combination output reaches the lowest word error rates eight was brought on
0:47:39recognizer san francisco while the
0:47:42one system outlier here
0:47:44from the best sat relation was right
0:47:51the quality is reasonable yeah i think you understand what's going on but
0:47:56can be improved
0:47:59different stories if you look at machine translation output so from
0:48:04perfect transcript
0:48:05clean transcripts
0:48:07from english into french yeah you have a
0:48:09rather
0:48:11oops
0:48:12translation
0:48:17i show you know another door
0:48:21which belongs to the other test sets
0:48:24sh
0:48:25i
0:48:41i don't
0:49:05i
0:49:07this is what we call this
0:49:10but
0:49:13and everybody agrees with this on the wall of the spectrum
0:49:23for tracing over the
0:49:27you want to
0:49:29right and on a good writer good
0:49:37the right
0:49:38yeah
0:49:40okay
0:49:42that does look at machine translation from arabic into english of the store so i wanted to show you because
0:49:48otherwise this plan
0:49:50that being
0:49:52unexplainable
0:49:55so he is used out from around
0:50:01it's not really especially the beginning nothing to show you
0:50:07i again you
0:50:11your grass
0:50:13and
0:50:17one was on the vocabulary
0:50:23but you can get an idea
0:50:25as you know chinese is much more difficult than rubber
0:50:29but the okay but look at these up from the bed
0:50:40i
0:50:41to apply because
0:50:47so there is the another colour word which is introduced which is this to do what one
0:50:54which means the we need
0:50:56as far as i
0:50:57just to
0:51:04again you last meeting it's
0:51:08it's a reasonable
0:51:11i'm from the future
0:51:17no i overview now briefly what are the main findings of these evaluations show a survey told the system papers
0:51:25by the participants and tried to figure out what where the optimal configuration and maybe ideally have some guidelines about
0:51:33the future participants are researchers that like to approach
0:51:38task so if you look at
0:51:40asr systems from acoustic guitar perspective
0:51:45participants typically download its
0:51:49the titles
0:51:50which can be downloaded
0:51:52and try to automatically align the manual transcripts with the with the audio
0:51:57so straightforward procedures
0:52:00and get around hundred fifty hours and then use these hundred fifty hours for training acoustic models
0:52:08as we see the technology
0:52:10instead used other data from the choir project speech lectures they own
0:52:17and the news
0:52:18for find a larger amount of hours
0:52:22what acoustic and linguistic features so participants use up to third order or acoustic features
0:52:32and
0:52:33large vectors
0:52:35twenty or hlda
0:52:37acoustic model training was done by the best the three systems with the discriminative training and then my and a
0:52:44minimum phoneme however
0:52:46criterion
0:52:47concerning language models foreground interpretations of language models were employed by combining type data and now this one
0:52:56a multi-pass decoding one of them all the participants from mountain pass decoding
0:53:04but using models of increased resolution from dawn starts to speaker adaptive a train acoustic models
0:53:11from trigrams to four gram language models
0:53:14and also applied if
0:53:16acoustic models in the process to do some
0:53:21courses
0:53:24so they had to use different acoustic features like you employ the neural network based the
0:53:29was the features alright we can use that if a lexicon
0:53:33sorry the use of different
0:53:36lexical
0:53:40concerning anti
0:53:43people working in parallel data selection criteria so we provided a lot of
0:53:49out-of-domain data very large collections like this eight hundred million words
0:53:54oral data french english
0:53:56you cannot use it
0:53:58in a system you run out of memory so that is the best you can do is
0:54:02to extract meaningful data from it
0:54:05and
0:54:09they use entropy over the line
0:54:10score criteria
0:54:14people work on multiple word segmentation for arabic english different alignment technique
0:54:19thus ending model features
0:54:21the work and
0:54:23adaptation
0:54:25for translation tables and language models by using interpolation log-linear interpolation or fill up
0:54:31interpretation of
0:54:32the phrase table discriminative training for
0:54:36translation model is done by microsoft research
0:54:40developing topic specific translation tables
0:54:46whose
0:54:48language models based on neural networks
0:54:51he pretty class language models by the key to model the style of told
0:54:56syntax based models based on categorial grammar by this you
0:55:01and then i would say concerning the comparison between phrase
0:55:05based hierarchical phrase based smt nothing definite can also some laps compare them
0:55:13some of them find one was better than other the others find all the middle
0:55:17sonar here
0:55:18results
0:55:23about a diversity two thousand twelve to introduce what's going on
0:55:27for next year
0:55:30so about a venue
0:55:32decided to be
0:55:34maybe in hong kong
0:55:36in december
0:55:37and that if you
0:55:40and some anticipation about what we are going to plan
0:55:42we are going to come from the text or task
0:55:47so the a soundtrack will be again on english and B is time will be lower contrast in france
0:55:53without using segmentation so you have the challenge recognise the speech and blouses
0:55:59but the primary round will be on the segment that stuff
0:56:04so english to french
0:56:06you're going to repeat the rubber english you are
0:56:09no thinking about to repeat a chinese english and we plan was to add and you want to exercise that
0:56:17has to be worked out so you want to support some
0:56:21longer term effort on our own specific languages so i think people should choose their own preferred language and have
0:56:26a is the possibility to war repeatedly on these language like we for instance for italian but
0:56:32so our friends in
0:56:34two okay
0:56:35would like to work on tradition so we're going to provide several translation directions here and would provide baselines and
0:56:42people will be able to separate set once on these different languages
0:56:47and we don't care really about having comparisons against each other but try to compare against the baseline and will
0:56:54try to do some comparisons across different languages we have some ideas about
0:57:00and as we lost a lot of this more players i mean
0:57:05smaller let's have students for instance we are introducing a new you're using a new small domain task could olympics
0:57:13corpus kindly provided by nist
0:57:18japan in this with the corpus of around sixty thousand sentences
0:57:22domain is travelling in traffic business a diamond support and was collected for the page
0:57:29yeah we're on a track changes
0:57:32some conclusions
0:57:34i diversity or task it's basically subtitling and translation task we add it's a asr and system combination yeah and
0:57:42you see a the data has been publicly released what resources language resources and benchmarks you can find it on
0:57:51the website
0:57:52and which also was subjectivity
0:57:56what is it once we have eleven partners
0:57:59random evaluations random story
0:58:04system on our data i must say B A I so when impressive effort in high quality research on this
0:58:11track and this witness by the research because you fine
0:58:14in the proceedings
0:58:15and significant improvement over the french
0:58:19each task
0:58:20so what to take on a at these detectors if you're not sure
0:58:25to you knew about
0:58:28i think that's a good interesting ideas
0:58:31by the participants about how to cope with this problem
0:58:36is it just will be online soon
0:58:38the proceedings are going to be published online
0:58:41we show the importance of subject evaluation
0:58:45right
0:58:45right
0:58:46crowd sourcing
0:58:47and you have to further normalize of these results because they are fresh one
0:58:52the
0:58:54take my invitation to try
0:58:56this
0:58:57this task
0:58:58and eventually join X T
0:59:00our
0:59:02yeah some references
0:59:04for my for my door and
0:59:08and finally some credits
0:59:12why the data
0:59:14people especially wood
0:59:16setting
0:59:26we have time for a couple of quick questions
0:59:28before we go to the next part of the session
0:59:33oh much thank you very much for a very interesting overview of I W S L T
0:59:39oh one of the things i guess that's probably very relevant for the community here is that is
0:59:44i an ongoing debate as to
0:59:49that's the way to improve speech-to-speech translation
0:59:52what is the speech people should talk to the translation people or whether they're both all of that are doing
0:59:58their own stuff and getting you know sort of slamming the two components together every once in a while and
1:00:04keep their distance from each other
1:00:06i'm wondering if you had any yeah it in any comments about the impact of having the speech people interact
1:00:14more or less closely with the energy people as far as advancing the state of the art in this area
1:00:23as
1:00:24right
1:00:25oh
1:00:30but
1:00:31you
1:00:32the work on the
1:00:35okay
1:00:35speech recognition
1:00:37she just
1:00:38so that yeah
1:00:39yeah
1:00:42or
1:00:43we do this
1:00:45yeah
1:00:46for
1:00:47she
1:00:48that is
1:00:51so
1:00:55right
1:00:55the
1:00:56i don't
1:00:58people
1:01:01that is
1:01:03stop
1:01:04work
1:01:17actually i had a question the cup maybe three years ago we had this somewhat disappointing discover even we were
1:01:23doing some of the gale
1:01:25research that even if the speech group managed to improve accuracy two hundred percent
1:01:31the translation wasn't good enough for us to meet the objectives of the program at the time
1:01:36and so in some sense we cut down our speech effort tremendously and put in a lower energies into translation
1:01:42and the hope was that one of these days translational get good enough that we can start paying attention to
1:01:47speech again
1:01:48as the I W estimate the experience been different or do have P do people accurately measure what difference it
1:01:54would make if
1:01:56the use the reference transcript on the test data have you looked at that as an evaluation question
1:02:01yes we are evaluation
1:02:05but with
1:02:07transcript and we
1:02:20i think
1:02:22if the war
1:02:23course
1:02:25five percent
1:02:28start
1:02:33like
1:02:36but
1:02:37well
1:02:38of course
1:02:41machine translation
1:02:45it's more difficult
1:02:46sense
1:02:48actually
1:02:50very readable result
1:02:52so we are not
1:02:54spot in errors here and there
1:02:57some languages
1:02:59with
1:03:00frames before saying
1:03:04it's far behind
1:03:06the level
1:03:10maybe
1:03:11iteration
1:03:12the goal set for machine translation work
1:03:15a beach
1:03:17all right now this is good to know because i think in gale we were seen that there was no
1:03:20difference even and divide error rate was fifteen to
1:03:24maybe not twenty but higher than fifteen percent so it's good to know that and you already starting to different
1:03:28so there's a reason to
1:03:30make the speech better
1:03:32other questions
1:03:34so let's thank our speaker once again