so thank you for a nice introduction

my name is but the shreds and the

at the beginning i was recessionary the banana what is it the of technology

i for one many different fields start thinking of porting recognizer


we really try to do it should speaker identification don't know asr

are you

roll the speaker is no particular isn't enough

that many of you have still using

but and down the here

two thousand five

something like a stranger happen to of for us a basically we do our approach

by one company and the

the companies

set to will give you some money but you would like to have different license

for recognizer it was publicly available

also we said okay this is find a do this

bill help us to

the to finance the there is an essential but the third row to be

quite the low expediency to go also

nine months just to the

negotiated the license university

and we realise two things that there is interest from commercial market it can be

recast additional money


that we need to do we the heat a better way

so there we started a company called for X

so i would like to talk about to the topics are the two main topics


how to the speech tickle you probably such a two

mark and the and then a i would like to the

shows are much related that we see a problem the user point of view

so at the beginning i will talk a few words about the company

then about the text widget use cases

technologies that are behind the our programs

and the so

how big over the technology to the rose and then i mean really indicates someone

grand challenges

people usually don't know what is that you speech

but if you look at this

at the

at a dislike so you can see that they result

that there is before make sure the about speaker it can be

E hundred and there are you gonna be eight sure it can be speaker i

didn't at you can't that example emotion states

meant the speaker speaks and so on there is the goal that the

you can detect the language you can detect dialects a keyword crazy so

you can do the whole speech transcription

maybe the topic is interesting

you can

do something domain incapable then the

but there are other modalities you can have some information about the

and white men the that the speaker is

to whom the speaker speaks you can have other solid so

what is up of user you go

very close animals

or you have a lot of information about equipment that was used

of the to get a relief what we on voice it can be the device


for example we also for ticket be transcription of huge it can be according to

you may be the test it in a speech quality


is important for user and the users can benefit from this information

about products E R

it also based in two thousand six as us

startup from brno university of technology

it has C

in czech republic in button or just five minute walk though from the university come


if we speak about the user so we have currently users in more than twenty

come companies

so i've got with the agency score said that all the bank

dell corporation that also brought up so service providers and others

the company use the roughly table and the

the little small only for an external funding so those so far

you if we speak the about the process how to transfer technology it probably search

the to the progress and the market

there are several steps

well i'm an important role E V speak about the research

theirself useful for a dollar a universities or inside companies

but the goal is to get the

the best like technology but the more i don't interest is unit could easily the

quality so like a or a set of for the stability of speech will for

the court is not do

the man it main importance

and the what was important for a so it probably also for


and this stage we will to be all limit as possible so it's the beans

quotative measurement


to the saba

okay so

open-source toolkit and so on

but then you need to getting better technology to

user somehow

this also for this you would be to do

next step but

you need to build the

code base that is or almost that is stable

it is fast the has a modified a P has documents a day shown a

proper licensing

us assume so the D V D this is what is what exactly

then that is others that but

a better you know

i need to build product for our customers so you can have nice technology you

could have nice interface is but if you don't have power back to you won't

be able to sell

so here the full cost is

functionally the and that this is donna either by phonics el or by other companies

we never be

not now i the bill mentioned

pretty domain use cases

or pretty main customers there are others but i selected this free

the first the are all sentences in course of course there are so we are

active you know why rasta the fires are is the quality control how to ensure

quality parentheses in

the call the course of terror and there are there is data mining from

voiced i think

the that the quality control what is it about

in both antenna

it you have to really

some kimberly there or supervisor

that's a pairwise well the

then a but i there are so i just this better than do

but i think of course

evaluation of operate that also

analysis of the results

for the team and some reporting

if the there are no speech technology

so you usually only three by a set of recordings is

inspect the then the use of to control wanting to local schools said there but

if you would be point something which is you are able to control

a hundred percent so the topic to get you are able to better use statistics


and the this everything is about the


the cost so of star but

we are able to try to reduce the number of advisers to how well

over operating costs

so it is very you are in but

to shorten the call so

if it you are able to the

you have

find problem so you really errors

some or but i just are not the us up despite the or a remote

well train the

i hear its is possibility to said that what is look training can that

usually a the

the formulation of people you know such posts of the rest is

you know tens of

paris and so we are able to reduce the this

the this amount of data and the again six of some calls

and about this approach is a huge the for quality control the main technology is


the at and doing some

this does so i'm not easy so on


so you are unable to get important statistics

like but better

the dialogue starts the number of speaker turn


speech adaption times

of unknown the call centres have

all the equipment so if they have some channels to the conversation the speakers are

not in separate the

well i like channels that we need to do diarization

then it is possible to deploy the key you want to raise the text order

you have some obligatory phase this you don't want the people to

speaker all the words

you would like to have some

of course grip compliance the people should

for all calls three the

and the it is possible at the to the voice speech transcription are mainly

for us

set a should

in this task

every about the data mining the this is other large a topic for also that


year again we have like two subtasks

one is but i mentioned of the corset errors

of course and therefore overloading

you may gina that you have

also there are of a few hundred people

and the there is large up

i rolled each of thousands of people start holding the

to all said that the service so you need to whatever it really quickly

we need to explore the export of what is wrong and the

maybe that some information to do initial i we are

a stage

and the japanese so you could be the this is solved by some be


in the call centers showing the topics that are just the discussed that the



i other important the

but you use the like the

i did value speech technologies for business of eight basically now a moan about so


indeed i don't know if you have for example how

may be done also

it's looking for places that too but i'm not

in new

new fast foods

you the approach is to but the because go to telephone operator and they ask

please could you

if a statistics of where people that the visit it the our fast foods are

putting day and the

the place where is the highest of that it was good consideration is good place

to start the you fast food

but the you know speech technologies the same

if you have more information for example it's on the phone line a is a

male or female or that they are more people or on the line or the

pairs and was interest in the in some regrets the in boston

it helps you to go

the whole business certain to push to some more

for this we usually use of speech transcription

and then

some of data mining to on top of it so

of course it is possible to at the

so i changing if you want to the session a narcotics


the other big groups are bangs

a bank so of colours have all sentence so what i dimension on the past

two slides

is important also here

but the other too large task

was a box are the bands needs to ensure the security part so on the

other side

they need something that these

breeze and the

for the user does that at the

the that doesn't to


much complications

so here the voice by but lisa very interesting it can be ways by timidity

using a cape race or it can be ways by comedy that is the dominant

on a big get the using a text independent speaker identification system

and then i other

task is for our protection in major in there are people according to bank so

for example a hour a day

i shows we make i didn't theses and that today are

requesting clones

you it is how to detect that this that if you don't have technology

but if you have technology like the speaker identification is a really simple

now about the intelligence agencies

the intelligence agencies the situation it is usually that the this intelligence agencies have

they really huge amount of data

the amount this i should that the

they are not able to put to see the

manually this can came from

a big use of from telecommunication network and communication the internet and so on

they are looking really for need to

you know i say it's take a

and for these it is possible into to use combination of technologies

so we are using combination of technologies

and you language identification agenda to speaker diarization keyword spotting speech transcription

data mining tools

and also a little fun a correlation with some other metadata for example from this

text is used

and so of course the sequences are very interesting you know operation and forensic speaker


now i will go to be better to the technologies

and the tell you what this important for a practical deployments

here are some of the technologies i want to speak about all of them you

can come and ask if you should it from a question

about the voice activity detection

i would say that this is the most important part therefore practical deployment

you can see our process

by this is the most important part

you can have

very nice results for example on these databases


you what do you will explore a target so

the users are working if such channels that the you should quantity of the traffic

it can be

tens of a sense is not speech at all

it is some technical signal for like dialling don so that six


everything can if you don't have that


built in

it it's a really harder to work with such channels

so we are using energy based the steam of would be eighties energy based

the at the beginning the to remove very large portion of the silences

then technique the signal removal like the tone detect to removal first

back to the spread like that are in mobile station i

signals and so on

and then we have a vad based on F zero tracking

because of the speech have the specific characteristic the that should be a we have


and the

and the respective behaviour of this

F zero and then we have you wanted what based vad

to get a very precise the segmentation

but is this say sets it is very important technology and they are still many


so it is important

the accuracy of media

directly affects the accuracy of the technologies

us some sort are actually trying just you can have music


that they're other speakers sounds of like people tend to

well a four or something like this

you have a an alignment silence

use a different technical signals

what is a challenge is the vad one variable well snrs

we at on distort each section was

well what is also through you that important i think it is unknown parameter to


non automatic way

of green vad because we know that we can do it's one to the deep

or specific channel

by training just as before

some good classifiers

but how to get a rise this it is still difficult and of colours distant


and well the language identification

currently if we are able to recognize about fifty languages

and the

what is even more important

that's that the user can add a new language


this is important especially for the intelligence come community because today will never

tell you able to instead of the languages are great interest the that

what the correct you on sent to you won't be able to collect the data

to have much easier X axis

to such data

we are using i-vector based the technology

and it is commanded training okay and

we have first of all men which means that

the language print is a less than


"'kay" a record


a better file

do it in this is the technology behind the there S this several stages

year we have feature extraction

collection of statistics using the ubm

usually the

we use gmm and the that is

are aesthetically the by some subspace so the subspace it's estimated on large quantity of

data to model the


variability in the speech

so in that we get the

of estimate so for but when we are in the subspace

so this part was prepared by for next year

but then and there is other part to

but that it is the classifier of languages we use a multi class logistic regression


and that this is done by using

about speaker recognition

the there are many task like speaker verification a speaker to

set of speaker spotting link analyses


after normalization some house

sometimes social network analysis

we can work in text independent or text dependent more

i-vector based the approach

we use diarization

i think it what is important here we have

use a based the system training for calibration

that again helps

people a lot

it is here

a so that the use of the same as a in case of

and which identification

what about a year we remove other what it but it is on speaker variability

i would be have some normalisation of ways pain so simply by

mean subtraction that can be done it user side



if we have well


we compare ways putting it's a


pretend the by one excel and it is do you get it but the

we allow our user to

the rain or i don't the this classifier

this is very important because

it's harder to get any recording for from clients


if you deliver a such system to clients the

and they are able to adapt the system you the amount of data can be

a really small it can be

for example fifty speaker does just few recordings of each

well i'm

we saw that the

a normal telephone channels that we are able to get about the forty percent improvement


the new deployments

and if it is about some

us special of or

for example many directions

we saw a hundred percent improvements just

with this

simple book

and of course so what is closely you

important is calibration


you know case that we are drinker like that and then calibration because that this

is also not seen too much in

and nice the because they're the recordings are

about two and half minutes long the but if you have

the huge

but variability in do lying to you need to do anything three this L C

the shore recording studio

solve it

by do some up for a times

what are the challenge in a language identification and the speaker identification

i think the that the main challenges are very short recording so it can be

one less than a three seconds

but the

very important for us is

keeping to the training corn a user side

because why less than three seconds to each if you have department of speaker identification

and the you would like to deploy eighteen bank the people don't want to speaker

they would like to have

the decision even before they start speaking

so i would say that the ten second these

the maximum of

a line that it that is a set the

and you really free second to for a verification

you can do we the we but text dependent systems

i is harder to do it with text independent systems but in case of text

independent systems

these two steps are report to study on background so bias it to do use

user is talking

the operator


i that is question how to ensure

accuracy over large number of acoustic channels and languages

the technologies are more and more general

independent but there still is someone


what is was link important there are a graphical tools so that how

the user is to visualize the information to do the calibration


if you do want to do this for user the user will never we the

in self

what we see also very challenging is

language identification and we could ideas deviation a no voice over ip networks because

there you have pockets you have gets lost

and the you if you have this costs

you usually cortex a are doing something that they are either put their zero also

okay are sensing is i speech

but this is not so the speaker to the that it this is something that


generated by decoding

so that's also it's very important topic

and of course the distance might

now i would

say few words a ball so diarisation a because that this is very important technology

you useful for example the call centres but model also

for anger

other users

we are using approaches one approach is really possible the not so

much weaker the

this is approach are based on

clustering of i-vector so we basically split the audio too small chunks and to do

clustering go for i-vectors

but the

then the i don't take the

fully bayesian approach to the initial you know might take the by

fabio one a

patrick kenny worked with this to

on the reset assures

quite with the text and it it'll be the

D P this is approach to bear you don't do a heart decision

during of the process of you have everything good

probably sticks and the you are going to do the decision and at the beginning

it at the end

this approach is i would say

but if you're at the but you want see

well my next slide it's not

fully to


memory cons i'm mean and the quite small

so what are challenging

so in diarisation

in my point of view the diarization

still technology that needs quite a lot of research

really so that it is very sensitive to initialization

it is very sensitive to

non speech sounds

do you usually it is about the wall so you got more gaussian before

for example if there are you sure that the

new sounds that you haven't seen in your training data needs to be sorted for

example we ask

the system

to keep two speakers

but the output was so the that the

we got two speakers in a

one or like

like under one labeler and the second speak that are you know

there were segments

it was other us

speaker sounds

i think of a lot in this case the

so what is important the it's a very

would the duration of your vad if you have

i just sounds the that the speech

it can hardly due to the adaptation

a so it's a it is but very sensitive to two so that speech and

also and then which is

what we see that the you human of with this is things systems you can

easily very should diarisation error rate the

close to what one percent the one is data


well what we also saw

and the

you the

first it's us

we could be that the rest of the one percent the

is that the there won't be pro by means segmentation about it's fails so

forty four for this recording good did this is the

usually of speaker to sweep but very similar voice but this happens

okay so i think there is a shana

that was a lot done during past two years but the data

the challenge is quite the

and of course you can speak about

text and distant mikes for

the of

processing cove of

the or like for example of deviance

it in both keyword spotting

so we are

the we are using approach is what one approach is something probably you know all


no see the

probably from project

is the lvcsr based the keyword spotting

is this is very good

but the


and it's expensive for development

the other keyword spotting the that the T V are using this acoustic bass the

year indifference the that the

it here we usually use a larger acoustic model

here it's a simple on your network based acoustic model

the there is no language model or data simple language model but here it's much

cheaper of for our development

so in case of

lvcsr we are stopping creep hundreds of hours of training data

in case of plastic you want splitting a

we are stuck used by the office of acoustic data or human less

the speech transcription a what we are using a

this is probably not important of all of you are working can this

feel that

we are using the system based on a

bottleneck features that the can combination we've other features hlda vtln

gmm based system or and your network based system is not explaining okay

speaker adaptation

and gram language model and generate the

usually confusion networks

what are the challenging

from the deportment point of view here

well of course the accuracies

still important

but i would say that the it's not so the most important challenge

the challenge is us be the

lower memory consumption

how to train new systems for the automatically course we would like to do it


so how to donna

hundreds of recognizers in a parallel

before all compute efficient computation one


and also

how to

to the lecture normalization is


adaptation of for any length of

speech utterance

course for example if we transcribe

along all source lectures some whatever

we try to put the

this much a adaptation is possible but if you are working with very short the

segments so like

three seconds or less

the adaptation

below heart how do you and the usually you will see worse results


the system

that was so one solution is to remove those this adaptation


the system to be less robust to train

not now of a how to sell of speech transcription

what we found that if you must speech transcription

and the you want to sell this technology is quite heart

you need to have but

something that this on top of this technology at that the real presently information to


this is the


there is too much text


this that there are still some errors

what is our experience that the

the user

bill never be happy about the accuracy of those

the speech recognition system if there are errors in more so the uses to mention

this are also

you if the words are correct the data start combine of a preposition suffixes in

this is correct the

a star complain about the some punctuation marks or grammar

this is but

if we use so

of for a summer

and the representation how to look at the data

bill help you

to sort of technology and that we are doing the in such weighted maybe do


we've the existing test bay it takes a base data mining tools

integration is donna

you usually on that the level

all of

of confusion networks also we have also the other a captive audience

this is one a to the to use of

this was the double

by our part company so like

so you have set session in gina

here you can have very complex squarey here are

documents it's a better found

the document

but you need to

bright some somehow the query the query can be very complex

so is here is


but the was so it's one possible ability you but if you want to

we have described topic so you can use this but it there

or you can go from update time you can result i still they are you

can look at what is the correlation among works

and the

you can you can

take this could happen automatically two classes

well i mean you have these so you can

here are need the correct someday time

then train

statistical based approaches

or if you can deploy stuff for example to see you what is the

how well

the topics

and morph in time

of the input

not now i have

two slides so how we transfer the call

what is the time please

all cases so it each

okay so it's we just quickly

this is how we

to transfer the call so it to use this at the i think in a

two thousand seven wanna be decided to write our speech for the of score

the well the reasonable so that you wanted to have something very

stable very fast that

and the before the proper interfaces

the speech for it has morgan two thousand five hundred topic objects coding or the

hour of speech processing go

it is a more then

minimal first lines of source code and it ceases steely

still use it might enable


be approach to refer to the recession

we it the research is usually done using standard tools like that

S T K in a car be by transcripts

i think it it's all through the nose these two kids that this is

for of

hmms reconnect the but this is for neural network training good colour be it's made

in the by then pour we

and so and so on

but the that diana we can the to use our code base

and we can implementing new system and a two hour speechcorder quickly in a

just two days the

well final nor seen a single line of C plus court is written

everything is don

flew configuration file this could do this configuration file


look like this

you have some objectivity this object so

well this description is the map


C plus interface the user to set functions

and then i we have some framework out to connect to be subjects to better

so some

you of fun we have four or the artemis entity

but if you need a algorithm we just goal and to buy one simple chip

for simple objects

a about interface is


the customers are used to

a locally specific interfaces


so i don't want to change data bits is so we

men then the double the


set of interface this C plus channel aussie sharper and marcy be protocol uses for

ivr so that is nice open source project the

press instead of face to build the our on how based so set B C's

and so on

the this is common a framework


but based

over a solution

we were speech set of our application server the base ever and some clients

this is just example of

our testing client

okay so some not now i will just summarised

three slides about the

some ongoing challenge is that the I C now

partner very important challenge these


a training data is the smog a small company it is a difficult for us

to get the data it is expensive

and the this out that the a common approach of allows us to at just

two and we just by here

so we are working the

for cheaper mesa how to do these

i think it'd great inspiration is google

so but not a we did something similar in you know language identification

in which i didn't if we are able to bit the data the that the

we can use of for training go for organisational

a speech recognition systems that can be deployed on balls like don't the you know

quantization of speech for telephone speech and the

for broadcast

so of what one possibility that we export the was to use broadcast

for this

but not so the whole content but the

automatically to take the

phone calls in the broadcast this ensures

a high variability of speaker does the dialects and the was speaking styles

language can be a very fight the using the when automatic language identification will so

we need to some

a small amount of data to bootstrap the this approach but then it is possible

this is speaker identification of the speakers of the variability conventional by current

speaker id technology

the you would like to

transcribe the it is some of the speech we think crowd sourcing

and that use have really unsupervised the training for adaptation


currently we discuss the D so we've

several company sent to would like to form a conception for this

you have some expending admins experience to when we did this

you know project the for language identification ldc anthony's the

of it turns out the E to be very successful and the melody so like

mainstream language identification

we have one line

up or go for adaptation

be backed by our

after companies the spinoff from but and that we believe that we could put to

reduce the cost for the opened of new recognizer

to variable and models

so if you are interested in the you would like to one and more just

sent me email and to

we can discuss this

the then other trying to we see is the that the

we have quite the roots the technology about the still the deep one man the

is hardly bring some of somebody six of each customer list

but if the specified

the each if we have departments many cantonese

we never know what to be the final

accuracy of up

you're technology and of unevenly to do adaptation again some

project so that i mention on the previous slide so we have

to word this

usually if you speak of the technologies that

we claim the

the customers that the technology


language-independent the

channel independent the but always there is some for two for station

the only possible way i see to reduce this risk

is to built on to evaluate that these technologies

on a many languages and to know that the results in advance before the technology


so for this again the data collection project and can have

and to you are thinking about you want

to extends some approaches to something like to work through much of spoken languages

because for language identification we have collection of about fifty languages and

good all the rapidly

and the

finally remark

what we see that is that the percentage is

full cost mainly on accuracy or more most of the research articles

are describing some improvement in accuracy but if we speak about commercial market the

i think that any improvement and the you know speech the

or will

something that cannot

and i do use

the cost of hardware

can help and can

help you to

have successful technology

so we saw you in some approaches the hardware cost is a really large can


fifty percent of the cost of the project and so on

so this is everything from me and thank you for attention if you have any

questions please ask

any questions

so we how did you do that didn't have to go to cepstral but better

or something like that

we are considering this approach should to but

you know it at the beginning it's harder to get the

money from adventurers

the so this started the in the trade that event to customers and the ask

or negotiate at some contacts

and the we just started to be for contract so basically the

custom development and the

B and some money on this custom development and then we compute developing technology

and we start something good technology and then

even to product and stuff

i have a question your

your solutions are on site or is it based on cloud services

most of the solution so

i don't say it actually bosses possible be because

that we can use of the technology one site but we have the base the

or interfaces for example that is the best interface that can be used for

called department

but so you have a lot of cloud deployments or not please models and not

know most of our current improvements are


a local click the like of one side the departments

but we have

the spinoff at the battalion is it of technology this is to play well

that is for example the recording go lecture here

this is already got based this is gonna but

i don't at the of lectures


so you started off connected with

with university do you still do you have now it's to say that you projects

that at the next cnns are an issue with that in terms of their

we only with the company a with the government

we are doing this see in different races we didn't have students

it's for next cer we some or twelve some people at the but

some contacts we have joint project

the sort out differently so


that's one thank you thank you