only one source statistical

and like to start the third and final invited talk

so we decided to use to actually you know

calculated from computer science and mathematics from the university l right

and since then she's been at cambridge university course you received your m fill that

her phd in statistical dialogue systems research associate and most recently has become a lecture

it's open dialogue systems in the department of engineering

and she is also a well like to the fellow of one of the colleges

the cambridge university

she's extremely well known i'm short everyone in this community because she's very well published

including a number

a for winning papers including classic style and she's coauthor of one of the nominees

of our for nominated papers at this six dial and have

after her talk if you still wanna do you can even more into her and

her colleagues research they have to posters at the afternoon poster session this afternoon

please welcome relief

and everybody here to sling

thank you don't really comes address is there was lining of getting the ski boats

here right

i

once in their sick not really clears one big family and if a family member

to do something kind of signal

so

thank you very much

a i will be talking about

soundness there are needed

that's what we're a building next conversation

and a deep learning can help us along that they are in some effort that

we've done between the dialysis this group in cambridge to achieve that

while i'm sure that we all agree

spoken conversation and in particular dialogue is one of the most natural rates of exchanging

information between q

we can be a book and be able to talk about what we just right

machines of the other hand there are very scoring huge amount of information okay not

so good share this information bit as in actual in human like right

so i'm sure and get lots of companies will have the virtual personal assistant sorry

privately locked loop and how they generate billions of calls

then the current models are very unnatural

no in domain and frustrating users

so in the research question that one to address is

how to be a continuous labeling

dialogue system capable of natural conversation

machine learning is very attractive for solving this task

one of the machine learning very high if i had to summarize machine learning the

just three words this would be data

model and prediction

so what they are in our case

okay is simply driver

or some parts of dialogues like

transcribed speech on a okay user intents what providing user feedback

the model is the underlying statistical model that lets us explain a time we use

of i've never directly model

once we train the model

we can make predictions

what is unusable

what to say back

to the user

that was

you just the building statistical dialogue systems has some three d so you assume this

the following structure

i guess is called dialog system consists of speech understanding unit

no management unit

and speech generation you

but it user speaks their speech is being recognized very speech recognizer

and a system a coherent state tracker that produce

dialog states it's of the

that is currently

these a policy makes a decision what to say back to the user

and very often are more or less nature of some kind of evaluated which vibrates

how good base decision well

second experiment we generate your

which reduces the textual output that is then presented to the user like text-to-speech synthesizer

i don't mind all these model of modules

is the ontology structured representation of the database that the dialogue system can talk so

this is

the structured it's obvious you

in goal oriented that exist

that is not to wear in the last

okay on automatic speech recognizers use size

some researchers go as far as saying that are known to reach the performance of

organs you want

in a speech recognition

i wouldn't say that but i would just like to point out that queries you

want more to be done

in the rest of the structure

of a spoken dialogue system

no this modular structure

there is no loss of information between these modules

and then the question is what can you to alleviate this loss of information

what can you want you

actually i

probability distributions between these modules

to our help alleviate the problem of loss of information

and the other when you do that

you the that's the map decoder and the state tracking becomes the belief tracker module

it produces a distribution over possible

a dialog states are described in

but before i go further into explaining how these systems where it's just take a

step back and see how many systems are currently prissy by its you are there

is

i have a personal assistant or microphone and a user pretty much everything i think

at what times map like tomorrow and it says you're like to note is that

there it is a p a shown with importance of doing a key and it

says

well as to not be a what time is tim's waiting to more you're waiting

is it five you know into a knocking tim's wearing that is you weighting is

the type ea in okay never mind we communicate by data alarms that four five

am

see it takes care of pretty much everything that i could probably due my so

but just kind of or for having someone to forming it did not however prevent

me from spilling coffee all over my trousers just before the ceremony i set up

at a i've got a problem and everything up

you're waiting using one hour and twenty five minutes james wedding i've got coffee all

over my trousers but has there is a nineteen ninety three stop okay intermediate for

l d a r any department stores double your favourite department store is no but

i don't think there do not p a find me a shot into a not

sure that

okay

do any stores after answers that's interesting question okay "'cause" call me a taxi to

the centre i'm sorry i can help axes into an okay

what you mean you can call it actually i'm getting a no i'm sorry i

cannot help you with that and so how am i supposed to get to the

weighting you're waiting using one hour and twenty for the last time came sweating your

wedding using one our work and twenty four minutes

well thanks be a you're really helping me out it is a should be thinking

you

so it should just a and resembles the actual personal assistants real virtual is really

going through that

okay so we'll

so let's try to problems that we address

the most obvious one from this means here is obviously phonemic awareness this is a

personal assistant was completely unaware of the user emotion and their state but there are

some things we need to address before that

so that the problems is closed it's

sure that can still not scale

and often maybe tool for long time to dialogue system it struck context

this problem is that each voice all

action of response is not pretty good

and the reason for that is the learner response choose between the very small set

of actions

and think to build an actual conversation unless thinking a lot of our systems to

choose between of a wide variety of actions

and finally systems their own or stick to different user needs

and this can be interpreted in many different raise but it is clear that we

need to more the user back there

if we want to achieve a better dialogue system

so first start with the for a bit explaining why we need to track one

do what is going straight fine

this is going to that of the dialogue system

it can talk about restaurants

the user said i am looking for a time restraint

i and how very acoustically similar so there is very likely to be a misrecognition

and we have high restaurant the fact that both

no extra dialog state do that based on our culture so the ontology for our

domain which was a restaurant or something else one and slot value pair

i

you hear that the system is sorry about the choice may but not so sorry

about this the slot where would that the system asks request with or what kind

of july

i

i which again gets misrecognized

as i there is high

and i don't do any ne extraction at this point this is mainly what happened

before

so

the i have a very small

and then system has no option but asking the same question again

what kind of that you lack

and this is what is particularly annoying to users asking the same question i get

i know what happens if you tracking

i don't really but in this pair

do you remember that was annotated with time within the previous turn you know the

probability of i based it is very low or overall probability they'll be actually higher

and always the same as the third option which is fair

this is not have the option of staying used a higher order fish

it is much better action

to be completely uncertainty free systems but the question is

how do we managed it's

i think about this is actually a very simple problem

all you're doing is matching does over the concept that you have the ontology with

the input because the user set register the user's flat side users that are

problem is not simple because we all know it

there is still many domains you can relate to a particular concept natural language

and then what you have to do is build a belief tracker for each of

these concepts at for

and that is something which doesn't scale

if you want to build an actual that exist

so that the i-vector about scaling vocal tract

note this solution to this problem is all to reuse knowledge you have for one

one-step

two four hundred constant

because we cannot hope to have labeled data for every kind of concept you want

a dialogue system to be

and real humans are very widely known that new situations and they need very useful

to do that

so it is actually ingredients for a large scale tracking are semantically constrained we're vectors

and are like to share parameters

so that i explain what we need what we mean by semantically constrained word vectors

more tolerant you have some close set

was used for the main like restrooms with a process for slots

like price range of values like chi question

and this do you should know what is that is that's

a very good here

in america one

are semantically similar

it should to some extent but also make sure you don't know what kind of

application a

so for instance you can say here is that you have stated in his head

of state in this case the queen or king are semantically still there but if

you have a dialogue system you in the analysis user said you for something in

the north wind looking for something you the set here

well north and sentence error in this context really want to my this technique

so what limitations the former phd students from are blue it used semantic

a second understandings and

synonyms to this is a vector space

so it in a here what will change and x can be very far away

but surely as a marking inexpensive will be close well

in other stand for

and

and

our

g expensive a sector are concepts from the ontology and i'm sure that our debate

is if the user may refer to be scores

so we use this to scalar tracking

you need to try and you have two times are typically three crash

another question is

that's what the system is saying

e referring to what we have the ontology

the second question is how what the user is a is there are three for

what we have

and your question is what is the onset

but the context of this of the conversation well

so i don't through the first question

you use it is in fact how may i help you or anything else can

be a vector embeddings region

i feature extractor

in here is to make this feature extractors you have

so in our case we have to be treated as but this could be any

kind of feature extractors like bidirectional

and one would be for domain a generic one for the main a generic one

for slot in a generic one guy

what we have an ontology

so we have and begging for restaurant name and price range by which e

so then maybe

actually we calculate the similarity between what our feature extractor for the main state

what are your right

the same process be the input that you got real user

you actually needs to be and i'm into analysis and or an rnn or a

cheer you anything each entry which hasn't requires it can how you keep track of

on

and then what you get

probability for the k and then you that the same procedure probability for slot value

and then when you're a five is to use a probability for the main or

particular slot in particular that and then you do this for all

and in your in your topology

you the belief state

and the current turn time

so what we

i is evaluated this

this tracker but how can you invited belief tracking you need a touch these labels

so in cambridge we have another works to create a be labeled datasets and u

is in the wizard of all set

so you have the i'm serious

one

who is represented representing the system so has access to the database and then not

clear that he's representing the user and has access to do the

task was provided to complete the user goal

so the tools to each other and channel i would you in a text actually

and

also the states and part of the system and user i eight is what the

user is setting so we get directly be a

we have used actually that one is very small have

one thousand two hundred dialogues with only one of them at a small number of

slots and model with a small number

recently collected a much larger a dataset

which have almost a thousand dialogues across domains

and the great thing here is that the means

the change of the main is not only happened on the dialogue level but also

on the turn

it is much longer dialogues it's much more slots and that is

so that it is where

well we hear this model to a high dimensional be a neural belief tracker

it was again developed by

but which doesn't do this knowledge sharing between different don't be different colours

and you very small on the smaller

and i think i'll performed

a mural belief tracker in every slot the user can be quite

that no what's happening one new on the larger scale dataset

no problem is a bit more complex because you or tracking domains and learn the

neural five was not able to track the main things that also we compared to

just the single lane

and here outperforms the

as well known looking at numbers for these are generally lower it shows that this

will release date that original over which shows this dataset is much richer and more

difficult to

to track

knowing full well as a set of things that have another class baseline but just

to show you how difficult this task

you or get only and percent accuracy where is then you knowledge sharing with nine

three point two

no this is the number of it my view is also ramadan to have a

general i

and if you're here next week for eight

or someone will talk about

this is more the

i am going to move

two variants

dialogue policy

one difference between v and policy optimisation

o

why dialogues are here

and i'm at this point in dialogue

tracking accumulate everything that happened so far in the dialogue which is important for coarse

age

i really tracking summarizes the past

but what else policy to

well there will always this point yes

okay the action in such a count of these dialogue act

bill be the best or when the user will be satisfied at the end of

this time

so the policy has to low future

is the one that

and what is the machine learning framework which allows us to perform live

well that uses reinforcement learning

reinforcement learning we have our dialogue system it is interacting with our user

the system is taking actions

and the user is responding results patients

based on these observations we create the state

the user is occasionally giving us the board

no here and i say user may be real user controls maybe simulated user has

really need to be to have i really exist

notable is applied

that's these states to actions

and

you want to find a policy it gives walter and user satisfaction

so there exists

once you

remind you of some of the concepts in you know reinforcement learning that here are

that we have

so that and the most important in the concept of every tear

so here at this point in that in the features that are going to and

at this point the reader is the random variable which says what is the overall

we were from this point that are

no because it's a random variable

maybe the estimate we can only estimate the next page

and they expectation return starting from a particular believes eight is divided function

and if we take the expectation start from a particular belief state updating a particular

action it's q function

estimating by the function q function or policy is equivalent if we find the optimal

q function will also be able to find

the optimal policy

i reinforcement learning by function or q function or policy or approximate it is the

network

this is good because neural networks give us more here approximation

which is preferred drug reinforcement learning was not of these functions are functions

the automated it's the optimization over the years that's local optimal

no probably the most famous people deep reinforcement learning algorithm using you network

well as you network do

i

approximates q function as a neural network parameterized parameters

and here we have a great in open lost

which is the difference between what our parameter a parameterized function is a setting and

maybe more your pain and what are

there is

one feature vector

no problem to me is it is used as a biased estimates

they are how that are correlated and targets are nonstationary

which is all the reason why you is a very unstable algorithm it can often

happen you can imagine give you good results

that sometimes it doesn't work tool

i think is all you want to optimize policy using a network

i assume parametrization policy with parameters only

and then what is greater here that's what the gradient of the object here want

to maximize the by the initial state that is given by

you only got

and this is what

what policy gradient is what it's

why is it i don't have here to prove here

but not just say it is directly used in reinforce algorithm also not complete

however it is okay so that it is not is like the one from the

un

but has a very high variance which again is not something that

three four

you know

it's also use the not clear creek to connect the search is going to give

you a diagram of what an actor critic a cow are clearly frame looks like

so this is our user this is our policy optimised there

model that has to part one after this is actually out are all the steepest

eighteen actions

and i'm is critical that criticises this actor

so make some action user wants be rewarded and belief state and then i think

that we define how words

our after what's

that's a dialogue system does not apply these methods to dialogue systems like to modeling

four or the policy of analysis we often find it takes too many iterations to

train

so we resort to using a summary space

so what is me

we can estimate of our state

and i there is i

it only choose you know

and full of action

and we had some heuristics which they you what this may be okay

it uses that this actually belongs to

a much larger master action space that hasn't toward are typically toward greater magnitude or

actions in the summary space

but this is obviously not good with

i really want to build an actual conversation you want to buy a any kind

of

here is explicitly flights and choose between

much richer actions

so the problem is it's too many interaction need and

this solution in this case is you experienced replay

and i don't know i'll

however this produces a much larger

allows us to learn a much larger space

so it is algorithm which is called a server it's and i critic algorithm uses

it is played

e s q function off policy

and uses be raised to compute the heart skipped also uses trust region policy h

so that it just briefly go through these point

now more experienced reply

have interaction with your dialogue system you're generating something that is cool

now in order to maximize the value of you at a

it's not times you can also go through that they and we played experi

no it is that a point not the system has learned something and all its

on actions are not particularly good so we should be exactly the same reward

there for you importance sampling ratios to this

it's a piece

is that they have it was generated in principle is not what we have right

now

and how we're

our gradient

well it is important issues

now if you that's for q function

do you will inevitably have to four

the whole trajectory to model keep it is important sounding ratio

multiply small number x

they're the irish

or if you marked by very much better

a explode

and this is funny truncate the importance of a

and also add bias correction utterance just to acknowledge that you're actually making

it is what's

retrace algorithm allows us to do

remind

we want to use actor critic framework so we want to estimate for policy and

q function

resulted you you'll and providing biased estimates for q function

so in that it one for hardly hear

for our for our lost for q

and given by retraced all agree

you and we

and when you as work from the one on why this provides one of is

that it small area but i just give you case why is you don't have

this and school clay

the thing is merely multiplying our

importance sampling rate issue

but we are trying to say that

so it is that they don

they don't vanish

and if you know these errors here

you know what we had in our in our

you

but there is no

right here which we shall with this

this is employed

is not

by s

and then manually think that we do is a trust region policy optimisation

now the problem is that there are all i think probably steve directly in a

reinforcement learning framework in the be proportional planning framework and small changes in parameter space

can result in very large an unexpected changes the policy

this solution is to use natural gradient but it is expensive to compute the natural

gradient gives you the direction of the speakers this

but

it is natural gradient can be approximated as kl divergence between the policies of all

subsequent parameter

we have here and then the transmission policy optimization expert or approximate that kl divergence

with the first order taylor expansion so that is to see between subsequent all densities

small so that you don't have i mean i

here you know how policy this is particularly important if you want to their be

a promising interaction is really going one to a really

afford to say i'm expected

so no and we want it is to a to a dialogue system that one

are directly in master space

we have to have adequate architecture

all the neural network

no i

a critical mass that the set so we are making the point and q function

at the same time

and in order to make the most of it you share a feature extractor apart

from our belief state

and that we want to learn a master space we have to choose between very

maybe

so that it will for policy and the q function

will have a part just using the summary action or if you think that this

is the dialogue

and a part

choosing which slot should complement this that go after this

and then we have a greater for policy which is given by just three policy

optimisation and the gradient for the

would you function which is given by a cell

okay so how does this work and you know datasets we apply this in the

cambridge restaurant domain

we have a really very large belief state

hence foundry actually is very large number master actions and me about it is simply

using its operating on my lap

and here are the results

so that system showing training

the y-axis is showing success rate so we'll

that would be

successfully completed or not

and all this model is learned and mastered state at all

and the other learning one summary actions

do you hear its on the policy is expected policy this learning the summary space

ease faster because the parents between much smaller number of actions

but actually menu

it mister actually

space has to wonder why don't you or actions it actually only twice this little

so this is good news

so as it were actually has these policies interaction with real users amazon mechanical turk

we use

and see that the performance in terms of success rate

are almost the same

but actually master actions case policy or position

this is the right is to gather why we have a regional in a house

to it and it has just been accepted i transactions with speech and language

only speech and language

okay so one thing

which you probably her about a

i hear that a my student for basic talk about

it would be addressed

the problem of having the user models

so when we optimize dialogue management you need to actually the simulated user we often

find we assume you can use a simulated users that are hand-coded

or not very realistic as the ones we have role

as

interaction if you have you have users

this solution here is to train

user model in an end-to-end fashion

and you have outcome is potentially have more natural

a conversation this simulated use

signal

what you will stimulate users and you simulated user consists of three

the first part of the goal generator and you can think about this is a

random generator to generate

what goals that the dialogue

at the other real user can't hack

and this and

is the feature extractor so in the feature extractor

extract features from what is in this state

that relates to what users well it's

and then you have to see this is because you will never do not features

history to the user utterance

so here it is how it works so

i some features so for instance if the system that i'm sorry there is no

such i-th turn and it would be speech is what else out that

a whole like now with what the user will and this user goal can then

potentially change

we have and un

a human life story off

this feature

in that it layer

and the sequence the sequence would

we each we start with the start of sentences

and a with the word that the simulated user in spring you see

so we should s simulator on his easy to see that because they sit there

real users to without systems so you want to model how

how the real users are correctly

and i simulated user in a fly unorthodox way so

but there is not so interest in into how well we were interested how the

user simulator ones sentences but also how can help us in training

a dialogue system

so which well five for each user simulator

we

but our

in which we train policies

so what one

policies were trained with neutral user simulator which is completely statistical and another one be

the agenda-based user simulator which is having which is based on rules

and for which a user stimulate the best performing policy on the other on the

other user simulator so that will become probably

ally clear in the next line

and then be and why these policies i'll to interact with users on the camcorder

so

a user simulator training

for all that is used for policy training so one neural user simulator another one

is

okay

and that we know how well a policy performing on neural user simulator

and what the best performing policy that's train a user simulator

and performing well on agenda-based

similarly for the agenda-based

no i mean what is the results show that your policy on agenda-based user separately

and then performing really agenda-based user simulator it's not going to one particular well on

real users

so rule based

approach is to build a user simulator is not particularly

but you always knew real user simulator and if you are wanting rate real users

a vector

we use

but the best performing you want to actually train neural user simulator and best performing

the agenda-based user c

each is that it's the learning is promising for modeling use

but see i with the h in it or you hear if we want the

best

performance

okay so well i lost five minutes i would like to talk about something that

probably closer to this all

in our community

how do we effectively

evaluate dialogue models

and how do we compare however

how can we print use good style

similarly was pcs here only a handful of loops around the world had access to

or at a time axis

and this is something that you want to change in cambridge

because you we want kicks not really and also allow people to easily compared to

each other

so we that mine are toolkit for building the house a statistical dialogue systems open

source

well i

if we use a i'm not sure a simulated environments

i algorithms

it can compare so you want to test the new

a new policy

you can more easily compared to the two sets the state-of-the-art

the collected a large corpus that i've just described

in school monte was we are making this open access

and this work was funded a by my faculty of four

so just a few works the bar titled i'll

so i know where is a implementations of statistical approaches to dialogue system

and it's more similar so you can vary you see a exchange your module for

the currently available functional in the two q

it can very easily be extended each other much closer domain and as if you

a four hundred dollars you would use it to your dialogue system

it offers not domain conversational function

and you the coherent also subscribe to our at this

and this was reading that words from or not just the card numbers but also

from the previous members a off the have systems group

and he's constantly expand

so in terms of benchmarking

you want to have a way of comparing algorithms in a fair way

so for freebase we define the main is different user settings and also different noise

levels in the user input so at

by the total

and

state and number of state-of-the-art parser optimization algorithms including the acre a brief digest of

about

so this initially it was let me because in the ema and probably chernotsky and

with present it needs symposium on the last year

so it's basically you the end of my talk

it's i mean you it's machine learning

allows us to solve any problems in on the rear facing in building natural conversations

you are you married it allows us share concepts between really tracking and in so

that we can have

so that we can

the operational system

in the same they come up with us to know to build a more always

the optimization modules use between a wide variety of a only of action

and also allows us to build more realistic models of users

so that we can train more accurate

policies

but there is a lot to be time to actually achieve a goal of an

actual conversation

and this is just the input the high score

so some of the i'm years are how we want to talk about

i'm structure they can we need to their a knowledge base

if you want system for very long conversation we need more accurate

and more sophisticated reinforcement learning more models

and finally we need to achieve sentiment various have more nuanced

we weren't fun function

to take into account when we are building

i don't have exist

so that's can will bring us closer to the a long term vision which is

have a natural conversation goal directed that

if you very

so would be compared with the so there is a statistical version of the agenda-based

imitate their

but you

realise on hands on having the structure

or all the conversation in this free that you first

asks some so some parts of it are hand-coded and then it has

pockets which are trained so this is done on it

the overall problem solving the overall problem of natural conversation would not be applicable because

we still have

that structure which is fixed so we have compared to that but actually this neural

stimulator was trained on very small amount of data so

i don't know if i have exact numbers dstc two is only i think one

thousand dialogues

so that's it's not a lot

no because of that didn't do not parameters were kept really small so for instance

if i go back

so we don't actually have in which the user the system

here we have in what's the semantic form of the of the user senses

so then this feature extractor is a fact i is very easy to build

i otherwise you would need a cnn or something more sophisticated here so that it

is and it would expand the number of

also how many

how we uk these vectors to be useful then implies how many parameters you have

analysis

so in this model everything this cat very small

just to account for the fact that you have a very small amount

so

so a lot

we carry you mean if you want to start from scratch or if you want

to use some of the models

do you want to start from scratch

then basically everything

everything is domainindependent in that sense so

in particular belief tracking

there

so maybe tracking takes input

and the ontology

so this is very this is just and the additional inputs to the belief tracker

and you in back

the word vectors with you have been your ontology to begin with

so traditionally you want okay and whether it appears in the user send

here we take where the vector of that

and compare the similarity of that word vector with our feature extraction

and we have

three or generic feature extractor which is the main slot and value

so

so there is crazy this should were as it is to a different domain

right

so in an accidental is that there is a more difficult problem in that sense

so you would need to redefine and then system i six

in forty two forty two work

i

and then

it will

however knowledge base looks like so whether

maybe you embed the where it's already been back maybe a particular constraint

so

i read in that the two

so that works very unhappy one stage of not requiring

label they tell the intermediate step

and that is a huge amount which because if you're generating millions of coals every

week

you don't have asked to twenty eight

so it certainly work investigated process inside because of batteries

but the downside that actually work

is about

and the reason for it is that it is still not able to figure out

how to train these networks do not require additional separation

so a lot of their own that is

along that line goes stable about

having and tool and differentiable neural network that you can propagate gradients true but you

still need at some kids the supervision to allow you to actually have

a meaningful output

and i'm not a problem is the evaluation obsessed is so

the research in this area have six hundred and many people or not originally from

dialogue are doing research in this area and they take this is a translation possibly

have

system input in user out and this is really not the case yukon

and why respect to be the bleu score that doesn't say anything about

the quality of these dialogue and it doesn't take into account the fact that you

can have a long-term conversation

three

yes

right

you raise or so that

say

so that there would be i mean

so the one in speech recognition have looked at this problem of having to iterate

over huge number of all works for instance in our in language modeling and there

are some cases like using things can on contrastive estimation do not have to do

a softmax but rather have normalized

output so that

it's one thing with for this work we need to have some similarity metric between

and some confusability between different

different elements of the ontology

so i don't know whether i have a we can answer to how to actually

do that

value

we have as a whole the ontology and then

sponded

for a non backed

having

because all you actually want to have is a good

is this would space representation

so you can almost

you can almost i'm from it and then noted here that would be a particular

work but that's very difficult to okay

so i think some interesting problem

it sometimes a really difficult then we actually have addressed this in this work

so

that

that doesn't

produced a good and bad

and in here is used

you use it in the sense that you know it is that okay consists or

something to do this slot summary action is something to do with slots

that's you know i

how many slots you will talk about

you like the system learn a to do that

so you know especially if you don't have enough training data you can always equal

rate at the system but once we were interested here was mostly to see whether

we can be you because

if you look at the reinforcement learning tricks which are really a use a reinforcement

learning for problems which can be da five simulated and it's often discrete space it's

it's the setting of joystick what to what the action space a to choose between

a very small number of actions

and if you want to apply the time period sticks without seriously you will inevitably

have to learn a larger

state action spaces this is really what we were interested in here but obviously you

always equal rate

which you just

described