Přepis řeči - THE NON-BAYESIAN RESTLESS MULTI-ARMED BANDIT: A CASE OF NEAR-LOGARITHMIC REGRET

that it there is a joint work with uh and i is due that of a seems to the found my the and you guy and uh a a separation much very from us C um so first the problem we are considering a a is that dynamic spectrum access and on the model and the is illustrated that system where looking at to um there are and need given then channels uh i i don't get it it to a primary system that's using a slot transmission structure and uh of second and their use of the objective here is to find the the i'd slot and the limit its sensing oh we a at any given so uh a set use the can only choose K channels to set and if the channel is i do it what transmit and uh if you transmit if you patch what i'd slot you will receive one unit to work then essentially is you get one packet delete and the problem here is how to way i choose which K channels was we should sense and the beginning of a based on observation as a visual have so far and to answer this question the first thing we need to specify is the model of channel all two so if we look at each channel so each channel is essentially piece the i do is a stochastic process and each channel is model as a stock as a process with on parameter so we do know what kind of stochastic process with be ways but the prime in so the simple liz so the process we can be i be here will be a i pro so the channel or as the ah for channel i will the i people in the only with a known means it so see the i'd say is the probability channellise i'd in any given as and changing i and not we want to choose without knowing which channel was more likely to be bus so i don't imagining for pink is with on the by S and you want to get as many had oh so now how do we choose channel um K out of you know there is lot and objective is to maximise house swoop so what is the maximum throughput would we kind even assume we not the perfect we know all the C in that scenario we no we should always sense the channel with the largest to seat with the max probability i and you that case i was we'll put we look at it is just see as every i does well we get when you in the now without knowing all the see guys is it possible to achieve the same maximum throughput be fine by the non model seen that and if yes how a turns of this is actually a well studied problem you've we take a D model this is you was special case of more the amp and vol and the region problem we can explain using this slot machine with a dependent are and that you just time count is that like one um to play but we don't know the we word statistics and the we word processor soon i and maximise long term i as essence of this problem is in the tradeoff between exploitation and exploration so you to it to like we want to play a a a um that has a large is the sample me based on what we have is there are so but that same time we have to play each are sufficiently many both a sufficient number of times it so that our some whole me is close enough to the true oh that's essentially the i which a problem and uh you can be read was we formulate it in the following way see the one to see that and is the we what mean of each are which is a little or no and the maximum we what we can i get over a reading of lance T you it has it's no is given by the maximal see that i use subscript the one here to in the extreme the here to see the white the largest one multiplied by right which is achieved by always playing the bass ah very into it or not we don't know all the see so we don't know which way is the best um we can design design a policy high that tell us which um to play based on what we have there are so far and uh this is the re what we can get for a given policy and how do we know how to these policy is it's measured by that you friends to these ideals in their real way where you do not so this is called the required also called cost of nor learning that's the cost we pay for not snowy them uh and very intuitive to this every time we spend time a a a a a a a that's not the best we waiting incur a loss of C that one minus seat we are spending time i a multiplied by the expected time we yeah play a the ice best so to minimize wet we want to minimize a model spend a each and uh object here is to minimize the growth rate of the web she it's intuitive whether always grew of with time because we and never stop playing each a to make sure our sample means zach and we can also see if we have a sampling year roles of whether with time T then we can achieve the maximum average reward of the best uh a find five the normal models area because if you divide by T C these is not be know you with and there is a problem is nice solved by at all i five they show that the minimum required will street is lot with T they also found the best leading constant uh and given in terms of kl divergence and they the also construct a the policy to achieve peace past well but was rather complicated policy and uh about ten to a fifty years later there's simple policies to achieve lot at this is essentially a index type of policy based on simple me one of the most uh uh uh well no one is proposed by our uh is the so called U C B one policy so in this case what compute the in that for each a are at each time T the in is them play the sum of me of these are class this turn is called of the confidence of on used term is given by a square with the two of log T divide by G i yeah i is the number of times we have played arm i not a car so you can see if we are playing a a a a significant a last log T used remote allow it will make the index the large as that we will have to so the is always played around with a large oh use actually were thing the policy to at you spend a lot of keep time each uh which that's what line all be has shown that what we have so you've what it you know i be model though wow a channel occupancy the problem is completely so it's just a special case of the classic the um and so you we consider a more general more um complex model that's the each channel is a markov chain to state B the i don't it change according to this a a transition that here and we don't know the transition problem it's we don't know P zero one Q one at all the channels are still has had an equal they have the same transition not how do we design channels action policy to maximise swoop turns out this is a variation of the uh a more yeah and it problem but has never been self so we are essentially a on the first to to address these more down band and um my and the what to challenges here so first here we can see if we have a markovian model for each channel then the optimal policy even if we know the transition probability is no longer the state home one channel we need to so each across channels based on our prediction or which channel is more likely to i do like these a fixed because of the channel memory makes prediction so that in this case we can not simply minimize our required by funding at that time was spent on a bad channel because there's no really bad channel say we need to switch the was smart we are all switching on much so you actually large mine to or what is the best way to switching on among channels based on how observation and in general there are infinite possibilities of how to to switching so the wheel are from these infinite possibility which one is the bad so that's really that difficult part and perhaps the reason that this problem has not addressed a lead to each so actually turns out if we do one of the transition probability it's is this problem itself is complex enough not this is is actually the S is multi um band problem and they're not that that makes uh formulate that we do ninety eight eight and this problem in general is peace space ha that's shown but have them each one that's the cool think here is we have a special recipe spend it the think we are you ways here is the only a two state markov G channel each channel and we have shown and they no model the optimal policy is a simple now and do this is uh a a a a real work uh um have down with a a couple of clever with er so the optimality is current to and there's certain conditions so if we have a key one one we to that P zero one which be is is that daisy chain it's more likely to stay rather than change state then of to melody is generally um shoe but that you are the think like you P was model that P zero once a sure what three channels and the all K equal to and mine there's one but we come it's can gently these holds for all and for you you a rising case we have be able to prove so but that's assume we can this time your to project or is true then we know and the not model myopic policy is the optimal and the more significant to these on the the problem is that the mouth you policy has a simple so a universal struck oh which means a we need to know is what if P one mine's squid to that P zero one we always switch in this way as we stay at a i don't channel if this channel with current the sense is i don't who was stay with you it becomes a it then we switched to the channel visited the log is that's are are way of speech and if you want one small the if zero one then way switching a you from way we actually stay if the channel was a the and the was which if becomes i with where why you re what then we go to somewhere else and how the with switch with switched to the channel most recently visited a are all channels with a T that you most of all you've there are not such an no the we switch to the channel busy long as anyway there essentially a of this says is all optimal we of change among channels i'm only takes two possible depending on the transition so that's really makes the are no problem significant a simpler to some instead of searching my infinite possible ways of how to switching on channels we know there are only two possible if we don't to P zero one what is either there this way or that the we can formulate a sort of conceptual more the and it well so we treat each way of channel switching as a a you really can see that what is a a a remote on a problem is nothing but a choice a possible option have and we don't know which of change the bass as exactly what do we have so then not we'll a which um is that what are we only have forms we of switching so we want to do which one is the good one because we don't know the transition probably so we don't know which one to choose so it same we just read of the problem to the class more they are banded problem not that i C there are two channel is we have to address the first one the price are we have to answer is how long to play each i so a region that we just play one as we got some word for the classic model but not a a a a a one way of switching on much so we have to follow that we of switching among channels long enough you all there to see if i so how long we play and terms not to the optimal lattice of formal each way use spell star depends on the transition probability unfortunately fortunately and that's something we don't not so that's one they we have to dress and the out way is the reward in this case and the longer i be kinda not even independent oh i be of rows are because each um is are only able way of switching on among channels and that they are also reaching a on the same as that of and J they are no longer and that right so to address these two challenges that's our uh solution here so first how long to play each uh the optimal lattice of playing each arm of only each we of a channel switching depends on the parameters are no priming mean so to get a a a round of was back to the trick we have here is we play a each and with increasing that so we do not fixed a a specific lance for playing each a but rather choose a sequence that rules with and we will pace certain price in terms of required but we can choose the increasing read of view sequence which really small a lot a lot of T and do the price a pay we pay will be how which we really a yeah a thing to deal with not i T samples um what we did it is we um proved a modified each one of thing which essentially is the chief for proving a lot required to in the classic i D and eight uh model so the uh channel thing an eh uh are we should not foundation is all the conditional um expectations should have should have the same me so that a conditional exactly expectation should be the same me and then we come by to the probability that the sample mean is to far from the truth and in this case we how we don't have identical um every we sample anymore so we need to to modify that because the conditional expectation no longer the same but we can get a the result if the conditional expectations are with thing a certain range of each other so then we can use similar techniques to prove there is and the what we have shown here is we can achieve a which very close to a log all there for the required essentially in in front of a lot we don't have a constant anymore is it's but a you know increasing sequence but this sequence can be a it is small as slow as you want so you choose a small you want to you hug close you want to get a lot T and based on that we can choose the increasing the for each L N so i was give the details on the relation between G T and uh L T actually very simple is just showing at each time what is the current L where you and uh that's a it or what is conclude um this paper here so what we have considered a a a a of the problem was a multi bit it by um dynamic spectrum access um we believe this general approach would be useful for for solving a either uh rest we spend it probably even a mark of general mark of C your process is on the primary a what is the message here the method here is if we are dealing with a small T are bandit problem with a known uh then them and we want to achieve the same i reach word as we can get and the not so in general problem is intractable but the problem may have a whole if and the the not model the optimal policy has what we call a find that option by means there are only fun and two ways of do we find the and optimal policy only has four that possible for depending on the primary don't we treat each possible for as as a a a a we don't to know the prior parameters then we tried to or which one is actually a one for that and a line on them so that's the general message but of course for each specific problem we have to but yeah sure whether log oh i'm sure close to log where can be but she and in the context of dynamic spectrum access that's great we have been focusing are the problem is really possible because house of this i a universal structure of the our P policy for that no model pace and uh i a little bit advertisement um i be giving a another problem related a problem being about forty minutes uh because this one a uh we can only somehow for a very special cases where you have only to state markov chain and all the channels are are are arms a stochastic identical the what if we're dealing with a general fine state markov chain with the no in the transition problem what can we do there so this is paper actually is a dressing this complete be general model um but we can also achieve not require but the definition of the required it is uh we oh oh path i don't one person something to sort of question from the speech true oh i yes so that's essentially here you that cool yes yeah so that the required we get it is these G Q locked chi right so we want chi G to grow very slow is time T and uh so if you want to achieve a lot a lot should be uh for chi chi then eventually do sequence uh this is just the number of but what is the value of L for the current so L one i is well we start we play one i L L one slot so then for the first L once lots the these sequence was what have the same value a one for the that what a a a a a model be out to we play out to time so for the next L tools as well as the but will be out so L is increasing so these sequence will be also be in crazy but even slower or because it's repeating the same value but it yeah okay or running times to time well just a record some um i field to see if a connection was not little huh but this is really trying to learn um trying to maximise this throughput put and there known parameters but a a is more like the inclusion was your peer users here we even only have one uh and secondary user i i i probably miss um but an action that i i don't really see whether two problems i really to um maybe we can talk more data you to to rooms so so let's that's the speaker

THE NON-BAYESIAN RESTLESS MULTI-ARMED BANDIT: A CASE OF NEAR-LOGARITHMIC REGRET

Spectrum Sensing for Cognitive Radio

Přednášející: Qing Zhao, Autoři: Wenhan Dai, Tsinghua University, China; Yi Gai, Bhaskar Krishnamachari, University of Southern California, United States; Qing Zhao, University of California Davis, United States