Fundamentals of learning: the exploration-exploitation trade-off
6 June 2012
The exploration-exploitation trade-off is a fundamental dilemma whenever you learn about the world by trying things out. The dilemma is between choosing what you know and getting something close to what you expect (‘exploitation’) and choosing something you aren’t sure about and possibly learning more (‘exploration’). For example, suppose you are in a restaurant and you look at the menu:
Fish and chips
Chole poori
Paneer uttappam
Khara dosa
Assuming for the sake of example that you’re not very good with Sri Lankan food, you’ve now got a choice. You can ‘exploit’ – go with the fish and chips, which will probably be alright – or you can ‘explore’ – try something you haven’t had before and see what you get. Obviously which you decide to do will depend on many things: how hungry you are, how good the restaurant reviews are, how adventurous you are, how often you reckon you’ll be coming back etc.
What’s important is that the study of the best way to make these kinds of choices – called reinforcement learning – has shown that optimal learning requires that you to sometimes make some bad choices. This means that sometimes you have to choose to avoid the action you think will be most rewarding, and take an action which you think will be less rewarding.
The rationale is that these ‘sub-optimal’ actions are necessary for your long term benefit – you need to go off track sometimes to learn more about the environment. The exploration-exploitation dilemma is really a trade-off : enjoy more now vs learn more now and enjoy later. You can’t avoid it, all you can do is position yourself somewhere along the spectrum.
Because the trade-off is fundamental we would expect to be able to see it in all learning domains, not just restaurant food choices. In work just published, we’ve been using a new task to look at how actions are learnt.
Using a joystick we asked people to explore the space of all possible movements, giving them a signal when they made a particular target movement. This task – which we’re pretty keen on – gives us a lens to look at the relation between how people explore the possible movements they can make and which particular movements they learn to rely on to generate predictable outcomes (which we call ‘actions’).
Using data gathered from this task, it is possible to see the exploitation-exploration trade-off in action. With each target people get 10 attempts to try to identify the right movement to make. Obviously some successful movements will be more efficient than others, because it is possible to hit the target after going all “round the houses” first, adding lots of extraneous movements and taking longer than needed.
If you had a success like this you could repeat it exactly (‘exploit’), or try and cut out some of the extraneous movement and risk missing the target (‘explore’). Obviously this refinement of action through trial and error is of critical interest to anyone who cares about how we learn skilled movements.
I calculated an average performance score for the first 50% and second 50% of attempts (basically a measure of distance travelled before hitting the target – so lower scores mean better performance). I also calculated how variable these performance scores were in the first 50% and second 50%.
Normally we would expect people who perform best in the first half of a test to perform best in the second half (depressingly people who start out ahead usually stay there!). But this analysis showed up something interesting: a strong correlation between variability in the first half and performance in the second half. You can see this in the graph:
This shows that people who are most inconsistent when they start to learn perform best towards the end of learning. Usually inconsistency is a bad sign, so it is somewhat surprising that it predicts better performance later on. The obvious interpretation is in terms of the exploration-exploitation trade-off.
The inconsistent people are trying out more things at the beginning, learning more about what works and what doesn’t. This provides them with the foundation to perform well later on. This pattern holds when comparing across individuals, but it also holds for comparing across trials (so for the same individual, their later performance is better for targets on which they are most inconsistent on early in learning).
You can read about this, and more, in our new paper, which is open-access over at PLoS One A novel task for the investigation of action acquisition.