inFERENCe

Saturday, 20 April 2013

Check out my quora blog, too

To see newer posts, check out inference.quora.com

Friday, 5 October 2012

The big data business model

This is of course in reference to this clip from South Park:

Monday, 18 June 2012

How to think outside the box - literally

This blog post is about innovation and creativity. More precisely it's about a stupid little technique, that once helped me be creative and achieve something unexpected: come up with a quantum physics technique, and publish a paper, with barely any training in physics.

Most big ideas and breakthroughs are unexpected. They are not part of your plan for next week, in case of a company, probably not on your roadmap for next year. It is important to have a roadmap, but it is just as crucial to realise that your potentially biggest achievements are not yet on it. Keeping that in mind, what can you do right now to trigger that next big idea coming? This may be your moment: follow these instructions. It may take some time, but bear with me. I mean, really, try this, if you can't commit yourself to try something creative right now, when you have nothing better to do than read this crappy blog post, then when will you?

My thinking box

Step 1: To be creative, you want to go beyond your confort zone: you want to think outside the box. What's the first thing you need for that? A box. That's right. If you want to think outside the box, you have to have the box first. So go and get one now. A cardboard box, large or small, will easily do. I used the one pictured on the right, which I found in the recycling bin in the office. Resume reading when you found your box.

Step 2: Now that you have the box, what's inside? Inside the box are things that you are confortable with thinking about. If you have a to do list, a project plan, a roadmap for next year(s), those all go inside the box. Everything that you anticipate to happen to you and your work, everything predictable, those all go inside the box.

Step 3+: When you're ready and have the box, you can start thinking outside of it. You will likely need a pencil, to take notes, I simply sketch stuff on the wall of the box. You may prefer a quiet place to think: this time, try a busy place: remember, you're outside your comfort zone now. The rest is up to you. Let me know if you how you saved the world using this technique.

Wednesday, 6 June 2012

My big fat data-driven wedding: how to decide when to get married

There's all this hype about the importance of using data to support critical business decisions. But few things can be more important in someone's life than choosing the partner they plan to spend the rest of their lives with. So look no further for a data-driven approach to support this critical decision: when should you stop looking and finally get married?The formal setup of the problem is as follows: there are $ N $ potential marriage partners out there, and I want to choose the one and only partner of my life. My goal of course is to settle with no-one but the very best partner to marry.I can 'date' each candidate sequentially in a random order. When I date someone, I can evaluate how she ranks in comparison to all previous partners I dated. Based on this, I can either go on and marry her in which case the process stops, or reject her and carry on dating more partners. Once a partner was rejected she cannot be called back again. The problem is hard, because it's assumed that I know nothing about a partner before I actually dated her. Some might immediately recognise the typical problem of exploration versus expoitation hidded in here. So when should I stop dating and marry someone?Let's say I want to maximise the probability of ending up with the best partner as wife. Firstly, there's no point in selecting someone who's not the best so far, because that also implies she cannot be the best overall. Also, the earlier you marry, the higher the chance of missing out on someone without even dating her. It turns out the optimal strategy is as follows: date the first $ k $ candidates and reject them irrespective of their 'performance'. Then, keep on dating partners, and choose the first one, who ranks above everyone I dated before. The threshold $ k $ depends on $ N $, assymptotically it's about a third of all candidates ( $ k \approx \frac{1}{e}N $ see slides for details).I gave a talk about this theory in our lab at Cambridge not long after getting married, and did a quick back-of-the-envelope calculation to see how close my marriage decision was to the optimal. The theory predicts based on my criteria I should've dated roughly 196,132 partners before getting married. Well, I leave it to you to guess whether I did that or not :) But at the end of the day I ended up making the best decision, so it doesn't matter.I got recently reminded of this problem, as I was thinking about making instant decisions in the PeerPerks platform, which as it turns out can be modelled as a more complicated version of the optimal marriage problem, called probabilistic submodular multiple-choice secretary problem. In PeerPerks, we allow brands to offer freebies to influential users of social media. Our job is to make sure the promotion generates maximal success based on various indicators. In PeerPerks, every once in a while a user turns up on the website and authenticates using their twitter or facebook. At this point we have access to their social media profile, so it's like we've dated that person: we have a little insight into which topics she's interested in, what kind of people she's connected to, etc. Based on this instant evaluation our algorithm decides whether or not the person qualifies to receive a perk or not.

Tuesday, 6 March 2012

Sock gnomes versus laws of large numbers

Whilst finishing one of my previous posts on causality and influence, I was pairing up socks that just came out of the laundry. I was outraged to observe that, yet again, I was hardly able to pair up over half of the socks. 'This cannot happen by chance, we must have sock gnomes, who deliberately mess up your socks' - I was thinking to myself. Clearly, I'm not alone with this problem:

eff you sock gnomes ! & your dirty sock stealing ways ! twitter.com/stevieraepeter…
— Stevie peters(@stevieraepeters) March 6, 2012

But then I realised I may be wrong, maybe it CAN happen by chance. Maybe it's in fact EXPECTED to happen, and if I could perfectly pair up all socks, I ought to be more surprised. So I started formulating the problem:

We have $ n $ perfectly distinguishable pairs of socks in our laundry bag. We decide to do our laundry, but because our washing machine has finite capacity, we can only wash $ k \leq 2n $ pieces of socks. We select the particular socks to be washed by drawing from the laundry bag uniformly at random, without replacement. After this procedure in our sample of $ k $ socks there will be a number $ Y_{n,k} $ of them that won't have their pairs in the sample. What is the expected value of $ Y_{n,k} $ as a function of $ n $ and $ k $?

First I wanted to give an analytic answer, but I had waaay to many other things to use my brainpower for, so sorry, this blog post is based on Monte Carlo estimates, that I could quickly hack together. (For those of you who doesn't know, Monte Carlo is a fancy name for simulating a bunch of random outcomes and then taking averages)

I came up with a visualisation to explain what's going on. Here's a simple (and ugly) version of plot if we have only $ n=2 $ pairs of socks, with explanation below.

2 pairs of socks means a total of 4 pieces in our laundry bag that we may select to wash. The x axis shows the number of socks washed (k), which therefore ranges between 0 and 4. The y axis shows the number of socks left unpaired. The gray rectangles show the distribution of unpaired socks with white meaning probability 0, black meaning probability 1 (see colorbar on the right).
If we wash 0 socks, with probability 1 we have 0 unpaired socks, hence the black rectangle at (0,0), meaning if we wash 0 socks, with probability 1 we have 0 unpaired socks.
If we wash a single sock, we obviously can't pair it up with anything, thus with probability one, we have 1 unpaired sock, hence the black rectangle at (1,1).
If we wash 2 pieces, than with probability ~0.65 those two won't be pairs, so we're left with two unpaired socks, hence the dark gray rectangle at (2,2); with probability ~0.35 we can actually pair them up and we're left with 0 unpaired socks, hence the lighter gray rectangle at (2,0). It's impossible to have a single unpaired sock, hence (2,1) is white.
And I hope you get it by now and could finish the reasoning all the way to the end, and figure out why the graph is always symmetric.

The blue dotted line shows the maximum of the number of unpaired socks, it's easy to show you can never have more than $ min(k,2n-k) $ unpaired socks. So everything above the blue dotted line should be white. The red line shows the expected number of unpaired socks as a function of $ k $, the number of pieces we have washed (remember, $ n $ is now fixed)

Below is the same kind of graph for the case we have $ n=10 $ pairs of socks. Notice the checkerboard-like pattern which due to the fact that the parity of the number of unpaired socks always corresponds to the parity of the number of socks washed. Pretty pleasing to the eye, isn't it?

The expected number of socks starts to be a smoother function of the number of socks washed, and it becomes even further as we further increase this number to 50, as the following figure shows:

The shape of the expectation looks the same as a function of $ \frac{k}{2n} $, the variance around this mean however has decreased substantially.
If we further increase the number of socks to 200 pairs, the variance practically shrinks to 0:

And for a little animation putting all this together for growing values of $ n $ click here.

So what's the answer? Well, based on Monte Carlo simulations I conjecture that the following law of large numbers holds for sock pairing:

As $ n \rightarrow \infty$ and $ k \rightarrow \infty$ but such that $ \lambda = \frac{k}{n2}$ is fixed, the fraction of unpaired socks converges to a deterministic value with probability one:

\[ \frac{Y_{n,k}}{2n} \stackrel{1}{\rightarrow} \lambda\left(1-\lambda\right)\]

So, the expected fraction of socks that you can't pair up can be as high as half of the total number of socks you have washed, so my analysis allowed me to rule out the alternative hypothesis of sock gnomes messing up my socks.

Take home message #1: Before you start believing in sock gnomes, satan, god, and other supernatural powers, check if events that seem unlikely really cannot be explained by more parsimonious models of probability.

Take home message #2: If you want to be able to pair up your socks, buy loads of the same colour and you won't have such problems.

Sunday, 29 January 2012

Flattr: should we use it in academia?

I have just found the service that I have been looking for for a while: Flattr. It's a social micropayment platform, an evolution of "Buy me a coffee" buttons that started appearing on blogs a few years ago. It allows you to donate and distribute a small amount of money each month to authors of content that you "liked" that month.

Here's how it works: you sign up on Flattr, top up your account with a small amount, let's say $5 a month, then you come back to my blog and press the Flattr button on the right. At the end of the month, your $5 will be distributed equally to authors of things you Flattr'ed, so I get my share of your $5. This way, the small contributions add up, and the community can reward authors of content they frequently read and who inspire them. I personally think, details and implementation apart, this general idea is brilliant. I'm definitely going to top up my account and suggest people, whose stuff I read to add a flattr button.

I actually first thought of such a social redistribution system in terms of academic funding, and almost launched a buy me a coffee button on my publications page to make a point. Now, having found Flattr, I started thinking again about how could such a system be used for a more equal redistribution of funding in academia. Let's review some current inefficiencies in academic research funding:

Inflexible fellowship values: Money in academia is often awarded in discrete chunks, especially for junior positions. You apply for various grants, fellowships, scholarships, and you either get the whole thing for x years or you get nothing. If you for example got a Gates Scholarship to study for a PhD in Cambridge, you got it for three years, the amount you get is practically fixed for 3 years (other than following inflation), you can't cease to be a Gates scholar if you do poorly, you can't become one if you perform better. It's pretty much fixed and independent of your performance for 3-4 years. This lack of adaptation may generate random inequalities: there are some fellowships that are only awarded every second year for example; it's close to impossible to get a decent academic salary in certain countries, independently of the quality of research being done.

Poor metrics: In my opinion academia struggles with a big unsolved problem: How to attribute authority, how to measure and reward someone's contribution, inspiration, impact? The most widespread metrics are number of citations, h-index, impact factors, eigenfactors, etc. These are crude, heuristic measures of influence and authority. They certainly make some intuitive sense and vaguely correlate with what we really wanted to measure, but these are not a solution, just complicated features. If we want a fairer system we have to establish new standards for measuring influence/importance/etc. (Well, first of all we have to agree what we want to measure)

Discrete random variables: When measuring output, there are way too many discrete variables involved: Is your paper accepted? Yes or No? Does this paper cite relevant papers of yours appropriately? Yes or No? Which journal did your stuff appear in? These binary decisions rectify the opinion formed about your paper: if your paper is accepted, it means some undisclosed set of reviewers who looked at it didn't find a mistake and liked it. But how much did they like it? Were they absolutely amazed by it, or did they just not care enough to reject it? Were the reviewers respected academics, or were they incompetent? What does the rest of the community think? If I want to state that in my opinion a particular piece of research represents huge step forward in machine learning, pretty much my only option to measurably reward the authors is to cite their paper, but maybe my work is on something completely different, so it's inappropriate.

All in all, the measurement of scientific output is inherently inefficient because of these binary/discrete decisions involved. Everybody with engineering background knows that discretisation means loss of information: measurement of academic output is discretised rather arbitrarily, with little consideration given to effective information transfer. And this loss of bandwith constrains the system of academic rewards to be either slow and accurate (you have to grow old to get recognition) or fast but inaccurate (you will be judged on the basis of a few noisy random variables). Often I think it's both slow and inaccurate :)

Does Flattr suggest a solution? Can something like Flattr help improve academic funding and assignment of credits? Imagine you get a fellowship, and you have to re-allocate 5% of your earnings each month to fellow scientists according to how important their contributions are to your field? Can this redistribution be done in such a way that equilibrium salaries are commensurate with the value of each scientist? Can we create an incentive-compatible academia, where it is in everyone's best interest to honestly report whose research they find valuable? Let me know what you think in comments.

Wednesday, 25 January 2012

Observe, don't just see, Dr Watson

The other day I was reading stories of Sherlock Holmes on the train. The following quite famous quote in which Holmes explains to Dr Watson his extraordinary inference and deduction skills caught my eyes:

Watson: And yet I believe that my eyes are as good as yours.

Holmes: Quite so. You see, but you do not observe...

In my interpretation this means that everybody is capable of seeing the same evidence, but what makes a good detective is the ability to focus on relevant aspects and putting these pieces together to reconstruct stories underlying their observations.

I started thinking about the relevance of this philosophy to data science and data-driven enterprises. Many companies, governments, etc now realise the value of data: the more data they have about their users/citizens, the better position they are in when making decisions - in theory. Take a look at for example the bazillion of startups based on the business model: "People like to engage in activity X. We will make their experience better by leveraging data from social networks". But can they really realise the potential in data? My impression is that many of them can't. These companies see; but they do not observe: they may be sitting on terabytes of data, but they're incapable of using it to tell the right story about their users.

For that, they need real data scientists, who are a scarce resource. Not every company can afford the luxury to have an experienced data science team playing with their data. Data scientists are private detectives of the modern world: they are presented with evidence (the data) and they are asked to uncover the hidden story that explains it all. In most cases, Sherlock Holmes had to find out who the murderer was and what their motivations were, data scientists (or rather, the algorithms they build) try to figure out what made you share a link on Facebook; whether your recent credit card transaction was made by you or it was fraud; or figure out from thousands of USB stick descriptions that "four gigabytes" and "4GB" mean the same thing.

For the average reader, the above examples would be somewhat less exciting than any of the murder cases Sir Arthur Conan Doyle's famous fictional consulting detective has ever worked on. But hey, maybe that's only because no-one with sufficient writing talent has picked this up yet: "Twelve Adventures of Hercule Bayes, the Consultant Data Scientist". I'd buy it.