Sunday, December 14, 2008

MIT Tech Rev article on Timing and Virality: "A Winning Web Formula"

MIT Technology Review had an interesting article on predicting the popularity of a video, or an article on digg, based on the number of visits it receives during its first few hours http://technologyreview.com/web/21753/?a=f. The idea is that advertisers can use that knowledge to do better Ad placement on more popular items. To quote part of the story:

"In the case of Digg, [Bernardo] Huberman [of HP Labs] says that within the first few hours it is clear whether a story will become popular or not (depending on how many "diggs"--or votes--it receives from the site's community of readers). Factoring in the time of day that a story is submitted (a noontime story will get, on average, twice as many early diggs as a story submitted at midnight), the researchers found that if a story receives a low number of diggs, it has relatively little hope of being one of the top viewed stories of the day. Conversely, if a story receives hundreds of diggs in the first hours, it's likely to be much more popular."

The mathematics behind this is interesting. I consulted for a social net telephony company, and one of the things that was noticeable was that the timing of when the invitations went out mattered. Tuesdays and Wednesdays worked out better than Fridays or Mondays for instance. There are good social and psychological reasons behind this. But since virality is a exponential function, even a tenth of the precent increase in acceptance rates goes a long way.

Just to clarify what I mean, if you have a system where a member brings on 1.2 friends to join on the average, and they in turn bring1.2 members, and so on, and so on... then you have a virality coefficient of 1.2. The total number of people joining your network is after t times iterations of people inviting 1.2 friends is a geometric sum. So for the general virality coefficient R (instead of 1.2), and after time t iterations, the number of members you have is a function of R^(t+1) and R^(-1) where ^ is to the power of.
Specifically starting with x people at time t=0 the total number is
x * (R^{t+1) -1) / (R-1).
Now since R-1 is a constant and we can write R=e^log(R) being the exponential, then we see why this function grows exponentially.

This is why it is important for the virality coefficient to R to be greater than 1.0. Otherwise for t going to infinity, and R< 1 it is easy to see that this total is equal to x/(1-R). That is, the number of memberships asymptotes to a constant value.

So one may argue that the initial visits to an article on Digg or YouTube video reinforce one another and set the virality coefficient R. But I think there is something more interesting going on here. That is the more a video is watched, or an article gets diggs the more likely it is for other people to see it, almost regardless of content. And that R may actually increase with time itself; until it reaches a saturation and drops off. This is mainly because you don't know what it is until you see it, which inadvertently increases the number of views. So you are in effect blindly following "the wisdom of the crowd" who themselves have followed the wisdom of the crowd previous to them :-) One may argue it is more like a herd mentality instead. That is why Duncan Watts, a researcher at Yahoo, could show that the quality of a song is a poor indication of its popularity.

The real problem is not just poor quality articles can become popular by setting a high initial conditions by catching a lot of eye balls early on. The real tragedy, IMHO, is that high quality videos or digg articles that do not get initial viewership get buried in the noise.

One way to avoid this is to not give a digg value to any article initially and display them in random orders. Or give randomly high fake values to initial posts and change it from viewer to viewer and see if they do get a lot of stars or not, and then readjust accordingly -- perhaps doing a weighted average, the weights being inversely "related" to the initial fake values. The time of posting and reposting can also be manipulated. That may not be initially a good thing in terms of the total viewership :-( But in the long run, it would create higher quality contents to rise to the top and more viewers as a result.