Monday, December 3, 2012

Personalized Pagerank for link prediction

I read Edwin Chen`s blog(who is famous data scientist, my role model!) and read interesting post.

Problem was Edge Prediction in given social graph. after I read through Edwin`s blog, I decided to implement it with Map/Reduce since I am not fluent with Scalding and scalar yet(These days I have been busy to catching up Functional Programming course from Coursera and it`s so much fun).

excerpt from Edwin`s blog

Briefly, a personalized PageRank is like standard PageRank, except that when randomly teleporting to a new node, the surfer always teleports back to the given source node being personalized (rather than to a node chosen uniformly at random, as in the classic PageRank algorithm).

I was wondering how efficient to compute personalized pagerank for all vertices in graph. using Map/Reduce with mahout`s matrix library, this computation becomes following.

1. build following matrix
2. build follower matrix
3. union following matrix and follower matrix and row normalize => undirected adjacency matrix
4. create digonal matrix and init startMatrix, prevMatrix as digonal.
5. updatedMatrix = startMatrix * (1 - alpha) + (prevMatrix * adjacency matrix) * alpha, iterate this # times.

Even though Edwin built many features and build random forest for awesome model, I think in reality, vertex-vertex similarity computation will become too costly so just focus on scaling personalized pagerank up to large scale.

To test Map/Reduce`s perfomance, I used twitter-2010 dataset.
following is dataset`s stat.
nodes41 652 230
arcs1 468 365 182
bits/link13.897 (64.29%)
bits/link (transpose)13.148 (60.83%)
average degree35.253
maximum indegree770 155
maximum outdegree2 997 469
dangling nodes3.72%
buckets0.09%
largest component33 479 734 (80.38%)
spid0.17 (± 0.001)
average distance4.46 (± 0.002)
reachable pairs81.08% (± 0.257)
median distance5 (73.16%)
harmonic diameter5.29 (± 0.016)


Note that maximum degree can be over 3000000 which quickly become computational bottleneck. so I pruned vertices with more than 10000 degree which we actually doens`t care much since this vertices already have too much link.

The result was better than I expected. Map/Reduce runs less than 2 hours on 8cores, 16G, 10 datanode cluster for 3 iterations.

I pushed my code into my collaborative filtering experiment codebase as class PersonalizedPageRank.
I will port this code into Giraph and compare runtime.

Other than performance, I plan to evaluate personalized pagerank based link prediction by using PPR score as recommendations and actual training data as test set. think calculating MAP on topK can be used evaluation metric. I will update this metric in a few days.

3 comments:

Unknown said...

Hi Do,
I was also reading Edwin Chen`s blog and I also thought the using MapReduce would be an interesting way to predict edges in very large graphs.

I have just started using Giraph with the Kaggle Facebook data set but I am not getting very far. Can you share any more details of how you used Giraph for edge prediction? I can' t see the PersonalizedPageRank class on GitHub.

Thank you,
Peter

Unknown said...
This comment has been removed by the author.
Unknown said...

Social classifieds offer a user-friendly platform to businesses and individuals to Post Free Ads