Problem was Edge Prediction in given social graph. after I read through Edwin`s blog, I decided to implement it with Map/Reduce since I am not fluent with Scalding and scalar yet(These days I have been busy to catching up Functional Programming course from Coursera and it`s so much fun).
excerpt from Edwin`s blog
Briefly, a personalized PageRank is like standard PageRank, except that when randomly teleporting to a new node, the surfer always teleports back to the given source node being personalized (rather than to a node chosen uniformly at random, as in the classic PageRank algorithm).
I was wondering how efficient to compute personalized pagerank for all vertices in graph. using Map/Reduce with mahout`s matrix library, this computation becomes following.
1. build following matrix
2. build follower matrix
3. union following matrix and follower matrix and row normalize => undirected adjacency matrix
4. create digonal matrix and init startMatrix, prevMatrix as digonal.
5. updatedMatrix = startMatrix * (1 - alpha) + (prevMatrix * adjacency matrix) * alpha, iterate this # times.
Even though Edwin built many features and build random forest for awesome model, I think in reality, vertex-vertex similarity computation will become too costly so just focus on scaling personalized pagerank up to large scale.
To test Map/Reduce`s perfomance, I used twitter-2010 dataset.
following is dataset`s stat.
|nodes||41 652 230|
|arcs||1 468 365 182|
|bits/link (transpose)||13.148 (60.83%)|
|maximum indegree||770 155|
|maximum outdegree||2 997 469|
|largest component||33 479 734 (80.38%)|
|spid||0.17 (± 0.001)|
|average distance||4.46 (± 0.002)|
|reachable pairs||81.08% (± 0.257)|
|median distance||5 (73.16%)|
|harmonic diameter||5.29 (± 0.016)|
Note that maximum degree can be over 3000000 which quickly become computational bottleneck. so I pruned vertices with more than 10000 degree which we actually doens`t care much since this vertices already have too much link.
The result was better than I expected. Map/Reduce runs less than 2 hours on 8cores, 16G, 10 datanode cluster for 3 iterations.
I pushed my code into my collaborative filtering experiment codebase as class PersonalizedPageRank.
I will port this code into Giraph and compare runtime.
Other than performance, I plan to evaluate personalized pagerank based link prediction by using PPR score as recommendations and actual training data as test set. think calculating MAP on topK can be used evaluation metric. I will update this metric in a few days.