Monday, December 3, 2012

Personalized Pagerank for link prediction

I read Edwin Chen`s blog(who is famous data scientist, my role model!) and read interesting post.

Problem was Edge Prediction in given social graph. after I read through Edwin`s blog, I decided to implement it with Map/Reduce since I am not fluent with Scalding and scalar yet(These days I have been busy to catching up Functional Programming course from Coursera and it`s so much fun).

excerpt from Edwin`s blog

Briefly, a personalized PageRank is like standard PageRank, except that when randomly teleporting to a new node, the surfer always teleports back to the given source node being personalized (rather than to a node chosen uniformly at random, as in the classic PageRank algorithm).

I was wondering how efficient to compute personalized pagerank for all vertices in graph. using Map/Reduce with mahout`s matrix library, this computation becomes following.

1. build following matrix
2. build follower matrix
3. union following matrix and follower matrix and row normalize => undirected adjacency matrix
4. create digonal matrix and init startMatrix, prevMatrix as digonal.
5. updatedMatrix = startMatrix * (1 - alpha) + (prevMatrix * adjacency matrix) * alpha, iterate this # times.

Even though Edwin built many features and build random forest for awesome model, I think in reality, vertex-vertex similarity computation will become too costly so just focus on scaling personalized pagerank up to large scale.

To test Map/Reduce`s perfomance, I used twitter-2010 dataset.
following is dataset`s stat.
nodes41 652 230
arcs1 468 365 182
bits/link13.897 (64.29%)
bits/link (transpose)13.148 (60.83%)
average degree35.253
maximum indegree770 155
maximum outdegree2 997 469
dangling nodes3.72%
largest component33 479 734 (80.38%)
spid0.17 (± 0.001)
average distance4.46 (± 0.002)
reachable pairs81.08% (± 0.257)
median distance5 (73.16%)
harmonic diameter5.29 (± 0.016)

Note that maximum degree can be over 3000000 which quickly become computational bottleneck. so I pruned vertices with more than 10000 degree which we actually doens`t care much since this vertices already have too much link.

The result was better than I expected. Map/Reduce runs less than 2 hours on 8cores, 16G, 10 datanode cluster for 3 iterations.

I pushed my code into my collaborative filtering experiment codebase as class PersonalizedPageRank.
I will port this code into Giraph and compare runtime.

Other than performance, I plan to evaluate personalized pagerank based link prediction by using PPR score as recommendations and actual training data as test set. think calculating MAP on topK can be used evaluation metric. I will update this metric in a few days.

RC File

I have been wondering how RC file actually works so I took a look at paper.
followings are simple review of paper.

In general, for map/reduce based data warehouse, following requirement needs to be addressed.

1. fast loading: able to load large amount data fast.
2. fast query process: query execution should run fast.
3. efficient storage space utilization: 
4. strong adaptivity to highly dynamic workload patterns: 

To achieve above requirement, there have been 3 major data placement scheme.

1. row-store
  • can`t provide fast query execution(full scan on all records, no skip on unnecessary columns)
  • can`t get high storage space utilization since it is hard to compress row efficiently
  • fast data load
  • strong adaptive ability to dynamo 
2. column-store
  • tuple reconstruction cost lots => query processing slow down
  • record reconstruction overhead => since record reside on multiple location in cluster, network cost goes high.
  • require pre-knowledge of possible queries for creating column groups
  • at query execution time, can avoid reading unnecessary columns.
  • high compression ratio
3. PAX store


What RC File provide
1. horizontally partition rows into multiple row groups
2. column-wise data compression within row group + lazy decompression during query execution
3. flexible row group size, trade off data compression performance query between execution performance
4. guarantees that data in the same row are located in the same node
5. column-wise data compression and skip unnecessary column reads

row group in RC File
1. sync marker: at beginning of the row group
2. metadata header: for row group: # records, each column bytes, each field in each column bytes.
3. table data section: column-store, columnA`s fields, columnB`s field….. sorted order
4. RLE for metadata header
5. each column is compressed independently using heavy-weight Gzip => lazy decompression

future work: automatically select best compression algorithm for each column according to it`s data type and data distribution

TODO: take a look at and others