viernes, 24 de septiembre de 2010

Huge Graph Manipulation: Symmetrizing a Directed Edgelist Graph

After months of struggling with efficient proccesing of graph on Python, I am starting to use very old tools from the bash command line to process graph. Remember: Always return to the basics.
In this case we want to extract the maximal symmetric subgraph of a directed graph and convert it to it's undirected representation. It is common to work with edge list representation file including one edge per line, without repeated edges.
The idea is to:
  1. Swap directed edges resulting in a lexicographical order for source and destination for each edge. Needed for next step. Using awk tool.
  2. Sort edges to put together symmetric connections, i.e. nodes connected by edges in both directions. Needed for the next step. Using sort tool.
  3. Filter repeated edges. That is, print only edge with more than one ocurrence. Use uniq tool.
cat graph.txt | awk -F'[ ]' '{ if ($1 <= $2) print $1, $2; else print $2, $1 }' | sort | uniq -d > graph.txt.symm

I benchmarked this one line script in my computer (Core 2 Duo 2.66MGz, 1GB RAM) with a LiveJournal social networks involving 77 million directed edges and the result was obtained after 20 minutes: 28 million undirected edges!
The secret behind standard POSIX tools is the correct usage of memory and disk to sort and process files. These tools don't abuse memory nor disk.