Advogato: Implementation of a network of contacts

Introduction

A while back I had to implement a website that offered some social networking. You know, people could be contacts, could be connected if there was a path following contacts from one to the other, and disconnected otherwise. The set of people connected to some given user constituted his network. Nobody is contact of himself.

Of course we needed to be able to say whether two people were contacts or not, and to know how many contacts a given user has. But we also needed to be able to determine whether two people were connected, to be able to restrict searches to networks, and to efficiently know their sizes. There are websites out there offering this kind of stuff so the first thing I did was to Google around. I didn't find anything of help though, so this article tries to fill that gap.

Representing Contacts

Since we need to keep track of contacts there's a join table with them. As far as those definitions is concerned it doesn't matter who invited to whom, so the relation being contact is symmetric. To simplify the SQL we specify commutativity in the database by doubling the entries in the contacts table. Thus, if u and v are contacts there are rows (u.id, v.id) and (v.id, u.id). This is not necessary technically, it is just a decission taken for convenience that may not be suitable in large sets.

With that table, knowing whether two people are contacts becomes a simple SELECT, and knowing how many contacts some user has becomes a simple SELECT COUNT(*). We can also easily know whether two people who are not contacts are at distance 2 from each other, and obtain the ID of the intermediate contact if that's the case:


  SELECT a.to_id FROM contacts a, contacts b
  WHERE a.from_id = #{u.id} 
    AND b.to_id   = #{v.id}
    AND a.to_id   = b.from_id
  LIMIT 1

Note how the duplication of rows simplifies the SQL, we can think of following contacts in some kind of direction. You may offer tests for distance 2, and perhaps distance 3, but this quickly explodes, so that needs some care. Of course you may cache stuff and provide just approximations. LinkedIn for instance does not give the exact size of your connections at distance 2, at least that's my interpretation of the plus sign attached to their counter.

Representing Networks

The contacts relation gives naturally a graph where u and v are contacts iff there's an edge between their nodes. Since being connected is a transitive relation, it takes a moment to realize that any two people who are connected have the same network.

That's the first key observation. Suppose w belongs to the network of u, which is connected to v. By definition there is a path P that connects w and u. Since there is a path Q connecting u and v, the path PQ connects w and v, and so w belongs to the network of v.

If you visualize this in the graph, at any given moment contact networks are exactly the connected components of the graph of contacts. Take a moment to visualize those connected components. I called them clusters in the application.

Now the second key observation comes: it is trivial to maintain a cluster field in the users table as long as the graph grows. Suppose a new user a joins the website and becomes contact of user u. Right, assign to a the cluster of u. Suppose existing users u and v become contacts. Right, join the components if they do not belong already to the same cluster. That's a single SQL statement:


  UPDATE USERS SET CLUSTER = '#{u.cluster}' WHERE CLUSTER = '#{v.cluster}'

And this is very important because social networks normally grow. It is rare that people break their edge, so the trade-off wins in mean. With this implementation it is trivial to know whether two people are connected, it translates to compare their cluster field. To compute the size of the network of some user translates to a simple SELECT COUNT(*) minus 1, and to restrict searches to the network of a given user you just put the cluster in some join.

If u breaks a relation with v then you need to pick either of them and recompute their transitive closure. That's the trade-off. Either they stay in the same cluster because they are still connected somehow, or either there's a split, that's expensive:


  cluster_so_far = Set.new
  to_visit       = Set.new([self.id])
  while to_visit.size > 0
    user_id = to_visit.entries[0]
    to_visit.delete(user_id)
    cluster_so_far << user_id
    contact_ids = []
    rs = dbh.query("select to_id from contacts where from_id = #{user_id}")
    rs.each {|row| contact_ids << row[0]}
    rs.free  
    to_visit += contact_ids.reject {|i| cluster_so_far.member?(i) || to_visit.member?(i)}
  end    
  
  # Normally IN is faster than OR, and both faster than an UPDATE per identifier.
  cluster_so_far.to_a.each_slice(200) do |s|
    dbh.query("update users set cluster = '#{cluster}' where id in (#{s.join(',')})")
  end

Conclusions

This implementation of a network of contacts has been running in production for some time, albeit the website has just a few number of users and have no experience about its scalability. I figured this out on my own and makes sense to me, but I am sure there's room for improvement and extensions.

Implementation of a network of contacts

Posted 10 Feb 2007 at 23:31 UTC by fxn

Finding community structure within networks, posted 11 Feb 2007 at 17:56 UTC by StevenRainwater » (Master)

related: 'tagging', posted 15 Feb 2007 at 12:35 UTC by lkcl » (Master)