Part of my final year project is a version control system. It’s not the most important part, but it is important.
I searched about for delta compression algorithms, but all of them were either inappropriate or would take too long for me to implement (not to mention overkill for a college project!). So I came up with my own line-oriented one. Last night I wrote a quick implementation of the algorithm out on paper in python. I haven’t tested it out yet, so I’m not sure about its performance. The end product will be written in C++.
def GetDiff(oldLines, newLines): """Constructs a diff to get from newlines to oldlines."""# Construct our special "hash" table. tblSize = len(newLines) * 2 + 1 tbl = tuple([[] for i in range(tblSize)]) for i in range(len(newLines)): pos = hash(newLines[i]) mod tblSize tbl[pos].append(i)
# Holds our diff information. result = []
# Find the common subsequences and differences. iOld = 0 iMax = 0 while iOld < len(oldLines): pos = hash(oldLines[iOld]) mod tblSize
# Start with no match nMax = 0 for iNew in tbl[pos]: nCur = 0 while nCur + iNew < len(newLines) and \ newLines[nCur + iNew] == oldLines[nCur + iOld]: nCur = nCur + 1 if nCur > nMax: nMax = nCur iMax = iNew
# Have we found a matching subsequence? if nMax == 0: # Nope. result.append(oldLines[iOld]) iOld = iOld + 1 else: # Woohoo! result.append((iMax, nMax)) iOld = iOld + nMax
return result
If anybody has any thoughts on it, I’d love to know.