Part of my final year project is a version control system. It’s not the most important part, but it is important.
I searched about for delta compression algorithms, but all of them were either inappropriate or would take too long for me to implement (not to mention overkill for a college project!). So I came up with my own line-oriented one. Last night I wrote a quick implementation of the algorithm out on paper in python. I haven’t tested it out yet, so I’m not sure about its performance. The end product will be written in C++.
def GetDiff(oldLines, newLines):
"""Constructs a diff to get from newlines to oldlines."""
# Construct our special "hash" table.
tblSize = len(newLines) * 2 + 1
tbl = tuple([[] for i in range(tblSize)])
for i in range(len(newLines)):
pos = hash(newLines[i]) mod tblSize
tbl[pos].append(i)
# Holds our diff information.
result = []
# Find the common subsequences and differences.
iOld = 0
iMax = 0
while iOld < len(oldLines):
pos = hash(oldLines[iOld]) mod tblSize
# Start with no match
nMax = 0
for iNew in tbl[pos]:
nCur = 0
while nCur + iNew < len(newLines) and \
newLines[nCur + iNew] == oldLines[nCur + iOld]:
nCur = nCur + 1
if nCur > nMax:
nMax = nCur
iMax = iNew
# Have we found a matching subsequence?
if nMax == 0:
# Nope.
result.append(oldLines[iOld])
iOld = iOld + 1
else:
# Woohoo!
result.append((iMax, nMax))
iOld = iOld + nMax
return result
If anybody has any thoughts on it, I’d love to know.