I'm building a new version control system, and I need input from people I want to use it: people like you. If we're going to get version control right, how should we do it?
I'm building a new version control system, and I need input from people I want to use it: people like you. If we're going to get version control right, how should we do it?
Recently, I've been playing about with Perforce, a rather nice version control system. It's very easy to use, it's got a bunch of neat features, but it's not free software. So, initially, I set off to create myself a free equivalent of Perforce; something to fix up (what I see as) the deficiencies in CVS and keep Perforce's good bits. I'm calling the system `Perverse' - the Perl Version System.
It's gone beyond that, though. This is the chance I have to create a free version control system and get it right. (Read: "dammit!") To do this, I need input from the people I want to use the system: the developers, you guys. You can see the current wishlist at http://perverse.sourceforge.net/wishlist.html, but I think Advogato is a good place for us to have a discussion on version control and how it should be done.
Yes, I know that CVS does this and Aegis does that and Bitkeeper does the other; now's the time to think about what we want from a perfect version control system, not a currently existing one.
Then all me and the boys have to do is go away and write it...
Great. Yet Another Version Control System. And echoes of a newbie... "Wow. Wouldn't this be great?! I'm gonna build <this>! Okay. Now everybody tell me what it needs! Oh, and who wants to help? I don't have any code, so a lot of help would be great!"
On the other hand, Subversion is a project funded by Collab.Net. It is being led by Karl Fogel, one of the original CVS developers. A complete design is there, and initial coding is in the works. There are several dedicated developers (Karl and several others) producing the code, rather than somebody's "wish" to see something better.
Seriously. By all means, Perverse and other systems can and should be built. Exploration of ideas through multiple development streams. That is what Open Source is all about. But I'm going to place my bet on Karl and friends.
[ disclaimer: yes, I'm also working on Subversion. But I'm only doing the network coding. Karl, Ben, and Jim are defining what Subversion really "is". ]
My personal view, after having worked on version control system design for use in Open Source, is that Subversion is a waste of time. It is an incremental improvement over CVS, but it avoids the thorny and truly important issue: Distributed Branching.
This is very difficult to retrofit properly on a design, and not spending the extra time up front to support this means that Subversion IMO will be at best a short term solution which has to be replaced shortly (possibly with another version control system also named 'Subversion' in order to keep market share), and at worst will be actively destructive for the Open Source community, by being a block on distributed branching and forcing the bar for replacement systems in a fashion that make them later in coming.
If you just want to keep a bunch of geographically distributed developers working on the same project with tight coordination. Distributed branching does not seem so critical...
In other cases, I can see where it would be useful, but I can see how it would make things worse by making "code forking" easier...
It would be nice to have distributed repositories that can support asynchronous programming teams. One team should be able to selectively incorporate changes from a remote repository on a commit-by-commit basis.
I've never tried cvs branching, but I saw the cheat sheet someone wrote up for themselves on all the steps involved. I'd like branching to work much like asynchronous distributed repositories, with support for partial merges of two branches. Sometimes two branches (or two repositories) will never synch up completely, because one branch (or repository) contains custom modifications that don't belong elsewhere. Direct support for this kind of usage would be nice. You can do it today with patch, but that takes lots of discipline, and is prone to error.
Finally, I've always found that any attempt to change the MANIFEST of a source distribution (add new files, delete old files, rename or migrate files) always required extra patience with cvs. You have to delete files first, then remember what they were to invoke "cvs remove". Is there any rename capability? How about moving a directory?
All praise to Brian Berliner (and Larry Wall and whoever wrote rcs) for making me fairly happy with my diff management tools. As well as Brian Hogencamp for the cvs wrapper scripts I use every day. Are you the next in line?
It is not critical for a small project with few developers and tight coordination. OTOH, for that kind of project, almost anything will work.
For a large project, disitrbuted branching is good in that it makes a much larger continium of forks, making it possible for forks to live for a while before being folded back into the old project, or for a fork and its parent to share the minor changes (bugfixes and minor, clear improvements).
Distributed branches also has another major advantage: Easier scalability, and "routing around" bad management. In a world owhere development is done in local branches and merged into "larger" branches (more used ones), whoever does the best integration wins (technically).
Forks are bad due to wasted effort; a large part of the reason for the wasted effort is the lack of good enough tools for handling forks. Small forks and rejoins can be very healthy for a project (e.g, the gcc -> egcs -> gcc development.)
OK. Here's my CVS wishlist. I agree with other posters that we should do an incremental change to CVS.
First, the import/vendor branch mechanism is weak. Once a file leaves the vendor branch, it can never return. Well, that's not entirely true, you can set the default branch back to the vendor branch, but then you lose the ability to get the right file with the -D command. So support for tracking the date/time when default branches changed so that you could make this work.
Second, CVS is too slow. It takes forever to cruise through the FreeBSD cvs repo. A better way to manage the repository is needed. Perforce is nice because it KNOWS all the chagnes to the repo relative to the current working tree. cvsup is quite a bit faster than CVS because it has done many good optimizations.
There needs to be something like Perforce change numbers where each change collects a number of files. This will allow bonehead mistakes to be detected and avoided more easily (eg, forgetting to commit a file, maybe in a different directory).
Branching and merging needs to be easier in CVS. If you've ever done lots of branching in both CVS and Perforce, you know how much more painful it is to do on CVS than Perforce. If branches were easier, it would be easier to work collaberatively on experimental things in a distributed manner (eg NEWCARD would be a branch, not in the mainline).
One could argue for days about the explicit lock, vs catch all change model differences between CVS and perforce. I love and hate both of them.
One thing that needs to be considered is remote distributed as well as remote mirroring. Perforce makes the mirroring hard.
Look at TeamWare from Sun. This is the system that BitKeeper was (mostly) based on. You should be able to use their try and buy thing to get a 30 day license for it if you want to experiment to see how it works... you can certainly read the docs on docs.sun.com.
Its primary deficiency, in my mind, is the lack of support for anything other than filesystem mounts to talk to remote workspaces. Being able to use something like TeamWare but with a protocol in between would be ideal in my mind.
TeamWare gets the "distributed branching" thing very, very right, in my mind... it's just that the distribution part only works properly with NFS over local networks.
This kind of reminds me of the discussion on autoconf and automake...
What I want to do as a developer is develop. What I don't want to do as a developer is spend a lot of time wrapped up in the picky details of developing that don't have anything to do with the functionality of the program. What I mean by these things are autoconf, automake, worrying about portability, using cvs, pushing and pulling files all over the internet, etc. I just want to code. After you learn the UNIX shell, sed, awk, perl, C, java, python, automake, autoconf, and numerous other languages and sublanguages, it's not fun or a challenge to keep learning obscure syntax that is only good for one thing, it's pretty friggin' annoying.
Granted, some of this is unavoidable. I'm not going to stand up and say that autoconf and automake are great, but they are a necessary evil, because portability is still an issue. What I don't like is how every single program as it increases in flexibility ends up tacking on their own extension language and/or scripting facility. The myriad of different things a programmer has to know just to write a program that can successfully write "Hello World" over a telnet connection and compile on more than about 3 OS's is really ridiculous.
So, in my opinion, what do I want in a versioning system? I want it to be controllable through a language that is already known, (something like perl or guile would be good, but for gods sakes, don't create another language or scripting facility) and I want the tool to get out of my way. A versioning system has nothing to do with the functionality of the program it is managing and as such should take as little time to administer, use, install, update, troubleshoot, etc. as possible.
CVS is a kick ass system, and I actually read the CVS manual, (I think Per Bothner wrote it, or?) and I learned it, but I would have much preferred something more simple. Because really, you can't really use CVS without also using diff, patch, and many other tools. Sure UNIX is cumulative, but the learning curve (not in difficulty, but merely in time) is way too steep for a tool that doesn't actually contribute to the program but only the organization of the source code.
All of that ranting said, I don't know how to build a program that would fit my own needs, but then again, I don't have to, because I'm not proposing to implement a versioning system. :) What I do know is that if it uses some sort of language or scripting, it should use something widely known, it should stay as simple as possible, and using basic UNIX shell knowledge, the user should be able to sit down in front of it and pretty much figure it out. You'll know you failed if you end up with a 200 page manual, 8,000 features, a nifty scripting language that when used feels like trying to drive a nail with a wrench, and 10 O'Reilly books on your program. Because if it warrants that much work and documentation in order to use it, it probably isn't done well.
I'm not down on complex programs, after all, hell I'm an emacs bigot. I'm just against losing a hefty portion of time that would otherwise be productive learning tools that are merely a means to an end. A versioning system is not an end in itself (like, say, nethack) it's a tool that I want to use and put away. Please make sure that your version of this tool isn't like Microsoft's paperclip. :)
Two quick clarifications on the subversion project:
First, "funded by CollabNet" is too strong. It is being hosted at subversion.tigris.org and CollabNet is bearing the hosting costs. Also, some of the core people working on it are CollabNet employees. But then again, some of the most important people working on it are employed elsewhere. Actually, there is probably as good an overlap between core subversion contributors and advogato users :)
Second, distributed branching has been raised as a desirable issue from the start. The decision was made to focus on "cvs without the bugs" first, simply to get something up and working. Work on the design has explicitly tried to keep options open for things like distributed branches.
Also, I invite interested developers to participate in all the projects on tigris.org. The "twin goals" of the site are to develop a project hosting infrastructure, and to host open source software engineering tool projects. The contributions of the more senior open source developers of advogato are very welcome.
I'd also rather people just work on improving CVS.
My wishlist is more specific than imp's. To wit:
If someone were to tackle the above, I'd be greatful.
I was very intrigued by PRCS; especially its Xdelta filesystem. I very much dislike the fact that it relies on Berkeley DB 3, which has an icky license. Actually, I'm most intrigued by Xdelta; I can see many uses for it beyond source code control. (Think wiki with WayBackMode yet not relying on a source code version control system, which is, IMO, too beefy for its needs.)
I have to agree with the apparent consensus here that we don't need yet another version control system project. There are plenty of half-finished ones out there already.
The thing I most dislike about CVS is it is too complex to learn.
I suggest that your Perverse system should have both a command line and a GUI interface. The command line should be written first, and the GUI version should run command line scrpit to operate.
Firstly, you should write the complete documentation needed to use the command line interface; this should include the concepts involved, and the actual commands. If this comes to more than about 3 pages of A4, it is too complicated. Review this on the web until you get a simple, coherent, consistent user interface.
Yet Another Version Control System.
Yep, the point of asking here is to find out what would stop it becoming yet another VCS.
And echoes of a newbie...
We'll let posterity decide that one... :)
Now everybody tell me what it needs!
Hmm. A lot of Open Source software is written to scratch an itch; this is all well and good, and a lot of really good software was written this way. Unfortunately, there's a lot of really bad software written this way too, because developers have been developing in a vacuum scratching an itch that they and nobody else have. If I'm going to scratch other people's itches, I don't think it's unreasonable for me to ask people what they are...
Oh, and who wants to help?
No, sir, I did not say that, and with good reason. Adding manpower to a late software program just makes it later. I'm more than happy to (in fact, I'd rather) write this myself, but then I'm grateful for the small team of very competent developers who have volunteered to help with this without me calling. But no more!
I don't have any code
No, sir, this is not true. I do not have very much code, but it's enough to self-host the project. I'm one of these rare types who doesn't believe in releasing a project while it's incomplete.
But I'm going to place my bet on Karl and friends.
I'm happy to bet that what they do will be better for some people, but not all of them.
Distributed branching does not seem so critical...
I'd say it was critical because I've come across several scenarios where you have sites that are separate and cannot talk to each other over the network. rsync and cvsup will give you read-only slave sites, but no way of getting slave changes back into the master repository.
Is there any rename capability? How about moving a directory.
CVS will bitch at you forever more if you try and do this; Perforce will bitch at you once. Why haven't people made this easy? Ho hum. :)
First, the import/vendor branch mechanism is weak.
I think the vendor model is not merely weak but actively broken.
There needs to be something like Perforce change numbers where each change collects a number of files.
Collecting multiple file edits into a single atomic change is a necessity.
I suggest that your Perverse system should have both a command line and a GUI interface.
Yes, absolutely; I'm concentrating on exporting a sensible API so people can write their own clients and reporting systems, and so have a system that fits around their existing SCCS/RCS/whatever shell hacks.
Firstly, you should write the complete documentation needed to use
the command line interface; this should include the concepts
involved, and the actual commands. If this comes to more than about
3 pages of A4, it is too complicated.
Oh, it'll be longer than that not because it's complicated but because I like writing. A cheat sheet should fit on a side of A4; for the beginnings of a user manual for the interface I'm planning, please see the Quickstart manual.
To anyone working on version control: consider robustness.
Can a user (inadvertently or otherwise) corrupt the repository through an otherwise legal set of commands? If the power fails halfway through a commit, will the repository be usable when the machine comes back up? What happens if the repository disk becomes full?
CVS falls down in these areas. For example:
You have likely screwed yourself here, since the "CVS" directory contains metadata, and it was copied into "dir2". One of the metadata files is a pointer to the repository directory. When "dir2" is committed, the contents of "dir1" in the repository will be overwritten. A new user may not realize this, and even experienced users are prone to oversights at times. I have done this, and a co-worker with 20+ years experience has done this. This problem is also messy to correct.
At my previous job, I wrote a version control system for Mentor Graphics IC design data. It was designed to be used by the typical absent-minded engineer. The software was setuid root. It gave up root privileges while keeping the ability to switch between the user's UID and a project-specific database UID. All metadata was kept in the repository, owned by the database UID. It was simply not possible for users to inadvertently goof up. Coupled with an atomic commit scheme in the repository, the system was almost bulletproof.
I have tried to introduce CVS to new users. I start off with the basics: checkout and commit. I don't get into branching or anything else. Inevitably, they try something that either damages the repository or "poisons" their work directory requiring a knowledgeable user to fix.
Please don't have history repeat itself. And never underestimate the stupidity or resourcefulness of your users. (My version control system checked that the invoking user owns his home directory and will refuse to continue if it is owned by someone else. This check has been triggered through production use. :-)
The one thing I've been missing in my seven or eight months of CVS experience is the ability to commit a bunch of files, but give them individual messages. It would also be nice if it built my changelog out of those commits, but now I'm just talking crazy talk. (Why not use a regexp or a list of files to check out? Auto-tar and gzip on export? Ack, too many ideas!)
It would be kinda nice to be able to review changes checked in by contributors before accepting them, but I guess there's always the possibility to revert to a prior version.
A word to the wise - CVS is lacking in features because writing a version control system is very, very difficult.
To begin with, there are a million fiddly weird cases you have to deal with. What do you do when someone removes a file and someone else modifies it? What if someone deletes a directory when someone else adds a file to it? How do you do merges? All of these are very deep and tricky issues to deal with.
Less obvious from a developer standpoint, but glaringly obvious to the end user, is that a version control system must be very, very stable. A single corrupted repository will get people to swear off a system forever, and you *cannot* break backwards compatibility with peoples's old scripts.
We're not talking about simple 'try to not have so many bugs' levels of stability here, I mean setting up the version control system so that it runs the entire regression testing suite every time you try to commit and rejects you if it fails.
Writing a version control system is a project I'd reserve for people with a lot of experience writing very tricky system code. It isn't a 'fun' project unless you honestly enjoy working on difficult aggravating code just because it's difficult aggravating code.
We evaluated ClearCase, and decided to stick with CVS. The only thing that a top-notch tool like ClearCase can do that CVS can't, is cleanly handling file moves. That must be fixed in a CVS replacement, because moves and renames are important when you start to do a lot of refactoring.
I am writing a little shell around CVS, called Trug (trug.sourceforge.net). It is process-driven, in that the developer says: "I'm now going to start to work on bug #123"; "Please backup my work"; "I'm finished with bug #123"; "I now want to integrate the work of bug #123 with the main branch"; etcetera. Trug takes care of the branching and merging commands. I think that is the way version control should be - support a process and support it in a very straightforward manner. Trug will support one process, the one I think is useful for our company. A more general tool may want to offer some "process builder's toolkit".
I hear a lot of people complain about the deficiancies of CVS. I use CVS a lot and I love it. What exactly is wrong with CVS in its current state?
In answer to aaronl's question of what's wrong with CVS, I've tried to set up a list of the most important points. This list is by no means comprehensive; there are a whole bunch of minor things that can be improved, and there may be issues that are major to some projects that I have not covered. And probably none of these issues are major for all projects; they are just major for some, and subtly influence how the projects run.
 For anything beyond a toy project, using modules to simulate renames is likely to lead to disaster. It may be possible to use modules to get renames to work with a wrapper, but this has a series of problems, including but not limited to the fact that modules are not branchable.
 Setting 'cvs server' as the shell for a user does not avoid shell access; CVS is not made to be trusted that way. You can get some protection through chroots, but getting detailed access control for your repository in the face of potentially hostile developers is not really feasible - and even trying to do so is so much work that projects do not do it, which is the main point. "Impossible" and "So inconvenient that people do not do it when they need it" are effectively equivalent for tools.
Two more DVS limitations for the list:
Okay, here's some random thoughts...
One thing that seems to come up a lot when I use CVS is being able to control other aspects of files. As other people have mentioned, being able to move files and directory structures around, and having the control system understand that. You should be able to move directories around, modify files, and all that in one commit. And you should be able to check out the project from before any of those moves, modifies, and whatever, and the control system will put it all in the right place.
Another thing to keep handy is being able to understand file permissions, user/group ownerships, links, named pipes, etc. etc. The file permisisons and such, obviously would have to be handled as some sort of "checkout final" (opposed to the usual "checkout working"). But as all things, this should be optional.
Finally, the control system should allow for easy integration with a "policy" system. (perhaps the previous idea of file permissions/ownerships and such fall under the policy control system). For example, a project should be able to say only certain users can add files to a directory, or only certain users can modify these files, etc. etc. But these issues are project policies, not source control.
And as all open source projects, the program should have a good API, and allow for loadable modules, to leverage many people writing small components.
New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!