Storing Research Data -- SQLite vs. The Other Guys
StoringResearchData Created Monday 27 February 2012
I generally have a real aversion to using a full-on SQL implementation like MySQL or Postgres when it's not really necessary... and I typically think it's unnecessary until something forces me to change my mind. When I started writing small projects in Perl, I used to use Berkeley DB key-value pairs — lately I've used Python Pickles for similar purposes. It's simple and quick, which is nice, but it basically forces you to roll your own code.
Lately, everyone's been getting on the SQLite bandwagon, and it's pretty awesome. I've moved to using SQLite as my first choice when storing anything beyond flat text data. It has nice portability characteristics (unlike your homemade solution), simple backup and export formats. And, being able to make queries on the data is great for me, since these days most of my data are experimental results, etc.
But another good reason for using an external, portable data sink became obvious when I started to visualize and analyze my data using R. R has the RSQLite package, which made importing data into R for plotting and analysis a breeze. (Or as much of a breeze as anything is in R.) The thing is, roll-your-own formats may be perfectly "good enough" for isolated projects, but the minute you start wanting to use a different tool to view the data, having it already be in a format that is easily accessible is a major win. And if you're like me, you won't always know that in two months you're going to want to use the data in some completely different way. So I feel like I got that capability for "free" just because I decided to store my data in a more structured and portable way.
But SQLite is targeted at a specific niche — projects that would benefit from SQL behaviors but don't need all the robustness and consistency guarantees of "enterprise" databases. While you get good performance (because SQLite works inside your code, rather than through RPCs), if you start wishing for read/write concurrency (say, importing new data and plotting other unrelated data at the same time) you may find yourself frustrated with SQLite's limitations. That's what happened to me. As I started to generate large numbers of plots, I wanted to be able to run multiple scripts (some of which generate plots, some of which update other tables) at the same time — SQLite can balk at this.
So, I switched to a more traditional SQL database, which itself was relatively painless because of the underlying SQL standard. (R has libraries for it as well — another good reason not to roll your own even if you don't need all that SQL provides.) And in turn, this highlighted another unexpected value for the more "enterprise" systems: caching. Re-running long queries just to tweak a plot is considerably quicker with the big server as opposed to SQLite.
None of this is new information — in fact, it's a pretty textbook tour of the hows and whys of data storage. But for me, it inspired a change of heart, thanks to the convenience factor. In the future, I'll probably start with a Big Dog SQL server for research projects (running on my personal laptop) because it avoids the papercuts encountered when my projects get to big for SQLite. But I'll stick to SQLite for simpler things, to avoid the dependencies created by the more enterprise approach.