20 Feb 2007 logic   » (Journeyer)

Ruby ActiveRecord and lazy instantiation

Let's say I have a data set in a SQLite database. And, let's say I need to iterate over every row in that database, performing some action based on it. There's a few ways I can do this. First, the old, direct-to-driver version:

require 'sqlite3'

db = SQLite3::Database.new("test.db")
db.execute("select * from test_entries") do |row|
  ...
end

SQLite example

That works, but of course, it's bound forever to SQLite, and we'd like to abstract that a bit. The "new hotness" appears to be ActiveRecord. So, let's reproduce that using AR:

require 'active_record'

class TestEntry < ActiveRecord::Base; end

ActiveRecord::Base.establish_connection({
  :adapter => 'sqlite3', :dbfile => 'test.db'
})

TestEntry.find(:all) do |entry|
  ...
end

ActiveRecord example

Here's a tip: don't do that. ;-) That find call actually instantiates an ActiveRecord::Base for every single row in the database, then returns the result set, which you're now iterating over. So, for anything but a trivially-small result set, you're going to use a ton of memory, and create an enormous workload for very little good reason.

So, I suppose this blog entry is really just a whine: why isn't there a lazy-instantiation iterator for result sets in ActiveRecord? In this (common, I'd think) case where you want to do some sort of trivial filter or action across the entire table on a per-row basis, but don't care about the row after you've worked with it, that would be a huge win. The interesting thing is that this is exactly how the auto-generated association methods work, and you actually have to go out of your way to do eager loading of associations (since just automatically doing joins under the hood is an obviously bad idea).

My specific use case is an application log stored in a table that I'd like to iterate over for some basic statistical analysis, and a single day of logs is a 600M SQLite database (approximately 3.6M rows). This won't fly for doing a single day of processing, and I'd like to be able to do something for monthly and quarterly reporting.

I'll probably resort to DBI to accomplish this in a database-agnostic manner, but that feels like an incredible step backwards. I just can't understand why the find* methods weren't written as iterators that do lazy instantiation of the result set; the interface would be effectively the same, but the potential performance gains could be tremendous in certain cases.

Syndicated 2007-01-18 12:01:00 from esm

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!