
Image of Thomas The Tank Engine is from the
New York Times, I think, but I can’t find the exact page.
I love ActiveRecord. I don’t really much like working with relational databases (like Tolkein wizards, databases tend to be good at getting you out of the kind of trouble that databases get you into…).
Sometimes, though, you get a requirement that seems explicitly designed to drive both ActiveRecord and your database of choice stark raving mad. Here’s an example based on an actual client requirement (details changed to protect the innocent):
Let’s say you are running an application for swim coaches to store the times of their swimmers in meets. A coach needs to be able to log in and see all of a swimmer’s times in a given event — say, the 100 meter freestyle.
Easy enough so far. Now let’s say that when a coach runs a report, she wants to see not just a swimmer’s individual times in an event, but also the swimmer’s relay times in any event. However, the coach only needs to see the most recent relay time for a given swimmer. Let’s also assume that older relay times should not be shown even in historical reports — so if the coach is running a report for November 2008, a relay time in December 2008 should still block a November time from being displayed. Got it?
What to do? Unless I’m mistaking some arcane ActiveRecord feature, can we establish that this can’t be done in a single AR call. A relay time shows up in the report if a) the swimmer shows up in the individual part of the report and b) this is the most recent relay time for that swimmer — in other words, it’s dynamic in two ways.
Here are three ways that you can try and bang through a complex problem like this one. These aren’t mutually exclusive, mix and match to solve your own complicated problem.
Drop down to the database’s level
If ActiveRecord’s hooks into the database aren’t powerful enough for what you want to do, there’s always find_by_sql if you want to write raw SQL. In theory, you could drop all the way down to stored procedures, though Rails is really not going to make that easy for you.
The advantage of moving down to the database level is, at least potentially, that it’ll be faster than solutions at the Ruby/ActiveRecord level. The downside is that the chunk of SQL is likely to be somewhat opaque and hard to maintain. Stored procedures add other difficulties in terms of making testing and deployment more complex.
I’d love to show you an example of going to the database for this example, but honestly, we tinkered with this solution, but never walked all that far down the path. I’m not enough of an SQL wizard personally to make this anything but painful.
Divide and Conquer
If the overall search behavior is too complex for a single ActiveRecord call, then split it up into multiple calls and merge the results together. The obvious advantages of this plan is that the resulting Ruby code should be cleaner and easier to follow then the SQL chunk — Ruby is somewhat easier to decompose into smaller pieces. One disadvantage is that you are making more database calls — though this might be somewhat mitigated by the fact that the individual database calls are simpler and possibly faster than the one mega-call. A second disadvantage is that sort (and maybe filter) features that would otherwise be managed by the database are handled somewhat more slowly in Ruby.
A hidden advantage of this approach is that it’s probably the easiest to build up via a test-driven approach. Once the tests are in place, you can move to a faster solution with more confidence.
In this particular case, I tried to write a two-step plan, one find for the individual times, and then another find using the list of individual names to gather the relay times. That would have worked, except for the requirement about the relay time having to be the most recent.
The next step was to try a multi-step plan with one find for the individual times, and then a further search for each swimmer to find the relevant most recent relay times. The sets were then combined and sorted in Ruby. A complication was the requirement that older times never show up in the report, even if the most recent time wouldn’t. The Ruby code for this was actually not bad — the main method looked like this
def execute
result = individual_times + displayable_relay_times
result = result.sort_by { |r| r.sort_tuple(sort_column) }
result.paginate(:per_page => self.page_size, :page => self.page)
end
And the per user method looked something like this — obviously I’ve left some less-relevant pieces out:
def individual_swimmers
@individual_swimmers ||= individual_times.map(&:swimmer).uniq
end
def displayable_relay_times
@displayable_relay_times ||= begin
individual_swimmers.map { |s|
s.displayable_relay_times(filter)
}.flatten
end
end
The performance was surprisingly not bad when there weren’t that many swimmers. It was able to handle the 200-500. Which actually would probably work for about 95 – 99% of all user searches. Unfortunately the other 1 – 5% could be as much as ten times bigger.
The real killer here is the pagination — the code here paginates at the end after all the data is grabbed. If you could paginate just on the swimmer’s names, it would put a hard limit on the number of database calls in one report, and would probably be fine for performance. But — you really can’t do that. For one thing, you don’t know how many relays there are, so you can’t actually say that there will be 30 records on each page. Even if that is acceptable, unless you are sorting on the swimmer’s name, determining which records go on each page is basically impossible given the expectations of the sorted display. And even if you could work through that somehow, there’s still the issue of export to an external file, which wouldn’t paginate at all, and would still be really, really slow.
Still, if all the reports were in the 200 range or so, I probably would have stuck with this for a while.
On the plus side I have tests in place, which will come in handy when I try this:
Pre-Define Valuable Data
In this case, although the determination of which relay records to display is complicated in the context of a single database search, in general, recognizing which relay records are most recent is not complex, and is a deterministic property of the data at any given time.
So… say you added a column to the database called “most_recent_relay” and wrote a rake task to populate it user-by-user (hey, you could even use the swimmer method hinted at in the previous method). With that column in place, this report can definitely be managed with two ActiveRecord calls (the second would take the list of swimmer_ids, and find all most_recent_relay times with swimmers in that list). You might even be able to squeeze this into one ActiveRecord call if you were willing to tangle yourself in nested WHERE clauses for a few hours. (I think — I haven’t actually implemented this step yet).
The downside of this mechanism is the bookkeeping. Every time a new relay record is saved, the most_recent_relay field needs to be updated for that swimmer’s times. This is a good case for an ActiveRecord callback to manage this automatically when you try to save a record. A mistake here could lead to subtle bugs.
That’s what I’ve got. Anybody else got a better idea?
Also, feel free to check out Rails Prescriptions, home of a new ebook on Rails testing, coming soon to a PDF reader near you.

Will pagenate has a wrapper method around find by SQL for pagenation
I would have to sit down and seriously think about this before I came up with a SQL solution (it would involve a subselect), but I would definitely do that instead of redesigning part of the application.
I don’t think the SQL would be too bad. Can you share the salient details of the data model and a bit of sample data (for testing)? I don’t know enough about the domain to model it.
Perhaps I was not completely clear about this — I’m not arguing that this couldn’t be done in SQL, only that it can’t be done in “core” ActiveRecord. Digging straight into SQL is one strategy for solving it. It’s not the solution I’ve chosen at the moment. My experience with complicated SQL queries is not great, especially when it comes to long-term maintenance, but obviously it can be made to work. The person who worked on this problem before I did was probably more inclined to an esoteric SQL solution than I was.
For what it’s worth, the actual data model (which is not about swimmers), is slightly more complex than presented here — in particular it has more joins that need to be there to manage various filter and sort options.
I think you were clear, Noel. It sounds like you didn’t choose the SQL approach because you thought it would be hard to understand. However, I think it would actually be more maintainable than the caching approach, since it would have fewer moving parts and have less chance of being buggy. If it were indeed the best-performing, least buggy approach, then it would be well worth your while to consider it an opportunity to improve your SQL skills. I would recommend a test-driven approach.
The only “simple” SQL I could imagine would be a UNION. I can’t think of anything in ActiveRecord that would deliver that in one find. OTTOMH I can’t think of a nice way to wrap SQL set operations. In any case, I think I prefer the dual-query option, which is something I need to try to focus on, given I tend to reach for my SQL skills (acquired over decades) first.
When I was only learning Rails, I thought that ORM is everything, ActiveRecord is the best thing since sliced bread and that it can solve all world’s problems.
Turns out, some of the business logic required somewhere over there is _so_ complicated that it’s not the issue of whether ActiveRecord can tackle a query needed in one statement, but an issue arises just -what- hellish SQL would be required to fetch it.
I love business world—it challenges us to tinker with data structures we never otherwise thought would be possible.
I’ll admit, this bugs me a bit too. More times than one I’ve written really nice looking instance methods in my models to calculate something about a particular object, like a total from a sum of line items or a pretty representation of a database string. Works all fine and dandy until you want to do something with the results of that method other than just display it in a view. What about sorting by it? Or filtering by it? You have to end up re-writing the function in your SQL finders, resulting in horrible, horrible duplication that will come back to bite you sooner or later.
The approach I’ve taken in the past, and find myself doing more often now-a-days is making this method live in the database as a column, and updating it when the record changes (in an after_save callback). When I create the migration to add the field, I also write in a little bit of Ruby code that iterates through all the existing records and populates the new field with the results of the method so old records can still have access.
Not sure if it’s the best way or not, but I don’t know of a better solution.
[...] On Beyond Zebra: When ActiveRecord Isn’t Enough [...]
[...] Previously on Agile Ajax: I wrote about a complicated problem that went beyond ActiveRecord. I resolved to add a database column with pre-generated results to make the search logic easier. Commenters questioned my reluctance to use SQL directly. [...]