Background

I was recently working on a Rails 2.3.5 application when I found myself stumped on some perplexing behavior that arose when using a named scope, the #destroy_all method and an after destroy callback fired by an ActiveRecord observer. Let’s begin with the models. There are Asset and User models. The assets table’s columns include the integer file_size and timestamp delete_after. There is a named scope, expired, which returns all assets for which the value of delete_after is in the past. Assets belong to users.

class Asset < ActiveRecord::Base
  belongs_to :user, :counter_cache => true
  named_scope :expired, lambda { {:conditions => ['delete_after < ?', Time.now]} }
end
# == Schema information
# ...
# user_id      :integer     not null
# file_size    :integer     default(0), not null
# delete_after :datetime
# ...

Users have many assets. The users table has a large integer column, current_disk_usage, which holds the denormalized sum of the file sizes of the assets belonging to each user. There is an instance method, #recalculate_disk_usage!, which triggers the User instance to update current_disk_usage be recalculating the sum of the user’s assets.

class User < ActiveRecord::Base
  has_many :assets

  def recalculate_disk_usage!
    update_attribute(:current_disk_usage, assets.sum(:file_size))
  end
end
# == Schema information
# ...
# current_disk_usage :integer(12)   default(0), not null
# ...

There is an AssetObserver class which triggers User#recalculate_disk_usage! any time an asset record is created, updated or destroyed. (The implementation show below is inefficient, and one workaround to the problem discussed in this article was to make the observer a bit gentler on the database. That version can be found at the bottom of the article.)

class AssetObserver < ActiveRecord::Observer
  def after_create(asset)
    asset.user.recalculate_disk_usage! unless asset.file_size.zero?
  end

  def after_update(asset)
    asset.user.recalculate_disk_usage! if asset.file_size_changed?
  end

  def after_destroy(asset)
    asset.user.recalculate_disk_usage!
  end
end

The problem

The application in question must periodically purge asset records which have expired (where delete_after is in the past). This is accomplished with a Rake task which sends #destroy_all to the Asset.expired named scope.

task :purge_expired => :environment do
  Asset.expired.destroy_all
end

And this is where the confusion set in. Whenever I manually created, updated or destroyed an asset the asset observer kicked in as expected and the user’s current disk usage was updated accordingly. Whenever I ran the purge_expired Rake task, the correct assets were destroyed, but the user’s current disk usage would get set to zero. If I manually created, updated or destroyed another asset the current disk usage would once again be set to the correct, up-to-date value. It was time to see what SQL was actually being run, so I opened up the log file.

DELETE FROM "assets" WHERE "id" = 2
SELECT sum("assets".file_size) AS sum_file_size FROM "assets" WHERE (("assets".delete_after < '2010-04-30 06:41:20.755388') AND ("assets".user_id = 5))
UPDATE "users" SET "updated_at" = '2010-04-30 06:41:21.189777', "current_disk_usage" = 0 WHERE "id" = 5

J’accuse! The DELETE statement on the first line removes the record from the assets table, and the UPDATE on the last line shows the user’s disk usage being set to zero. But what’s up with the SELECT in the middle? It’s calculating the sum of the file sizes of the user’s assets, but it’s limited the sum to those assets which have expired – look at the first half of the WHERE clause. For some reason the conditions imposed by the Asset.expired named scope are filtering into the call to user.assets.sum(:file_size). I took a look at the implementation of ActiveRecord::Base.destroy_all for a clue.

# File activerecord/lib/active_record/base.rb, line 876
def destroy_all(conditions = nil)
  find(:all, :conditions => conditions).each { |object| object.destroy }
end

Not really an obvious answer, but it helped to steer my thinking. Asset.expired returns an instance of ActiveRecord::NamedScope::Scope. Sending #destroy_all to that scope calls ActiveRecord::Base.find(:all) which returns an array of Asset instances. We then iterate over that array, sending #destroy to each asset. For each asset, after #destroy runs, AssetObserver#after_destroy(asset) is called which calls asset.user.recalculate_disk_usage! which in turn calls asset.user.assets.sum(:file_size).

A workaround

Okay, but I still didn’t know why the expired named scope conditions at the beginning of the process were showing up again when calling user.assets near the end of the process. However I had a hunch the call to ActiveRecord::Base.find had something to do with it, so to test this hunch I modified the Rake task, removing the call to destroy_all and instead iterating over the scope directly.

task :purge_expired => :environment do
  Asset.expired.each { |asset| asset.destroy }
end

So what happens when I run the Rake task now?

DELETE FROM "assets" WHERE "id" = 12
SELECT sum("assets".file_size) AS sum_file_size FROM "assets" WHERE ("assets".user_id = 5)
UPDATE "users" SET "updated_at" = '2010-04-30 18:13:41.050527', "current_disk_usage" = 1696416 WHERE "id" = 5

Et voilà! I don’t know what destroy_all’s use of ActiveRecord::Base.find does that leaks into later elements of the destruction process. Bypassing destroy_all by iterating over the named scope directly seems to fix the problem, and to be honest, while the original version with destroy_all looked more like idiomatic Rails, the use of each feels more like idiomatic Ruby, i.e., purer. I think this may be a bug, but I’d need to get deeper into the guts of Rails’ traversal across the association to be sure. I’ve only encountered this problem on Rails 2.3. I haven’t tested the Rails 3 release candidate yet.

A different workaround

But wait, there’s more. As I mentioned earlier, the original implementation of AssetObserver isn’t efficient. Here’s the problem: each time an asset record is destroyed, the database is issued both SELECT SUM and UPDATE statements. The latter is unavoidable, but is the former strictly necessary? ActiveRecord provides models with #increment! and #decrement! instance methods. They’d normally be used to increment or decrement a counter by one, but they both take optional second arguments which are the amount by which the counter (current_disk_usage) should be changed. When the observer’s callback is triggered, the observed object (in this case, an asset) is passed as an argument to the callback method. Since the callback is acting on a single object at the time, instead of telling the user to calculate a sum of all remaining asset’s file sizes, we could just increment or decrement the current disk usage by the file size of the asset being changed (or in the case of updates, by the difference between the new and old file sizes). Here’s a simple version of how that would look:

class AssetObserver < ActiveRecord::Observer
  def after_save(asset)
    if asset.file_size_changed?
      difference = asset.file_size_change.last - asset.file_size_change.first
      asset.user.increment!(:current_disk_usage, difference)
    end
  end

  def after_destroy(asset)
    asset.user.decrement!(:current_disk_usage, asset.file_size)
  end
end

By removing the SELECT SUM call we certainly won’t hit the database as hard for each asset being removed, so perhaps this is a more correct way of doing things.

As it happens, the application this problem comes from has very light database use as it is. With only a couple hundred users, each with a few dozen assets, usually not more than a couple dozen assets total expiring each week, and reliably long window of inactivity each night (asset expiration happens once per day), peppering the database with a few dozen SELECT SUM statements isn’t a big deal. The use of User#recalculate_disk_usage! also has the benefit of being self-healing: if, somehow, the current disk usage value for a user were to become corrupted, it would be automatically fixed the next time #recalculate_disk_usage! was called. If instead of simply triggering the user’s recalculating of disk usage, AssetObserver becomes responsible for the actual values being added or subtracted from current_disk_usage there may be implications for possible data corruption (although that should be pretty well mitigated by the asset activities being wrapped in a single database transaction).