SUM, DISTINCT, and Eagerly-Loaded Associations
railsA common task in application is to list a collection of records along with a
summary such as the sum of a particular attribute across the collection. For
example, imagine we have Job model which has a numeric rate attribute. In a
jobs#index
action, we might list the jobs in the body of a table, and display
the sum of the rates in the table footer.
<table>
<tfoot>
<tr>
<td>…</td>
<td><%= number_to_currency @jobs.total_rate %></td>
</tr>
</tfoot>
<tbody>
<%= render @jobs %>
</tbody>
</table>
Job.total_rate
is easy enough to implement:
class Job < ApplicationRecord
def self.total_rate
sum :rate
end
end
Let’s seed the database and prove that this works.
class JobTest < ActiveSupport::TestCase
def test_total_rate_returns_sum_of_rates
2.times { Job.create! rate: 100 }
assert_equal 200, Job.total_rate
end
end
Running via Spring preloader in process 34676
Run options: --seed 49117
# Running:
.
Finished in 0.027461s, 36.4148 runs/s, 36.4148 assertions/s.
1 runs, 1 assertions, 0 failures, 0 errors, 0 skips
This test passes. Woo! Great effort, people. Chocolate milks all around. Let’s go home.
Or not…
Eager Loading and Row Duplication
What if there is a has_many
association on the Job model? For example, let’s
say our Jobs can be assigned to multiple Categories.
class Category < ApplicationRecord
has_and_belongs_to_many :jobs
end
class Job < ApplicationRecord
has_and_belongs_to_many :categories
end
If we are displaying the categories on jobs#index
, we may want to eagerly
load the categories association to ensure we don’t bombard the database with an
additional query for every category.
class JobsController < ApplicationController
def index
@jobs = Job.includes :categories
end
end
How would Job.total_rate
behave in this scenario?
categories = 2.times { |n| Category.create! name: "category #{n}" }
Job.create! do |job|
job.rate = 100
job.categories = categories
end
Job.create! do |job|
job.rate = 100
end
Job.includes(:categories).total_rate # => 300.0
Practically speaking, we know the total rate is 200.0 not 300.0. So what’s
going on? Well, here’s the SQL that’s being executed when we call
Job.total_rate
:
SELECT SUM("jobs"."rate") FROM "jobs"
LEFT OUTER JOIN "categories_jobs" ON "categories_jobs"."job_id" = "jobs"."id"
LEFT OUTER JOIN "categories" ON "categories"."id" = "categories_jobs"."category_id"
What’s happening is a row is returned for every existing Job-Category
combination. If a Job belongs to one Category, one row is returned; if a Job
belongs to two Categories, two rows are returned, etc. (Because we are doing a
LEFT OUTER JOIN
, if a Job belongs to zero Categories, we still get one row
for the Job itself.) Let’s look at a different query to see the actual result
set:
SELECT
jobs.id AS job_id,
category.id AS category_id,
jobs.rate AS rate
FROM jobs
LEFT OUTER JOIN categories_jobs ON categories_jobs.job_id = jobs.id
LEFT OUTER JOIN categories ON categories_jobs.category_id = categories.id
The result set looks like this:
job_id | category_id | rate |
1 | 1 | 100 |
1 | 2 | 100 |
2 | NULL | 100 |
So when we perform the SUM
the first Job is duplicated. We need
Job.total_rate
to account for this, but how? First, let’s write a test:
def test_total_rate_with_duplicate_rows
categories = 2.times.map { |n| Category.create! name: "category #{n}" }
Job.create! rate: 100, categories: categories
Job.create! rate: 100
assert_equal 200, Job.includes(:categories).total_rate
end
…and verify that the test, in fact, fails:
Running via Spring preloader in process 34719
Run options: --seed 64383
# Running:
.F
Failure:
JobTest#test_total_rate_with_duplicate_rows [/Users/john/Projects/summable/test/models/job_test.rb:13]:
Expected: 200
Actual: 300.0
bin/rails test test/models/job_test.rb:9
Finished in 0.080557s, 24.8272 runs/s, 24.8272 assertions/s.
2 runs, 2 assertions, 1 failures, 0 errors, 0 skips
Great! Now how do we make the test pass? Well, the goal is to calculate the
SUM
of the rates for each distinct job in the result set. In other words,
when considering the query that loads the Jobs, we want to calculate the SUM
for the Jobs whose IDs are in the original result set.
ActiveRecord::Base.pluck
seems like it might be useful here.
def self.total_rate
unscoped.where(id: distinct.ids).sum :rate
end
ActiveRecord::Base.ids
is a shortcut for pluck :id
. We call
ActiveRecord::Base.distinct
to return each of the matching IDs only once. Let’s rerun the tests, and see if
it works.
Running via Spring preloader in process 34924
Run options: --seed 55533
# Running:
..
Finished in 0.085292s, 23.4488 runs/s, 23.4488 assertions/s.
2 runs, 2 assertions, 0 failures, 0 errors, 0 skips
The tests pass, so, huzzah? Well, sort of.
Using Sub-selects
Let’s look in the logs and see the query(ies) that actually get executed.
SELECT DISTINCT "jobs"."id"
FROM "jobs"
LEFT OUTER JOIN "categories_jobs" ON "categories_jobs"."job_id" = "jobs"."id"
LEFT OUTER JOIN "categories" ON "categories"."id" = "categories_jobs"."category_id"
SELECT SUM("jobs"."rate") FROM "jobs" WHERE "jobs"."id" IN (2, 1)
Well, we can see why it works, but I have a nagging concern about the way this
works. It’s performing two separate queries. The first query looks up the Job
IDs and loads them into an Array. The second query performs the actual SUM
,
passing the Array of Job IDs in as a parameter. We only have two Jobs in the
database during the test, but what will it look like in production? What if we
have 100,000 Jobs? Well, it will perform the first query, creating an Array of
100,000 IDs, and then it will perform the second query, passing in a
100,000-element Array as a query parameter. Yikes.
The thing is, SQL is a clever language. You don’t have to perform two separate queries, and you don’t have to construct an arbitrarily long Array in order to perform this task, when you can just use a sub-select. The SQL for such a query might look something like this:
SELECT SUM(rate)
FROM jobs
WHERE id IN (SELECT DISTINCT id FROM jobs)
So what’s going on? ActiveRecord is able to generate sub-selects. Why isn’t it
doing it here? Well, pluck
is a special method; it executes immediately and
always returns an Array. We don’t actually want to use pluck
in this
situation. We want to use
ActiveRecord::Base.select.
def self.total_rate
unscoped.where(id: distinct.select(:id)).sum :rate
end
Let’s rerun the tests, and make sure it still works:
Running via Spring preloader in process 35770
Run options: --seed 48301
# Running:
..
Finished in 0.092239s, 21.6829 runs/s, 21.6829 assertions/s.
2 runs, 2 assertions, 0 failures, 0 errors, 0 skips
Looks good. Now let’s check the logs and see the query that gets executed.
SELECT SUM("jobs"."rate") FROM "jobs"
WHERE "jobs"."id" IN (SELECT DISTINCT "jobs"."id" FROM "jobs")
Bingo! There’s the sub-select. So does that mean we’re done? Well, not quite.
What happened to the JOIN
clauses?
ActiveRecord provides two different ways for joining additional tables in a
query:
ActiveRecord::Base.joins.
and
ActiveRecord::Base.includes.
Which method you choose depends on the reason you are joining the additional
tables. You use joins
when you just need to join the table in the query in
order to reference columns from that other table. You use includes
when you
want to eagerly load the referenced association when constructing the result
set, say to avoid an N+1 query problem.
The joins
method defaults to performing an INNER JOIN
. Rails 5 added
ActiveRecord::Base.left_outer_joins
(aliased as left_joins
) to perform a LEFT OUTER JOIN
instead. If you are
using Rails 4.2.x or earlier, you can still perform a LEFT OUTER JOIN
, but
you have to pass a SQL String to the joins
call.
# Rails 5 and later
Job.joins(:categories) # INNER JOIN
Job.left_outer_joins(:categories) # LEFT OUTER JOIN
# Rails 4.2.x and earlier
Job.joins(:categories) # INNER JOIN
Job.joins(<<SQL) # LEFT OUTER JOIN
LEFT OUTER JOIN categories_jobs ON categories_jobs.job_id = jobs.id
LEFT OUTER JOIN categories ON categories_jobs.category_id = categories.id
SQL
The includes
method always performs an OUTER JOIN
.
ActiveRecord omits the JOIN
clauses from the sub-select, because we created
them using includes
. We chose includes
because we wanted to eagerly load
the Categories when we loaded the Job. This makes sense for the goal of loading
and displaying a collection of Jobs, but there’s no reason to load the
Categories in Job.total_rate
because we aren’t instantiating the Jobs; we’re
just performing a SUM
. That was good enough to make the test pass, but will
that always work?
Eager Loading and Conditions
Let’s consider a new scenario. This time, instead of calculating the total rate for all Jobs, let’s calculate the total rate for Jobs that belong to a particular category. Here’s a test to demonstrate:
def test_total_rate_with_search_parameters
categories = 2.times.map { |n| Category.create! name: "category #{n}" }
Job.create! rate: 100, categories: categories
Job.create! rate: 100
assert_equal 100, Job.includes(:categories)
.where(categories: { name: 'category 1' })
.total_rate
end
In this case, we are filtering Jobs for those that belong to Categories with the name “category 1”. As only one Job belongs to that Category, we expect the total rate to be 100. Let’s run the test and see what happens.
Running via Spring preloader in process 36190
Run options: --seed 64894
# Running:
..E
Error:
JobTest#test_total_rate_with_search_parameters:
ActiveRecord::StatementInvalid: PG::UndefinedTable: ERROR: missing FROM-clause entry for table "categories"
LINE 1: ...IN (SELECT DISTINCT "jobs"."id" FROM "jobs" WHERE "categorie...
^
: SELECT SUM("jobs"."rate") FROM "jobs" WHERE "jobs"."id" IN (SELECT DISTINCT "jobs"."id" FROM "jobs" WHERE "categories"."name" = $1)
app/models/job.rb:7:in `total_rate'
test/models/job_test.rb:22:in `test_total_rate_with_search_parameters'
bin/rails test test/models/job_test.rb:16
Finished in 0.094740s, 31.6657 runs/s, 21.1104 assertions/s.
3 runs, 2 assertions, 0 failures, 1 errors, 0 skips
Pay particular attention to the query that gets executed:
SELECT SUM("jobs"."rate") FROM "jobs"
WHERE "jobs"."id" IN (SELECT DISTINCT "jobs"."id" FROM "jobs" WHERE "categories"."name" = 'category 1')
Recall that in the previous example, ActiveRecord has simplified the sub-select
by removing the JOIN
clauses. That wasn’t a problem before, but now, in this
scenario, we need the JOIN
clauses to be preserved within the sub-select to
ensure the categories table is available for our search.
Earlier we talked about joins
and left_outer_joins
. If we used those
instead of includes
, would the JOIN
clauses be preserved? Lets’ modify the
test and find out.
def test_total_rate_with_search_parameters
categories = 2.times.map { |n| Category.create! name: "category #{n}" }
Job.create! rate: 100, categories: categories
Job.create! rate: 100
assert_equal 100, Job.left_outer_joins(:categories)
.where(categories: { name: 'category 1' })
.total_rate
end
And here’s the test result:
Running via Spring preloader in process 37200
Run options: --seed 51677
# Running:
...
Finished in 0.115139s, 26.0555 runs/s, 26.0555 assertions/s.
3 runs, 3 assertions, 0 failures, 0 errors, 0 skips
Hey! It works. So is that enough? No. As mentioned before, the point of using
includes
is to eagerly load associations. If we called
Job.left_outer_joins(:categories)
from a controller action, the categories
table will be available in Job.total_rate
, but the Categories will not be
eagerly loaded. If we start displaying Categories, we will set off a flurry of
additional SQL queries to load the Categories one by one — the stereotypical
N+1 query.
Okay, so we really do want to use includes
. How can we get Job.total_rate
to work? Well, as it happens, an ActiveRecord::Relation has methods for
introspection into the parameters of its query. In this case, there’s an
accessor method named includes_values
; each time you call includes
the
associations you reference are added to includes_values
.
Job.includes(:categories).includes_values # => [:categories]
WARNING The includes_values
method is not mentioned in the Rails API
documentation, so it may be that it’s not meant for public use. I can’t promise
this will always work. We’re living dangerously.
What if we modified Job.total_rate
to use includes_values
and
left_outer_joins
?
def self.total_rate
subquery = distinct.select :id
if !subquery.includes_values.empty?
subquery = subquery.left_outer_joins subquery.includes_values
end
unscoped.where(id: subquery).sum :rate
end
We rollback the test to the previous version using includes
instead of
left_outer_joins
and rerun the tests.
Running via Spring preloader in process 37510
Run options: --seed 28323
# Running:
...
Finished in 0.097461s, 30.7815 runs/s, 30.7815 assertions/s.
3 runs, 3 assertions, 0 failures, 0 errors, 0 skips
It passed. To understand why, let’s look at the SQL that was generated:
SELECT SUM("jobs"."rate") FROM "jobs"
WHERE "jobs"."id" IN (
SELECT DISTINCT "jobs"."id" FROM "jobs"
LEFT OUTER JOIN "categories_jobs" ON "categories_jobs"."job_id" = "jobs"."id"
LEFT OUTER JOIN "categories" ON "categories"."id" = "categories_jobs"."category_id"
WHERE "categories"."name" = 'category 1'
)
Fantastic! So, we’re done right? Right? Sadly, no.
Eager Loading, Conditions, and Pagination
Even in a modest-sized application, there are times when you want to paginate a collection to keep response times and memory usage low. If our database has 100,000 Jobs, we don’t want to instantiate them all at once. But that doesn’t mean we don’t want to consider the entire result set in the footer. We may only be display 25 Jobs per page, but we probably want to know the total rate for all the matching Jobs.
The simplest way to paginate results is by adding LIMIT
and OFFSET
clauses
to our query. (Often this is not the most efficient
way, but it
is the most common, and so this is the approach I will be addressing below.)
The kaminari gem is my go-to gem for
adding pagination to a Rails application.
2.times { Job.create! }
Job.count # => 2
Job.page(1).per(1).count # => 1
So our goal is to ensure that Job.total_rate
performs the SUM
across all
matching jobs regardless of what page we are looking at. Let’s write a test to
see what happens when we use pagination.
def test_total_rate_with_joins_and_pagintation
categories = 2.times.map { |n| Category.create! name: "category #{n}" }
Job.create! rate: 100, categories: categories
Job.create! rate: 100, categories: categories
assert_equal 200, Job.includes(:categories)
.where(categories: { name: 'category 1' })
.page(1).per(1)
.total_rate
end
We use per(1)
to restrict our query to one Job per page, and page(1)
to
return the first page of results.
Running via Spring preloader in process 37954
Run options: --seed 43188
# Running:
...F
Failure:
JobTest#test_total_rate_with_joins_and_pagination [/Users/john/Projects/summable/test/models/job_test.rb:29]:
Expected: 200
Actual: 100.0
bin/rails test test/models/job_test.rb:25
Finished in 0.125655s, 31.8333 runs/s, 31.8333 assertions/s.
4 runs, 4 assertions, 1 failures, 0 errors, 0 skips
Curses! As feared, Job.total_rate
looks at the current page of results rather
than the entire result set. Let’s look at the SQL to see what’s going on.
SELECT SUM("jobs"."rate") FROM "jobs"
WHERE "jobs"."id" IN (
SELECT DISTINCT "jobs"."id" FROM "jobs"
LEFT OUTER JOIN "categories_jobs" ON "categories_jobs"."job_id" = "jobs"."id"
LEFT OUTER JOIN "categories" ON "categories"."id" = "categories_jobs"."category_id"
WHERE "categories"."name" = 'category 1'
LIMIT 1
OFFSET 1
)
Those LIMIT
and OFFSET
clauses are limiting the SUM
. What’s a developer
to do?
Well,
ActiveRecord::Base.except
can be very useful here. The except
method allows us to skip certain types of
query conditions. In this case, we want to drop the LIMIT
and OFFSET
conditions. Let’s rewrite Job.total_rate
one more time.
def self.total_rate
summable.sum :rate
end
def self.summable
subquery = except :limit, :offset
if !subquery.includes_values.empty?
subquery = subquery.left_outer_joins subquery.includes_values
end
unscoped.where id: subquery.distinct.select(:id)
end
I’m now using except
to drop the LIMIT
and OFFSET
clauses. While I’m at
it, I’m also extracting a new method, Job.summable
. Job.total_rate
was
becoming a little hairy for my tastes. I also moved distinct.select(:id)
because I felt the result was a bit more intuitive.
This is the moment of truth. Let’s run the tests.
Running via Spring preloader in process 41410
Run options: --seed 2220
# Running:
....
Finished in 0.121656s, 32.8797 runs/s, 32.8797 assertions/s.
4 runs, 4 assertions, 0 failures, 0 errors, 0 skips
Hooplah? Yes, hooplah. “Give me a milk, Lou. Chocolate.”
Summary
In closing, let’s look at all the code together.
# app/controllers/jobs_controller.rb
class JobsController < ApplicationController
def index
@jobs = Job.includes(:categories)
.page(params[:page]).per(params[:per])
end
end
# app/models/category.rb
class Category < ApplicationRecord
has_and_belongs_to_many :jobs
end
# app/models/job.rb
class Job < ApplicationRecord
has_and_belongs_to_many :categories
def self.total_rate
summable.sum :rate
end
def self.summable
subquery = except :limit, :offset
if !subquery.includes_values.empty?
subquery = subquery.left_outer_joins subquery.includes_values
end
unscoped.where id: subquery.distinct.select(:id)
end
end
# test/modesl/job_test.rb
require 'test_helper'
class JobTest < ActiveSupport::TestCase
def test_total_rate_returns_sum_of_rates
2.times { Job.create! rate: 100 }
assert_equal 200, Job.total_rate
end
def test_total_rate_with_duplicate_rows
categories = 2.times.map { |n| Category.create! name: "category #{n}" }
Job.create! rate: 100, categories: categories
Job.create! rate: 100
assert_equal 200, Job.includes(:categories).total_rate
end
def test_total_rate_with_search_parameters
categories = 2.times.map { |n| Category.create! name: "category #{n}" }
Job.create! rate: 100, categories: categories
Job.create! rate: 100
assert_equal 100, Job.includes(:categories)
.where(categories: { name: 'category 1' })
.total_rate
end
def test_total_rate_with_joins_and_pagintation
categories = 2.times.map { |n| Category.create! name: "category #{n}" }
Job.create! rate: 100, categories: categories
Job.create! rate: 100, categories: categories
assert_equal 200, Job.includes(:categories)
.where(categories: { name: 'category 1' })
.page(1).per(1)
.total_rate
end
end