Quickly list missing foreign key indexes

Run this code in a rails console to list foreign keys which aren't indexed.

c = ActiveRecord::Base.connection
c.tables.collect do |t|  
  columns = c.columns(t).collect(&:name).select {|x| x.ends_with?("_id" || x.ends_with("_type"))}
  indexed_columns = c.indexes(t).collect(&:columns).flatten.uniq
  unindexed = columns - indexed_columns
  unless unindexed.empty?
    puts "#{t}: #{unindexed.join(", ")}"
  end
end

This list will look something like this:

attachments: parent_id, asset_id
domain_names: organisation_id
event_memberships: user_id, event_id
events: editor_id
group_actions: user_id, group_id
groups: user_id
icons: parent_id
invitations: sender_id
legacy_actions: item_upon_id
news_items: author_id
organisations: midas_id
pages: author_id
pending_event_memberships: invitation_id, event_id
resources: user_id, resourceable_id
subscriptions: subscribable_id, user_id
taggings: tag_id, taggable_id, user_id

For each column in the list, ask yourself why you don't need an index.

Update: Andrew Coleman has added output in migration format. If you want to play around with it further, here's the original code on gist.

Using indexes in rails: Choosing additional indexes

The first part in this series of posts looked at adding indexes to foreign keys, to improve the performance when navigating rails associations. But many queries involve data other than just foreign keys. With the judicious use of indexes, we can improve these too.

Let's take the conversations table used in the first article, and add a column to hold the language, and some timestamps. Here's the full schema:

create_table conversations do |table|
  table.string   :subject, :null => false
  table.integer  :user_id, :null => false
  table.integer  :subject_id
  table.string   :subject_type
  table.string   :language_code
  table.datetime :created_at
  table.datetime :updated_at
end

We want to split conversations by their languages, so we'll add a named_scope to the Conversation class:

class Conversation
  belongs_to :user
  belongs_to :subject, :polymorphic => true
  
  named_scope :in_language, lambda {|language| 
    { :conditions => {:language_code => language}}
  }
end

Using Conversation.in_language 'en' will now get us all conversations in English. Like we did for foreign keys, we can see how long the query takes and read the explain plan.

mysql> SELECT * FROM conversations WHERE language_code = 'en';
90791 rows in set (3.94 sec)

mysql> EXPLAIN SELECT * FROM conversations WHERE language_code = 'en';
+-------------+------+---------------+---------+-------+---------+-------------+
| select_type | type | key           | key_len | ref   | rows    | Extra       |
+-------------+------+---------------+---------+-------+---------+-------------+
| SIMPLE      | ref  | NULL          | NULL    | NULL  | 1000111 | Using where | 
+-------------+------+---------------+---------+-------+---------+-------------+
1 row in set (0.00 sec)

Adding an index to the language_code column should improve the query performance, so let's do that and see the effect on our query:

mysql> SELECT * FROM conversations WHERE language_code = 'en';
90791 rows in set (3.02 sec)

mysql> EXPLAIN SELECT * FROM conversations WHERE language_code = 'en';
+-------------+------+---------------+---------+-------+---------+-------------+
| select_type | type | key           | key_len | ref   | rows    | Extra       |
+-------------+------+---------------+---------+-------+---------+-------------+
| SIMPLE      | ref  | lang_code_ix  | 3       | const |   98345 | Using where | 
+-------------+------+---------------+---------+-------+---------+-------------+
1 row in set (0.00 sec)

So the query now uses the index, and the time taken has gone from almost 4 seconds to just over 3. That's not nearly as big a performance gain as before, but why? The answer is in the number of rows returned: 90791. The index helps the database find the relevant rows quickly. However, it still has to read those rows, and reading over 90,000 rows will always take a significant amount of time.

In a real app we're unlikely to want to read all these rows at once, so let's do another quick comparison limiting the query to the first 100 rows:

Without the index:  

mysql> SELECT * FROM conversations WHERE language_code = 'en' LIMIT 100;
100 rows in set (1.32 sec)

And with the index:

mysql> SELECT * FROM conversations WHERE language_code = 'en' LIMIT 100;
100 rows in set (0.01 sec)

Much better.

Choosing between indexes

We've seen that by using an index and limiting the number of results we can quickly get the 'first' 100 English conversations. But in this case 'first' doesn't really mean anything. When no order clause is specified, MySQL may appear to order its results by id, but this is just a coincidence and shouldn't be relied on. Let's instead order our results by created_at to get the 100 most recent conversations.

mysql> SELECT * FROM conversations WHERE language_code = 'en' ORDER BY created_at DESC;
100 rows in set (4.42 sec)

mysql> EXPLAIN SELECT * FROM conversations WHERE language_code = 'en' ORDER BY created_at DESC;
+-------------+------+---------------+---------+-------+---------+-----------------------------+
| select_type | type | key           | key_len | ref   | rows    | Extra                       |
+-------------+------+---------------+---------+-------+---------+-----------------------------+
| SIMPLE      | ref  | lang_code_ix  | 3       | const |   98345 | Using where; using filesort | 
+-------------+------+---------------+---------+-------+---------+-----------------------------+
1 row in set (0.00 sec)

Even though this query uses our index and only returns 100 rows, it has still taken almost 4.5 seconds! The reason for this terrible performance is hinted in the extra information in the explain plan: using filesort. The database is reading all rows that match the condition (all 90791 of them), then using a filesort to order them before returning the first 100.

If we add an index on created_at and do the query again we get:

mysql> SELECT * FROM conversations WHERE language_code = 'en' ORDER BY created_at DESC;
100 rows in set (4.39 sec)

mysql> EXPLAIN SELECT * FROM conversations WHERE language_code = 'en' ORDER BY created_at DESC;
+-------------+------+---------------+---------+-------+---------+-----------------------------+
| select_type | type | key           | key_len | ref   | rows    | Extra                       |
+-------------+------+---------------+---------+-------+---------+-----------------------------+
| SIMPLE      | ref  | lang_code_ix  | 3       | const |   98345 | Using where; using filesort | 
+-------------+------+---------------+---------+-------+---------+-----------------------------+
1 row in set (0.00 sec)

It's pretty much exactly the same - still almost 4.5 seconds. This is because MySQL can only use one index per table in a query. It has to choose between the index on language_code and the one on created_at, and in this case chooses the language code index. We can force it to use our other index for comparison:

mysql> SELECT * FROM conversations 
       USE INDEX(created_ix) WHERE language_code = 'en' 
       ORDER BY created_at DESC LIMIT 100;
100 rows in set (0.02 sec)

mysql> EXPLAIN SELECT * FROM conversations 
       USE INDEX(created_ix) WHERE language_code = 'en' 
       ORDER BY created_at DESC LIMIT 100;
+-------------+------+---------------+---------+-------+---------+-----------------------------+
| select_type | type | key           | key_len | ref   | rows    | Extra                       |
+-------------+------+---------------+---------+-------+---------+-----------------------------+
| SIMPLE      | ref  | created_ix    | 8       | const | 9903411 | Using where                 | 
+-------------+------+---------------+---------+-------+---------+-----------------------------+
1 row in set (0.00 sec)

Using a trick stolen from Pratik Naik (in the comments of his article) we can force the use of a particular index in rails with a special named scope, and perform our query:

Conversation.named_scope :use_index, lambda {|index| 
  {:from => "#{quoted_table_name} USE INDEX(#{index})"}
}

Conversation.in_language('en').use_index('created_ix').all(:order => 'created_at desc')

But there is also another way - using indexing multiple columns.

Using compound indexes

A compound index indexes across two or more columns. When defining a compound index, the order of the columns is significant, as the database reduces the set of candidate rows by comparing the columns in turn. So an index created with add_index :conversations, [language_code, created_at] will compare created_at first, then language_code.

Because of this, we need to take some care in choosing the order of our columns. In general, the rule is to specify the most selective column first. That is, the column with the most unique values. So for our query, we'll add the following:

add_index :conversations, [created_at, language_code]

If we explain the query without forcing the index we find it is still efficient:

mysql> EXPLAIN SELECT * FROM conversations 
       WHERE language_code = 'en' 
       ORDER BY created_at DESC LIMIT 100;
+-------------+------+----------------+---------+-------+---------+-----------------------------+
| select_type | type | key            | key_len | ref   | rows    | Extra                       |
+-------------+------+----------------+---------+-------+---------+-----------------------------+
| SIMPLE      | ref  | lang_and_ca_ix |      48 | const |  640231 | Using where                 | 
+-------------+------+----------------+---------+-------+---------+-----------------------------+
1 row in set (0.00 sec)

A technique for choosing index column order

Sometimes it's hard to know which order your columns should be in an index, but there's an easy way to get a rough idea. Rewrite the query, removing all conditions, and selecting count(distinct column_to_index) for each column. So for our query, we'd do the following:

mysql> SELECT count(distinct language_code), count(distinct created_at)
       FROM conversations;
+-------------------------------+----------------------------+
| count(distinct language_code) | count(distinct created_at) |
+-------------------------------+----------------------------+
|                            21 |                     584089 | 
+-------------------------------+----------------------------+
1 row in set (1.90 sec)

From this it's clear that there are more distinct created_at values, so we probably want to index this column first. Note though that I said probably. When deciding on indexes, there are no hard and fast rules. Instead, we need to measure and analyse the queries used in our particular app, to ensure we are using the best indexes.

The next (and last) article in the series will go through some more advanced techniques, including when not to add an index, and how to spot redundant indexes.

Using indexes in rails: Index your associations

Many rails developers are great at building applications but have limited experience in database design. As a consequence, projects often have half-baked indexing strategies, and as a result suffer bad performance.

To try and improve this I've planned a series of posts on indexes, targetted at rails developers. In this first post I'll introduce indexes and how to index your associations, then I'll write about choosing additional indexes to improve query performance, and finally how to avoid redundant and duplicate indexes.

A brief overview of database indexes

Wikipedia states that 'a database index is a data structure that improves the speed of operations on a database table'. Unfortunately, this improvement comes at a cost.

For every index on a table, there is a penalty both when inserting and updating rows. Indexes also take up space on disk and in memory, which can affect the efficiency of queries. Finally, having too many indexes can cause databases to choose between them poorly, actually harming performance rather than improving it.

So while indexing is important, we shouldn't just throw indexes at our slow queries: we need to choose carefully how to index our data.

Indexing simple associations

By far the most common performance problem I've encountered in rails projects is a lack of indexes on foreign keys. There's no real excuse for this - not indexing foreign keys can cripple your app.

Take the following schema:

create_table users do |table|
  table.string :login
end

create_table conversations do |table|
  table.string  :subject, :null => false
  table.integer :user_id, :null => false
end

We can use this to map a one-to-many relationship between users and conversations, where user_id as the foreign key.

Here are the models to do that:

class User < ActiveRecord::Base
  has_many :conversations
end

class Conversation < ActiveRecord::Base
  belongs_to :user
end

With these models, to find all conversations for a particular user we'd use user.conversations, which in turns uses sql like this:

SELECT * FROM conversations WHERE user_id = 41;

I can run this query on a test database which I've randomly populated with 1,000,000 rows, to see how long it takes. Note, I've cut out the actual results as they are unimportant:

mysql> SELECT * FROM conversations WHERE user_id = 41;
12 rows in set (1.42 sec)

mysql> EXPLAIN SELECT * FROM conversations WHERE user_id = 41;
+-------------+------+---------------+---------+-------+---------+-------------+
| select_type | type | key           | key_len | ref   | rows    | Extra       |
+-------------+------+---------------+---------+-------+---------+-------------+
| SIMPLE      | ALL  | NULL          | NULL    | NULL  | 1001111 | Using where | 
+-------------+------+---------------+---------+-------+---------+-------------+
1 row in set (0.00 sec)

Although the query is simple, it took 1.42 seconds. The key column show the key or index that MySQL decided to use, in this case NULL as there are no indexes. The rows column is also relevant. It shows that MySQL will need to look at around 1,000,000 rows; that's a lot of data being loaded and compared.

What a difference just an index makes

If we then add an index on user_id:

add_index :conversations, :user_id, :name => 'user_id_ix'

And do the same select:

mysql> SELECT * FROM conversations WHERE user_id = 41;
12 rows in set (0.01 sec)

mysql> EXPLAIN SELECT * FROM conversations WHERE user_id = 41;
+-------------+------+---------------+---------+-------+---------+-------------+
| select_type | type | key           | key_len | ref   | rows    | Extra       |
+-------------+------+---------------+---------+-------+---------+-------------+
| SIMPLE      | ref  | used_id_ix    | 5       | const |  108    | Using where | 
+-------------+------+---------------+---------+-------+---------+-------------+
1 row in set (0.00 sec)

The difference is remarkable. From over 1.4 seconds to about 1 hundredth. Unless you have a cast-iron reason not to, index your foreign keys.

Indexing polymorphic associations

So for simple associations, we can add an index on the foreign_key column. For polymorphic associations the foreign key is made up of two columns, one for the id and one for the type. Let's add another association to our models to illustrate this.

add_column :conversations, :subject_id, :integer
add_column :conversations, :subject_type, :string

create_table :artworks do |table|
  table.string :title
end

class Artwork < ActiveRecord::Base
  has_one :conversation, :as => :subject
end

class Conversation < ActiveRecord::Base
  belongs_to :subject, :polymorphic => true
end

Here we've added an association between Artwork and Conversation, where an artwork can be the subject of a conversation. From an artwork, we can find the related conversation (if any) with artwork.conversation which will use the following SQL:

SELECT * FROM conversations WHERE subject_id = 196 and subject_type = 'Artwork';

Again the query takes around 1.4 seconds without any indexes. Now though, we have a choice on what to index. We can index either subject_type on its own, subject_id on its own, or both together.

Let's try each in turn, and measure the performance.

First, an index on just subject_type:

  
mysql> SELECT * FROM conversations WHERE subject_id = 196 and subject_type = 'Artwork';
12 rows in set (0.31 sec)

mysql> EXPLAIN SELECT * FROM conversations WHERE subject_id = 196 and subject_type = 'Artwork'
+-------------+------+---------------+---------+-------+---------+-------------+
| select_type | type | key           | key_len | ref   | rows    | Extra       |
+-------------+------+---------------+---------+-------+---------+-------------+
| SIMPLE      | ref  | sub_type_ix   | 5       | const | 89511   | Using where | 
+-------------+------+---------------+---------+-------+---------+-------------+
1 row in set (0.00 sec)

An index on just subject_id:

  
mysql> SELECT * FROM conversations WHERE subject_id = 196 and subject_type = 'Artwork';
12 rows in set (0.01 sec)

mysql> EXPLAIN SELECT * FROM conversations WHERE subject_id = 196 and subject_type = 'Artwork'
+-------------+------+---------------+---------+-------+---------+-------------+
| select_type | type | key           | key_len | ref   | rows    | Extra       |
+-------------+------+---------------+---------+-------+---------+-------------+
| SIMPLE      | ref  | sub_id_ix     | 5       | const | 204     | Using where | 
+-------------+------+---------------+---------+-------+---------+-------------+
1 row in set (0.00 sec)

An index on subject_id, subject_type:

mysql> SELECT * FROM conversations WHERE subject_id = 196 and subject_type = 'Artwork';
12 rows in set (0.01 sec)

mysql> EXPLAIN SELECT * FROM conversations WHERE subject_id = 196 and subject_type = 'Artwork'
+-------------+------+--------------------+---------+-------+---------+-------------+
| select_type | type | key                | key_len | ref   | rows    | Extra       |
+-------------+------+--------------------+---------+-------+---------+-------------+
| SIMPLE      | ref  | sub_id_and_type_ix | 5       | const | 5       | Using where | 
+-------------+------+--------------------+---------+-------+---------+-------------+
1 row in set (0.00 sec)

So subject_type compared ~90,000 rows in 0.31 seconds, subject_id compared ~200 rows in 0.01 seconds and subject_id, subject_type compared 4 rows also in 0.01 seconds. We should add an index to subject_id, subject_type as so:

add_index :conversations, [:subject_id, :subject_type]

Wrapping up

This should give a basic overview of indexes and the performance improvements they can give. Hopefully I've shown that foreign_keys should always be indexed, and how to index them. The next article (which I hope to publish later this week) will explain more about how to reason about indexes, and how to identify additional indexes (beyond those on foreign keys) to add.