Scaling Rails on Heroku

By: Kuba Fietkiewicz

People choose Ruby on Rails as the framework of choice at their startup for a variety of reasons, not the least of which is the joy that it brings back to programming. Putting aside the arguments that Rails does not scale and that you should avoid it if you are building a high traffic app, at YP which is a top 30 Internet property, we have been using Rails for our flagship site for at least 5 years.

Engineers choose Heroku for many of the same reasons as they do Ruby. Heroku is every engineer’s favorite deployment platform, because it makes deployment of your app dead simple, removing much of the thinking required in setting up a scalable infrastructure.  Yet, thinking is still required, as no platform, not even one as well engineered as Heroku can overcome poor decision-making.

Given a hypothetical that we have made the decision to deploy Rails on Heroku, how would one go about scaling a typical app there?  Since applications are complex and varied in their requirements, everyone’s favorite and unhelpful answer is “it depends.”  The fact is, that you need to first understand where the problem is, before you can begin to address it.  Ultimately, you will need to do a little bit of sleuthing before you can determine where you will need to spend your time and money optimizing your infrastructure for scale.

Despite this, there are certainly a number of high-level architectural decisions that you can make immediately, which will make it easier for you to hone in on problems and optimize your infrastructure.

A typical app has static assets such as images, style sheets, and videos; dynamic application code that accesses and acts on data; and the data itself in some data store. Each of these exerts different demands on your application and thus should be considered independently of the others. How big are the static assets?  Do the videos need to stream? How do we prevent buffering?  Does the application code have long running algorithms? Is the data constantly changing?  What is the ratio of reads to writes?  Is it transactional data or reporting data?  These are just some of the questions to consider.

Approaching scaling problems can best be addressed with an application architecture that supports the scaling of each architectural layer independently of the others through the separation of concerns.

How is this achieved?  Static assets should be deployed to infrastructure optimized to deliver those as fast as possible, namely Content Delivery Networks (CDNs). CDNs fulfill requests for assets from servers closest (or best suited) to the requesting client so as to reduce the latency from network congestion.  This offloads work from your dynamic application, which can instead focus on responding to user requests to act on data and building the response for presentation.  Finally, data and business logic access should be encapsulated behind an http service layer, exposing a slowly changing contract to the front-end application, and responsible for providing the fastest possible access to business objects, regardless of back-end implementation.

The Setup

For the example app, I will make some initial assumptions so as to create a baseline benchmark. I will then move one lever at a time to show how it can affect application performance.  The app is very simple.  It has a /users/bench endpoint that performs a select for a random user on a table with 1MM rows, then displays that user’s information.

The App

$ rails new rails-bench --skip-test-unit --database=postgresql
$ cd rails-bench
$ rails g scaffold User name:string email:string username:string

config/database.yml

defaults: &defaults
 adapter: postgresql
 encoding: unicode
 pool: 5
 host: localhost
 port: 5432

config/routes.rb

get 'users/bench' => 'users#bench'

app/controllers/users_controller.rb

def bench
 @user = User.find_by(username: "user#{rand(0..1000000)}")
 render :show
end

Gemfile

gem "rails_12factor", group: :production
ruby "2.0.0"

Initial Heroku Setup

$ git init && git add . && git commit -m "init"
$ heroku create
Creating calm-age-1887... done, stack is cedar
http://calm-age-1887.herokuapp.com/ | git@heroku.com:calm-age-1887.git
Git remote heroku added

Database Setup

Given that Postgres is the db of choice on the Heroku platform, that’s what we’ll use.  The standard Dev database is inadequate for our test which will contain more than the allotted 10K rows in this tier:

$ heroku pg:info
 === HEROKU_POSTGRESQL_PINK_URL (DATABASE_URL)
 Plan:        Dev
 Status:      available
 Connections: 2
 PG Version:  9.3.1
 Created:     2013-11-30 23:20 UTC
 Data Size:   6.4 MB
 Tables:      0
 Rows:        0/10000 (In compliance)
 Fork/Follow: Unsupported
 Rollback:    Unsupported

So let’s upgrade it:

 $ heroku addons:add heroku-postgresql:standard-yanari
 Adding heroku-postgresql:standard-yanari on calm-age-1887... done, v6 ($50/mo)
 Attached as HEROKU_POSTGRESQL_NAVY_URL
 The database should be available in 3-5 minutes.
 ! The database will be empty. If upgrading, you can transfer
 ! data from another database with pgbackups:restore.
 Use `heroku pg:wait` to track status..
 Use `heroku addons:docs heroku-postgresql` to view documentation.

 $ heroku pg:promote HEROKU_POSTGRESQL_NAVY_URL
 Promoting HEROKU_POSTGRESQL_NAVY_URL to DATABASE_URL... done

Dataset

For the following set of tests we’ll use a dataset with 1MM users.  In order to seed the database with these users we’ll use the activerecord-import gem, and the following seeds script.  This reduces the time to load the dataset from 33min/MM down to 3min/MM.

Gemfile

gem "activerecord-import"

db/seeds.rb

save_slice = 100000
(0..1000000).each do |index|
 users << [
   "first name, lastname #{index}", 
   "name#{index}@mailinator.com", "user#{index}"
 ]
 if(index%save_slice==0)
   User.import columns, users, options
   users = []
 end
end

Then we redeploy and migrate the database, and seed our data:

 $ git commit –am “added data import” && git push heroku master
 $ heroku run rake db:migrate
 Running `rake db:migrate` attached to terminal... up, run.9316
 ==  CreateUsers: migrating ====================================================
 -- create_table(:users)
 -> 0.0193s
 ==  CreateUsers: migrated (0.0227s) ===========================================

 $ heroku run rake db:seed
 Running `rake db:seed` attached to terminal... up, run.1153
 …
 

Benchmarks

For most of the benchmarks we will be using two tools: ApacheBench, Version 2.3 Revision: 1528965, and Jmeter v.2.10. For each test we will set the number of requests at 1000 with a concurrency at 100.  I will be using Jmeter for the purpose of gathering performance data on the application as a whole, as ApacheBench does not request any related/embedded static assets. [1][2]

Pulling The Levers: Tuning The Database

Impact: High

Tuning your database likely has the highest impact on the performance of your application, and it’s ability to scale.

Database Sizing

Sizing your database is an important consideration for the performance of your app.  Heroku recommends choosing a plan where your entire data set can fit into the Postgres in-memory cache as data served from cache can be 100-1000X faster than from disk.  [3]

Looking at our oversimplified 1MM user data in the database, we see that it only takes up 140MB of space and that it should easily fit into the Standard Yanari 400MB cache. [4]

 $ heroku pg:info
 === HEROKU_POSTGRESQL_NAVY_URL (DATABASE_URL)
 Plan:        Standard Yanari
 Status:      Available
 Data Size:   139.5 MB

Looking at the query that determines the cache hit ratio we see that 99% of data resides in the cache and so we have optimized our instance size to fit all of our data in memory:

 $ heroku pg:psql
 => SELECT
 ->   sum(heap_blks_read) as heap_read,
 ->   sum(heap_blks_hit)  as heap_hit,
 ->   sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read ->   )) as ratio
 -> FROM
 ->   pg_statio_user_tables;
 heap_read | heap_hit |         ratio
 -----------+----------+------------------------
 133361 | 15432845 | 0.99143265867096966338

Our bench tests however do not show adequate performance even with the entire data set in cache:

 $ab -n 1000 -c 100 http://host/users/bench
 Requests per second:    7.15 [#/sec] (mean)
 Time per request:       13987.977 [ms] (mean)

 $java -jar ApacheJMeter.jar -n -t Rails\ Bench\ Plan.jmx
 1000 in 141.2s = 7.1/s Avg: 13602 Min:  2708 Max: 22793

Data Indexes

The reason that our benchmark does not perform well even with the entire data set in cache, is that our test retrieves random rows from our entire data set.  This query plan shows that the entire users table must be scanned to find the correct row:

 $ heroku pg:pgsql
 => EXPLAIN ANALYZE SELECT * from users where username='user1234';
 QUERY PLAN
 --------------------------------------------------------------------------------------------------------
 Seq Scan on users  (cost=0.00..17777.00 rows=1 width=82) (actual time=261.886..370.853 rows=1 loops=1)
 Filter: ((username)::text = 'user1234'::text)
 Rows Removed by Filter: 1000000
 Total runtime: 370.989 ms
 (4 rows)

Simply by adding an index to the right column on the database can drastically improve the performance of your queries:

$ rails g migration AddIndexToUser

db/migrate/<date>_add_index_to_user.rb

class AddIndexToUser < ActiveRecord::Migration
  def change
    add_index :users, :username
  end
end

$ git add . && git commit –m “add index to username” 
$ git push heroku master
$ heroku run rake db:migrate
Running `rake db:migrate` attached to terminal... up, run.7636
Migrating to AddIndexToUser (20131201014140)
==  AddIndexToUser: migrating =================================================
-- add_index(:users, :username)
-> 26.3659s
==  AddIndexToUser: migrated (26.3661s) =======================================

Now if we run our query plan:

 => EXPLAIN ANALYZE SELECT * from users where username='user1';
 QUERY PLAN
 --------------------------------------------------------------------------------------------------------------------------------
 Index Scan using index_users_on_username on users  (cost=0.08..4.0 9 rows=1 width=82) (actual time=0.142..0.143 rows=1 loops=1)
 Index Cond: ((username)::text = 'user1'::text)
 Total runtime: 0.210 ms
 (3 rows)

We see an appropriate 1000X improvement in data access times.

And if we run our benchmarks we see a less dramatic, but still significant 13X improvement in request times for the ApacheBench benchmark, and a 7X improvement in the Jmeter benchmark:

$ ab -n 1000 -c 100 http://host/users/bench
 Requests per second:    89.04 [#/sec] (mean)
 Time per request:       1123.054 [ms] (mean)

 $ java -jar ApacheJMeter.jar -n -t Rails\ Bench\ Plan.jmx
 Results = 1000 in  20.9s = 47.9/s Avg: 1556 Min: 372 Max:  2924

The reason that you begin to see a difference in throughput between the ApacheBench and Jmeter benchmarks is that Jmeter is requesting the entire set of web content including the 2 other embedded files:

/assets/application.js:104kb
/assets/application.css:1kb

Previously this effect was not visible, as the request throughput was constrained by the database response times.

More Advanced Considerations (Not Implemented/Benchmarked)

Replication

One way to achieve increased read throughput on your data is via a master-slave replication setup.  This can be achieved on Heroku through the use of follower databases.  Follower databases are read-only copies of your master database, which are updated in near-real-time from transactions happening on the master. [5]

Creating a follower database is as easy as provisioning a new database with the –follow flag which points to your current master:

$ heroku addons:add heroku-postgresql:ronin --follow MASTER_DB_URL

Followers are one way to have a hot standby ready for manual failover of your database.  Manual failover can be achieved simply by unfollowing your follower, and promoting it as your master:

$ heroku pg:unfollow <FOLLOWER_DB_URL>
$ heroku pg:promote <FOLLOWER_DB_URL>

Substituting <MASTER_DB_URL> and < FOLLOWER_DB_URL > with the names of your master and follower respectively.

HA Replication

High availability replication is available on all Premium and Enterprise plans, and is generally transparent to the application owner.  As it requires no owner input, other than purchasing the right plan, if this is something that is important to you, then make sure to purchase a premium plan and read about it here: [6]

https://devcenter.heroku.com/articles/heroku-postgres-ha

N+1 Queries

The N+1 problem comes from using naive ORM constructs to get at data within an association.  Using the example from Rails Guide on Active Record Querying [7]:

clients = Client.limit(10)
clients.each do |client| 
  puts client.address.postcode 
end

The above code executes 1 (to find 10 clients) + 10 (one per each client to load the address) = 11 queries in total. If we eagerly load the association using the includes() method, we can change the code to run only 2 queries instead of the original 11:

clients = Client.includes(:address).limit(10)
  clients.each do |client|
  puts client.address.postcode
end

This results in only 2 queries being run:

SELECT * FROM clients LIMIT 10
 SELECT addresses.* FROM addresses
 WHERE (addresses.client_id IN (1,2,3,4,5,6,7,8,9,10))

Database Conclusion

The majority of apps are constrained by their slowest component, their database.  Clearly tuning your database size, indexing the right columns, and optimizing your queries will have significant impact on the scalability and performance of your entire system.  This is the first place where a significant amount of effort should be invested.

Pulling The Levers: Static Assets

Impact: High

Heroku Static Asset Serving

One can enable the serving of static assets from a Rails app in Heroku by using the rails_12factor gem.

gem 'rails_12factor', group: :production

This isn’t enough. While in other environments, a reverse proxy such as nginx can and should be used to intercept and serve static content rather than sending those requests to the app layer, the Heroku routing mechanism precludes the need for nginx, and thus removes the ability for static content to be served in this way. [8] When you use Heroku, all asset requests get sent to the app layer to be served by your dynos. The idea is to use the right tool for the job, and this is not it.

The previous Jmeter test of 100 concurrent threads looped 10 times with a 5 second ramp up, and limit of 4 concurrent embedded asset requests shows that our throughput is quite low 47.9 pages/sec:

$ java -jar ApacheJMeter.jar -n -t Rails\ Bench\ Plan.jmx
 Results = 1000 in  20.9s = 47.9/s Avg: 1556 Min: 372 Max: 2924

Static Assets Served From CDN

The best place to serve your static assets from is a CDN.  Using a Content Delivery Network optimizes the delivery of static assets on your site. This allows us to offload all requests for these static assets off of your web dynos, which in turn will free those dynos to handle more requests for dynamic content. [9]

Changing our production.rb environment file to use cloudfront as the assets host:

config/environments/production.rb

config.action_controller.asset_host="http://d65asdf.cloudfront.net"

After redeploying, our same Jmeter test shows a dramatic improvement, now matching our ApacheBench test:

Results = 1000 in  10.0s = 99.9/s Avg: 525 Min: 106 Max:  1064

Static Assets Conclusion

The test clearly shows that offloading static assets to a CDN has significant positive impact on the performance of your application.  This would be far more pronounced if the static assets were of the number and size representative of today’s sites.  When your application is not busy serving static assets, it is freed up to be able to serve the maximum possible number of requests.

Pulling The Levers: Application Server

Heroku recommends the running of Rails on multi process application servers such as Unicorn. [10] The theory being, that the application server should take advantage of all available CPU cores. Given that a 1X dyno only has 1 core [11], this number of workers should not exceed 2 or 3.

Unicorn recommends setting the number of worker processes to at least the number of cores, but not much more than that.  More can be set to overcome some inefficiencies caused by slow non-blocking requests. Additionally, the number of workers running Rails should not exceed the amount of memory available on the machine. [12] Michael VanRooijen has previously shown the serious negative effect of too many unicorn workers on a dyno. [13]

Following Heroku’s Unicorn configuration advice, we add the Unicorn gem to the Gemfile, a Unicorn configuration, a Procfile, and redeploy:

Gemfile

gem ‘unicorn’

config/unicorn.rb:

 worker_processes 2 # amount of unicorn workers to spin up
 timeout 30         # restarts workers that hang for 30 seconds

 preload_app true

 before_fork do |server, worker|
   Signal.trap 'TERM' do
     puts 'Unicorn master intercepting TERM and sending myself QUIT instead'
     Process.kill 'QUIT', Process.pid
   end
   defined?(ActiveRecord::Base) and
     ActiveRecord::Base.connection.disconnect!
 end

 after_fork do |server, worker|
   Signal.trap 'TERM' do
     puts 'Unicorn worker intercepting TERM and doing nothing. Wait for master to send QUIT'
   end
   if defined?(ActiveRecord::Base)
     config = Rails.application.config.database_configuration[
       Rails.env
     ]
     config['reaping_frequency'] = ENV['DB_REAP_FREQ'] || 10 
     config['pool']            = ENV['DB_POOL'] || 5
     ActiveRecord::Base.establish_connection(config)
   end
 end

Procfile

 web: bundle exec unicorn -p $PORT -c ./config/unicorn.rb

With 2 workers our apache bench test shows an expected ~2X improvement:

 Requests per second:    208.93 [#/sec] (mean)
 Time per request:       478.622 [ms] (mean)

And our Jmeter test also shows significant gains in throughput, though not as pronounced:

Results =  1000 in 7.3s = 137.6/s Avg: 241 Min: 108 Max: 1053

Dyno Size

The standard Dyno size has 512MB of memory and 1cpu core. [11]  With this sizing, it shouldn’t make much sense to increase the number of workers above 2 or 3, especially if that number of workers stretches the Dyno’s memory limits.  In order to understand the memory utilization of your app it is possible to add the log-runtime-metrics add on, which can provide insight into your apps memory and CPU utilization metrics by injecting those into the log stream. [14]

 $ heroku labs:enable log-runtime-metrics
 $ heroku restart

The resulting log entries show that our app uses up a total of 136MB of memory with 3 workers.  Depending on the demands of the app, you may be able to get away with more worker processes, however given that we’re constrained by 1 CPU, this may not provide many dividends.

If we double the size of the Dyno, do we see a similar increase in throughput?

 $ heroku ps:resize web=2x

The ApacheBench benchmark improves slightly due to this change:

 Requests per second:    271.86 [#/sec] (mean)
 Time per request:       367.834 [ms] (mean)

However the Jmeter benchmark does not:

 Results = 1000 in 8.1s = 123.7/s Avg: 282 Min: 103 Max: 1278

It appears that this simple app is not bound by CPU. Increasing to 5 workers on a double-sized Dyno, the benchmarks show neither an increase in speed, nor an increase in throughput:

 Requests per second:    274.72 [#/sec] (mean)
 Time per request:       364.008 [ms] (mean) 
 Results =  1000 in  7.2s = 139.1/s Avg: 216 Min: 100 Max: 1179

If we instead double the number of Dynos with 2 workers on each, will we see a corresponding increase in throughput?

 $ heroku ps:resize web=1x
 $ heroku ps:scale web=2

Not a significant change from our 1 Dyno with 2 workers, even with a slight degradation, and a significant degradation from the double sized Dyno:

 Requests per second:    198.71 [#/sec] (mean)
 Time per request:       503.255 [ms] (mean)
 Results =  1000 in 8.2s = 122.6/s Avg: 251 Min: 106 Max: 1002

Application Server Conclusion

Tuning your application server requires a deep understanding of the characteristics of your application.  If your application’s behavior is that of short lived requests, then using and tuning an application server such as Unicorn can be very beneficial.  Under those circumstances, we were able to get a peak output of 274 requests/sec with 5 workers on a double sized Dyno.

Pulling The Levers: SOA Back End

Impact: High, Potentially Negative

When isolating your concerns, it is often advantageous to remove data access from the application and move it behind highly performing http endpoints.  This pattern has several advantages.  It protects your application from often changing back end implementation details, it offloads some of the work to other applications, and it allows for different scaling decisions to be made based on each isolated component.

Setup

To setup this scenario, we added a JSON endpoint /users/username.json?id=… to our app connected to our data store, and we created a new application that uses ActiveResource to call that endpoint passing it our random username.

New App

 $ rails new rails-bench-soa --skip-test-unit

Gemfile

 gem ‘activeresource’

config/routes.rb

 get 'users/bench' => 'users#bench'

app/controllers/users_controller.rb

 def bench
 @user = User.find(
    :first, 
    :from => :username, 
    :params => {:id => "user#{rand(0..1000000)}"}
  )
 render :show
 end

app/models/user.rb

 class User < ActiveResource::Base
   self.site = "http://apihost"
 end

The other files are the same as the previous app:

 config/environments/production.rb
 config/unicorn.rb
 Procfile
 $ git init && git add . && git commit –m “initial app”
 $ heroku create
 $ git push heroku master

Make sure to change your cloudfront origin server configuration to pull from your new Heroku domain.

Old App Changes

config/routes.rb

get 'users/username' => 'users#username'

app/controllers/users_controller.rb

 def username
   username = [User.find_by(username: params[:id])]
   render json: username
 end

Setting the Baseline

Running our apache benchmark against our new SOA endpoint, we see that Rails contributes sub millisecond overhead and that we continue to be constrained by the database:

 Requests per second:    122.88 [#/sec] (mean)
 Time per request:       813.782 [ms] (mean)
 Completed 200 OK in 6ms (Views: 0.4ms | ActiveRecord: 4.4ms)
 Completed 200 OK in 4ms (Views: 0.6ms | ActiveRecord: 2.5ms)
 Completed 200 OK in 7ms (Views: 0.4ms | ActiveRecord: 6.0ms)

Testing the SOA Implementation

Now that we add network overhead, by running our benchmarks against our new app which uses the service endpoint instead of a direct database connection we see a significant degradation in performance:

 Requests per second:    42.33 [#/sec] (mean)
 Time per request:       2362.617 [ms] (mean)

With similar results for Jmeter

Results = 1000 in  25.0s = 40.0/s Avg: 1896 Min: 134 Max: 4314

With much of our app waiting for the network, we should be able to take advantage of more requesting Rails processes, so let’s double up on our web scale:

 $ heroku ps:scale web=2

While our benchmark’s throughput does not double, it does improve significantly:

 Requests per second:    74.60 [#/sec] (mean)
 Time per request:       1340.528 [ms] (mean)
 Results = 1000 in  15.6s = 64.2/s Avg: 962 Min: 129 Max: 4198

Increasing the number of dynos to 4, results in additional gains:

 Requests per second:    121.59 [#/sec] (mean)
 Time per request:       822.410 [ms] (mean)

SOA Conclusion

These benchmarks show that there is a very high overhead cost in implementing a service back end, one that will never be as fast as direct database access.  The cost may be the lesser of two evils, however, when you reach a scale that exceeds the capabilities of the database.  At that time the refactoring of your service back end can be made transparent to your front end, as you have provided a stable http interface into that layer.  In the meantime, additional throughput can be had by horizontally scaling your application stack.

Though not illustrated in the above benchmarks, using Rails for your SOA back end can be like hitching an Airstream to your drag racer.  When the solution calls for speed, bringing the comforts of home may not be the wisest choice in that circumstance.  Although in this limited test we didn’t see a significant impact from Rails, rather one from the network, under heavy sustained load I would expect the performance of the service stack to suffer. Using lighter weight frameworks such as Sinatra can be an improvement, though by some benchmarks, not enough of one, and not always. [15]  In my experience this has been a successful next step. At YP we moved many of our back end services to Sinatra and in general received a 10X speed improvement in those cases.

Summary

This is by far NOT a comprehensive guide to scaling your application on Heroku.  A number of important topics were not covered including caching, monitoring, logging, offline processing, long queries, asynchronous processing, and the list goes on. The goal of this post is to provide you with the minimal amount of information that you need to begin the investigation into how to scale your app on Heroku.

What was shown is that tuning your database is the number one thing that you need to approach when building your application.  This includes understanding the size of your dataset, understanding how your application queries your data, and understanding the most efficient ways to do so.

Moving static assets off your application server is essential to increasing the ability of your application server to be able to serve meaningful requests, and essential from a user experience standpoint.

Tuning your application server depends on the characteristics of your application and should be done with that understanding, and with the understanding of the limitations of the application server.

Implementing SOA may be counterproductive in the beginning, and will negatively affect the performance of your application. It will however be necessary at some point in the growth of your app as you look to find other ways to scale beyond the capabilities of your database.

This is just the tip of the iceberg of strategies that you will need to apply in your ongoing search to extract the maximum performance from your application. As you do this, you will need to move one lever at a time, and continuously benchmark your changes.

References

[1] http://httpd.apache.org/docs/2.2/programs/ab.html
[2] http://jmeter.apache.org
[3] https://devcenter.heroku.com/articles/heroku-postgres-plans#cache-size
[4] https://devcenter.heroku.com/articles/heroku-postgres-plans#standard-tier
[5] https://devcenter.heroku.com/articles/heroku-postgres-follower-databases
[6] https://devcenter.heroku.com/articles/heroku-postgres-ha
[7] http://guides.rubyonrails.org/active_record_querying.html#eager-loading-associations
[8] https://devcenter.heroku.com/articles/http-routing
[9] https://devcenter.heroku.com/articles/using-amazon-cloudfront-cdn-with-rails
[10] https://devcenter.heroku.com/articles/rails-unicorn
[11] https://devcenter.heroku.com/articles/dyno-size
[12] http://unicorn.bogomips.org/TUNING.html
[13] http://michaelvanrooijen.com/articles/2011/06/01-more-concurrency-on-a-single-heroku-dyno-with-the-new-celadon-cedar-stack/
[14] https://devcenter.heroku.com/articles/log-runtime-metrics
[15] http://www.techempower.com/benchmarks/#section=data-r7

Advertisements