Improving Seed Fu

Wow! This takes me back! Please check the date this post was authored, as it may no longer be relevant in a modern context.

August 18th, 2009

Seed Fu is one of the most common bootstrapping solutions for Rails. Bootstrapping is a technique for storing initial data with your application’s code. Seed Fu let’s you have dedicated fixture files for development and production: Use less data for development mode, or create default test users.

Bootstrapping a new developer database makes a great example. After building their database, a new developer may need to create a user:

$ script/runner User.create(:email => 'me@mydomain.com', :password => 'apass', :password_confirmation => 'apass')

Imagine you need to set up several users with settings and relationships, and this quickly becomes difficult to document. Seed Fu has you build fixture files in db/fixtures/ or db/fixtures/#{RAILS_ENV} that look like this:

User.seed(:email) do |s|
  s.email    = 'me@mydomain.com'
  s.password = 'apass'
  s.password_confirmation = s.password
end

And makes it easy for a new developer (or production deployment) to bootstrap:

$ rake db:seed

The argument to seed(:email) defines the columns checked before writing the row. If there is already a row with the email address me@mydomain.com, Seed Fu will update that row instead of inserting a new row. This let’s you run seed to update a database that already has content.

The bad news is Seed Fu was terribly slow on large datafiles and consumed RAM without freeing it, which means it never completed seeds of many large fixtures. The good news is I’ve got yer fix right here:

http://github.com/mixonic/seed-fu/tree

It’s waiting on the maintainer proper to merge upstream (though I haven’t heard back yet). Let’s see what changed.

Go Faster

On my 2Ghz Core 2 Duo, 7200 rpm hard-drive, 2G ram laptop:

real    114m1.482s
user    75m53.490s
sys 6m35.414s

And on a production server:

real    49m51.865s
user    27m4.381s
sys 1m6.247s

For importing 1223431 rows into a truncated database…269 seeds a second on my laptop, 753 seeds a second on production. Seed Fu is still checking for existing records and using ActiveRecord to add seeds. So what changed?

The biggest change is dropping ActiveRecord validations. Validations are slow monsters. The next logical step would be to stop using ActiveRecord all together, or at least toy with disabling callbacks, but that feels one step too far. Disabling validations means keeping valid data in your seeds becomes your responsibility. It’s a trade-off, but worth it.

Two smaller and 100% backwards compatible speed-ups are in this commit. The first walks the short constraints array instead of the longer data array when finding limiting conditions:

     def condition_hash
-      @data.reject{|a,v| !@constraints.include?(a)}
+      @constraints.inject({}) {|a,c| a[c] = @data[c]; a }
     end

And the second avoids hitting method missing after the first call:

-    def set_attribute(name, value)
-      @data[name.to_sym] = value
-    end
-

     def method_missing(method_name, *args) #:nodoc:
-      if (match = method_name.to_s.match(/(.*)=$/)) && args.size == 1
-        set_attribute(match[1], args.first)
+      if args.size == 1 and (match = method_name.to_s.match(/(.*)=$/))
+        self.class.class_eval "def #{method_name} arg; @data[:#{match[1]}] = arg; end"
+        send(method_name, args[0])
       else
         super
       end

Method_missing is great for spreading some nice looking sugar around, but it was being hit several times for each seed! By creating a method and calling it directly later, we shave off more time.

Use Less IO, Memory

The seed file I used with a 1.2 million rows was 165M. Gzipped it is 16M. That means Less IO for our slow disks, and fewer obnoxious files in source control. Seed Fu now reads .rb.gz just like .rb files.

Seed Fu’s major failing point was that is grew to eat all RAM when dealing with gigabytes or even megabytes of fixtures. At one point, forking for each fixture looked like the only solution. It was clumsy and not very elegant.

Instead the better solution was to break up execution of the large seed files. Seed Fu reads a .rb.gz or .rb file into memory as a string. If it hits:

# BREAK EVAL

It evaluates everything it has just collected and starts again from after the comment. The memory usage on the 1.2 million row import was about 60M of RAM (not unheard of for a Rails process), but it stayed there the whole import.

Add a Generator For Large Fixtures

165M fixture files are not being written by hand. Chances are, if you run into issues with SeedFu and speed, you have data coming from a 3rd party. To keep the bootstrapping for your app as easy as rake db:seed, you need to create Seed Fu fixtures from XML, CSVs, Web Services, any kind of source.

Say hello to SeedFu::Writer! Use the writer to generate large fixtures that take advantage of # BREAL EVAL and the more concise seed_many syntax. Take a look:

seed_writer = SeedFu::Writer::SeedMany.new(
  :seed_file  => SEED_FILE,
  :seed_model => 'City',
  :seed_by    => [ :city, :state ]
)

FasterCSV.foreach( CITY_CSV,
  :return_headers => false,
  :headers => :first_row
) do |row|

  # Do some logic on row...

  # Write the seed
  #
  seed_writer.add_seed({
    :zip => row['zipcode'],
    :state => row['state'],
    :city => row['city'],
    :latitude => row['latitude'],
    :longitude => row['longitude']
  })

end

seed_writer.finish

See more detail at the bottom of my Seed Fu fork’s github page. SeedFu::Writer::SeedMany takes several arguments upon initialization:

:seed_file - Where to write output (probably db/fixtures/my_fixture.rb).
:seed_model - Which model to seed.
:seed_by - An array of which columns to constrain the seed by.
:quiet - Setting this to true will quiet standard out.
:chunk_size - How many seeds to write before breaking evaluation. Default is 100.

And SeedFu::Writer::Seed takes an additional argument to add_seed to set the :seed_by columns for that particular seed.

Seed Fu is Better, Now It’s Your Turn

This is a nice step for Seed Fu that begins to make it a real solution for 100s of megabytes of data. The writer gets us closer to a repeatable cycle for importing 3rd party data (just store your conversion scripts with the app code).

Do you have an alternative plan for large data-sets in Rails? What would you like to see Seed Fu do next?