My first Sinatra app was an email scraper


And yes, I did it my way

Updated on November 4, 2019

A couple of weekends ago I built my first Sinatra app to find some more business email patterns for Toofr. Gotta say... 'tis wonderful to work with such a clean, uncluttered Ruby framework.

Email Scraping: The Birth of the Blues

My little scraper is essentially a web app without a website. I didn't know very much about Sinatra when I got started. In fact, I still don't know much about it. I'm probably only scratching the surface on what it can do. What I do know is that it requires only a few files and you can call its functions via a Raketask. This is important, as I'll describe later on.

So why did I choose Sinatra? Simply put, I wanted a few things:

  • Keep it lightweight. My pure Ruby scraper script was just one file, so I figured a full-blown Rails app was wayyyy overpowered. I just need the bare bones.
  • Use ActiveRecord for database reading and writing. This way I wouldn't have to write any direct Postgres SQL code. I'm familiar with ActiveRecord syntax and wanted to build fast and not have to keep looking up SQL read and write functions.
  • Play nicely with Heroku. I decided from the beginning that I was going to run this little scraper on Heroku. I found from testing my script locally that the target site would ban my IP after a very low number of pings. Since Heroku spins up a new dyno every time you schedule a task, a great way to avoid using proxy servers is to just have Heroku make you a new dyno every time you get blocked.

Sales Hacking: All of Me

Exactly how lightweight is Sinatra? I was amazed. It's super light. Here's the file list. It's 6 files, and one of them is blank! (Procfile, for my specific case, since I'm not running a website.)

  • app.rb - This file seems to be the guts. It's loading all the Sinatra goodies and includes my models.
  • config.ru - I read that Heroku likes seeing this, so I included it.
  • environments.rb - This defines my Heroku and local database connections.
  • Gemfile - Just like a Rails app, this controls my libraries.
  • Procfile - Since I don't need Heroku to run any web or worker dynos, I include this file but leave it blank.
  • Rakefile - Here's where I put the actual scraper code. The scraper itself became a Raketask that gets called by Heroku Scheduler.

The Best is Yet to Come

I'll describe the Ruby scraper in more detail in my next post. It's a pretty brute force technique, but it's working really well!

A quick teaser - here's the contents of the app.rb file.

# app.rb

require 'sinatra'
require 'sinatra/activerecord'
require './environments'

class Domain < ActiveRecord::Base
end

class Page < ActiveRecord::Base
end
More Find Emails Articles >>