Using Amazon Cloudsearch with Python and Boto Thu, Apr 12, 2012
This morning Amazon announced their latest addition to the toolkit, Cloudsearch. Here at ex.fm, we were lucky enough to be part of the private beta for Cloudsearch, so we’ve been running it in production for the past few months.
In terms of performance, Cloudsearch is much faster and more reliable than our previous Apache Solr install. The best feature by far is scaling on demand. Data is horizontally partitioned on the fly as the number of documents you’re indexing grows. The number of instance size and number of instances scales up and down depending on how many bits you’re pushing and pulling. For more on how the auto scaling works, see Dr. Vogels’ post. In 4 months, we’ve had a total of 15 minutes of downtime, which is very good for a private beta. We’re also not concerned with lock-in because it is so similar to the way we were using Solr. We were even able to swap in Cloudsearch for Solr without anyone really noticing, except that search results were twice as fast and had much better relevance.
Out of the box, there are command line tools for everything, but we wanted a nicer integration with our app and management scripts. We already use boto for all of our S3, EC2, and Route53 acccess to AWS, so it was a pretty natural fit to create a drop in module to add Cloudsearch. Disclaimer: what we came up with was designed specifically for our needs and timeline, so we took some liberties in not making the code entirely “boto-ized” and not all Cloudsearch features are fully implemented. We’ll contribute all of this back into boto and hopefully a stable version will land in boto master soon.
So let’s say we want to add search for our users. We want to be able to search by username, be able to drill down in the results to users in different countries, and sort the results in some interesting ways.
It’ll take a few minutes for your new domain to provision instances and get itself set up, so go check reddit and come back in a bit.
Now, time to get our documents in. If you’ve read the above links, you already know your new Cloudsearch cluster gives you two endpoints to work with; the document endpoint and the search endpoint. You can find these in the console once your cluster is ready. To get your documents indexed you post a JSON body to the document service endpoint in a schema Amazon calls Search Data Format (SDF) that looks like this:
Now let’s get our users indexed.
At this point, you should be able to check back at the console and see that you have 4 searchable documents. You can even run some test searches from the console. Try `dan*` and notice the faceting. Let search using some python.
Now lets say a user wants to delete their account and we need to remove them from the index.
We can build on the above to have our Cloudsearch cluster work even better for millions of documents like we deal with everyday at ex.fm.
Something that has worked very well for us is queueing incremental updates to songs and users. When a song or user is modified and we want to reindex the document in Cloudsearch, we add the documents id to a redis set. We then have a celery periodic task that collects all of the ids from the redis set, builds an SDF with hundreds of document updates and commits them all at once.
Another trick is using UTC timestamps in place of actually keeping a version number for each document. In our case, for songs and users, we do keep a version number and mongodb makes incrementing this extremely convenient and simple. For some new domains we’re working on though, we are using the timestamp as a version hack.
No downtime reindexing allows easy experimenting with a lot of different sorting algorithms (re: rank expressions). Rank expressions are really amazing and we’re still trying to wrap our heads around them fully. It’s incredibly simple to have a rank expression that implements any of the various popularity algorithms (ie Hacker News, Reddit, StubleUpon) in one line.
Lastly, it has been really handy to export entire mongodb collections to SDF’s and store them on S3 for even faster iteration and potential disaster recovery. We have a script that creates thousands of celery tasks; 1000 document ids at a time. Each celery task then pulls those 1000 objects from mongodb, creates the SDF and stores it on S3. We can then run another script, that creates thousands of more tasks to grab each SDF from S3 and directly post it to the document service. We’ve found this to be extremely effective for indexing millions of documents in a very timely fashion.
Many thanks are due to Jon Handler, Umesh Sampat, Gil Isaacs, Mike Bohlig, Stuart Tettemer, Puneet Gupta, Sonja Hyde-Moyer and everyone else from the Cloudsearch and AWS teams.
Sound like fun? Come work with us. jobs [at] ex.fm