Thursday, September 24, 2009

Pybots success stories and a call for help

Any report of the death of the Pybots project is an exaggeration. But not by much. First, some history.

Some history

The idea behind the Pybots project is to allow people to run automated tests for their Python projects, while using Python binaries built from the very latest source code from the Python subversion repository.

The idea originated from Glyph, of Twisted fame. He sent out a message to the python-dev mailing list in which he said:
"I would like to propose, although I certainly don't have time to implement, a program by which Python-using projects could contribute buildslaves which would run their projects' tests with the latest Python trunk. This would provide two useful incentives: Python code would gain a reputation as generally well-tested (since there is a direct incentive to write tests for your project: get notified when core python changes might break it), and the core developers would have instant feedback when a "small" change breaks more code than it was expected to."

This was back in July 2006. I volunteered to maintain a buildbot master (running on a server belonging to the PSF) and also to rally a community of people interested in running this type of tests. The hard part was (and still is) to find people willing to donate client machines to act as build slaves for a particular project, and even more so people willing to keep up with the status of their build slaves. The danger here, as in any continuous integratin system, is that once the status turns to red and doesn't go back to green, people start to ignore the failed steps. Even if those steps exhibit new and interesting failures, it's too late at this point (this is related to the broken windows theory).

The project starting fairly strong, gained some momentum, but then slowly ran out of steam. It was a combination of me not having the time to do the rallying, and of people not being interested in participating in the project anymore. At the height of its momentum, in early 2007, the Pybots farm consisted of 11 buildslaves running automated tests for more than 20 Python projects, including Twisted, Django, SQLAlchemy, MySQLdb, Bazaar, nose, twill, Storm, Trac, CherryPy, Genshi, Roundup. Pretty much a who's who of the Python project world.

Early success stories

Here are some examples of bugs discovered by the buildslaves in the Python farm:
  • new keywords 'as' and 'with' in Python 2.6 causing problems for projects that had variables with those names
  • Python install step failing even though all unit tests were passing (this underscores the importance of functional testing)
  • platform-specific issues -- for example Bazaar issues on Windows due to TCP client behavior, Twisted issues on Red Hat 9 due to multicast behavior, Python core issues on OS X due to string formatting errors
(for a more thorough overview of the Pybots project, including lessons learned, see also my PyCon07 presentation)

Recent signs of life and more success stories

In the last month or so there has been a flurry of activity related to the Pybots farm. It all started with an upgrade of the buildbot version on the machine hosting the Pybots buildmaster. This broke the master's configuration file, so the Pybots status page went completely dark.

As a result, Steve Holden posted a plea for help answered by a few people who showed interest in adding build slaves to the project. In parallel, Jean-Paul Calderone jumped in to help on the buildmaster side and he managed to fix the buildbot upgrade issue (thanks, JP!) David Stanek also expressed interest in taking a more active role on the buildmaster side.

Jean-Paul also sent more success stories to the Pybots mailing list. Here they are, verbatim, with his permission:

"The skip story:

The Twisted pybots slave started skipping every Twisted test one day. I noticed and filed http://twistedmatrix.com/trac/ticket/3703 (which goes into a bit of detail about why this happened). This happened to come up during the PyCon language summit, so there was some real-time discussion about it, resulting in a Python bug being filed, http://bugs.python.org/issue5571. Then, as that ticket shows, Benjamin Peterson was nice enough to fix the incompatibility.

The array/buffer story:

The Twisted pybots slave started to fail some Twisted tests one day. ;) The tests in question were actually calling into some PyCrypto code, so this failure wasn't in Twisted directly. PyCrypto loads some bytes into an array.array and then tries to hash them (for some part of its random pool API). I filed http://bugs.python.org/issue6071 on which someone explained that hashlib switched over to the new buffer API, lost support for hashing anything that only provides the old buffer API, and that array.array still only supports the old buffer API. This one hasn't been fixed yet, but it sounds like Gregory Smith plans to fix it before 2.7 is released.

There are other success stories too, incompatible changes that are more like bugs on the Twisted side than on the Python side (assuming one is generous and believes that incompatible changes in Python can actually be Twisted bugs ;). Things like typos that didn't result in syntax errors in an older version of Python but became syntax errors in newer versions (in particular, a variable was defined as 0x+80000000 instead of 0x80000000 - the former actually being valid syntax in 2.5 but became illegal in 2.6)."


My hope is that stories like these will convince more people about the usefulness of running tests for their projects against 'live' changes in the Python trunk (or other Python branches). I am not aware of any other testing project that accomplishes this for other programming languages.

In particular, if there is enough interest, we can also configure the Pybots master to trigger test runs for your project of choice using Py3k binaries! Think how cool you'll appear to your grandchildren!

How you can help

If you want to be involved in the Pybots project, please subscribe to the Pybots mailing list and show your interest by sending a message to the list. Here are some resources to get you started:

Jeff Roberts on a scalable DNS scheme for EC2

My ex-colleague from OpenX, Jeff Roberts, has another great blog post on 'A Scalable DNS Scheme for Amazon's EC2 Cloud'. If you need to deploy an internal DNS infrastructure in EC2, you have to read this post. It's based on battle-tested experience.

Monday, September 14, 2009

A/B testing and online experimentation at Microsoft

Via Greg Linden, I found a great presentation from Ronny Kohavi on "Online experimentation at Microsoft". All kinds of juicy nuggets of information on how to conduct meaningful A/B testing and other types of controlled online experiments.

One of my favorite slides is 'Key Lessons', from which I quote:

  • "Avoid the temptation to try and build optimal features through extensive planning without early testing of ideas"
  • "Experiment often"
  • "Try radical ideas. You may be surprised"

The entire presentation is highly recommended. You can tell that this wisdom was earned in the school of hard knocks, which is the best school there is in my experience, at least for software engineering.

Wednesday, September 02, 2009

Bootstrapping EC2 images as Puppet clients

I've been looking at Puppet lately as an alternative to slack for automated deployment and configuration management. I can't say I love it, but I think it's good enough that it warrants banging your head against the wall repeatedly until you learn how to use it. I do wish it was written in Python, but hey, you do what you need to do. I did look at Fabric, and I might still use it for 'push'-type deployments, but it has nowhere near the features that Puppet has (and its development and maintenance just changed hands, which makes it too cutting edge for me at this point.)

But this is not a post about Puppet -- although I promise I'll blog about that too. This is a post on how to get to the point of using Puppet in an EC2 environment, by automatically configuring EC2 instances as Puppet clients once they're launched.

While the mechanism I'll describe can be achieved by other means, I chose to use the Ubuntu EC2 AMIs provided by alestic. As a parenthesis, if you're thinking about using Ubuntu in EC2, do yourself a favor and read Eric Hammond's blog (which can be found at alestic.com) He has a huge number of amazingly detailed posts related to this topic, and they're all worth your while to read.

Unsurprisingly, I chose a mechanism provided by the alestic AMIs to bootstrap my EC2 instances -- specifically, passing user-data scripts that will be automatically run on the first boot of the instance. You can obviously also bake this into your own custom AMI, but the alestic AMIs already have this hook baked in, which I LIKE (picture Borat's voice). What's more, Eric kindly provides another way to easily run custom scripts within the main user-data script -- I'm referring to his runurl script, detailed in this blog post. Basically you point runurl at a URL that contains the location of another script that you wrote, and runurl will download and run that script. You can also pass parameters to runurl, which will in turn be passed to your script.

Enough verbiage, let's see some examples.

Here is my user-data file, whose file name I am passing along as a parameter when launching my EC2 instances:


#!/bin/bash -ex

cat <<EOL > /etc/hosts
127.0.0.1 localhost.localdomain localhost
10.1.1.1 puppetmaster

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
EOL

wget -qO/usr/bin/runurl run.alestic.com/runurl
chmod 755 /usr/bin/runurl
runurl ec2web.mycompany.com/upgrade/apt
runurl ec2web.mycompany.com/customize/ssh
runurl ec2web.mycompany.com/customize/vim
runurl ec2web.mycompany.com/install/puppet


The first thing I do in this script is to add an entry to the /etc/hosts file pointing at the IP address of my puppetmaster server. You can obviously do this with an internal DNS server too, but I've chosen not to maintain my own internal DNS servers in EC2 for now.

My script then retrieves the runurl utility from alestic.com, puts it in /usr/bin and chmod's it to 755. Then the script uses runurl and points it at various other scripts I wrote, all hosted on an internal web server.

For example, the contents of upgrade/apt are:


#!/bin/bash
apt-get update
apt-get -y upgrade
apt-get -y autoremove


For ssh customizations, my scripts downloads a specific .ssh/authorized_keys file, so I can ssh to the new instance using certain ssh keys.

To install and customize vim, I have customize/vim:


#!/bin/bash
apt-get -y install vim
wget -qO/root/.vimrc http://ec2web.mycompany.com/configs/os/.vimrc
echo 'alias vi=vim' >> /root/.bashrc


...where .vimrc is a customized file that I keep under the document root of the same web server where I keep my scripts.

Finally, install/puppet looks like this:


#!/bin/bash
apt-get -y install puppet
wget -qO/etc/puppet/puppetd.conf http://ec2web.mycompany.com/configs/puppet/puppetd.conf
/etc/init.d/puppet restart


Here I am installing puppet via apt-get, then I'm downloading a custom puppetd.conf configuration, which points at puppetmaster as its server name (instead of the default, which is puppet). Finally, I restart puppet so that the new configuration takes effect.

Note that I want to keep these scripts to the bare minimum that allows me to:

1) ssh into the instance in case anything goes wrong
2) install and configure puppet so the instance can talk to the puppetmaster

The actual package and application installations and customizations on my newly launched image will be done through puppet, by associating the instance hostname with a node that is defined on the puppetmaster; I am also adding more entries to /etc/hosts as needed using puppet-specific mechanisms such as the 'host' type (as promised, blog post on this forthcoming...)

Note that you need to make sure you have good security for the web server instance which is serving your scripts to runurl; Eric Hammond talks about using S3 for that, but it's too complicated IMO (you need to sign URL and expire them, etc.) In my case, I preferred to use an internal Apache instance with basic HTTP authentication, and to only allow traffic on port 80 from certain security groups within EC2 (my Apache server doubles as the puppetmaster BTW).

Is your hosting provider Reliam?

Some exciting news from RIS Technology, the hosting company I used to work for. They changed their name to Reliam, which stands for Reliable Internet Application Management. And I think it's an appropriate name, because RIS has always been much more involved into the application stack than your typical hosting provider. When I was there, we rolled out Django applications, Tomcat instances, MySQL, PostgreSQL and Oracle installations, and we maintained them 24x7, which required a deep understanding of the applications. We also provided the glue that tied all the various layers together, from deployment to monitoring.

Since I left, RIS/Reliam has invested heavily in a virtual infrastructure that can be combined where it makes sense with physical dedicated servers. The DB layer is usually dedicated, since the closest you are to bare metal, the better off you are in terms of database access. But the application layer can easily be virtualized and scaled on demand. So you can the scaling benefit of cloud computing, and the performance benefit of dedicated servers.

Here are some stats for the infrastructure that RIS/Reliam used for supporting traffic during the recent Miss Universe event (they host missuniverse.com and missusa.com):

* 135 virtual servers running the Web application
* 9 virtual servers running mysql-proxy
* 1 master DB server and 5 read-only slave DB servers running MySQL
* 301.52 Mbps bandwidth
* 33,750 concurrent users
* over 150K concurrent sessions per second

An interesting note is that they used round robin DNS to load balance between the mysql proxies and had all proxies configured to use the master and all five slaves. They managed to get mysql-proxy 0.7.2 running with this patch.

So...what's the point of this note? It's a shout-out to my friends at RIS/Reliam, and a warm recommendation for them in case you need a hosting provider with strong technical capabilities that cover cloud/hybrid computing, system architecture design, application deployment and deep application monitoring and graphing.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...