Showing posts from 2012

2012 Year in Review: devops blog posts and articles

15 things I learned from "Shipping Greatness"

I finished reading "Shipping Greatness" by Chris Vander Mey (ex-Googler, ex-Amazonian). It's a great book, and I highly recommend it to anyone who is interested in learning how to ship software as close to schedule as possible, with as few bugs as possible, which is really the best we can hope for in this industry. Here are some things I jotted down while reading the book. You should really read the whole book, if only for the 'in-the-trenches' vignettes from Google and Amazon.

1. When tackling a new product, always start with the needs/problems of your customers (of course Jeff Bezos is famous for this approach at Amazon).

2. Have a mission statement that inspires and that also fits on a t-shirt if possible (for example the mission statement of the personalization team at Amazon is "increase customer delight").

3. Define your product by writing a press release (Amazon does it; it forces you to be succinct and to capture the essence of your product).


Code performance vs system performance

Just a quick thought: as non-volatile storage becomes faster and more affordable, I/O will cease to be the bottleneck it currently is, especially for database servers. Granted, there are applications/web sites out there which will always have to shard their database layer because they deal with a volume of writes well above what a single DB server can handle (and I'm talking about mammoth social media sites such as Facebook, Twitter, Tumblr etc).  By database in this context I mean relational databases. NoSQL-like databases worth their salt are distributed from the get go, so I am not referring to them in this discussion.

For people who are hoping not to have to shard their RDBMS, things like memcached for reads and super fast storage such as FusionIO for writes give them a chance to scale their single database server up for a much longer period of time (and by a single database server I mostly mean the server where the writes go, since reads can be scaled more easily by sending t…

Quick troubleshooting of Sensu 'no keepalive from client' issue

As I mentioned in a previous post, we started using Sensu as our internal monitoring tool. We also integrated it with Pager Duty. Today we terminated an EC2 instance that had been registered as a client with Sensu. I started to get paged soon after with messages of the type:

 keepalive : No keep-alive sent from client in over 180 seconds

Even after removing the client from the Sensu dashboard, the messages kept coming. My next step was of course to get on the #sensu IRC channel. I immediately got help from robotwitharose and portertech.  They had me try the following:

1) Try to remove the client via the Sensu API.

I used curl and ran:

curl -X DELETE http://sensu.server.ip.address:4567/client/myclient

2) Try to retrieve the client via the Sensu API and make sure I get a 404

curl -v http://sensu.server.ip.address:4567/client/myclient

This indeed returned a 404.

3) Check that there is a single redis process running

BINGO -- when I ran 'ps -def | grep redis', the command returned T…

Monitoring doesn't have to suck

This is a follow up to my previous post, where I detailed some of the things I want in a modern monitoring tool. A month has passed, and I thought I'd give a quick overview of some tools we started to use as part of our monitoring and graphing strategy.

Several people recommended Sensu, so my colleague Jeff Roberts has given it a shot, and he liked what he saw (blog post with more technical details hopefully soon from Jeff!). We're still tinkering with it, but we're already using it to monitor our Linux-based machines (both Ubuntu and CentOS). Jeff is working on our Chef infrastructure, and soon we'll be able to deploy the Sensu client via Chef. What's nice about Sensu, and different from other tools, is the queueing mechanism it uses for client-server communication, and for posting events such as 'send this metric to Graphite' or 'send this alert to Pager Duty' or 'send this notification to this email address'. It does have a few rough edge…

What I want in a monitoring tool

I started a new job a few weeks ago, and I'm now at a point where I'm investigating monitoring options. At past jobs I used Nagios, which I know will work, but I would like to look into other more modern tools. I am aware that #monitoringsucks, and I am pretty sure people have hashed these topics before, but here are some of the things I want from a modern monitoring tool:
Ideally open source, of if not affordable per host per month pricing (we already signed up as a paying customer of Boundary for example)Installation and configuration should be easily scriptableserver installation, as well as addition/modification of clients should be easily automated so it can be done with Puppet/Chef API would be idealRobust notifications/alerting rulesescalationsservice dependenciesevent handler scriptsalerts based on subsets of hosts/servicesfor example alert me only when 2+ servers of the same type are downOut-of-the-box pluginsdatabase-specific checks for example Scalabilitythe monitor…

3 things to know when starting out with cloud computing

In the same vein as my previous post, I want to mention some of the basic but important things that someone starting out with cloud computing needs to know. Many times people see 'the cloud' as something magical, as the silver bullet that will solve all their scalability and performance problems. These people are in for a rude awakening if they don't pay attention to the following points.

Expect failure at any time

There are no guarantees in the cloud. Failures can and will happen, suddenly and mercilessly. Their frequency will increase as you increase the number of instances that you run in the cloud. It's a sickening feeling to realize that one of your database instances is gone, and there's not much you can do to bring it back. At that point, you need to rely on your disaster recovery plan (and you have a DR plan, don't you?) and either launch a new instance from scratch, or, in the case of a MySQL master server for example, promote a slave to a master. The s…

10 things to know when starting out as a sysadmin

This post was inspired by Henrik Warne's post "Top 5 Surprises When Starting Out as a Software Developer". I thought it was a good idea to put together a similar list for sysadmins. I won't call them 'surprises', just 'things to know'. I found them useful when I started out, and I still find them useful today. I won't prioritize them either, because they're all important in their own way.

Backups are good only if you can restore them

You would be right to roll your eyes and tell yourself this is so obvious, but in my experience most people run backups regularly, but omit to try to restore from those backups periodically. Especially if you have a backup scheme with one full backup every N days followed by either incremental or differential backups every day, it's important to test that you can obtain a recent backup (yesterday's at a minimum) by applying those incrementals or differentials to the full backup. And remember, if it's no…

The dangers of uniformity

This blog post was inspired by the Velocity 2012 keynote given by Dr. Richard Cook and titled "How Complex Systems Fail". Approximately 6 minutes into the presentation, Dr. Cook relates a story which resonated with me. He talks about upgrading hospital equipment, specifically infusion pumps, which perform and regulate the infusion of fluids in patients. Pretty important and critical task. The hospital bought brand new infusion pumps from a single vendor. The pumps worked without a glitch for exactly 1 year. Then, at 20 minutes past midnight, the technician on call was alerted to the fact that one of the pumps stopped working. He fiddled with it, rebooted the equipment and brought it back to life (not sure about the patient attached to the pump though). Then, minutes later, other calls started to pour in. It turns out that approximately 20% of the pumps stopped working around the same time that night. Nightmare night for the technician on call, and we can only hope he retaine…

Installing Python scientific and statistics packages on Ubuntu

I tried to install the pandas Python library a while ago using easy_install/pip and I hit some roadblocks when it came to installing all the dependencies. So I tried it again, but this time I tried to install most of the required packages from source. Here are my notes, hopefully they'll be useful to somebody out there.

This is on an Ubuntu 12.04 machine.

Install NumPy

# wget
# tar xvfz numpy-1.6.2.tar.gz; cd numpy-1.6.2
# cat INSTALL.txt
# apt-get install libatlas-base-dev libatlas3gf-base
# apt-get install python-dev
# python install

Install SciPy

# wget
# tar xvfz scipy-0.11.0b1.tar.gz; cd scipy-0.11.0b1/
# cat INSTALL.txt
# apt-get install gfortran g++
# python install

Install pandas

Prereq #1: NumPy 
- already installed (see above)

Prereq #2: python-dateutil

# wget…

A sweep through my Instapaper for June 2012

I'm not sure if I'll do this every month, but it does seem like a good way of recapitulating the last month in terms of interesting blog posts and articles that came my way. So here's my list for the month of June 2012: Latency numbers every programmer should know -- from cache references to intercontinental network latency, some numbers that will help you do those back-of-the-envelope calculations when you need to speed things up in your infrastructureCynic -- test harness by Ruslan Spivak for simulating remote HTTP service behavior, useful when you want to see how your application reacts to various failures when interacting with 3rd party servicesAmazon S3 performance tips and tricks -- some best practices for getting the maximum performance out of S3 from Doug Grismore, Director of Storage Operations at AWSHow to stop sucking and be awesome instead -- Jeff Atwood advises you to embrace failure, ship often, listen to feedback, and more importantly work on stuff that matt…

Installing and using sysbench on Joyent SmartOS

If you read Percona's MySQL Performance blog (and if you run MySQL in production, you should!), then you know that one of their favorite load testing tools is sysbench. As it turns out, it's not trivial to install this tool, especially when you have to install from source, for example on Solaris-based systems such as the Joyent SmartOS machines. Here's what I did to get it to work.

Download source distribution for sysbench
I downloaded the latest version of sysbench (0.4.12) from the Sourceforge download page for the project.

Compile and install sysbench
If you launch a SmartOS machine in the Joyent cloud, you'll find out very quickly that it's lacking tools that you come to take for granted when dealing with Ubuntu or Fedora. In this case, you need to install compilers and linkers such as gcc and gmake. Fortunately, SmartOS has its own package installer called pkgin, so this is not too bad.
To see what packages are available if you know the tool you want to install,…

Using the Joyent Cloud API

Here's some notes I took while doing some initial experiments with provisioning machines in the Joyent Cloud. I used their CloudAPI directly, although in the future I also want to try the libcloud Joyent driver. The promise of the Joyent Cloud 'SmartMachines' is that they are really Solaris zones running on a SmartOS host, and that gives you more performance (especially I/O performance) than regular virtual machines such as the ones offered by most cloud vendors. I have yet to fully verify this performance increase, but it's next on my TODO list.

Installing the Joyent CloudAPI tools

I did the following on an Ubuntu 10.04 server:

installed node.js -- I downloaded it in tar.gz format from then I ran the usual './configure; make; make install'installed the Joyent smartdc node package by runing 'npm install smartdc -g'created new ssh RSA keypair: id_rsa_joyentapi (private key) and (public…