Wednesday, October 15, 2014

Testing CDN and geolocation with webpagetest.org

Assume you want to migrate some.example.com to a new CDN provider. Eventually you'll have to point example.mycompany.com as a CNAME to a domain name handled by the CDN provider, let's call it example.cdnprovider.com. To test this setup before you put it in production, the usual way is to get an IP address corresponding to example.cndprovider.com, then associate example.mycompany.com with that IP address in your local /etc/hosts file.

This works well for testing most of the functionality of your web site, but it doesn't work when you want to test geolocation-specific features such as displaying the currency based on the users's country of origin. For this, you can use a nifty feature from the amazing free service WebPageTest.

On the main page of WebPageTest, you can specify the test location from the dropdown. It contains a generous list of locations across the globe. To fake your DNS setting and point example.mycompany.com, you can specify something like this in the Script tab:

setDNSName example.mycompany.com example.cdnprovider.com
navigate http://example.mycompany.com

This will effectively associate the page you want to test with the CDN provider-specified URL, so you will hit the CDN first from the location you chose.

Monday, October 13, 2014

Watch the open files limit when running Riak

I was close to expressing my unbridled joy at how little hand-holding our Riak cluster needs, when we started to see strange increased latencies when hitting the cluster, on calls that should have been very fast. Also, the health of the Riak nodes seems fine in terms of CPU, memory and disk. As usual, our good old friend the error log file pointed us towards the solution. We saw entries like this in /var/log/riak/error.log:

2014-10-11 03:22:40.565 UTC [error] <0.12830.4607> CRASH REPORT Process <0.12830.4607> with 0 neighbours exited with reason: {error,accept_failed} in mochiweb_acceptor:init/3 line 34
2014-10-11 03:22:40.619 UTC [error] <0.168.0> {mochiweb_socket_server,310,{acceptor_error,{error,accept_failed}}}
2014-10-11 03:22:40.619 UTC [error] <0.12831.4607> application: mochiweb, "Accept failed error", "{error,emfile}"

A google search revealed that a possible cause of these errors is the dreaded open file descriptor limit, which is 1024 by default in Ubuntu.

To be perfectly honest, we hadn't done almost any tuning on our Riak cluster, because it had been running so smoothly. But recently we started to throw more traffic at it, hence issues with open file descriptors made sense. To fix it, we followed the advice in this Riak doc and created /etc/default/riak with the contents:

ulimit -n 65536

We also took the opportunity to apply the networking-related kernel tuning recommendations from this other Riak tuning doc and added these lines to /etc/sysctl.conf:

net.ipv4.tcp_max_syn_backlog = 40000
net.core.somaxconn=4000
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_tw_reuse = 1

Then we ran sysctl -p to update the above values in the kernel. Finally we restarted our Riak nodes one at a time.

I am happy to report that ever since, we've had absolutely no issues with our Riak cluster.  I should also say we are running Riak 1.3, and I understand that Riak 2.0 has better tests in place for avoiding this issue.

I do want to give kudos to Basho for an amazingly robust piece of technology, whose only fault is that it gets you into the habit of ignoring it because it just works!

Thursday, October 02, 2014

A quick note on haproxy acl rules

I blogged in the past about haproxy acl rules we used for geolocation detection purposes. In that post, I referenced acl conditions that were met when traffic was coming from a non-US IP address. In that case, we were using a different haproxy backend. We had an issue recently when trying to introduce yet another backend for a given country. We added these acl conditions:

       acl acl_geoloc_akamai_true_client_ip_some_country req.hdr(X-Country-Akamai) -m str -i SOME_COUNTRY_CODE
       acl acl_geoloc_src_some_country req.hdr(X-Country-Src) -m str -i SOME_COUNTRY_CODE

We also added this use_backend rule:

      use_backend www_some_country-backend if acl_akamai_true_client_ip_header_exists acl_geoloc_akamai_true_client_ip_some_country or acl_geoloc_src_some_country

However, the backend www_some_country-backend was never chosen by haproxy, even though we could see traffic coming from IP address from SOME_COUNTRY_CODE.

The cause of this issue was that another use_backend rule (for non-US traffic) was firing before the new rule we added. I believe this is because this rule is more generic:

       use_backend www_row-backend if acl_akamai_true_client_ip_header_exists !acl_geoloc_akamai_true_client_ip_us or !acl_geoloc_src_us

The solution was to modify the use_backend rule for non-US traffic to fire only when the SOME_COUNTRY acl condition isn't met:

       use_backend www_row-backend if acl_akamai_true_client_ip_header_exists !acl_geoloc_akamai_true_client_ip_us !acl_geoloc_akamai_true_client_ip_some_country or !acl_geoloc_src_us !acl_geoloc_src_some_country

Maybe another solution would be to change the order of acls and use_backend rules. I couldn't find any good documentation on how this order affects what gets triggered when.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...