Mathias' thoughts...: July 2010

Thursday, July 8, 2010

Vote for the Ubuntu stack exchange

This morning Evan email hit my inbox: there is a suggestion to create a stack exchange for Ubuntu.

I've always been impressed by the stackoverflow and serverfault web sites. Granted forums have been around for a long time - however I love the user interaction provided by the folks behind Stack Exchange. A couple of months ago they created area51 to request new ideas that could use the same framework behind stackoverflow and serverfault. And again the user experience for handling these requests is great.

In my opinion Stack Exchange provides an excellent user experience that fosters user contributions and collaboration - in-line with the values of the Ubuntu community.

So I went over to area51 and voted on the on-topic and off-topic questions for the Ubuntu proposal.

Tuesday, July 6, 2010

Velocity 2010: Fast by default - Thursday

Thursday was the last day of the conference and followed the same format as Wednesday: keynotes in the morning, three parallel tracks in the afternoon.

Creating Cultural Change

John Rauser from Amazon shared a few experiences about creating cultural changes inside and outside organizations.

Here are some key takeaways:

Try something new

Seek group identity

Welcome newcomers

Be relentless happy

Theses ideas actually reminded me of how the Ubuntu community is been built up.

In the Belly of the Whale Operations at Twitter

John Adams of Twitter presented a few insights on how operations are run at Twitter.

He outlined several principles to keep in mind when building their infrastructure:

Nothing works the first time. Plan to rebuild everything more than once.

Deploy faster and more often as less code will change.

Detect problem as early as possible - to recover fast.

Disable/enable features in production aka Feature darkmode.

To support these guiding principles he listed some of the tools that are used:

configuration management done with puppet and svn.

Reviewboard to review changes made to the infrastructure

Ganglia to take care of monitoring

Scribe to collect and aggregate all logs into Hadoop HDFS using LZO compression.

Murder to deploy their code to all of their systems via bittorrent.

Google analytics to track errors pages while Whale Watcher to track errors in logs.

Unicorn to powers their rails stack.

Lightning talks

Thursdays lightning talks covered another round of useful tools in helping optimizing page loads:

httpwatch: a commercial tool that loads web pages and analyses it

pagetest

speedtracer: chrome browser extension that provides an insight on what the browser is doing when a loading a page

fiddler2

Moving Fast

Robert Johnson of Facebook gave a talk about the culture of moving fast at Facebook. Here are a few short sentence to summarize his points:

How to scale? Have a team that reacts fast.

The release cycle: Make changes every day as frequent small changes makes it easier to figure out what went wrong.

Control and responsibility to one person.

He finished with a few lessons that were learned:

New code is slow.

Give developers room to try things.

Nobody's job is to say no.

Practice of Continuous Deployment

Throughout the conference I heard multiple times the idea of continuous deployment. With continuous integration being pushed on on the developer side, its pendant on the ops side is continuous deployment: tests, build, deploy. Deploy multiple times a day with a good monitoring system to identify quickly when things go wrong. When things go wrong it's easier to identify what changed as the number of changes is rather low. All the big shops have a deployment dashboard to review what went live, when and by whom.

The launchpad team is already following this idea: Launchpad edge has a daily update of the code running against the production database. Releases (with DB schema changes) are conducted on a monthly basis. And Ubuntu is providing something similar as the development version is always available for installation - and releases are cut every 6 months.

Monday, July 5, 2010

Velocity 2010: Fast by default - Tuesday and Wednesday

Here is a report on Velocity 2010, the Web Performance and Operations conference. In its third year it grew to more than 1100 attendees - this year was sold out.

Tuesday workshops

Tuesday was dedicated to workshops even though most of them turned out to be presentations with demos given the number of participants. So not a lot of hands-on sessions. Here is a small selection of talks I found interesting throughout the day:

Infrastructure automation with Chef

Overview of the chef project lead by the high energy and opinionated Adam Jacob from Opscode.

For me the most exciting part was the ability that chef provides a complete view in your infrastructure and the ability to query your infrastructure any way you want.

Adam gave a few high impact principles:

Being able to reconstruct a business from a source code repository, a data backup and bare metal resources.

Another interesting feature from the knife tool was the ability to start/spawn new instances in EC2 from the command line. For example the following command will give you an ec2 instance running your rails role within a few minutes:

knife ec2 server create 'role[rails]'

Protecting "Cloud" Secrets With Grendel

A technical overview of the Grendel project: OpenPGP as a software service.

The project gives the ability to share encrypted documents between multiple people. From the security perspective each user private key is stored in the cloud encrypted by a pass phrase only known to the user transmitted via http basic auth.

Wednesday sessions

Wednesday was the first day of the conference with keynotes in the morning and three tracks in the afternoon.

Datacenter Infrastructure Innovation

James Hamilton from Amazon Web Services gave an interesting overview of the different parts of building a data center.

An interesting point he made was that data center should target 100% usage of their servers while the industry standard is around 10 to 15% utilization on average. This objective lead to the introduction of spot instances in EC2 so that resource usage could be maximized and Amazon cloud infrastructure can be flat lined. That reminds me of some comments from Google engineers stating that they try to pile as much work as possible on each of their servers. At their scale having a server powered off is costing money.

He covered other topics:

air conditioning: DC could be run way hotter they are now

power: the cost of power has a small part of the total cost of running a data center - server hardware being more than half of the cost. This is an interesting point with regards to the whole green computing movement.

Speed matters

Urs Hölzle from Google covered the importance of having web page that load fast and a range of improvements Google had been working on for the last years: from the web browser (via chrome) down to the infrastructure (such as dns).

He also highlighted that Google page ranking process now takes into account the speed at which a page loads. As heard multiple times during the conference there is now empirical evidence that links directly the page load speed to revenue: the faster a page load the more people will stay on the web site.

Lightning talks

Wednesdays lightning demos show cased a list of tools focusing on highlighting performance bottleneck and helping out tracking why page load are slow and how to improve them:

Yslow

dynaTrace link

page speed

Getting Fast: Moving Towards a Toolchain for Automated Operations

Lee Thompson and Alex Honor reported on the work of the devtools-toolchain group. The group formed a few months ago to share experiences and build up a set of best practices. Of the use cases they've outlined KaChing's Continuous Deployment was the most interesting one:

Release is a marketing concern.

Facebook operations

Tom Cook of Facebook gave a sneak peak at the life of operations in Facebook.

Very interesting talk about the developement practices of one of the busiest website of the internet. Facebook is running of two data center (one on the east coast, one of the west coast) while they're building their own data center in Oregon.

Their core OS is Centos 5 with a customized kernel. For system management cfengine is set to update every 15 minutes with a cfengine run taking around 30 seconds. All of the changes are peer reviewed.

On the deployment front bug fixes are pushed out once a day while new features are rolled out on a weekly basis. Code is pushed to 10000s of servers using bittorrent swarms. Coordination is done via IRC with the engineer available in case something goes wrong.

The developer is responsible for writing the code as well as testing and deploying it. New code is then exposed to a subset of real traffic. Ops are embedded in engineering teams and take part of design decisions. They're actually an interface to other ops.

As a summary tom gave a few points:

version control everything

optimize early

automate++

use configuratiom mgmgt

plan to fail

instrument everything

don't waste time on dumb stuff