Mathias' thoughts...: February 2010

I had the opportunity to attend FOSDEM this year. The most amazing (and frustrating) part of the event was the huge number of talks that were given. Making choices was sometimes hard. However the FOSDEM team recorded most of the sessions and the videos are available now. Perfect for such a busy event!

Here a few highlights of the presentations I attended:

Linux distribution for the cloud

I started the day by attending Peter Eisentraut (a Debian and PostgreSQL core developer) session about Linux distributions for the cloud. He focused on the provisioning aspect of clouds by giving a history of how operating systems were installed: from floppy drives to Cloud images. He dedicated one slide to Ubuntu's cloud offering including Canonical Landscape commenting that Ubuntu is clearly the leader of distributions in the cloud space. He also outlined what were the current problems such as lack of standards and integration of existing software stacks. He pointed out that linux distributions could drive this.

The second part of his talk was focused on the Linux desktop and the impact of cloud services on it. Giving ChromeOS as an example he outlined how applications themselves were moved to the cloud. He then listed the problems with Cloud services with regards to the Free Software principles: non-free server side code, non-free hosting, little or not control over data, lack of open-source community.

He concluded by outlining the challenge in the domain: how could free software principles be transposed to the Cloud and its services? One great reference is Eben Moglen talk named "Freedom in the Cloud".

Beyond MySQL GA

Kristian Nielsen, a MariaDB developer, gave an overview of the Developer ecosystem around MySQL. He listed a few patches that were available to add new functionalities and fix some bugs: the Google patch, Percona patches, eBay patches, Galera multi-master replication for InnoDB as well as a growing list of storage engines. Few options are available to use them:

packages from third party repositories (such as ourdelta and percona)

MariaDB maintains and integrates most of the patches

a more do-it-yourself approach where you maintain a patch serie.

I talked with Kristian after the presentation about leveraging bzr and LP to make the maintenance easier. It could look like this:

Each patch would be available and maintained in a bzr branch - in LP or else where.

The Ubuntu MySQL package branch available in LP would be used as the base for creating a debian package (or the Debian packaging branch since Debian packages are also available in LP via bzr)

bzr-looms would glue the package bzr branch with the patches bzr branches. The loom could be available from LP or elsewhere.

bzr-builder would be used to create a recipe to build binary packages out of the loom branch.

Packages would be published in PPAs ready to be installed on Ubuntu systems.

The Cassandra distributed database

I finally managed to get in the NoSQL room to attend Eric Evans overview of the Cassandra project. He is a full time developer and employee of Rackspace. The project was started by Facebook to power their inbox search. Even though the project had been available for some years the developer community really started to grow in March 2009. It is now part of the Apache project and about to graduate to a Top Level Project there.

It is inspired by Dynamo from Amazon and provide a O(1) DHT with eventual consistency and a consistent hashing. Multiple client APIs are available:

Thrift

Ruby

Python

Scala

I left before the end of the talk as I wanted to catch the complete presentation about using git for packaging.

Cross distro packaging with (top)git

Thomas Koch gave an overview of using git and related tools to help in maintaining Debian packaging. He works in web shop where every web application is deployed as a Debian package.

The upstream release tarball is imported in a upstream git branch using the pristine-tar tool. Packaging code (ie the debian/ directory) is kept in a different branch.

Patches to the upstream code are managed by topgit as seperate git branches. He also noted that topgit was able to export the whole stack of patches in the quilt Debian source format using the tg export command.

Here is the list of tools associated with his workflow:

pristine-tar

git-buildpackage

git-import-orig

git-dch

topgit

The workflow he outlined looked very similar to the one based around bzr and looms.

Scaling Facebook with OpenSource tools

David Recordon from Facebook gave a good presentation on the challenges that Facebook runs into when it comes to scale effectively.

Here are a few numbers I caught during the presentation to give an idea about the scale of the Facebook infrastructure (Warning: they may be wrong - watch the video to double check):

8 billion minutes spent on Facebook every day

2.5 billions of pictures uploaded every month

400 billion page/view per month

25 TB of log per day

40 billions pictures stored in 4 resolutions which bring a grand total of 160 billions photos

4 millions of line codes in php

Their overall architecture can be broken into the following components:

Load balancers

Web server (php): Most of the code is written in PHP: the language is simple, it fits well for fast development environments and there are a lot of developers available. A few of the problems are CPU, Memory, how to reuse the PHP logic in other systems and the difficulty to write extensions to speed up critical parts of the code. An overview of the HipHop compiler was given: a majority of their PHP code can be converted to C++ code which is then compiled and deployed on their webserver. An apache module is coming up soon probably as a fastcgi extension.

memcached (fast, simple): A core component of their infrastructure. It's robust and scales well: 120 millions queries/second. They wrote up some patches which are now making their way to upstream.

Services (fast, complicated): David gave an overview of some of the services that Facebook opensourced:
- Thrift: an rpc server, now part of the Apache incubator.
- Hive: build on top of hadoop it is now part of the Apache project. It's an SQL-like frontend to hadoop aiming at simplifying access to the hadoop infrastructure so that more people (ie non-engineers) can write and run data analysis jobs.
- Scribe: a performant and scalable logging system. Logs are stored in a hadoop/hive cluster to help in data analysis.

Databases (slow, simple): 1000's of MySQL servers are used as a persistence layer. InnoDB is used for the storage engine and multiple independent clusters are used for reliability. Joins are done at the webserver layer. The database layer is actually the persistence storage layer with memcached acting as a distributed index.

Other talks that seemed interesting

I had planned to a attend a few other talks as well. Unfortunately either their schedule was conflicting with another interesting presentation or the room was completely full (which seemed to happen all day long with the NoSQL room). Here is a list of them: