Tuesday, September 19, 2006

Test Driven Development proves useful at Google:
Google is building very sophisticated products, with complex cutting edge technologies, never tried before algorithms, optimizations, heuristics. Their applications have significant scalability needs and have to deal with difficult issues such as spam, bots and attacks.

Testing is very important because of all of these challenges, plus the software needs to be durable--it can't crash.

Google believes that great code comes from happy engineers. Their engineering structure is very very flat. Engineers are largely self-managing and take on a lot of responsibility. There is a very strong peer review culture, and engineers are empowered to set their own goals. This structure creates an organization of very motivated and productive engineers. This makes the engineers feel empowered to build quality software.

Google has gone through tremendous growth in code base, users, engineers. More systematic processes for testing an analysis have been added.

Google has focused a lot on the early part of the development process-quality via design and review process. Design documents are required for all non-trivial projects and a formal peer review process is done. All changes to code base require peer review. Strict programming style guidelines and formal initiation to those guidelines for all new engineers. Great code comes from a good early design and review process! Process moves a bit slower because of thise, but quality and end-results are better.
Goals of testing and analysis--smooth development process without build breakage (Unit Testing and XP have made a big impact here), functional correctness and backward compatibility, robustness scalability and performance, understanding user needs/improving functionality.

Standard Google practices include unit tests and functional tests, continuous builds, last known good build, release brances with authorized bug fix check-ins, focused teams for release engineering, production engineering and QA, bug tracking and logging of production runs.

Google brings in XP consultants to educate engineers, employs extreme feedback mechanisms like monitors and ambient orbs for visual feedback. They have specialized test and analysis tools during production and prior to production. Sometimes, they have fix-it weeks for fixing bugs, writing tests, improving documentation, etc.

When Google introduced XP to improve quality and other metrics, they hired a team of XP consultants and paired consultants with engineers. They created short projects for the employees with testing/XP as a theme. The teams focused on understanding code base and building good unit tests, and other TDD aspects. How to use infrastructure such as JUnit. XP was introduced at Google about 8 months ago. Engineers are not forced to use XP, but XP adoption is going well. Already, they have seen improvements in key metrics and in stability.

First steps were to build functional tests for existing code, develop tests that fail for bugs in the bugs database, unit tests for existing servlet handlers, unit tests and TDD for new code, and fix-it weeks devoted to developing unit tests and functional tests.

current status: very stable builds due to unit tests, better stability in backwards compatibility due to unit and functional tests. More TDD in future, with many more tests. Goal is to get to a stage where XP and testing offers benefits beyond build stability and backwards compatibility--much more quality in production software.

Google does a lot of logs of production runs and have tools and APIs to process them. Rule based and strema-based tools feed off logs and produce graphs, call pages, etc. Exception traces during production are extracted from the logs, and each stack traces is assigned to a particular engineer for further analysis (done wtih clever correlation with the SCM system to find the right engineer)
Servers are packaged togher with heap inspection methods, which can be invoked through special commands to the server. This produces a full memory dump which is then analyzed off-line by tools to locate problems. This is necessary because it is impossible to replicate the production system in its full complexity.

Google has many databases and uses an O-R mapping layer to hide details of access. Often performance issues are related to database access. Tools are used to identify database queries resulting from different servlet handlers. Ratchet tests that fail if database activity exceed thresholds.

Multi-threaded program behavior is a difficult area to test for, as these programs may demonstrate bad behavior (e.g. race conditions) only under certain circumstances that may be impossible to reproduce. Static analysis can't be used to check for this behavior. Google has a hybrid system that runs on the system being tested for a short time and then performs static analysis on the resulting execution trace.

Key messages: self-motivated engineers, grass roots adoption of XP and other new techniques, productivity and quality continues to improve, and is what helps Google keep up with their tremendous growth. Unit testing has helped us improve our infrastructure and to work better together.

No comments: