Iterating on Iterations – The Year-Long Evolution of the Way We Work at Next Big Sound

[Originally posted in two parts on popforms.com.]

With epidemically low employee engagement, being highly effective and happy at work is an exception, not the norm. Why are we failing to engage at work? Why are healthy, high performance teams so rare?

At least part of the answer to these troubling questions lies in the fact that most companies are organized in inflexible, hierarchical, command-and-control silos. These organizational structures are arguably ill-suited even to the assembly lines where they originated over a century ago, let alone today’s knowledge workforce. Even more surprisingly, of the many companies that have adopted a modern, iterative approach to product development (known as “Lean” or “Agile”), only a few take the same iterative approach to their organizations.

At Next Big Sound, we are committed to iterating not only on our products but also on the way that we work.

Our fundamental approach is rooted in openness. We want everyone to be directly involved in deciding to work in a particular way, and able to easily learn the history and the rationale behind past decisions. We continuously ask “Why?” and never settle for “I don’t know” or, worse, “Because we’ve always done it this way.”

Organizations are complex systems that exhibit surprising, emergent behaviors. We can’t predict the future–no one really knows if changing the organization in a specific way will have the intended results, in the same way that no one knows if adding a new feature will help make software successful. (Though we do know a thing or two about who might enter the Billboard 200 next year). However, we are an organization that’s willing to quickly experiment with various ways of working and adjust based on what we’ve learned.

This is a story of the evolution of the way that we work at Next Big Sound, a record of the things we’ve tried and tweaked over the last year.

Since July 2013, we’ve been iteratively building a healthier, more flexible, high-performance organization, a place where highly engaged and happier folks could do some of the best work of their lives.

We’re far from done, and this account is also meant to provide the necessary context and encouragement for everyone to continue asking “Is there a better way to do it?”

July 2013: Creating product-focused, self-organized teams

In the spring of 2013, we were working in several product teams and a “core” team which was responsible for infrastructure, storage, and our API. The idea was that each of these teams would have all (human) resources necessary to do the work planned for each product. In reality, though, it wasn’t always clear what teams were (or should be) working on, and there was sometimes a lack of focus with multiple projects going on at the same time within each team.

To address some of these issues, the entire company gathered to discuss a proposal of working in a different way.

First, we agreed to do away with strictly product-focused teams, and instead introduced project-focused teams. We defined a “project” as 2-4 weeks of focused work, and agreed that there would only be one project at a time per team. We also encouraged everyone to keep the teams small, in order to minimize communication overhead and maximize speed, and independent, in order to minimize external dependencies.

Before the start of the project, each team would scope the work and define a clear and measurable outcome. At the project’s completion, we would show everyone the progress during a “demo day”. Teams would also conduct retrospectives to learn what we did well or could do better.

We also outlined the role of management as simply “to provide clear business goals, and to help teams maximize productivity, minimize distraction, and to remove roadblocks”. You might notice that, by omission, the role of management was (and still is) not to tell people what to do, or how to do it.

Instead of top-down management, teams would self-organize and self-manage, with everyone was encouraged to take on the team lead role. (In fact, as of today, everyone at the company has served as a team lead on at least one project). At the time, the role of team leads was loosely defined, with the main focus on ensuring communication within a team and the rest of the company. We offered some loose guidelines, but each team had the choice to follow, not follow, or amend them.

Over the previous four years, and several versions of our flagship product, we accumulated significant amounts of technical debt, with an aging storage system nearing capacity, and two similar but not quite the same versions of our analytics dashboard in production. With that in mind, we agreed to have at least one “non-project” team to pay down technical debt and fix bugs at all times.

At the time, we thought that the most significant difference from prior iterations is that teams will now self-organize to complete specific projects. That is, people can join or ask others to join a team at any time, not just at the beginning of a project.

In retrospect, the more important change that we agreed to try was a new method of working that we later started calling “self-selection”. A year later, it is still a cornerstone of the way that we work at Next Big Sound: you get to pick what you work on, whom you work with, and where you work.

This is not a startup “perk”, or a recruiting tactic; it is rooted in a deeply held belief that everyone should have the autonomy to work in the most engaged way that makes them happiest and thus most productive.

After a brief discussion, we dove right in, self-organizing project teams and selecting team leads. Watching folks self-select into projects was both nerve-racking (Will it actually work? Will people select the difficult, unglamorous, but critical projects aimed at paying down our technical debt? Will it all devolve into chaos?), and yet proceeded in a remarkably matter-of-fact fashion.

August 2013: Arrival of the BugBusters

With a month of working in this new way under our belt, everyone in the company met to discuss what we’ve learned and to see how we might improve. While initial results were very positive – the sharpened focus and the self-selecting teams were working well – we decided we needed to do three things: shorten the length of projects (now called “iterations”); clarify the mechanics of an iteration; and reduce interruptions.

Interestingly, most of the teams had chosen one month as the length of the first iteration. A month is practically an eternity in startup time; accurately planning such a significant amount of work is difficult in any organization. Most teams had experienced significant changes in scope of iterations, which either unexpectedly ballooned or had to be reduced before completion. As a result, we decided to limit the length of iterations to two weeks, a practice that we stuck with until April of 2014.

There was also some confusion about the mechanics of iterations, which we clarified by specifying things like when iteration scopes should be defined, and where they should be documented; and when retrospectives should be conducted. Given the fact that iterations were now fixed length, it became possible to start and end all iterations on the same day (typically every other Wednesday), which became company-wide demo day.

We also noticed that there was a non-trivial amount of work required to fix bugs or data import issues, and address systems/ops-related alerts or outages. We wanted to keep these interruptions to a minimum, so we introduced an evolution of the “non-project” team idea: a 1-week BugBusters engineer rotation, which started each Wednesday at noon.

The BugBuster was tasked with incident management, i.e., triage of any issue or bug that might arise during the rotation. The BugBuster was not expected to be able to fix every issue, and would ask for help as necessary. Things that couldn’t be fixed quickly (e.g., within about a day) would become projects and go through the normal prioritization and self-selection processes.

BugBusters was perhaps the most impactful change that we introduced at this time, especially because engineers could now choose to do an entire hack week before or after their rotation in recognition of the community service aspect of BugBusters, and work on whatever they wanted at Next Big Sound.

This currently adds up to about four weeks of individual hack time per year for each engineer (with an additional two weeks of company-wide hack days). It’s also worth noting that although we recognized the critical nature of BugBusters, we did not mandate that there must have always been an engineer on rotation or that every engineer must have participated in BugBusters.

Instead, we chose to treat BugBusters as one of the projects up for self-selection, something that we had to clarify and “tune up” a few months later. Still, the BugBusters rotation (also fondly called “HackBusters”) is alive and well today, and is responsible for some of the continued innovation at Next Big Sound. Some of the projects that came out of hack weeks included Tunebot, explorations of Zipf Law for Facebook Page Likes, an Next Big Sound Charts iPhone App, and countless product experiments and improvements.

November 2013: When projects go un-selected

When people hear about self-selection, the first question is usually whether there are projects that don’t get selected. What about that “shit project” that no one wants to work on? What about projects with external customer deadlines?

Yes, it’s true, it happens: sometimes seemingly important projects don’t get selected. When this happens, no one is ever coerced or forced to work on a project that they did not select. Instead, we ask a lot of questions, starting with, not surprisingly, “Why?”.

If the project’s importance is obvious, why did it still not get selected? Was its importance clearly communicated? Does the team have the necessary skills to complete the work? Are we working on other projects that are higher priority? If we think the project is, in fact, important, we have at most two weeks to advocate for its selection for the following iteration. Otherwise, we could be thankful to the “wisdom of the crowd” for showing that the project is not as critical or time-sensitive as initially thought.

With that in mind, we introduced a concept of project advocate, a person who could provide the necessary context to the team during self-selection. Ideally, the advocate would cover things like why this project is important to do during the coming iteration, and how it ties to company themes and goals. In addition, we decided that each project idea should explicitly list the skills required to complete it (e.g., front-end developer, Java, design).

We also noticed that the timing of communication of iteration scopes, updates, and retrospective results was somewhat haphazard. (All these are communicated via e-mail to the entire company).

Because accountability–which, in this case, is literally the responsibility to provide an account of what’s happening to the entire team–is such a critical part of self-selection, we agreed to adhere to a clear communication schedule, which specified the exact timing of initial iteration scopes, mid-iteration progress updates, scope changes, and retrospectives.

In the two months since the last tune up to the way that we were working, it also became increasingly clear that BugBusters was a project that we had to have someone select at all times (not letting it go during a week when no one selected it, as we had done before).

For at least one two-week iteration, no one selected to be on BugBusters, which actually resulted in higher interrupt levels for most engineers. In addition, certain tasks (like managing over 100 data sources) were falling disproportionately to several engineers and client services folks, who were unable to fully dedicate their time to other projects.

Most importantly, we realized that having the rotation was required for self-selection to apply fully to all engineers: we did not want to have a single engineer dedicated to BugBusters/incident management at all time, because she would not be able to fully participate in self-selection. More fundamentally, having someone on ops at all times (not just when someone wants to) is required for us to operate the Next Big Sound service.

After a lengthy and heated all-hands discussion, we agreed to have someone on BugBusters at all times and created a place for engineers to track (and trade) their upcoming rotations. To be clear, this still does not mean that every engineer must do BugBusters (although by this point, every engineer has completed at least one rotation). Simply put, as part of self-selection, we trust everyone to do what’s right for them, their team, and NBS.

March 2014: Doing demo day

With the mechanics of self-selection mostly worked out, we continued what has been (according to the founders) the longest sustained period of high productivity in the company’s 5-year history. We next turned our attention to demo day – a high point of the iteration when everyone gets to show their work and celebrate our progress as a team.

Perhaps the best way to illustrate the issues with demo day was to compare its intention with how it was actually practiced:

In theory

  • demos are short (under 7 minutes)
  • demo day should last about 1 hour
  • demos highlight “the difference between what was initially planned and what was accomplished, including identifying any loose ends”
  • demos are explicit opportunities to learn from others, and the most salient parts of retrospectives are emphasized
  • demo day artifacts (e.g., presentations) should be easily found
  • the entire history of any iteration should be easily accessible at any time to anyone in the company
  • we intend the software that we write to be tested, documented, and shipped to production during the iteration

In practice

  • demos are sometimes short, and frequently go over 7 minutes
  • demo days have lasted as long as 2 hours
  • demos sometimes highlight the difference between the original scope and what was completed; remaining work is sometimes documented as Trello cards
  • demos sometimes refer to lessons learned and stop/start/continue items from retrospectives
  • there is no central place for demo day artifacts
  • only the original and updated scopes and results of retrospectives can be found in e-mail (if you are at the company when it it sent); it’s not always clear what was actually shipped and how that might differ from what was planned
  • the status of testing, documentation, and shipping is sometimes mentioned during demos, and not consistently documented
  • many demos use PowerPoint presentations, with “the rate of information transfer … asymptotically approaching zero

The first thing that we nipped in the bud was the (over)use of PowerPoint. Instead, we opted to write out iteration summaries, and store them in a central place (Google Drive). We created an (optional) summary template, and agreed to keep summaries under six pages in length. We then decided to try reading the summaries simultaneously as a group during demo day, in the same way that we’ve done during several all-hands meetings (inspired by the same practice during meetings at Amazon).

With the introduction of the iteration summaries, we now had a rich, historic, narrative record of each iteration. Due to the high-bandwidth communication of the written word (and an occasional animated gif), the team now had a higher degree of awareness about the many projects going on than ever before.

However, because we also limited the actual demos to two minutes, the social aspect of presenting your project in front of the entire team was greatly diminished. The high-energy demo days became subdued. Before we could address this issue, though, we identified another gap between how we thought about the way we worked vs how we actually worked that we had to dive into first.

April 2014: Getting things done

Starting in July of 2013, we settled into the comfortable rhythm of two-week iterations. However, in reality, few big projects really divide neatly into two-week chunks. As a result, either “work [expanded] so as to fill the time available for its completion” (according to Parkinson’s law); or people took on additional work to fill available time within the iteration. Both of these situations were common, but not easily visible (which is why it took us this long to address the situation).

In addition, because we were emphasizing iterations that are exactly two weeks in length, shorter project and leftover items from previous iterations that took less than two weeks to complete were effectively “second-class citizens.”

That is, two-week iterations subtly encouraged the selection of projects that took at least two weeks to complete. As a result, some projects took a long time to finish, staying at the “almost complete” mark for extended periods of time, and incurring a significant context-switching cost to get to 100% complete.

We also had a whole class of projects (namely data science reports and data journalism articles) that were much more dynamic in nature, with constantly changing priorities (driven by customers or partners), and highly variable effort required to complete them. In recognition of this, some team members were already working outside the structure of normal two-week iterations. A recent experience of cramming several smaller research projects into a two-week iteration has further highlighted the awkward fit of rigid iteration lengths to this type of work.

To address the above shortcomings of fixed length iterations, in early April, we agreed to try working in iterations that could vary in length, not to exceed two weeks. To minimize the cost of context-switching, we also encouraged folks to stay on projects for their duration (not just for an iteration). As one engineer put it, “One fully complete project is better than five halfway complete ones.”

We also agreed to make context-switching explicit: recognizing the fact that we can only work on one thing at a time, we can only mark one task as “in progress” at any given time, marking all others as “blocked”. (We use Trello for tracking our work).

We also changed demo day to accommodate the new, variable-length iterations. During the bi-weekly demo day, folks present the results of any iteration that has been completed before that Wednesday. That might be one or 10 iterations. If an iteration was not fully completed before a particular demo day, its results would be presented during the next one:

We’ve also returned to more interactive demo days. While we still ban PowerPoint and generate iteration summaries (which everyone is encouraged to read offline), we’ve shifted the focus of demo days back on actual demos.

Today

What has worked in the past may not work in the future. As we grow, we remain committed to collaboratively iterating not only on our products, but also on the way that we work.

Unwilling to settle for cookie cutter approaches, we’ll continue to experiment until we find methods that best fit our culture and the challenges ahead.

This evolution is not driven solely by management; in fact, many of the changes described above were championed by designers, client services folks, data journalists, data scientists, and engineers. We strongly believe that this is one of our competitive advantages, and one of the conditions that makes people working at Next Big Sound to stay highly engaged.

Acknowledgements

I would like to thank my colleagues Liv Buli and Karl Sluis for their excellent and tireless feedback and advice on writing this story. I would also like to thank the entire Next Big Sound team for so fearlessly embracing the iterative approach to the way we work.

The importance of attribution in nascent fields and communities

In the rush to be original, innovative, provocative, or first-to-market, we often forget to acknowledge “prior art” or provide the context for the new ideas that we’re espousing.  The resulting lack of credibility is one of the most serious threats to emergent fields and their practitioner communities (such as devops or systems safety).

Would devops exist without ITIL or the work of Deming? Would Agile exist without Waterfall? Would the all-electric Tesla Model S exist without the hybrid Prius and the gas-guzzling Hummer?

That is not to say that there’s nothing new under the sun. However, even the most groundbreaking ideas do not exist in a vacuum, but only in relation to previous ideas. They build on—or refute—what came before.  Humans suffer from a built-in resistance to change, and when new ideas are presented without proper context or attribution, they risk becoming just someone’s brilliant ideas, too easy to dismiss or accept, depending on the person’s popularity, without full and critical evaluation.

In science, it’s simply not enough to receive new ideas in dreams or visions; ideas that stick must have solid foundations, and often come with bibliographies many pages long.

Want to build your or your idea’s credibility? Want to strengthen your nascent field or emerging community? Emphasize their lineage, and give full attribution.

An open letter to #1 Recruiter From #1 Hedge Fund In The World

Recently, a recruiter (who I’m lovingly calling “#1 Recruiter”) sent this gem on LinkedIn with the subject “I would like to talk to you”:

I work at [Company]  (#1 Hedge Fund in the world), reviewed your profile and I would like to talk to you. Please let me know your availability to connect next week.

I tweeted and ignored the SPAM, but a few days later, #1 Recruiter followed up:

I am following up with you because I work at [Company] (#1 Hedge Fund in the world), reviewed your profile and I would like to talk to you. Please let me know your availability to connect next week.

Notice the expert use of copy-paste. To be fair, he did include a few extra links with information about the company, including their “Culture and Principles” web page. Nice touch!

This interaction neatly summarizes just about everything that’s wrong with recruiting (and LinkedIn). So instead of ignoring, this time I wrote a brief reply, cc:ing the CEO of #1 hedge fund in the world:

[#1 Recruiter],

If you actually reviewed my profile, you would see that I know at least half a dozen people who currently work at [Company]. I am *quite* familiar with the company, and appreciate its culture.

One of the core tenets of your company’s culture is radical openness and honesty. With that in mind, I’d like to be open and honest with you. What you’ve sent is SPAM. It reeks of mediocrity, the opposite of your company’s “overriding objective [of] excellence”. Stating only that you work for a company with money (“#1 hedge fund in the world”) as the reason to connect will not net you people who “value independent thinking and innovation”. I’d be weary of anyone who actually responds to your message (and I’d guess only about 1 or 2 out of a 100 do); they, like you, hate their jobs and are just looking for money.

If you truly are looking for people who seek and can create “meaningful work and meaningful relationships”, why do you approach those you’re trying to recruit in such an utterly meaningless, repulsive way?

Why not take the time to actually tell candidates what working at [Company] would look like? Why not take 5 minutes to highlight the specific parts of the person’s background that stood out to you, and that would be especially relevant at [Company]? It would save you time in the log run, and help you find amazing candidates.

It’s not difficult, but it does require you to make a decision about which business you’d like to be in: selling counterfeit Viagra, or representing (favorably) the #1 hedge fund in the world.

#devopsWater

Free t-shirts are a long-standing tradition at tech conferences. Until this year, Devopsdays NYC, a tech conference that I co-organize, was no exception. At best, attendees wind up with a closet full of free t-shirts they wear with pride; at worst – and more often – these t-shirts end up in a landfill.

While planning this year’s conference, I found that each t-shirt would cost approximately $12 to produce. Given that Devopsdays NYC would have about 200 attendees, the total cost for t-shirts would add up to about $2,500. This particular number reminded me of a donation page on the Lotus Outreach website for their wells project. I quickly found the page, and the symmetry was astounding: “each well costs $2,500 and provides life-saving clean, safe drinking water to approximately 200 rural villagers – at the extraordinary cost of about $12 per person”.

Suddenly, it made sense: by forgoing ephemeral mementos, members of the devops community attending this year’s conference could build a deep-bore water well that would provide life-saving, clean, safe drinking water to approximately 200 rural villagers in Cambodia for years to come! I ran this idea by Patrick Debois (who coined the term “devops”) and others in the group that organizes Devopsdays events around the world, and they were enthusiastic to try this in New York!

At the end of the conference, we donated $12 of each Devopsdays NYC registration to Lotus Outreach, for a total of $2500. My co-organizer Michael Rembetsy and I received an ovation when we described the impact that the attendees will have on life in one Cambodian village with this donation. We are all looking forward to following the progress of the well construction – the devops community will certainly enjoy seeing “Devopsdays NYC” on a plaque near the completed well, and knowing that they’ve made a huge difference in the lives of some of the poorest people on earth.

It is my hope that other tech conferences adopt the same approach, and donate the money they would normally spend on t-shirts to worthwhile causes like Lotus Outreach.

Here’s a video about the Lotus Outreach Wells Program:

The Human Side of Postmortems: Managing Stress and Cognitive Biases

[Download The Human Side of Postmortems to your Kindle].

Imagine you had to write a postmortem containing statements like these:

We were unable to resolve the outage as quickly as we would have hoped because our decision making was impacted by extreme stress.

We spent two hours repeatedly applying the fix that worked during the previous outage, only to find out that it made no difference in this one.

We did not communicate openly about an escalating outage that was caused by our botched deployment because we thought we were about to lose our jobs.

While the above scenarios are entirely realistic, it’s hard to find many postmortem write-ups that even hint at these “human factors.” Their absence is, in part, due to the social stigma associated with publicly acknowledging their contribution to outages. And yet, people dealing with outages are clearly subject to physical exhaustion, psychological stress, not to mention impaired reasoning due to a host of cognitive biases.

This report focuses on the effects and mitigation of stress and cognitive biases during outages and postmortems. This “human postmortem” is as important as the technical one, as it enables building more resilient systems and teams, and ultimately reduces the duration and severity of outages.

Here’s a video of me discussing this topic at Surge 2013: