Agh! People are starting to use my App! Now what?
This is a slightly beefed up version of the lightning talk I gave at Austin on Rails last night - Feb 25, 2014.
I had intended to give this talk on things I've learn about infrastructure as I've been able to help stabilize and grow at a my last two jobs. I've had a lot of experience working on deployment infrastructure, from Dreamhost, to Rackspace, to Heroku, AWS, and now am mananging our infrastructure on Amazon OpsWorks using Chef (which I've been writing about recently).
However, as I thought about it, the problems that have been hindering our growth have not been as much server infrastructure related, as much as they were visibility-related.
I heard a great saying, recently, which I generalized to
You don't have an X problem, you have a visibility problem.
It's like the grizzled engineer who walks into a ships malfunctioning boiler room and calmly knocks on some pipe or fitting to fix the problem. The problem is not knowing to knock on something, it's not knowing what to knock on. If we know that a specific slow DB query is taking down your database, that's a tractable problem. "Our database just stops responding even though the load on it looks 'normal'" is much less so.
You don't have a performance problem, you have a visibility problem
I use and love New Relic for this. It's been a goto tool for me at my last couple of positions to, say, quickly find that slow database query. Pushing some code, seeing a new slow query pop up, and having New Relic surface that to you in a nice pretty UI is wonderful. So much time savings. Sure, I KNOW that I can enable slow query logs, and I can take those slow queries into my database and run an explain on them and figure out why there's a problem. And really, in an ideal world where I had a DBA managing my databases, I wouldn't even need that. But in the teams that I've been on, in the 3-6 people range, having a tool like New Relic that can surface these quickly, easily, and obviously has been invaluable to getting things on the right track.
I shudder to remember the low-visibility times I've had with some of application systems when the first, and generally only, signal that something was wrong was customers calling to report that they were having problems.
Can you imagine? I remember it. It was awful. So get yourself some visibility into your peformance. There are a number of tools for this that have different tradeoffs, but New Relic has been a goto for me.
You don't have a code quality problem, you have a visibility problem
For this, my goto tool is Code Climate. I've written before about how much I love Code Climate and so I won't go into detail here but just to say that what I've been really appreciating more and more is how helpful it is to have it hooked up to your group chat. I've had instance where I've been going back and forth with code in code review (you are doing code review of pull requests, right?) and though the code is still not wonderful looking, I don't have the heart to send it back for another round of refactoring.
But then all of a sudden merging that and having Code Climate announce to your team's chat room that you just merged subpar code is just such a great kick in the pants to keep at it.
You don't have a error rate problem, you have a visibility problem
I recognize that "error rate" is probably not the best name for this, but I'm having a hard time finding a better word for that aspect of quality. You know, that aspect where your code doesn't fall over and die on a significant number of requests.
When I started at my current position, the team had been so "beaten" by the high error rates that most of the engineering team had just turned off all notifications from Airbrake. "There are too many errors! It's flooding my inbox, I can't use my email with them jamming it!" On the flipside, of course, is the fact that though the errors flooding your inbox are making it hard to use your email software, they are making it even harder for your customers to use YOUR software.
We were getting thousands of errors a day, and it was bad. How could we be motivated to fix this?
I was tempted to just turn the volume up to 11. I am a huge fan of making things that should be fixed VERY PAINFUL. For example, if we need to make or change a certain type of user role sometimes but can't decide on a good UI solution for it because we don't understand the frequency and need, rather than build a crappy UI for it, I'm very comfortable with putting the need to change that record in a console onto engineers until someone can come up with a good solution. Otherwise, you'll end up with a crappy UI in use for years and generations of developers and support people cursing. At least that's what I envision.
But then I found a tool called Raygun. It's exception tracking with sane notification. It sends us awesome emails like, "Hey, I just saw this error for the first time. You may care, who knows." (I'm paraphrasing liberally) Then, if it sees it a few more times in the next few minutes it will send a message like, "Hey, this error started 5 minutes ago, it's happening X times per minute. You may want to look into it." It sends a reminder 30 minutes later, an hour later, and then on a less regular basis until you fix it. Brilliant.
So with this, we turned all the error notification on, and had a #RaygunZero initiative. We turned the errors as they came in into Trello cards and the whole team worked for a coulpe weeks on getting that error rate as low as possible. It was a huge success. We went from thousands of errors a day, to often single digits. We serve several million requests a day, and a big majority of our errors are due to spammers/scrapers/weird bots trying to access odd URL's. So we haven't had a ZERO day yet, but gotten into the single digits a few times, which is awesome.
It gives us amazing visibility into errors as soon as they hit production and we could squash them. Pumpup.
Raising the visibility of these issues is always easier than I think it will be, and more worthwhile
So yes, your code quality is important, and it always surprises me how easily I can fall into a pattern in my work where I'm not really seeing everything I need to see.
Technical Debt vs Sales Debt
And don't forget - you have a team of engineers who is trying to build features without accumulating too much technical debt, while ta the same time you have a sales team out trying to make deals. Want to know a secret? They're probably promising features and roadmaps that would make you gag. Even when they are trying to be on their best behavior, they're promising - just like you're probably being too optimistic in your estimates. The way to getting out of your sales debt, though, is to minimize and prevent technical debt. Your code quality is your best way to stay on top there - it's a battle of inches.