The Bike Shed: 321: Leaving Breadcrumbs

About this Episode

Steph tells a cute story about escape artist huskies, and on a technical note, shares a journey in regards to class variables and modules inheritance.

Chris talks about how he's starting to pursue analytics and one of the things that he's struggling with that he's always historically struggled with is the idea of historical data. He's also noticed a lack of formalization of certain things and is working with his team to remedy that.

This episode is brought to you by ScoutAPM. Give Scout a try for free today and Scout will donate $5 to the open source project of your choice when you deploy.

Become a Sponsor of The Bike Shed!

Transcript:

CHRIS: Hello and welcome to another episode of The Bike Shed, a weekly podcast from your friends at thoughtbot about developing great software. I'm Chris Toomey.

STEPH: And I'm Steph Viccari.

CHRIS: And together, we're here to share a bit of what we've learned along the way. So, Steph, it's an entirely new year. What is new in your new year?

STEPH: Well, the year is off to an interesting start because we helped rescue a husky.

CHRIS: Rescue as in now this is your dog or rescue as in the dog was trapped in a well, and another dog told you about the dog being trapped in a well, and then you helped the trapped? [laughs] Which of those situations are we working with?

STEPH: [laughs] I'm really wishing it was the second version [laughs] where there's a dog that tells me about another dog trapped in a well. No, this is a third version where there was a husky that was wandering around the gym that we go to. And so Tim, my husband, called and said that "There's this husky, and he's super sweet, but he seems very lost." And our gym is located near a major road, and so we were worried that he was going to wander about and get hit.

So I hopped into our car and took a crate and a leash, and he hopped right in. Clearly, he belonged to somebody; he'd just escaped. So he hops right in, and then we bring him home. But I put him in the backyard because I want to keep him separate from our dog, Utah, just because I don't know this dog, and I want to keep him safe. And I go back inside to grab a few things. I come back out, and the husky is gone. And I'm like, well, shit. [laughs] Now I'm starting to understand why this husky is missing or why this husky seemed lost.

So then I started looking for the husky, and Tim comes home. He's helping me look for the husky. And it was one of those awful moments where we live near...it's not a major road, but people tend to speed on it. And the husky and I happen to see each other across the road. And so the husky was like, oh, human friend and starts coming across the road towards me. And there's this large SUV that's also coming from the other direction. I'm like, oh, this is it. This is my nightmare. This is becoming real. This dog is about to get hit.

Thankfully, the driver saw the husky and stopped in time, so everything was fine. And the husky just finished trotting across the road to me, brought him in, kept him in the kennel in the garage. We didn't have any backyard adventures after that. The husky then thanked us by howling most of the night. [laughs] So this poor husky has had an adventure. We've had an adventure.

And then, around 4:30 in the morning, I go out because I'm checking on the husky and going to let him out. And I'm scrolling on the app called Nextdoor. And I see that someone posted a picture of this exact husky that's like, "Please help me find my dog." And I was like, yes. Because we were going to have to take him to a county shelter or at least go see if he had a chip so then we could return him. But thankfully, we found the owner. I found out the husky's name is Sebastian. And then we had him for a few more hours, and then we had a wonderful husky and human reunion.

CHRIS: That story had everything. It had ups; it had downs; it had huskies. It had escape artist huskies, in fact. I have...this is only through Reddit because that's how people learn about things in the world, but huskies are a rather vocal dog breed. So when you say the dog was howling, huskies have a particular way of almost singing, and it kind of sounds like yelling rather than more traditional dog sounds. Was that the experience you had?

STEPH: Luckily, it wasn't too bad. His howling was more just; he didn't want to be in the crate. He seems like an indoor dog. So he's like, what am I doing outside in the garage? I should be indoors. And so he wasn't too loud. It was more he was just bemoaning his situation.

But our dog Utah could hear him upset in the garage. And so that was also getting Utah upset because he didn't understand why there was a dog so close. And that was what led to the sleepless night because we couldn't get both of them to calm down. Because then, as soon as one of them calm down, the other one would get the other one riled.

CHRIS: As it so often happens.

STEPH: I'm so grateful that it turned out to be a happy story, though. That part was wonderful. And if we see the husky again, now we know his name is Sebastian and that he'll just come home with us. [chuckles] And we'll know how to return him since he seems to be an escape artist.

CHRIS: And we were best friends forever.

STEPH: On a more technical note, I have quite the journey to share in regards to class variables and modules inheritance. But before I dive in, I'm curious, what's new in your world?

CHRIS: Oh. Well, I'm excited to dig into that story. But I've got two smaller things in my world this week that are top of mind. I don't really have answers on them. I have more questions. One is we're starting to pursue analytics. We want to try and understand our system a little bit better. What is the experience of our users? How are they coming into the system? What are they doing? How long does it take them to do the things that we want them to do? All those sorts of questions you want to be able to answer about your application.

And one of the things that I'm struggling with that I've always historically struggled with is the idea of historical data. So data changes over time, and often we actually want to know about those transition points. We want to know about the different states that a user or any record in the system has been in. And I'm finding myself feeling the same pain that I felt many times and starting to think again about the relevant options out there in the world.

To give a slightly more pointed example of what we're dealing with, users come in, and then there are a few steps for them to actually sign up for the application. And so their user record or their application, if you will, will go through a couple of different states. So they can be basically approved directly, and now they're an active user of the system, that's one option.

But they can also end up in a state where they're pending review. And then eventually, depending on the outcome of that review, whether it's manual or someone intervenes or what have you, then eventually they can transition to either being denied or being accepted. And then they'll again be an active user. And so there's a question now of how many of the users that end up in that pending state end up transitioning into active.

And as I looked at the database, I was like, I do not have this information right now. I know their current state. And the logs could tell me all of this. We don't have proper log archiving right now. And I also don't have a system for, like, let me pull down gigabytes of logs and try and sift through that to understand the answer, especially for something domain level like this.

But this is one specific example that represents a category of things in my mind. The stuff that I've looked at in this space otherwise is Event Sourcing. So the idea that rather than having a discrete representation of the state of your application, you store every event as an individual log, essentially of like user did X, thing happened, Y occurred. And then, at any given point, you need to know about the state of your system; you just reduce all of those events through some magical reducer that produces the current state.

I also very recently read an article called Event sourcing is Hard. So I have that in my head as a counterpoint. This seems like a thing that is non-trivial to do, makes sense for a certain scale. But of course, like anything else, it has its trade-offs.

Another thing that I've looked at and never really pursued mostly because it's in a different ecosystem, is Datomic, D-A-T-O-M-I-C, which I think I've mentioned before. But it's a database that actually stores data in this historical format. And so you can ask for the current value, but then you can also ask for what are all the states that this user has been in? And what are the timestamps of those changes?

One small thing that we do have that I really like...so this is one example of us; I think leaning into wanting to have more information, higher fidelity information, is often we want to know something like was this ticket paid? Did someone pay for this ticket? And so paid is a BooleanProperty on this ticket record within our system. So the ticket can be held for a little while and eventually gets paid. And now, yes, it has been paid for. It is good. You can use it. But often, we want to know not just that it's paid but when it was paid.

And so there's a gem that we are using on the project called time_for_a_boolean by former thoughtboter, Caleb Hearth. And it does a wonderful job of basically instead of storing a Boolean value in the database, you store a timestamp. But then the Boolean can be inferred. If there is a value, if there's a timestamp for that record in the database, then there are a bunch of helper methods that get introduced that say, like, paid? That's now a method that I can ask, and it will tell us that. But we can also find the paid_at, paid_at value.

And so we have this higher fidelity data when we need it, but we can also collapse it down to the simpler representation. Because most often, all we need to know is, have they paid for it? Cool, then they're good. They can come into the concert, that sort of thing. But yeah, this is a broader question that I don't have a great answer to. I think Postgres and Rails and just the nature of how we approach these applications pushes us in a certain direction.

Another thing I'm exploring is downstream analytic systems. What if I send a bunch of events to them, and they act as a half-event sourcing type thing? But yeah, this is going to be, I think, an open question for me for a while.

STEPH: Yeah, you said a lot of really good options. When you're talking about in our ecosystem, we get pushed in one direction or the other that makes me think of the projects that I've been on. Typically, what they'll reach for first is something like a Papertrail. So then, that way, they can check for the historical versions of an object and how it was changed and see who changed it. That's one way to track the logs. I like the idea that if you can outsource it and send all of those events to a logging system and then essentially ask for that data back as you need it.

You made me think of a recent project as well where we needed to track the state. So it was a patient matching system. And we really needed to know when a patient match was created or disconnected and then who did that and perhaps for what reason. And to ensure that we had as much information as possible, we took that opportunity to just create a record for it.

So we had a patient match record or...I forgot the name of the other one where we created where a patient did not have a match. But we were creating a record every time someone did that. Granted, probably that’s not going to happen nearly as often as someone paying for an event or the situations that you're describing.

This was ideally infrequently that someone was going to unmatch a patient because it meant that our system had matched people that shouldn't be matched, and then a human had intervened. But yeah, it's interesting the space that you're in. And you listed all the good things that I would have thought of.

CHRIS: I think you listed Papertrail, which is one that I hadn't actually thought of yet for this particular instance. This only came up earlier today also. So this is new in my head that I'm really being pushed in this direction. But I think Papertrail could be a good solution for where we're at. But it is one of those where you often don't know the thing you want to know.

And I'm terrified of losing data of like; I had the data. I knew it at one point in time, but now I can't reknow it in the future because I didn't write it down. That's one of the things that I just don't want to happen in the world. And so finding those ways of like, how can we architect a system so that we can do the normal, straightforward, boring things most of the time but then when we need to expand out the analytics dimension of the system that we're working on...and trying to thread that needle and find the ideal optimization on both sides is a tricky one.

But yeah, I'll definitely take another look at Papertrail and see if that...at a minimum, I think that's a good solution for where we're at now. And then this is going to be a thought that's going to roll around in the back of my head for a while. So if I come up with anything else, perhaps a grander solution, I'll certainly bring that back to The Bike Shed. But yeah, what else is up in your world? I want to hear the story of the class variables.

STEPH: Well, it is quite a journey. So I hope you're ready. Specifically, I was pairing with Joël, who was working on fixing a test that had been marked as being skipped for a while. We weren't really sure why. We figured maybe because it's flaky. But then, as Joël had restored that test, he realized it was actually failing consistently.

So it was a test that was failing for a reason folks maybe didn't understand, but they decided to cancel or to skip that test. But they didn't actually want to get rid of it because it seemed like a pretty important test based on the description. So Joël saw it and got excited because it seemed very relevant to some of the work he was already doing. So then, he is now investigating why this test is failing consistently.

So in this story, we have four main characters: we have a class, two modules, and a class variable. So enter the class stage left. All right, so this class defines a class variable which I have to say is not something I work with very much in Ruby. So class variables kind of felt a bit novel and diving back into like, oh yeah, these are a thing.

So the class defines a class variable that's called cache and assigns this variable to an instance of a cache. So then this class includes two modules who we'll call Module A and Module B. And we'll enter them stage right. And both of these models look to see if cache is already set. And if it's not, they also set the cache class variable.

So with that information, in our test, we don't want to exercise the real cache just because then if other tests are reading from that cache, which is proving to be a source of flakiness for these tests, then they are overriding each other's expectations, and it's causing some of the tests to flake.

So instead, we want to use a fake cache, just like an in-memory cache. So the test and its setup is already overriding. It's setting that class variable to say, hey, I want you to be a fake cache, just be in-memory. However, while executing that test, one of the modules is checking to see if that cache is set, which is being set in our test setup.

So test setup sets the value. We're running the test but then in the module, the model checks to see if it's set, and it’s suddenly nil instead of using the cache that we had set. So now it's defaulting back to say, "Oh, it's unset. So let me go back and set it to the real cache," which is exactly what we're trying to avoid.

So then the question became, if we're setting the class variable in our class, why is it being populated in one of the modules but it's not being populated in the other module? So one of them has it set to the in-memory cache, but the other one does not.

So I'm going to gloss over some of the details because this stuff is pretty tangling. But essentially, when the test is running, and it's loading the class, and we are overriding that class variable, it's getting shared with one of the modules because as soon as one of the models does set that class variable, there's a bidirectional link that gets set between the parent class which is the module in this case, and the class itself.

And as soon as that module sets the class variable, then they're going to talk to each other, and they're going to reference the same value. However, this only seems to happen for one of the parents. You can't do this for both. So if you have two parents that are trying to share a class variable with the same class, that doesn't work. So that's a particular bug that we were running into.

I do have some good news because if anybody is very nervous about the situation that I'm describing, I feel you. The good news is that in Ruby 3, they actually warn when this is happening and have introduced an error. So you don't have this inheritance confusion that can come out of the fact that these parent classes are also trying to share a class variable with this child class.

So in Ruby 3, if you are writing a class variable in that class but then you try to overwrite that class variable in the parent of that class or by the module that's being included, then an error is going to be raised. So it's going to warn you if you're creating this bidirectional link between those two class variables and that you shouldn't be overriding the child's ownership of that class variable.

Instead, if you're going to use class variables, which, one, is not my cup of tea, but if you're going to use class variables, it should be defined in the parent class, and then it can be shared downstream in the inheritance versus trying to go upstream and then having your ancestors essentially override some of those class variables.

So all of that is to say we were on a very interesting journey of understanding how class variables work, how the inheritance works, how that bidirectional link is getting established, and then how Ruby 3 comes in to warn us if something funky is happening.

CHRIS: Oh, that is interesting. And I'm now going to catalog that as a piece of information that my brain will retain for roughly the amount of time that we are recording this podcast and then immediately forget.

STEPH: As you should. [laughs]

CHRIS: It's one of the reasons that I try to avoid inheritance. And I try to avoid class variables as much as possible because of this category of problem, a very subtle bug that you have to try and really hone in. And you have to be very smart to debug this sort of thing. I don't want to be that smart. I want to code in a way that I can be less smart on any given Thursday. That's my goal in life.

I will ask one other question, though. So there's just a cache that this class and pair of modules are hanging around with, and then you want to swap it out for in-memory. This sounds remarkably like the Rails cache. Is this cache distinct special? Could it not just be backed by rails.cache, THE cache within the rails context, which can be backed by Memcached, or Redis, or in-memory when you're in tests, or the NullStore, which I think is the default in development is probably how that goes?

Is there a particular reason? Is this a special cache? Is there additional behavior that this cache has beyond the normal thing? Or is it just like, at some point, someone's like, oh, I need a cache. I'm just going to use a class variable, that'll be easy, which it definitely is, but then you run into complexities.

And caches are one of those hard things to get right. So it's one where I would immediately be like, whoa, whoa, I would love to not make up our own cache here. So I'm wondering, is there a distinct reason, or is it just this happened, and here we are?

STEPH: So I think we are using a custom cache that we are pointing to. So it is another service. It's not a Rails cache or an abstraction that we can point to and use. It is a different cache that we are using. And I'm trying to think back to the exact code. But there is a method that essentially checks to say, hey, should I use the real cache? Should I use the in-memory cache?

And that is something that we've explored to find a way to make this more global for the test suite because we really want to control this for all the tests. Because it's very easy to not realize in the test that you should avoid using that shared global cache. And so that way, the tests don't interact with each other but instead always use an individualized cache for each test to make sure that it is self-sufficient and independent. But we haven't gotten that far yet in figuring out how we can take a more global approach with this.

CHRIS: Gotcha. So I don't know the details. I assume there are reasons here. But just to play this out, if we find ourselves saying we have a reason to have a distinct cache, to have a special cache over here, but it's a cache...and caches fundamentally, that word always will raise my attention. It will be like, okay, this is a place that bugs will come and aggregate. And we need a distinct one that has special behavior as an external service, or that is just something like in...

There's a wonderful blog post that Mike Burns wrote at one point that was about...I think it was something like things that will make me look at your pull request in more detail. And I really loved it because it did capsulate all of these like, yeah, there are good reasons to do everything on this list. But if you do any of them, I will look at your pull requests and be like, oh, that's interesting. Why are we doing that, though? Do we have to do that? Are you sure? Are you triple sure we have to do that?

And this is definitely one of those things where caches automatically catch my attention. Even if we're using the built-in cache, I'm like, do we need to? Is that a definite thing? And then all the more so when we're using a custom bespoke one. Again, I assume that there are reasons that there's something special that's going on here. Perhaps the caching behavior is distinct from just it's Redis, and we throw data. And if it falls out the backside, that's fine. Maybe you need entirely different behavior here. But it is something that I would poke at a bunch.

STEPH: Yeah, you're asking a lot of good questions. I will have to go back and look at some of the code because we spent enough time in Ruby specifics that I didn't pay as much attention to the cache. Because right now, as we are working on these tests, we're trying to fix just the test without changing the application code, one, because that feels like a safer space. And if the test is flaky, we're just trying to change the test first.

But some of these tests we're starting to realize I'm not sure we can fix the test without also changing some of the application code, or the way that we do have to fix the test is really an incentive to back up and say maybe now's the time that we look at some of the application code. Because another question that comes to mind is why use a class variable, and does this need to be shared by the class and the modules?

And there's a part of me that suspects that maybe some of this logic was extracted to a module, but then it wasn't cleaned up in the other places. And so that's why we still have a reference. And it's essentially then being shared and set and unset and reset in those different places. So I think you ask some good questions, and I have some more questions of my own when we have time to revisit that portion of the test and application.

As another example of some of the tests that I've been working on, one of the tests that I...because we have a list, we can usually tell some of the tests that are flaky. So one of the ones that I was investigating was a similar issue where there was a shared resource, and someone had tried to mock it out. So they had taken the time to say, hey, I don't actually want to use that real resource that's over there; instead, I want to just return the scanned value.

But instead, they'd accidentally stubbed out a class-level method instead of the instance-level method. And so it was running, but it wasn't actually stubbing anything else since that's the method that's not getting called. So that was just an oversight for that test. So I fixed that test. But I noticed that we were using allow any instance of, so then I did take the time to go through that file and change and move away from the use of allow any instance of.

And for folks that are less familiar with allow any instance of, RSpec has some really great docs that talk about how it's very helpful for dealing with legacy code. But essentially, it is a code smell that you're using; allow any instance of because you are saying that my test is or my code is so complex that I can't really mock out the specific instances that I want to and then return specific behavior. So instead, I'm having to use this more global approach to say, hey, any instance of this method, I want you to mock it out versus this very specific instance that I know that I'm working with.

But we can include a link in the show notes because there's a nice write-up that talks about some of the reasons that allow any instance of is not recommended. So that's been kind of fun. There's been a little bit of joy to get to refactor away from that and actually stub out a specific instance.

Part of the work, too, that I'm noticing as Joël and I are going through these tests is leaving breadcrumbs for other developers as well because they have a very large team. And they're very junior friendly, which is just incredible. I love that so much about this company. And because they do hire a lot of juniors, then it is a tough codebase. It's a fairly old codebase.

So as these juniors are coming in, they're seeing a lot of these patterns. And they're propagating these old patterns that aren't necessarily the best patterns to propagate. But they're doing their best, and then they are reusing what they're seeing. So part of the work as we are revising these tests, my hope is that people will see some of these newer patterns and use those instead of following some of the older patterns.

CHRIS: I can only imagine that you're writing borderline novels in your pull request descriptions and commit messages there. I do wonder, is there an index of those that you're collecting? So there's like, here's the test remediation examples list, and you're slowly adding to them. This was a weird one with a class variable. And this was a weird one that had flakiness due to waiting or asynchronous behavior. And gathering examples of those, but specifically from the codebase.

I could see that being a really useful artifact because I happily traverse through git blame all the time. But I don't know that that's always a thing. And frankly, I have to work for it sometimes. So if there is that list of here are pull requests that specifically did X, Y, and Z, I think that could be super useful.

STEPH: Yeah, that's a great idea. And yes, they have some shared team documentation that speaks to specifically flaky tests because they're aware that this is a problem. They are working together to address this. And they have documentation that states ways to avoid flaky tests. If you encounter a flaky test, here are some of the ways that you can triage to find out what's wrong.

So as Joël and I have been finding good examples, then we've been contributing to that document. And they also have team meetings. So our plan is to attend some of those meetings and be like, "Hey, this is just some of the stuff that we've seen this week, some of the things that we improved and changed," and share the progress that we're making.

Since everyone is aware that there are these developers that are working hard to improve the test suite, but then share that information with the rest of the team so they too can feel...one, they can just see the changes that are taking place. But they too can also benefit and apply those strategies themselves when they see a flaky test.

Oh, but you did just remind me of a thing. So one of the tests that I was going through...I'm very intentionally going through and making the smallest change possible. So I will do the gross, ugly fix whatever it is to get something to pass, and then I will commit it. And then I'll think about okay, well, how can I make this better? So essentially, I have the fix, whether it's pretty or not. And then, after that, I start to have other commits that make it prettier.

And so, I had a pull request that had four commits that told the story that I was very happy about and progressed along in a more positive direction. And I issued that, and I discovered that Gerrit, when it sees four commits, it split all of them into their own change request.

And so, instead of having what I thought would be this nice story, now got split across these four change requests. And I thought, well, that's less helpful. So I ended up squashing two of them, but I still kept three of them because they stood alone, and each told a story. But that's something that I've learned about Gerrit.

CHRIS: Always so interesting how our tools shape our work.

STEPH: And it made me think back to the listener who asked the question about ensuring that CI runs for each commit. Well, here you go, Gerrit. [chuckles] Gerrit does it for you. It ensures that every commit gets split into its own change request.

CHRIS: I mean, as you said earlier, not my cup of tea but... [laughs]

STEPH: Yeah, I'm still lukewarm. I'm still discovering Gerrit and how we get along.

Mid-roll Ad

And now a quick break to hear from today's sponsor, Scout APM.

Scout APM is leading-edge application performance monitoring that's designed to help Rails developers quickly find and fix performance issues without having to deal with the headache or overhead of enterprise platform feature bloat. With a developer-centric UI and tracing logic that ties bottlenecks to source code, you can quickly pinpoint and resolve those performance abnormalities like N+1 queries, slow database queries, memory bloat, and much more.

Scout's real-time alerting and weekly digest emails let you rest easy knowing Scout's on watch and resolving performance issues before your customers ever see them. Scout has also launched its new error monitoring feature add-on for Python applications. Now you can connect your error reporting and application monitoring data on one platform.

See for yourself why developers call Scout their best friend and try our error monitoring and APM free for 14 days; no credit card needed. And as an added-on bonus for Bike Shed listeners, Scout will donate $5 to the open-source project of your choice when you deploy. Learn more at scoutapm.com/bikeshed. That's scoutapm.com/bikeshed.

What else is going on in your world?

CHRIS: In my world, we keep adding new users to the system. We keep doing more stuff. These are all wonderful things, the direction you certainly want to be heading. But as we're doing that, I've recognized that we had a lack of process and a lack of formalization of certain things.

And a lot of the noise of the work was just coming to me because I was the person that everybody knew. I can ask a question; Chris will know the answer, et cetera. And then there were things that we needed to keep an eye on. But because it was everyone's job, it was no one's job. So we've introduced the idea of a point person on the engineering team. So this is a role that will rotate each week. I think you and I have worked on a handful of projects that had something similar to this.

There was a team that we worked with that had an ad hoc list, which were just little tasks that needed to be done by developers. So there was one person who would run with that. I've heard it called captain before, the sprint captain. We're not really doing sprints. So for various reasons, that title didn't work for me. But point person is what I went with here.

And so the idea is rather than having product management or anyone else in the organization just individually reaching out to developers, we want to try and choke that off, have a single point of communication. And so just today, I introduced into Slack, a group, but it's a group of one person. So @pointdev is technically the handle for this person. It’s a group in Slack. And each week, we'll rotate who the members of that team are. And technically, you could add multiple, but the idea is this is just one person. So we'll rotate the person.

And what ends up happening is if anyone...say the product manager says, "@pointdev, what's the status on..." blah, blah, blah, that will notify the person who is the point person that week. So that's a nice feature in Slack so that we can condense it down and say rather than asking individuals, ask this alias. We're introducing one layer of abstraction in our communication tools, much like we do in our software.

So I'm drafting now the list of like, here's all the stuff that I think this person...because we're trying to push all of the quote, unquote, "other work" the non-product feature development work into this person's purview for a given week. So it's monitor Sentry for any new errors as they come up, triage them, and figure out what we want to do.

Ideally, and this is perhaps aspirational, I would like to keep inbox zero in Sentry. I know how you feel about that more generally and perhaps even more specifically within the world of errors, but that's my dream. We're going to see how it goes.

STEPH: I don't know if people know I am the opposite of inbox zero. This is the life that I'm living.

CHRIS: What about with errors, though? What about something like Sentry?

STEPH: I want to say that I would be a better human with my email. But I'm going, to be honest [laughs] and say that I would probably have the same approach where I am not an inbox zero person. I've come to terms with it. I used to really strive and think I needed to change. But I have reached a point of comfort with this is who I am. There are many like us, so shout out to all y'all.

CHRIS: Oh yeah, by far the more common approach, I think. So specifically with the errors, I struggle a bit with it because what ends up happening is we are implicitly ignoring the errors. And if we're doing that, I would rather just sit around and have a conversation and be like, let's just explicitly ignore them. There's a button in the UI. We can ignore them.

If this is not a real error, we can add it to the list of things that we do not report on. We can ignore that error. We can ignore it for a week and add a card to Trello that has a due date that says, "Hey, we got to work on this." But let's take that implicit indifference to that particular error mode of our application and make it explicit.

Let's draw that line in the sand such that when I see a new error pop up, I'm like, oh, that seems like something I should do something about. I really want high signal-to-noise when I'm seeing errors coming. And so I'm willing to work for that. But it is a trade-off, and it does take effort.

And it's noisy, especially browser extensions, and whatnot, just fighting the page. Facebook showed up one day. I don't know how Facebook got in there. Someone was browsing our website from within Facebook's browser, which I didn't know was a thing, but they had their own thing. And it fires a bunch of events, and Sentry was just like, let me slurp all of those up. Those seem fun. That was noisy. So we had to turn those off, but we explicitly turned them off.

STEPH: I do like the approach that you're taking where it's one person, and then it's a rotating shift because I think that makes it more reasonable for someone's who's like, hey, this is going to be noisy for a week. And then you're going to look through these emails and check all these errors, and then either silence them because you don't think that they're interesting or mute them for now. Or if you're going to convert it into a ticket, set a due date, whatever the triage approach is going to be.

But that feels more achievable versus inbox zero for life is just exhausting. But I feel like if you're doing it rotating week by week, that seems like a nice approach and also easier to keep it at inbox zero because that way, you are keeping up with all the errors. Because I agree; otherwise, what's the point of tracking all the errors if you're just going to ignore them?

CHRIS: Yeah, definitely the rotating, I think, is critical. I think the other thing that's been critical specifically on the error front is we've had now a handful of meetings where we triage the backlog together, the backlog of errors. So like, what all is coming into Sentry? What's going on? And we go through the process of determining is this a real thing? Should we fix this? Should we ignore it?

And we do that together so that it becomes not just one person's intuition about whether or not this is important or not or what the source of it might be but a shared intuition such that now any one of us, when it's our week, can ideally represent the team in that way and be like, never mind, never tell us about this again because it's very easy to silence things in Sentry that you would actually like to know about when they become real. But right now, we have this edge case that is an ignorable version. So trying to get there that's been fun.

But yeah, once again, Sentry, that's one of the things on this person's list. There are ad hoc support tickets for our operations team. So anything that needs to happen on a user's behalf that currently needs a developer to console, let's funnel all of those to this one individual, respond to any new questions. So this is where that Slack handle will be useful.

Check for any stuck jobs in Sidekiq. So is there anything that's been retrying for a while? Because it probably shouldn't. Maybe one or two retries is cool, but past that, something has gone wrong. And we should either get in there and fix it or just kill that job because it's never going to succeed, which is quite often the case but go in there and keep an eye on those and then look for anything.

We're starting to use due dates within Trello, which is currently our project management system. We'll see. Someday we're definitely going to grow out of that. But for now, it's good enough and checking for anything that's overdue or coming up in the next week in terms of due dates and just making sure that we're being responsive to that.

And so, I really like the idea of having this be a named set of things and a singular focus for one individual. Because again, that idea of like, if it's everybody's job, it's nobody's job. Or if it's nobody's job, then it's my job, and I don't want it to exclusively be my job. [chuckles] So I'm trying to make it not exclusively my job and to share the knowledge about it and make sure that these are skills that we all have and ideas and et cetera. But also, I would be fine to answer fewer questions in Slack each day.

STEPH: I have to admit, as soon as you were telling me that you had established this role, I was quietly congratulating you on helping delegate some of these responsibilities to the team. Because like you said, you are then the person that takes on all these tasks.

CHRIS: There's a laziness to that. Like, it's easy for me to just answer the questions. It's harder for me to put up a wall and say, "No, no, we have a process for this." And quite possibly, what's going to happen behind the scenes is that questions are going to come in to whoever is this point person. They're not going to know the answer. They're going to reach out to me, and then that conversation is still going to happen. But even by doing that now, now that person will see that answer, will understand the thinking or the background, the context that I have.

And so it's that weird thing of like, it would be so much easier for me to just answer one question. But to answer all the questions, well, I can't do that. And so I'm working to try and do more of the delegation to try and hand things off when they're in a known state and to identify this sort of stuff so that the team broadly can be stronger and better able to support everyone else in the organization. So that's the dream. We'll see how it goes.

STEPH: Yeah, I love that approach. I'm also thinking how interesting this role is because I'm imagining a mix between someone who is like the front point person at like an ER. So like, things are coming in, and they're in a tragic state and need help and need to be diagnosed.

But at the same time, you mentioned they're going around. They're checking Sidekiq. They're looking at some email errors. So they're also that night shift guard that's walking around with a flashlight just poking in each room. So it seems like a very stressful and low-key role all at the same time, all mixed up into one week. That person probably needs a beer at the end of the week.

CHRIS: There is a version of the story in my head that is...I wouldn't say this feels like a failure mode, but I would rather this not have to exist at all. I would rather things to be calmly humming along and not require a dedicated person each week to deal with the noise. I don't think that's realistic, certainly not as early on as we are in our organization. But I do wonder, is this a crutch? Is this something that we should be paying more attention to?

And I know in teams that you and I have worked with in the past that has been a recognition of like, this is a crutch. But it's a costly crutch. Like, we're taking an entire...in our case, it's not requiring the entirety of a developer's week. They're able to do this pretty easily and then still get a bunch...like, 75% of their time is still feature work. But we're just choking down who's the person that will be responding to questions when they pop up so that fewer individuals are interrupted?

But I have seen organizations where this definitely filled an entire week and spilled out more than. And then there was the recognition of that and the addition of another person that comes along and tries to fix stuff along the way as opposed to just responding. And so I want to make sure this isn't a band-aid but is, in fact, a necessary layer that we then try and shore up, you know, we should have fewer errors. That feels true. Okay, cool. Let's fix the bugs in the app.

And these ad hoc things that an admin needs to have done can that be a button in the UI? Can they actually self-serve in those cases? And we're slowly moving towards those. Ideally, fewer jobs get stuck in Sidekiq. And so, my hope is that this isn't a job that gets harder and harder over time. It's a job that potentially, if we're being honest, probably stays about this hard. I don't think it's ever going to be just like, nope, nobody needs to do anything. The app just runs, and it's great. And it never has bugs.

But that is a question in my mind as I start to embrace this thing of like one person is dedicated for a week to this. And if right now it's only 25% of their time, okay, that's probably fine. But if suddenly it's 50% of their time or 75% or 100% of their time for that whole week, that becomes too high of a bar in my mind. And I want to keep a close eye on it and make sure it's not trending in that direction. And I will be one of the people on the rotation. So I'll get to be in the trenches.

STEPH: I appreciate all the thoughtfulness that you're putting into it. And I'm thinking back on a project where we had a similar rotation because we had an issue Slack channel. And so anytime there was an issue, then it would get posted in there. And before, it was going out to everyone, or there was one particular person that was always picking it up and then trying to delegate it to others as they needed to. But then we started a similar rotation.

And one of the key benefits that I found from that is it signaled to the team, hey, this person might get pulled away. They can pick another ticket or two, but we need to give them lower priority tickets because there's a chance that they're going to get pulled away to work on something else. And that's okay, and we're going to plan for it.

Versus without this role in mind, then you had people all taking on high priority tickets, but then someone had to be the one that's like, well, I'm going to punt on my high priority and feel stressed about the fact that I've got this other thing to deal with. But then, I didn't actually do the work that I planned for.

So I feel like you're helping introduce calmness into the week, even if it is a stressful role. But then there's the goal that this becomes less of a stressful role, and if you see it trending in the opposite direction, then that's something to investigate.

But I also feel like triage and communication is such an important part of being a developer that it also feels very relevant upskilling for the whole team to go through. So there's also that benefit of where this approach also empowers the rest of the team to also experiences, build empathy, look for additional fixes, and then also build these important skills.

Overall, I really applaud your thoughtfulness. And I think it's a really good idea. And it will be interesting to see which direction that this role trends if it gets easier or if it's getting harder over time.

CHRIS: Well, thanks. I appreciate that. And I'll certainly report back as we develop this but hopefully, it stays about where it is. That feels right. And I think I'll probably...that's one of those things that I will monitor. And if I feel it moving in the wrong direction, then step in and try and get it back to this space because this feels like a maintainable reasonable amount.

And we shouldn't be fixing every bug and adding every button to the UI. That's just actually not how it works, unfortunately, would love to. That's not true. You shouldn't have every button in the UI. That's so many buttons. But broadly, I hope we can maintain roughly this, and I think identifying it and laying it out now I'm feeling good about having that structure. So yeah, we'll see how it goes. Will report back. But again, thank you for the kind words.

With that tour of a bunch of different things, should we wrap up?

STEPH: Let's wrap up.

CHRIS: The show notes for this episode can be found at bikeshed.fm.

STEPH: This show is produced and edited by Mandy Moore.

CHRIS: If you enjoyed listening, one really easy way to support the show is to leave us a quick rating or even a review on iTunes as it really helps other people find the show.

STEPH: If you have any feedback for this or any of our other episodes, you can reach us at @_bikeshed or reach me on Twitter @SViccari.

CHRIS: And I'm @christoomey.

STEPH: Or you can reach us at hosts@bikeshed.fm via email.

CHRIS: Thanks so much for listening to The Bike Shed, and we'll see you next week.

All: Byeeeeeeeee!!!!!

Announcer: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let's make your product and team a success.

Support The Bike Shed

Episode Sponsors

Scout
Scout APM is leading-edge application performance monitoring designed to help Rails developers quickly find and fix performance issues without having to deal with the headache or overhead of enterprise-platform feature bloat. With a developer-centric UI and tracing logic that ties bottlenecks to source code, you can quickly pinpoint and resolve performance abnormalities -- like N+1 queries, slow database queries, memory bloat, and more. Scout's real-time alerting and weekly digest emails let you rest easy knowing Scout's on watch and resolving performance issues before your customers ever see them. Scout has also launched its new error monitoring feature add-on for Python applications. Now you can connect your error reporting and application monitoring data on one platform. See for yourself why developers worldwide call Scout their best friend and try our error monitoring and APM free for 14-days, no credit card needed! And as an added bonus for Bikeshed listeners: Scout will donate $5 to the open-source project of your choice when you deploy. Learn more at scoutapm.com/bikeshed.

Episode Sponsors

Episode Link

Embeddable Audio Player

Download URL

Social Network Quick Links