Learning about Observability (O11y) with Indy

Introduction

Recently I came across a blog post arguing that nobody really wants logs; what they really want is event streams. This has me thinking about the work we do when we debug production problems in Indy, and how difficult and elusive that task can be. In this post, I’ll to explore why production support is hard, what we’re missing, and how we can improve the situation with a perspective shift.

Then, I’m going to talk about how we can expand on this new approach to radically improve the experience for users.

The Inherent Complexity of Indy

Indy hides a lot of complexity for build systems. It manages retrieval of content from groups of repositories seamlessly, by looking through all member repositories of a group until it finds the desired content. These repositories could be hosted, in which case the content either is or is not stored locally. However, the repositories could also be remote, in which case Indy checks its local cache, and makes requests upstream to the remote server to pull down the desired content. This can incur any number of problems, ranging from request timeouts to server errors, all of which have to be handled gracefully by Indy as it searches the group membership for some repository that can provide the requested content.

In the case of metadata files, Indy actually requests this content from all member repositories in a group, parses whatever matching files it can find, aggregates them, and provides the result back to the user. This process incurs all the same potential problems present when serving simple content - but magnified, since the same content must be retrieved wherever it is available. It also carries heavy overhead in terms of file parsing the metadata content in order to merge it. And even when the content is retrieved, it may still be invalid, causing parser errors. All of which must be managed gracefully by Indy so the user experience is not diminished.

All repository managers do these things. However, Indy also includes an integration point to Brew, searching for matching builds and automatically setting up remote repositories to proxy that content too. The search process itself incorporates many opportunities for failure or even just confusing results, since it employs certain rules to determine whether a matching build is valid for inclusion. The latency introduced by querying Brew can also lead to problems with the user request, particularly if there are a lot of matching builds in Brew (extending the query response time and increasing the amount of filtering work Indy has to perform).

To cope with all the areas where latency may be injected, Indy employs numerous caches. However, these caches also add complexity. If a cache misses too often, performance may be severely impacted, and the user may start to see request timeouts. If a timeout from the scheduler cache is not persisted and allowed to expire at the appropriate time, it can lead to the return of stale results, as older aggregation output is used.

Because of its position as a gateway or middleman, Indy’s opportunity for solving potential errors (or performance problems) is imperfect. Some problems may arise out of slow or malfunctioning upstream servers; some problems may come from bad build data in Brew. Indy’s is an inherently chaotic problem domain.

For all of that complexity - or rather, because of it - Indy provides immense value by managing and organizing all of this and providing a nice, easy-to-use interface over it all.

##Context: The Production Support Dilemma

Indy is quite valuable. But all of this chaos under the hood can be very hard to support. When the user receives an error, tracking it back to where the problem took place can involve inspecting the activity from literally thousands of repository interactions. If a piece of aggregated metadata is wrong, we have to inspect every source that may have contributed to the aggregation, including Brew. Then, we have to consider things like content expiration timing, which may have allowed the metadata to become stale. It’s possible there is something about the Brew build - or the Brew tag in which it exists - that disqualified it from inclusion.

If performance is a problem, it can result in request timeouts or builds that take hours when they should only take fifteen minutes. Performance problems can come from improperly sized thread pools, Infinispan caches, or the overall JVM heap. Performance problems can come from malfunctioning or overloaded upstream servers, like Repository X. Or, they could come from a minor code change that resulted in a deadlock while promoting content between repositories.

It’s also important to note that a single wrong response can cause a build to fail. While most sites address performance or system error using statistical methods and “representative” requests, any single incorrect response from Indy could cause a critical build to fail. That build might take 2 hours to run, and make thousands of requests to Indy in the process. So, even an error rate of 0.1% (1 in 1000) could result in a build that fails repeatedly.

To support Indy, we’re mostly reliant on the log files for information about how things are going wrong, what repositories are being checked, etc. This is a losing game, since producing enough log content to be useful for the wide variety of potential problems quickly balloons to the point where log files roll too quickly to catch the error in the time it takes to report it and engage. Even if we can store the log volume necessary to debug individual requests, this again falls apart when the system is under high load. In this situation, the sheer volume of log messages hitting the file alongside the ones related to the one request we want can be immense, making it quite challenging to isolate the messages we want.

Yes, having proper log aggregation and indexing would help us reassemble the story of what happened during a particular request. But even if we filter on a particular unique request ID, we still have to make sense of the sequence of events that took place to serve that content. Since it has to check so many content sources in order to serve user requests, Indy makes extensive use of threads, and multiple thread pools, to achieve concurrency wherever possible. It could never scale appropriately otherwise. This means the chronological story of how a particular user request is served could involved interleaved messages from multiple threads operating concurrently. If only one thread had a problem, which maybe resulted in metadata that couldn’t be parsed, we’re right back to the problem of filtering out the noise to get at the problem.

When performance problems happen, we have to analyze date stamps in log messages to get an idea where the system is spending its time. If you do manage to track the latency down to a subsystem within Indy, this still doesn’t provide much insight into why that part of the system is running slowly. We address this by reporting metrics on the different parts of Indy, especially those parts that cache results or touch external systems. Performance problems usually affect multiple requests, which can be lucky in a perverse sense: they give us more data about what parts of the system are functioning normally and which aren’t. However, if the performance problem isn’t apparent from the metrics, you’re stuck right back with the problem of trying to make sense of the logs that were recorded during that period. Neither metrics nor log messages are good at storing the context in which the problem occurred.

Each thing we serve to a user has a story. Errors and incorrect results are maybe the most interesting stories for production support. A “story” in this sense is the origin story for that result…how it came into being. Stories are made of context and the subject that context references. When we try to solve problems, we’re given the subject…the user always knows that part.

Therefore, context is really the holy grail of troubleshooting. What happened while serving the request that may have set the stage for the reported error? What incorrect content may have been fed into the aggregation to produce the incorrect result? Where did Indy look - but not find content - before the error or incorrect result? Did it find a cached answer? Did it experience an index cache miss? These questions are why a simple error message usually isn’t enough to diagnose a production problem.

Context is the single hardest thing to extract from log messages and metrics. Context is the story of where Indy looked for content before returning the result. It’s the story of what actions may have set the application to a state where failure may have been inevitable. Indy traverses certain logic - looks in certain places for content - and this traversal is the context we always try to reverse engineer.

Why not store it up front in a way that makes sense?

Context is not Logs…It’s Events

Currently, Indy produces a few streams of events as they take place inside the application. Currently, these are completely retained inside the application. They’re used to trigger certain actions in response to something like file access or an update to a remote repository definition. Events contain pertinent details about an action that targeted a certain entity (file, repository, whatever), such that event listeners can make decisions about what other actions to take in response. For example, metadata file deletion happens as a result of group definition updates, such as when member repositories are added or deleted that render the aggregated metadata files stale.

The idea of building an event - bundling contextual information along with a description of the target entity - should be enough to allow us to describe the way a user request is served, or the way a file has been altered or accessed. If we’re producing some sort of aggregated content, it should give us the conceptual room to describe the places from which content was read, and the places from which content was not read (and why)…along with any errors that came up while aggregating that content. Most importantly, it should allow us to key the whole description of context to some target entity, such that we can retrieve the history of that entity by looking through the full stream of events keyed to that target.

While the file events I used as an example above currently don’t hold enough context (or scope) to be very useful as troubleshooting devices, I believe the concept can be expanded for exactly that purpose. What is an Event?

An event always describes a unit of work performed on some target entity.

This definition is deliberately scale-free. In fact, it really only makes two assertions:

  • It describes some cohesive “unit” of work.
  • The work is performed on some target entity.

But…What Events Should We Publish?

It’s worth noting here that entities could exist on a few different conceptual levels, leading to events that are effectively nested within the context of other events. A file access event could refer to a subset of the work done while processing a user-request event, for instance. We don’t need to constrain the scope of these events, except to say that we probably don’t want too many different flavors of event floating around…we’re fundamentally in the business of routing, aggregating, and serving files to users, after all.

If we really wanted to, we could create events for every method of every component in the execution path. However, this misses the point. We really only care about a few things:

  • The history of actions targeting a given file, across multiple user requests (file-oriented event slicing)
  • The set of logical operations that took place when filling a user request (request-oriented event slicing)

The execution could be sliced in other ways to present context keyed to different details. In Indy, these are the keys we normally care about. Event Streams

The other key point to using events as production support tools is that events have to escape their runtime.

Currently, Indy events exist only within the confines of the JVM, and are used only to trigger local actions. In order to become useful for production support, we must be able to store, index, and observe events. Just like metrics or logs, events must become a data stream that Indy produces as it executes.

Messaging platforms like ActiveMQ (UMB) or Kafka provide a great way to handle this sort of data stream. Indy can publish events to a topic, and one or more consumers can read events off the topic. One of these consumers would be an event observation application, which would store the events in an indexed database and provide a query interface (and probably, a UI). This application would fill a niche similar to Kibana or Grafana, allowing support engineers to lookup a particular request or file and learn its history. If we create events with the right context data, it could even trivialize the work of troubleshooting many categories of error found in Indy.

We’re talking about production support here, but it’s worth noting that this isn’t the only use for an event stream. Event streams are composed of structured data, and are decoupled from the publishing system. This means we could attach message consumers for other purposes…such as populating a larger build content database for use in producing source containers, querying for vulnerabilities, or myriad other uses. Indy may only provide a small part of the story for build content, but then again it might fill in some key puzzle pieces. Self-Service Production Troubleshooting

We talk a lot about production support, and it’s implied that this involves a support engineer to help the user understand why things are going wrong. But there’s another case to consider here: wrong content that’s not a malfunction in Indy.

In the Java world, there’s a lot of discussion for newbies about how to use exceptions. Exceptions are resource intensive to produce, since they require copying the full stack trace on construction. So, the rule of thumb is to create exceptions only in exceptional circumstances. The implication here is that not all errors (or incorrect results, if you like) are actually exceptional. For these, it’s better to implement some sort of handling logic that might include retry logic, or reporting back to the user in some form.

The same principle applies to the use of support engineers. Ideally, support engineers should only be involved when Indy is malfunctioning.

Currently, we’re pretty far away from that ideal. Whenever content is returned that seems demonstrably wrong, a support case is opened and an engineer gets involved. Sometimes this is the right call, but often this happens because users don’t know what else to do, or how to fix their situation. Indy issues have been reported for these reasons (among many others):

  • Repository X failed to respond before the Indy caching request timed out, resulting in a 404 when the content is visible in Repository X via the user’s browser.
  • Repository X timeouts caused Indy’s metadata aggregation process to run too long, resulting in a user request timeout.
  • Build content was in the wrong Brew tag, and Indy’s build-proxying rules refused to include it in aggregated metadata.
  • An upstream Eclipse repository returned a 404 HTML page with a HTTP status code of 200, which was returned to the user.
    • …or failed to parse as metadata during the aggregation process
  • Repository X contained an incorrect file that collided with a GAV in Maven Central; since Repository X was first the group membership ordering, the wrong file was served to the user.

None of these are Indy malfunctions, strictly speaking. Yes, Indy can implement compensatory logic in some cases to help narrow down the set of failure modes. However, some types of failure will always be outside of Indy’s control.

Also, notice that while Indy could in theory parse that HTML file to extract the 404 from it, in practice this is nearly impossible to generalize. And Indy would have to know to try to parse it in the first place. Many idiots can stand up a basic webserver that behaves badly, after all. The other thing to consider it the processing time involved in trying to improve Indy’s guesses at what’s really going on with a piece of content, and balancing that against performance of the system when things are configured correctly. Too much investigation will penalize response times even from repositories not run by idiots.

The most attractive thing about publishing event streams instead of logs is that they won’t require a lot of interpretation, as long as we put in some effort to describe the work in plain terms. If we trivialize the debugging effort required for entire categories of errors, that means we can provide a UI to our users, and provide production troubleshooting as a self-service activity.

Conclusion

Indy implements relatively simple routing logic that combines with variable quality in upstream data and services to result in complex failure modes. However, we can use this relative simplicity of logic to construct events that describe the work taken to serve a user request, or the actions that targeted a particular file. Events consist of a target entity of some kind, and the context information about some unit of work affecting that target. This definition is intentionally fuzzy, leaving room for us to select which entities we care about. These entities define the types of events we will publish.

Event data is infinitely more valuable than raw structured logs, since they provide context for that action that’s intact. By contrast, logs force us to reconstruct the context that led to a given result. Providing intact context for how Indy arrived at a given result trivializes many types of production support by simply providing the list of places content was sought, what was rejected (and why), and what was eventually served. Trivializing the construction of this story presents the opportunity to expose this information to our users and provide a self-service troubleshooting experience for many common scenarios.

Events should be published as one or more data streams, ideally to a message bus like ActiveMQ (UMB) or Kafka. This decouples the observation system from the application producing events, which keeps scope as tight as possible for both, and makes ownership cheaper. It also provides an opportunity for new event consumers to be added later that can make other use of the data streams. As long as events contain structured information (in addition to human-readable parts), they can be reused an incorporated in other services…they can become more than simple elements for human observation.

Written on July 3, 2018

Comments