Your most valuable business data
This post is part of the series “The Zen of Data Pipelines” which explores how to build scalable and robust data systems. Most of the ideas and best practices described in this series were learned while implementing and maintaining a large scale data pipeline at F-Secure for the Rapid Detection Service. You can find more posts of this series in this blog or all series content on Gitbook.
#Your most valuable business data
The complexity of the data pipeline increases with the number of processing components and data load. As the pipeline increases in size and complexity, it is often the case that the system as a whole becomes more opaque. Opacity comes from the fact that there are multiple services running at the same time and it is difficult to keep track of errors and a service status without a proper monitoring system in place. Under these circumstances, it is hard for developers to understand whether the system is behaving as expected, which usually means that it is not. In complex data pipelines - and any other complex software systems - visibility of how the system is performing is key for identifying issues and bottlenecks early enough and to understand which components need to be improved. Implementing the monitoring system from day one is a key business decision for your team.
Logs and metrics
Monitoring data can be split into two types: logs and metrics. Logs are data generated by the services in the pipeline, mostly data from the services’ standard output. Metrics are snapshots of key values about how the systems are performing and its resource consumption. Given the nature of logs and metrics, they are better represented as unstructured data and time series data, respectively. Since the structure and nature of metrics and logs are different, they serve different purposes and must be stored in a dedicated data store. While it is useful to perform full text search over logs entries, metrics can be used to visualize how the systems are performing over time and to trigger automated alerts when the systems perform unexpectedly.
Visualize the data
Graphical representations of monitoring data will help the development team to get a clear picture of how the system is behaving and performing. Examples of technical and business insights that monitoring data can provide are how data load varies with time, how component auto scaling is performing, what are the possible bottlenecks of the system and how much have the processing costs changed after a design refactoring. Graphical representations are often helpful not only for the technical teams which are directly working on the system, but also for the sales and business teams. Monitoring data can and should be shared with everyone for supporting decision making within the company.
Automated alerting
Both logs and metrics are essential for reasoning about the system status and for supporting technical and business decisions. However, graphical dashboard showing historical and near-real time system metrics and searchable logs are not enough in complex systems. As the number of components and data load increases, automation becomes more important. It is not productive - and oftentimes impossible - to assign a person or a team to keep constantly inspecting metrics and logs to find system issues and insights. Thus, for logs and metrics to be useful, there have to be automated alerts in place. Automated alerts are triggered when certain log or metrics conditions hold true. In order to build an automated alerting system, it should inspect logs and metrics in near real time and perform actions based on the data. Those actions can be, for example, triggering self healing scripts, scaling up a specific component (e.g. if data load is high and the available component resources are going down) or making a call to whoever can fix the issue. The alerting conditions are defined depending on the domain and system requirements. There are system resource alerts which are common to most software projects, such as alerts which trigger when CPU, memory and disk available for a certain component are running low.
When defining the alert thresholds and conditions (e.g. trigger alert when disk space is lower than 10%), aim at preemptive alerting. Preemptive alerting is when alerts trigger early enough so there is time to fix the problem before the system is affected. In some cases, there is a fine line between preemptive alerts and false positives and the alerts’ conditions may need to be tuned often to avoid noise. This brings us to the last point about alerts: test them. The alert conditions are code snippets which run over the latest metrics data and trigger or not the alerts, depending on the data. They can be written in any language or with a domain specific language (e.g. TICK scripts by InfluxData), depending on the alerting system used. Thus, alerts are code. With time, the alerting system will become critical to your business. The data format and alert conditions change often on complex pipelines and the alert scripts need to be changed accordingly. The alerts need to be tested against real data in an automated way if we want to ensure that no blind spots appear when data format changes or the alert code has changed. There are frameworks that streamline the tests for alerts (e.g. kapacitor unit), depending on the alerting system used.
Start monitoring early
In early times of the project, teams often overlook monitoring and focus on building the data system itself. This is a crucial mistake which you should avoid. Even the simplest monitoring system will provide actionable data about the system’s status, issues and bottlenecks to help making important technical and business decisions, which can be a great help in the early stages of the project. Building the monitoring early is an investment that pays off fast and will lay the grounds for a scalable and robust data system.
Takeaway: As complexity increases, monitoring data becomes more important in the context of data pipelines. Logs and metrics are different in nature and serve different purposes. Store logs as searchable data and metrics as timeseries. Build an alerting system so that the team can be alerted about potential issues early enough to fix the problem before the system is affected. Alert conditions are code, so it is important to automate testing for them as well. Build the monitoring stack from the beginning, it will greatly help the team to make decisions about what to improve and where to invest time on.