As businesses increase their emphasis on the stability of distributed computing systems, SRE practices play an important role, prompting vendors like New Relic to expand observability tools accordingly.
Service Level Management (SLM), a new feature for the New Relic One platform, became generally available this week free of charge to current customers. It provides a framework for site reliability engineering (SRE) teams to configure service level indicators (SLI) and service-level objectives (SLO), automatically set baselines, and monitor reliability. microservices according to these performance indicators.
A company spokesman also said New Relic plans to announce a security offer this year but did not provide further details. FutureStack, the company’s annual user conference where it typically makes major product updates, is scheduled for May.
Relic SLM’s new beta testers said this week’s vendor update shows how the move to a microservices architecture has expanded the role of both observation tools and SREs in their companies. , and welcomes the possible addition of security surveillance to that mix. SREs have also begun to play an expanded role in DevSecOps environments.
Ultimately, carrying out multiple types of monitoring and measurement to capture the user experience, rather than tracking the raw performance of individual components of the infrastructure, is why microservices differ from monoliths, observability is different tracking and SREs are different from traditional sysadmins, a New Relic SLM said earlier. adopter.
“Tracking is very helpful when the failure modes are well understood, such as exhausting system resources like memory or threads,” said Andrew Myers, senior manager of SRE at Zip.co, an online payments company. in Australia. “Observation helps us understand the state of a distributed system by looking at all the data it generates, not just individuals. [resources]. ”
Observability tools have entered the cutthroat consolidation phase
At least a few businesses have begun integrating observability tools with New Relic, adding logs and distributed traces to New Relic’s traditional APM tools as they evolve, as well as metrics and aggregation of data from third-party tools like Prometheus, and phasing out competitive tools like Splunk and Grafana as a result.
However, some businesses are making aggregation options that also favor other vendors, and New Relic is playing catch-up in supplying SREs-two of its main competitors, Dynatrace and Datadog, have SLI and SLO tracking features from 2020 and 2019, respectively.
These competitors also cover an entire category of IT security monitoring and DevSecOps that New Relic has not yet addressed. The observability market is ripe for further attrition and consolidation as users continue to reduce the number of IT management tools they use, including for security, and New Relic must keep pace with competitors, including in security monitoring, to succeed in the long run.
“[Adding application security tools] will make good sense as they continue to target the software delivery lifecycle and beyond to developers, ”said Stephen Elliot, an analyst at IDC.“ Code scanning is an interesting area, as well as vulnerability tests for developers. “
New Relic is still emerging from a major upheaval in May 2021, when it appointed a new CEO and reorganized its product portfolio to create New Relic One, a unified observability platform. According to the company’s latest earnings report, its revenue has continued to grow since then, with 14,600 customers in the third quarter of its fiscal, which ended in January.
However, while it tackles the innovator dilemma, which is also creating chaos for enterprise IT vendors Splunk and ServiceNow, New Relic has yet to recover profitability, forecasting relatively flat revenue in the fourth fiscal quarter. its quarter, and does not rely on profitability until fiscal. 2023.
SREs, observability creates unity from chaos
SREs played the role of facilitator as microservices matured into a company that early adopted SLM, creating a centralized observability stack with New Relic and using it to organize communication between developers. , platform engineers and product teams.
“In a monolithic environment, reliability is only in the SRE team – we only care if things break down in production,” said Stefan Kolesnikowicz, SRE chief at Achievers, a manufacturer of employee recognition software that based in Toronto.
As Achievers ’culture and microservices grew on the Google Cloud Platform, however,“ everyone became responsible for reliability, ”he said. The distributed nature of microservices, by definition, forces collaboration between the teams that develop and manage them, and their complexity cannot be handled by any single team.
The Achievers SRE team has created a developer self-service portal called Abattoir, in agreement with the often cited “cattle vs. pets” similarities that emerge in the very automated and short-term infrastructure that underlies in rapidly changing microservice environments.
The New Relic SLM will enter the Abattoir to let software engineers and product teams configure and monitor SLIs and SLOs for the services they manage, thanks to a new integration with Terraform that automatically creates objects in New Relic observability database behind the scenes.
“We have a checkbox for that – really, engineers will just say,‘ Yeah, I like it, ’” Kolesnikowicz said. “That’s all that was translated from YAML, where the engineers wrote it, and pushed it into Terraform, [which] communicates with the New Relic API, which creates all of those things in New Relic. “
All of this reflects how system reliability has risen to the top of the Achievers priority list as well, Kolesnikowicz says, as it has in many businesses with microservices that are becoming mainstream.
“We’re trying to be more stringent, so if your error budget is running out, that’s your highest priority, to increase your reliability before you can release new features and introduce more risk to our platform,” he said. Kolesnikowicz. “[New Relic SLM] will give us better insight into how a system performs and its impact on the rest of the platform, and product integrations will allow them to see, ‘Hey, you’re slipping into your error that budget. ‘”
SLI/SLO wish lists: burn rate alerts, edge metrics
Previous SLM adopters would like to see built-in alerting about budget errors added to the tool in a future release. They can use the New Relic query language to configure custom alerts as the error budget burn rates reach certain limits, but it will be easier if that alert comes packed with SLM .
“It’s also nice to have some smart help teams decide on realistic targets for service levels based on the historical data we have as a baseline,” said Myers of Zip.co. “That’s something we need to coach our teams internally.”
Another potential refinement for SLM in the future is the expanded support for Prometheus metrics that Achievers tracks in its individual Kubernetes clusters via Istio service mesh, according to Kolesnikowicz. The New Relic One already integrates Prometheus metrics for other uses, but it hasn’t yet been built into SLM.
“If you’re familiar with the SRE book, [it says] you can move the measurement closer to the user to improve its quality, “he said, referring to Google Site Reliability Engineering’s seminal manual.” Now, we measure [SLIs] on the server side – we want to measure it in the load balancer, which will be in our Istio instance. “
Budget error burn rates and support for Prometheus metrics are both on the vendor’s short -term roadmap for SLM, a New Relic spokesperson said.
Beth Pariseau, senior news writer at TechTarget, is an award-winning veteran of IT journalism. He will be reached at [email protected] or on Twitter @PariseauTT.