赏金猎人

Why You Need Organizational SRE Support

Einic Yeo · 6月16日 · 2019年 ·

Building reliable services is expected in today’s software ecosystem. Customers don’t care whether you’re working with highly integrated systems or in a more controlled environment–they simply need your service to work. Site reliability engineers (SREs) help you adhere to SLAs and SLOs, run tests, monitor software and hardware, and iterate processes to build reli版权声明:本文遵循 CC 4.0 BY-SA 版权协议,若要转载请务必附上原文出处链接及本声明,谢谢合作!able applications and systems.

Organizations need to support SRE efforts because of the potential costs of downtime, the customer expectations of reliability, and the increasing speed of software development. Reliability engineers can create a cohesive culture of reliability by adding visibility into the software development lifecycle (SDLC), creating processes, and collaborating with teammates to make systems more robust.

The Benefits of SRE

By having a hand in a little of everything from monitoring a server’s disk sp版权声明:本文遵循 CC 4.0 BY-SA 版权协议,若要转载请务必附上原文出处链接及本声明,谢谢合作!ace and setting up failovers, to writing code for a web app版权声明:本文遵循 CC 4.0 BY-SA 版权协议,若要转载请务必附上原文出处链接及本声明,谢谢合作!lication, SREs can see the bigger picture of system reliability–no matter how complex the infrastructure. Over time, this exposure helps spread system knowledge across all of engineering and helps teams build more robust services. So of course, SRE begins with getting company-wide buy-in.

Everyone from the leadership team down to your most junior software engineer needs to understand the benefits of SRE–and help support those SRE initiatives. Teams that adopt SRE practices will help fill gaps between developers and IT operations teams, helping cultivate a more collaborative DevOps-type culture. So, we thought we’d lay out some of the challenges to getting buy-in for SRE, and a number of ways you can go about getting buy-in from leadership and others in your organization.

The Challenges

SREs can act as a bridge between IT operations and development teams. By understanding how to write code, take ownership of that code, and maintain it, SREs will help bake reliability into everything across the entire system. It’s important to make this clear to leadership, engineering, customer support, sales, marketing–everyone.

However, organizations continuously prove to be reluctant to change. Many times, change means additional costs, lost productivity, and general confusion. In order to get organizational buy-in for SRE, you need to convince leadership that reliable systems reduce costs, while simultaneously convincing your teammates that shared responsibility of development and operations is beneficial. Make sure to address these points when presenting to the leadership team about getting started with SRE.

There’s a large need for SRE, in one capacity or another, for every engineering team. Now, let’s look at some tips for making the case to leadership about the importance of SRE within your own organization:

Getting Buy-In From Leadership

You’ll need to present a compelling argument to leadership if you’d like them to get excited about change. To get started, here are a few suggestions of things to cover that should help you get organizational support for SRE:

SRE Mission

  • The mission of SRE should always be to provide a more reliable service for your customers. But, the ways to achieve success will vary. Define what SRE should accomplish, and why it’s important for stakeholders–both internal and external. Then, be sure the mission answers the question, “how will SRE drive success for your team and customers?”

Scope of the Project

  • A clearly defined scope of the project can ease concerns and make leadership more comfortable with potential changes. Walk through the overall responsibilities of the SRE team, the vision, the structure, the culture, and time requirements. Define the tooling necessary and any methods you’ll use for breaking down silos and adding resiliency into your systems.

Implementation of SRE

  • What will the implementation of SRE look like, and what kinds of resources are required? Are you creating a fully dedicated SRE team or some kind of team with part-time SRE responsibilities distributed amongst multiple teams. Because every company is different, there’s no one structure that works best. Just be sure to lay out a clear argument why your chosen implementation works best for you.

Measuring SRE Efficacy

  • What will success look like? How will you measure the effectiveness of SRE efforts? This point is crucial as it will help leadership quantitatively monitor the efficacy of SRE and explain it to key stakeholders. Set SLAs and SLOs, but even more, define long-term goals for reducing incident frequency, shortening incident MTTA/MTTR (mean time to acknowledge/resolve), and measuring reactive vs. proactive incident readiness.

Why is SRE Needed?

  • In combination, everything above will help answer this question. But a clearly laid out statement of why you need SRE is essential. If possible, give examples of incidents or pain points where an SRE team would have been beneficial. If you prove the value outweighs the cost, leadership will buy into the effort.
版权声明:本文遵循 CC 4.0 BY-SA 版权协议,若要转载请务必附上原文出处链接及本声明,谢谢合作!

The Necessity of Supporting SRE

SRE is only as good as the team supporting it. Effective SRE depends on collaboration from other teams and business units. The more visibility an SRE team can get throughout the entire SDLC and incident management lifecycle, the better they can assess ways to improve the system. With reliability engineers solely focused on finding pain points and scalability concerns, the rest of the DevOps team can maintain more consistent continuous deployment and integration.版权声明:本文遵循 CC 4.0 BY-SA 版权协议,若要转载请务必附上原文出处链接及本声明,谢谢合作!

Giving time and energy to the efforts of SRE means you’re more prepared for incidents when they occur, and issues are escalated less often. Site reliability engineers can conduct post-incident reviews, create actionable runbooks with context, help on-call engineers respond to problems, all while simultaneously ensuring a greater level of reliability in new deployments and current infrastructure.

1 条回应
  1. Echo2019-9-13 · 0:43

    Good