fbpx
Wikipedia

Site reliability engineering


Site reliability engineering (SRE) is a set of principles and practices that applies aspects of software engineering to IT infrastructure and operations.[1] SRE claims to create highly reliable and scalable software systems. Although they are closely related, SRE is slightly different from DevOps.[2][3][4]

History Edit

The field of site reliability engineering originated at Google with Ben Treynor Sloss,[5][6] who founded a site reliability team after joining the company in 2003.[7] In 2016, Google employed more than 1,000 site reliability engineers.[8] After originating at Google in 2003, the concept spread into the broader software development industry, and other companies subsequently began to employ site reliability engineers.[9] The position is more common at larger web companies, as small companies often do not operate at a scale that would require dedicated SREs.[9] Organizations that have adopted the concept include Airbnb, Dropbox, IBM,[10] LinkedIn, Netflix,[8] and Wikimedia.[11] According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.[12][13]

Definition Edit

Site reliability engineering, as a job role, may be performed by solo practitioners or organized in teams, usually being responsible for a combination of the following within a broader engineering organization: System availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.[14] Site reliability engineers often have backgrounds in software engineering, system engineering, or system administration.[15] Focuses of SRE include automation, system design, and improvements to system resilience.[15]

Site reliability engineering, as a set of principles and practices, can be performed by anyone. SRE is similar to security engineering in that everyone is expected to contribute to good security practices, but a company may decide to eventually hire staff specialists for the job. Conversely, for securing internet systems, companies may hire security engineers. To define and ensure their reliability goals, companies may hire SREs as well.[citation needed]

Site reliability engineering has also been described as a specific implementation of DevOps, although they are slightly different. SRE focuses specifically on building reliable systems, whereas DevOps is more broadly focused.[2][3][4] Although they have different focuses, some companies have rebranded their operations teams to SRE teams with little meaningful change.[9]

Principles and practices Edit

There have been multiple attempts to define a canonical list of site reliability engineering principles, but while consensus is lacking, the following characteristics are usually included in most definitions:[1][16]

  • Automation or elimination of anything repetitive in a cost-effective way.
  • Avoidance to pursue much more reliability than what's strictly necessary. Defining what's necessary is a practice by itself (see list of practices below).
  • Systems designed with a bias toward the reduction of risks to availability, latency, and efficiency.
  • Observability—as in, the ability to ask arbitrary questions about a system without having to know ahead of time what to ask.[17]

The site reliability engineering practices also vary widely, but the list below is relatively commonly seen as at least partially implemented:

  • Toil management as the implementation of the first principle outlined above.
  • Defining and measuring reliability goals—SLIs, SLOs, and error budgets.
  • Non-Abstract Large Scale Systems Design (NALSD) with a focus on reliability.
  • Designing for and implementing observability.
  • Defining, testing, and running an incident management process.
  • Capacity planning.
  • Change and release management, including CI/CD.
  • Chaos engineering.

Implementations Edit

Site reliability engineering teams engage with the other teams within their companies and the SRE principles and practices in various forms. Here is a high-level overview of common SRE team implementations:[18]

Kitchen Sink, a.k.a. “Everything SRE” Edit

The scope of services or workflows covered is usually unbounded.

Infrastructure Edit

These focus on the reliability of behind-the-scenes systems that help make other teams' jobs more efficient. These are often confused with "Platform" teams or "Platform Operations" teams. Infrastructure SRE teams may pair up with one or more platform engineering team(s), but they differ in that Infrastructure SRE teams focus on performing most, if not all, of the work described in the principles and practices listed above. Platform teams tend to focus on building the platform, and while reliability is desirable, that's not their sole priority.

Tools Edit

The teams focus on tools to measure, maintain, and improve system reliability. For example, Nagios Core or Prometheus (software).

Product or application Edit

SRE team for product and/or application. Some large companies tend to staff several of these.

Embedded Edit

Usually, SRE solo practitioners or pairs staffed within a software engineering team apply most of the principles and practices described above.

Consulting Edit

These teams consult on how to implement SRE principles and practices. These are usually experienced SREs who've worked on teams in one or several of the implementations above. SREs on external facing consulting SRE teams are often called "Customer Reliability Engineers". They rarely, if ever, change the customer's configuration or code.

Large companies who have adopted SRE tend to have a combination of the implementations described above, including multiple teams of the same implementation, e.g. multiple Product/application SRE teams to meet specific demands of several products and an Infrastructure SRE team to pair up with a Platform engineering group to meet reliability goals of a common platform for both products/applications.

Industry Edit

The USENIX organization has held an annual SREcon conference since 2014 for site reliability engineers in the industry, and also holds regional conferences with similar themes.[19]

See also Edit

References Edit

  1. ^ a b "Evaluating where your team lies on the SRE spectrum". Google Cloud Blog. Retrieved 2021-06-26.
  2. ^ a b Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030.
  3. ^ a b Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.
  4. ^ a b "What is SRE? - SRE Explained - AWS". Amazon Web Services, Inc. Retrieved 2022-11-05.
  5. ^ Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian. Retrieved June 17, 2021.
  6. ^ "What is SRE?". Red Hat. Retrieved June 17, 2021.
  7. ^ Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.
  8. ^ a b Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch. Retrieved June 17, 2021.
  9. ^ a b c Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.
  10. ^ "Site Reliability Engineering". IBM Cloud Education. IBM. November 12, 2020. Retrieved June 21, 2021.
  11. ^ "SRE - Wikitech". wikitech.wikimedia.org. Retrieved 2021-10-17.
  12. ^ Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.
  13. ^ Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus. Retrieved June 17, 2021.
  14. ^ Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.
  15. ^ a b Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login:. Vol. 40, no. 3. pp. 35–39. Retrieved June 17, 2021.
  16. ^ "The 7 SRE Principles [And How to Put Them Into Practice]". www.blameless.com. Retrieved 2021-06-26.
  17. ^ "Learn about observability | Honeycomb". docs.honeycomb.io. Retrieved 2021-06-26.
  18. ^ "SRE at Google: How to structure your SRE team". Google Cloud Blog. Retrieved 2021-06-26.
  19. ^ "Usenix SREcon". USENIX. 2021. Retrieved June 17, 2021.

Further reading Edit

  • Limoncelli, Tom; Chalup, Strata R.; Hogan, Christina J. (September 2014). The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services. Vol. 2. Upper Saddle River, NJ: Addison-Wesley. ISBN 978-0133478549. OCLC 891786231.
  • Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall Richard, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly. ISBN 978-1491929124.
  • Blank-Edelman, David N., ed. (2018). Seeking SRE: Conversations About Running Production Systems at Scale (1 ed.). Sebastopol, CA: O'Reilly. ISBN 978-1491978863. OCLC 1052565720.
  • Beyer, Betsy; Murphy, Niall; Kawahara, Kent; Rensin, David; Thorne, Stephen (2018). The Site Reliability Workbook: Practical Ways to Implement SRE. O'Reilly. ISBN 978-1492029502.
  • Welch, Nat (2018). Real-World SRE: The Survival Guide for Responding to a System Outage and Maximizing Uptime. Packt. ISBN 978-1788628884.
  • Adkins, Heather; Beyer, Betsy; Blankinship, Paul; Lewandowski, Piotr; Oprea, Ana; Stubblefield, Adam (2020). Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems. O'Reilly. ISBN 978-1-4920-8312-2. OCLC 1129470292.
  • Rosenthal, Jones, Casey, Nora (2020). Chaos Engineering: System Resiliency in Practice. O'Reilly. ISBN 978-1492043867.{{cite book}}: CS1 maint: multiple names: authors list (link)

External links Edit

  • Awesome Site Reliability Engineering resources list
  • How they SRE resources list
  • SRE Weekly weekly newsletter devoted to SRE
  • SRE at Google landing page for learning more about SRE in Google
  • Komodor K8s Reliability learning center with resources for SREs working with Kubernetes

site, reliability, engineering, this, article, multiple, issues, please, help, improve, discuss, these, issues, talk, page, learn, when, remove, these, template, messages, this, article, contains, wording, that, promotes, subject, subjective, manner, without, . This article has multiple issues Please help improve it or discuss these issues on the talk page Learn how and when to remove these template messages This article contains wording that promotes the subject in a subjective manner without imparting real information Please remove or replace such wording and instead of making proclamations about a subject s importance use facts and attribution to demonstrate that importance May 2023 Learn how and when to remove this template message This article appears to contain a large number of buzzwords There might be a discussion about this on the talk page Please help improve this article if you can May 2023 This article may require copy editing for grammar style cohesion tone or spelling You can assist by editing it May 2023 Learn how and when to remove this template message Learn how and when to remove this template message Site reliability engineering SRE is a set of principles and practices that applies aspects of software engineering to IT infrastructure and operations 1 SRE claims to create highly reliable and scalable software systems Although they are closely related SRE is slightly different from DevOps 2 3 4 Contents 1 History 2 Definition 3 Principles and practices 4 Implementations 4 1 Kitchen Sink a k a Everything SRE 4 2 Infrastructure 4 3 Tools 4 4 Product or application 4 5 Embedded 4 6 Consulting 5 Industry 6 See also 7 References 8 Further reading 9 External linksHistory EditThe field of site reliability engineering originated at Google with Ben Treynor Sloss 5 6 who founded a site reliability team after joining the company in 2003 7 In 2016 Google employed more than 1 000 site reliability engineers 8 After originating at Google in 2003 the concept spread into the broader software development industry and other companies subsequently began to employ site reliability engineers 9 The position is more common at larger web companies as small companies often do not operate at a scale that would require dedicated SREs 9 Organizations that have adopted the concept include Airbnb Dropbox IBM 10 LinkedIn Netflix 8 and Wikimedia 11 According to a 2021 report by the DevOps Institute 22 of organizations in a survey of 2 000 respondents had adopted the SRE model 12 13 Definition EditSite reliability engineering as a job role may be performed by solo practitioners or organized in teams usually being responsible for a combination of the following within a broader engineering organization System availability latency performance efficiency change management monitoring emergency response and capacity planning 14 Site reliability engineers often have backgrounds in software engineering system engineering or system administration 15 Focuses of SRE include automation system design and improvements to system resilience 15 Site reliability engineering as a set of principles and practices can be performed by anyone SRE is similar to security engineering in that everyone is expected to contribute to good security practices but a company may decide to eventually hire staff specialists for the job Conversely for securing internet systems companies may hire security engineers To define and ensure their reliability goals companies may hire SREs as well citation needed Site reliability engineering has also been described as a specific implementation of DevOps although they are slightly different SRE focuses specifically on building reliable systems whereas DevOps is more broadly focused 2 3 4 Although they have different focuses some companies have rebranded their operations teams to SRE teams with little meaningful change 9 Principles and practices EditThere have been multiple attempts to define a canonical list of site reliability engineering principles but while consensus is lacking the following characteristics are usually included in most definitions 1 16 Automation or elimination of anything repetitive in a cost effective way Avoidance to pursue much more reliability than what s strictly necessary Defining what s necessary is a practice by itself see list of practices below Systems designed with a bias toward the reduction of risks to availability latency and efficiency Observability as in the ability to ask arbitrary questions about a system without having to know ahead of time what to ask 17 The site reliability engineering practices also vary widely but the list below is relatively commonly seen as at least partially implemented Toil management as the implementation of the first principle outlined above Defining and measuring reliability goals SLIs SLOs and error budgets Non Abstract Large Scale Systems Design NALSD with a focus on reliability Designing for and implementing observability Defining testing and running an incident management process Capacity planning Change and release management including CI CD Chaos engineering Implementations EditSite reliability engineering teams engage with the other teams within their companies and the SRE principles and practices in various forms Here is a high level overview of common SRE team implementations 18 Kitchen Sink a k a Everything SRE Edit The scope of services or workflows covered is usually unbounded Infrastructure Edit These focus on the reliability of behind the scenes systems that help make other teams jobs more efficient These are often confused with Platform teams or Platform Operations teams Infrastructure SRE teams may pair up with one or more platform engineering team s but they differ in that Infrastructure SRE teams focus on performing most if not all of the work described in the principles and practices listed above Platform teams tend to focus on building the platform and while reliability is desirable that s not their sole priority Tools Edit The teams focus on tools to measure maintain and improve system reliability For example Nagios Core or Prometheus software Product or application Edit SRE team for product and or application Some large companies tend to staff several of these Embedded Edit Usually SRE solo practitioners or pairs staffed within a software engineering team apply most of the principles and practices described above Consulting Edit These teams consult on how to implement SRE principles and practices These are usually experienced SREs who ve worked on teams in one or several of the implementations above SREs on external facing consulting SRE teams are often called Customer Reliability Engineers They rarely if ever change the customer s configuration or code Large companies who have adopted SRE tend to have a combination of the implementations described above including multiple teams of the same implementation e g multiple Product application SRE teams to meet specific demands of several products and an Infrastructure SRE team to pair up with a Platform engineering group to meet reliability goals of a common platform for both products applications Industry EditThe USENIX organization has held an annual SREcon conference since 2014 for site reliability engineers in the industry and also holds regional conferences with similar themes 19 See also EditChaos engineering Cloud computing Data center Disaster recovery High availability software Infrastructure as code Operations administration and management Operations management Reliability engineering System administrationReferences Edit a b Evaluating where your team lies on the SRE spectrum Google Cloud Blog Retrieved 2021 06 26 a b Beyer Betsy Jones Chris Petoff Jennifer Murphy Niall eds 2016 Site Reliability Engineering How Google Runs Production Systems Sebastopol CA O Reilly Media ISBN 978 1 4919 5118 7 OCLC 945577030 a b Vargo Seth Fong Jones Liz March 1 2018 What s the Difference Between DevOps and SRE class SRE implements DevOps Video Google a b What is SRE SRE Explained AWS Amazon Web Services Inc Retrieved 2022 11 05 Hill Patrick Love DevOps Wait until you meet SRE Atlassian Retrieved June 17 2021 What is SRE Red Hat Retrieved June 17 2021 Treynor Ben 2014 Keys to SRE USENIX SREcon14 Retrieved June 17 2021 a b Fischer Donald March 2 2016 Are site reliability engineers the next data scientists TechCrunch Retrieved June 17 2021 a b c Gossett Stephen June 1 2020 What Is a Site Reliability Engineer What Does an SRE Do Built In Retrieved June 17 2021 Site Reliability Engineering IBM Cloud Education IBM November 12 2020 Retrieved June 21 2021 SRE Wikitech wikitech wikimedia org Retrieved 2021 10 17 Oehrlich Eveline Groll Jayne Garbani Jean Pierre 2021 Upskilling 2021 Enterprise DevOps SkillsReport PDF Report DevOps Institute Retrieved June 17 2021 Oehrlich Eveline May 4 2021 What it takes to be a site reliability engineer TechBeacon Micro Focus Retrieved June 17 2021 Treynor Ben In Conversation Interview Interviewed by Niall Murphy Google Site Reliability Engineering a b Jones Chris Underwood Todd Nukala Shylaja June 2015 Hiring Site Reliability Engineers PDF login Vol 40 no 3 pp 35 39 Retrieved June 17 2021 The 7 SRE Principles And How to Put Them Into Practice www blameless com Retrieved 2021 06 26 Learn about observability Honeycomb docs honeycomb io Retrieved 2021 06 26 SRE at Google How to structure your SRE team Google Cloud Blog Retrieved 2021 06 26 Usenix SREcon USENIX 2021 Retrieved June 17 2021 Further reading EditLimoncelli Tom Chalup Strata R Hogan Christina J September 2014 The Practice of Cloud System Administration DevOps and SRE Practices for Web Services Vol 2 Upper Saddle River NJ Addison Wesley ISBN 978 0133478549 OCLC 891786231 Beyer Betsy Jones Chris Petoff Jennifer Murphy Niall Richard eds 2016 Site Reliability Engineering How Google Runs Production Systems O Reilly ISBN 978 1491929124 Blank Edelman David N ed 2018 Seeking SRE Conversations About Running Production Systems at Scale 1 ed Sebastopol CA O Reilly ISBN 978 1491978863 OCLC 1052565720 Beyer Betsy Murphy Niall Kawahara Kent Rensin David Thorne Stephen 2018 The Site Reliability Workbook Practical Ways to Implement SRE O Reilly ISBN 978 1492029502 Welch Nat 2018 Real World SRE The Survival Guide for Responding to a System Outage and Maximizing Uptime Packt ISBN 978 1788628884 Adkins Heather Beyer Betsy Blankinship Paul Lewandowski Piotr Oprea Ana Stubblefield Adam 2020 Building Secure and Reliable Systems Best Practices for Designing Implementing and Maintaining Systems O Reilly ISBN 978 1 4920 8312 2 OCLC 1129470292 Rosenthal Jones Casey Nora 2020 Chaos Engineering System Resiliency in Practice O Reilly ISBN 978 1492043867 a href Template Cite book html title Template Cite book cite book a CS1 maint multiple names authors list link External links EditAwesome Site Reliability Engineering resources list How they SRE resources list SRE Weekly weekly newsletter devoted to SRE SRE at Google landing page for learning more about SRE in Google Komodor K8s Reliability learning center with resources for SREs working with Kubernetes Retrieved from https en wikipedia org w index php title Site reliability engineering amp oldid 1171483450, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.