What is SRE?

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).

SRE Principles

  1. Find Service Level
    Service Level Indicator(SLI), Service Level Object(SLO) & Service Level Agreement(SLA) are parameters with which reliability, availability and performance of the service are measured.
  2. Error Budgets
    •An error budget is 1 minus the SLO of the service. A 99.9% SLO service has a 0.1% error budget. If our service receives 1,000,000 requests in four weeks, a 99.9% availability SLO gives us a budget of 1,000 errors over that period.
  3. Eliminate Toil
    Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. SRE job is to eliminate as many as Toils by Automating stuff
  4. Automate Everything
    SRE team Automation provides
       – Consistency as systems scale
       – A platform for extending to other systems
       – Faster repairs for common problems
       – Faster action than humans
       – Time savings by decoupling operator from
  5. Support Releases
    Running reliable services requires reliable release processes.
    Continuously build and deploy, including
    – Automating check gates
    – A/B deployments and other methods for checking sanity
            SRE don’t afraid to roll-back a problem release.

DevOps to SRE Tranformation

It is not as easy as you think. And let me explain why?

There are so many misconceptions about the SRE = DevOps. But, it is NOT equal and there are so many things in SRE that DevOps won’t cover. For example, DevOps focus more on Deployment Velocity and application uptime. But, SRE focus on SILOs and Error budgets. DevOps won’t take any authority on deployments nor it influences the deployment velocity. Where SRE can STOP Deploying the application when the Error budget is exceeded. So, this proves SRE has authority in the SDLC process and also it can impact the business owner’s view.


And There is say “SRE Class Implement DevOps”. if we take SRE as one Big function….

SRE(DevOps) {
         ci();
         cd(); 
         mon();
         testing(); 
         ....
}

I would assume that most Organizations are practicing the DevOps that want to jump into SRE to increase their Application or Product uptime and focus on the SILOs.

DevOps to SRE
1. There are not many changes required, easy to get started on your SRE journey.  DevOps is mainly focused CI/CD, automation, and monitoring apps. with this DevOps team easily adapt the SRE culture by Implementing the additional controls in SRE. This is still a big change but i would say that should be the start.

2. You can practice and adopt SRE approach, an experiment in your environment (product) at a low cost. As i mentioned above, we can start with the DevOps controls, and moving forward the Practicing the SRE controls won’t cost.

3. FullStack to SRE Journey. Small and medium enterprise Companies have a limited # of DevOps Engineers following the full-stack engineering model. -That case implemented SRE will be 5 Steps Process.

Fullstack to SRE Journey


4. No knowledge/coverage gaps between SRE/DevOps teams. DevOps acts as a glue between various teams that are creating solutions, dependent on each other, or consists out of distinct pieces of software. So, moving to SRE from DevOps is not going to be challenging.

DevOps to SRE Model

Again, this transformation depends on how teams collaborate with each other and how fast they can adapt to the change. There are a few Good books available for you to learn SRE Approach. But in my view, not all textbooks and theories can teach you with specific teams structure that you have. Understanding the current state is the starting point for the SRE journey.

Here Some of SRE books links:
Site Reliability Engineering: How Google Runs Production Systems (known as “The SRE Book”)
The Site Reliability Workbook: Practical Ways to Implement SRE (known as “The SRE Workbook”)
Seeking SRE: Conversations About Running Production Systems at Scale

Let me know your thoughts in comments.