What is SRE?

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).

SRE Principles

  1. Find Service Level
    Service Level Indicator(SLI), Service Level Object(SLO) & Service Level Agreement(SLA) are parameters with which reliability, availability and performance of the service are measured.
  2. Error Budgets
    •An error budget is 1 minus the SLO of the service. A 99.9% SLO service has a 0.1% error budget. If our service receives 1,000,000 requests in four weeks, a 99.9% availability SLO gives us a budget of 1,000 errors over that period.
  3. Eliminate Toil
    Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. SRE job is to eliminate as many as Toils by Automating stuff
  4. Automate Everything
    SRE team Automation provides
       – Consistency as systems scale
       – A platform for extending to other systems
       – Faster repairs for common problems
       – Faster action than humans
       – Time savings by decoupling operator from
  5. Support Releases
    Running reliable services requires reliable release processes.
    Continuously build and deploy, including
    – Automating check gates
    – A/B deployments and other methods for checking sanity
            SRE don’t afraid to roll-back a problem release.