Ensuring product reliability from the outset

~ The role of site reliability engineering in software development ~

When you want to check your internet connection is working, most people’s natural reflex is to check Google. We assume it always operates an uninterrupted service — and Google engineers work hard to achieve that perception. All software products need constant improvements to resolve bugs, improve user experience and more, but how do developers achieve this while consistently providing a reliable service? Here Craig Cook, chief engineer at software engineering expert Catapult, explains the importance of adopting a site reliability approach from the outset.

Originating at Google in the early 2000s, site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operational problems. The goal is to create scalable and reliable software systems that can operate autonomously and withstand the unpredictability of the software by following principles including automating repetitive tasks, measuring continuously, managing system capacity and handling incidents.

The benefits of SRE

If businesses experience frequent outages, slow or poorly performing applications or encounter bugs, they may benefit from considering how to better embrace site reliability engineering. SRE ensures applications remain reliable during frequent updates. Through continuous monitoring and observability practices, engineers can detect and address issues before they escalate, limiting disruption to user experience.

Moreover, SRE fosters collaboration between development and operations teams, enabling rapid changes without compromising reliability. This collaborative approach not only enhances the customer experience but also improves operational planning by minimising the impact of incidents and downtime.

Reliability from the outset

Site reliability isn’t just about removing bugs from existing products. Integrating SRE principles from the beginning of the product development lifecycle enables engineers to meet customer needs more effectively.

Designing for reliability ensures the product will be able to handle real-world demand, while automating as much as possible can streamline processes, reducing the burden on operations teams so that they can continue focusing on customer value. A streamlined pipeline can accelerate development cycles, enabling quicker delivery of features and fixes.

Businesses often view these as individual tasks for one dedicated site reliability engineer. However, reframing SRE as an essential skill for any software engineer, rather than a distinct role, ensures its prioritisation and maximises the benefits for businesses.

But isn’t this DevOps?

Some businesses believe that ‘DevOps engineers’ are responsible for tasks related to site reliability. That’s because approaches to SRE and DevOps are similar — both prioritise automation and collaboration — but they differ in focus. DevOps is a high-level philosophy and set of principles that apply at the organisational level to improve collaboration between the different parties involved in building and maintaining software. In fact, it aims to tear down the barriers and collapse them into one collaborative working group. On the other hand, SRE is a set of specific practices aimed at improving the reliability of software.

So, when considering how to improve efficiency in product development, engineers should follow both DevOps principles and SRE practices, which can help them develop a diverse skill set to build and maintain reliable and high quality software infrastructure. However, it’s important to note that while DevOps principles and SRE practices can go hand-in-hand, they are not mutually exclusive.

Applying SRE

Businesses must ensure that their teams have the necessary skills and expertise to adopt SRE practices. This may involve hiring new team members who are skilled in SRE, or partnering with consultancies to inject the required knowledge into the team. SRE requires more than technical expertise, it’s also about understanding customer needs and aligning development accordingly.

Metrics provide valuable insights into the health, performance and reliability of software systems. By taking a metrics-focused approach, engineers can understand service level agreements and adapt software to meet customer needs. Alerts triggered by drops in performance or health can prompt the need for adjustments — the frequency of these alerts depends on the application and customer perspective.

Consider a service like online banking, where transactions like transferring funds between accounts require security. For important actions like these, users expect a high level of reliability, so any errors should trigger immediate alerts for engineers to action. Other operations can be prone to human error — for example when logging into a portal, people often misspell or even forget their password. These errors should be monitored, but won’t require serious alerts for each instance.

When thinking about product development, considering site reliability engineering should be instinctive — just like we always go to Google when there are online issues. By integrating SRE principles into product development, teams can create customer-centric solutions that are resilient, scalable and equipped to handle real customer needs from the outset. To ensure engineers can deliver this high-quality software, businesses should reframe SRE — seeing it as an essential skill set rather than a distinct role.

Interested in how Catapult can help you integrate site reliability engineering into product development? Contact our software engineers.