Site Reliability Engineer

Our client is a global e-commerce company providing an online platform where businesses can easily create and order customised marketing materials. They're active in multiple international markets across Europe and North America.

They’re looking for a Site Reliability Engineer (SRE) to lead their monitoring and observability efforts. You'll define and improve SLOs and SLIs, guide teams on best practices, and help maintain a stable, reliable platform through modern monitoring solutions.

Key Responsibilities

Lead Monitoring & Observability Strategy: Develop and lead the implementation of the company’s monitoring and observability approach.
Define & Maintain SLOs/SLIs: Set, implement, and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical services.
Mentor Product Managers & Engineering Leads: Guide teams on the definition and optimisation of SLOs/SLIs.
Collaborate Across Teams: Work closely with engineering, product, quality, and monitoring teams to manage incidents and maintain system health.
Set Up Monitoring Tools: Configure and manage tools like Datadog, Cloudflare, and Azure Cloud to monitor platform performance.
Improve Incident Management: Continuously improve processes to identify and resolve performance bottlenecks.
Optimise CI/CD Processes: Enhance CI/CD pipelines for better performance, reliability, and incident prevention.
Integrate Observability in Testing: Collaborate with QA teams to incorporate observability into testing processes for early issue detection.
Ensure High Availability & Security: Implement best practices to maintain high availability, performance, and security across the infrastructure.
Evolve SRE Practices: Drive the evolution of SRE practices and foster a culture of observability within the team.

What You Bring

Site Reliability Engineering Experience: Mid-level to senior experience in an SRE role, with a solid background as a developer.
E-commerce Experience: Experience working on high-traffic, customer-facing platforms such as e-commerce.
Monitoring & Observability Expertise: Strong experience with monitoring tools, observability frameworks, and related technologies.
Experience with Datadog or Similar Tools: Hands-on experience with Datadog or similar monitoring tools.
Cloud Experience: Experience working in a cloud-focused environment (e.g., Azure or similar).
Scripting Proficiency: Proficient in scripting for automation and system management.
SLO/SLI Implementation: Proven experience defining and implementing SLOs and SLIs for large-scale systems.
Incident Management & Collaboration: Deep understanding of incident management and effective collaboration with engineering teams.
Passion for System Reliability: Monitoring-focused and passionate about enhancing system reliability and visibility.
Mentorship Experience: Previous experience in mentoring and guiding teams on observability best practices.

Why Apply Now?
Don’t miss the opportunity to make a significant impact in a dynamic environment. This role allows you to mentor teams, implement best practices, and drive system improvements. Enjoy a flexible 4-day workweek and 100% remote work (Portugal-based).

Are you ready to take the next step in your career? Send your CV to ari.kilab@robertwalters.com

Ofertas de emprego similares

Ver mais ofertas de emprego

Site Reliability Engineer

Partilhar

Ofertas de emprego similares