Site Reliability Engineer
Our client is a global e-commerce company providing an online platform where businesses can easily create and order customised marketing materials. They're active in multiple international markets across Europe and North America.
They’re looking for a Site Reliability Engineer (SRE) to lead their monitoring and observability efforts. You'll define and improve SLOs and SLIs, guide teams on best practices, and help maintain a stable, reliable platform through modern monitoring solutions.
Key Responsibilities
- Lead Monitoring & Observability Strategy: Develop and lead the implementation of the company’s monitoring and observability approach.
- Define & Maintain SLOs/SLIs: Set, implement, and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical services.
- Mentor Product Managers & Engineering Leads: Guide teams on the definition and optimisation of SLOs/SLIs.
- Collaborate Across Teams: Work closely with engineering, product, quality, and monitoring teams to manage incidents and maintain system health.
- Set Up Monitoring Tools: Configure and manage tools like Datadog, Cloudflare, and Azure Cloud to monitor platform performance.
- Improve Incident Management: Continuously improve processes to identify and resolve performance bottlenecks.
- Optimise CI/CD Processes: Enhance CI/CD pipelines for better performance, reliability, and incident prevention.
- Integrate Observability in Testing: Collaborate with QA teams to incorporate observability into testing processes for early issue detection.
- Ensure High Availability & Security: Implement best practices to maintain high availability, performance, and security across the infrastructure.
- Evolve SRE Practices: Drive the evolution of SRE practices and foster a culture of observability within the team.
What You Bring
- Site Reliability Engineering Experience: Mid-level to senior experience in an SRE role, with a solid background as a developer.
- E-commerce Experience: Experience working on high-traffic, customer-facing platforms such as e-commerce.
- Monitoring & Observability Expertise: Strong experience with monitoring tools, observability frameworks, and related technologies.
- Experience with Datadog or Similar Tools: Hands-on experience with Datadog or similar monitoring tools.
- Cloud Experience: Experience working in a cloud-focused environment (e.g., Azure or similar).
- Scripting Proficiency: Proficient in scripting for automation and system management.
- SLO/SLI Implementation: Proven experience defining and implementing SLOs and SLIs for large-scale systems.
- Incident Management & Collaboration: Deep understanding of incident management and effective collaboration with engineering teams.
- Passion for System Reliability: Monitoring-focused and passionate about enhancing system reliability and visibility.
- Mentorship Experience: Previous experience in mentoring and guiding teams on observability best practices.
Why Apply Now?
Don’t miss the opportunity to make a significant impact in a dynamic environment. This role allows you to mentor teams, implement best practices, and drive system improvements. Enjoy a flexible 4-day workweek and 100% remote work (Portugal-based).
Are you ready to take the next step in your career? Send your CV to ari.kilab@robertwalters.com
Sobre a vaga
Tipo de contrato: Permanente
Especialização: Tecnologias de informação
Área: DevOps and Cloud
Indústria: Marketing
Salário: Negotiable
Tipo de trabalho: Remoto
Nível de experiência: Gerente
Local: Lisboa
FULL_TIMEReferência da vaga: 7RHWBR-024BCFDD
Data postada: 8 de maio de 2025
Consultor: Ari Kilab
lisboa information-technology/devops-and-cloud 2025-05-21 2025-07-07 marketing Lisbon PT PT Robert Walters https://www.robertwalters.pt https://www.robertwalters.pt/content/dam/robert-walters/global/images/logos/web-logos/square-logo.png true