Senior Site Reliability Engineer for Datacraft team
Job Description
- We're taking autonomous search mainstream, making product discovery more intuitive and conversational for customers, and more profitable for businesses.
- We’re making conversational shopping a reality, connecting every shopper with tailored guidance and product expertise — available on demand, at every touchpoint in their journey.
- We're designing the future of autonomous marketing, taking the work out of workflows, and reclaiming the creative, strategic, and customer-first work marketers were always meant to do.
Become a Senior SRE for Bloomreach!
Join the newly form Datacraft team — the team building the next-generation data platform for Bloomreach Engagement. Datacraft owns three interconnected domains:
- Data Warehouses (~60%) — making Bloomreach data first-class in customer DWHs (Snowflake, BigQuery, Databricks). The strategic goal for 2026–27 is to use DWHs to exponentially accelerate data adoption.
- Loomi Analytics Agent (~20%) — evolving Loomi Analytics into an agentic analytics assistant that can explore data across systems, explain insights, and act on them.
- Dashboards & Analytics Stack (~20%) — moving Engagement reporting onto DWH-backed, modern analytics stacks (semantic layers, headless BI tools).
As a Senior SRE, you will be the reliability backbone of this AI-first data team. Your work will directly impact the deployments, pipelines, reliability, and observability of pipelines and services that hundreds of enterprise customers depend on — from data exports into Databricks and BigQuery, to the AI agent Loomi uses to surface insights.
Datacraft is an AI-first team. We believe code is a commodity and expect every engineer to fluently use coding agents (e.g., Cursor, Claude Code, Copilot, Gemini CLI) as a core part of their daily workflow. The ability to leverage AI tooling to accelerate development, prototyping, and problem-solving is not optional — it's foundational.
For candidates at the P3 / Senior SRE level, starting monthly compensation begins at 3 800 € gross, with the final offer tailored for each candidate based on their skills and experience. Stock options and a comprehensive benefits package are also included. Working in one of our Central European offices (Bratislava, Praha, Brno) or from home on a full-time basis, you'll become a core part of the Engineering team.
What challenge awaits you?
As a P3 (Senior) SRE at Bloomreach, you are an independent professional — expert in reliability engineering, able to decompose objectives into actionable infrastructure improvements, and lead initiatives end-to-end with minimal day-to-day guidance.
We need you to build and operate an ecosystem where data engineers can safely and efficiently develop, debug, and operate data-intensive jobs and services — spanning Kafka ingest pipelines, Iceberg data lakes, multi-DWH exports, Databricks deployment and orchestration (Airflow / Cloud Composer), and agentic AI workloads.
Your responsibilities
a. Platform reliability & observability
- Build and maintain the reliability ecosystem where engineers can safely develop, debug, and operate DataCraft services running on GCP and Kubernetes (DataProc, Cloud Composer, BigQuery, Snowflake/Databricks connectors).
- Ensure end-to-end observability across the full data platform — from Kafka ingest through GCS/Iceberg staging, Airflow orchestration, to Databricks and BigQuery destinations — enabling the team to catch missing loads, SLA breaches, and data drifts before customers notice, or costs drift.
- Drive scalability so services can scale vertically and horizontally based on operational and telemetric data (OpenTelemetry, Prometheus, Victoria Metrics).
- Maintain team health dashboards and alerting (Grafana, PagerDuty, Sentry).
b. Infrastructure as Code & deployments
- Own and evolve Terraform-based infrastructure for DataCraft services.
- Automate deployments, instance setup, and operational runbooks to eliminate manual/semi-manual steps.
- Maintain CI/CD pipelines (GitLab) with linters, security scans, and code quality checks, AI code reviews, enabling engineers to produce high-quality MRs.
c. Security & compliance
- Help the team fulfill security requirements for ISO and SOC2 audits by enforcing security principles: key distribution, key rotation, authorization & authentication at the service level, data encryption in transit, data isolation, resource limitations, and audit logs.
- Ensure data access controls are properly enforced across multi-DWH environments (BigQuery, Snowflake, Databricks).
d. Incident management & L3 support
- Participate in and drive L3 on-call rotation and incident resolution for DataCraft services.
- Contribute tooling for debugging, troubleshooting, and performance testing of data pipelines and orchestration layers.
- Use telemetry data and distributed tracing to navigate complex, distributed service architectures.
e. Agentic platform reliability
- Ensure reliability and observability of the Loomi Analytics Agent data infrastructure — LLM API gateway performance, MCP server health, and evaluation pipeline availability.
- Monitor and alert on data quality issues that could introduce inconsistencies or hallucinations in Loomi's responses — making the agent's data access patterns reliable and debuggable.
Our tech stack
Languages: Python (primary), Go, SQL Messaging & streaming: Apache Kafka Storage & databases: Databricks, BigQuery, Apache Iceberg, GCS, Mongo, Redis Data processing & orchestration: Apache Spark, DataFlow, Airflow / Cloud Composer Infrastructure: GCP, Kubernetes, Terraform AI / Agentic: LLM APIs, MCP, agent orchestration frameworks Observability: Grafana, Prometheus, Victoria Metrics, PagerDuty, Sentry, OpenTelemetry CI/CD & tooling: GitLab, Jira, Confluence AI coding agents: Cursor, Claude Code
Your qualifications
Professional experience
Impact
- You can articulate how your contributions transformed the way engineers work and fostered a strong SRE/DevOps culture.
- You can demonstrate how impactful reliability work connects to business success and customer outcomes.
Ownership
- You embrace the you build it, you run it principle — you love owning what you ship.
- You are cost-aware: effective vertical and horizontal autoscaling and detailed telemetry insights are how you demonstrate mindfulness of cloud spend.
Systematic approach
- Infrastructure as Code is the only thing that brings stability into chaos
- You design for failure: SLOs, error budgets, and runbooks are first-class artifacts, not afterthoughts.
Data-driven
- You use telemetry and metrics to give engineers actionable feedback on how applications and services behave.
- You can navigate complex data platform architectures using distributed tracing and debugging.
Technical skills
- Solid hands-on experience with GCP (BigQuery, DataProc, Cloud Composer, GCS) and Kubernetes.
- Experience with Python; Go is a strong advantage.
- Familiarity with data pipeline technologies (Kafka, Airflow/Cloud Composer, Spark, Iceberg) — you don't need to write ETL code, but you need to operate it reliably and know when something is wrong.
- Fluent use of AI coding agents (Cursor, Claude Code, Copilot, Gemini CLI, or similar) — you already use these tools daily to accelerate work.
- Comfortable with on-call rotation and 24/7 incident response.
- Remote-first mindset — you know how to be effective in distributed teams.
- You are able to learn and adapt — essential when exploring new tech or navigating our growing codebase.
Strongly preferred
- Experience operating single-DWH environments (Snowflak, Databricks or BigQuery).
- Familiarity with agentic/LLM workloads — API reliability, latency SLOs, trace observability for AI systems.
- Experience with open table formats (Iceberg, Delta Lake) in production environments.
- Exposure to data security and compliance in the context of customer-facing DWH integrations (consent, data retention, PII handling).
Personal qualities
- Ownership & accountability — you take issues from detection through to resolution and follow-up prevention.
- Systematic thinking — you identify root causes, not symptoms, and document your findings so the team learns.
- Collaboration & communication — you explain trade-offs and constraints clearly to both engineers and non-engineers.
- Bias for reliability — operational excellence (SLOs, oncall friendliness, proactive alerting) is not a chore, it's your craft.
- Continuous improvement mindset — you are comfortable iterating, revisiting assumptions, and improving incrementally.
- Comfortable operating remote-first in a distributed team across Central Europe.
Your success story
In 30 days:
- Get to know the DataCraft team, the company, and the most important processes.
- Set up your local and GCP development environment and complete the Engagement engineering onboarding.
- Understand the current state of DataCraft services: pipelines, orchestration, observability gaps, and on-call runbooks.
In 90 days:
- Start contributing to the L3 on-call rotation, handling incidents, troubleshooting, and debugging — which will sharpen your understanding of the platform and surface fresh improvement ideas.
- Deliver your first meaningful reliability improvement: an observability enhancement, a deployment automation, or an SLO definition for a key DataCraft service.
In 180 days:
- Own the reliability posture of at least one DataCraft domain end-to-end — able to independently design, operate, and continuously improve it.
- Drive measurable improvements in MTTR, alert signal-to-noise ratio, or deployment confidence across the team.
- Be a trusted reliability partner in architecture discussions — your input shapes how new DataCraft services are des