Bloomreach

Senior Site Reliability Engineer for Datacraft team

DevOps / Infra Czech Republic Today

Apply for this role

Listed via Greenhouse · Redirects to Bloomreach's careers page

Job Description

Bloomreach is building the world’s premier agentic platform for personalization.We’re revolutionizing how businesses connect with their customers, building and deploying AI agents to personalize the entire customer journey.

We're taking autonomous search mainstream, making product discovery more intuitive and conversational for customers, and more profitable for businesses.
We’re making conversational shopping a reality, connecting every shopper with tailored guidance and product expertise — available on demand, at every touchpoint in their journey.
We're designing the future of autonomous marketing, taking the work out of workflows, and reclaiming the creative, strategic, and customer-first work marketers were always meant to do.

And we're building all of that on the intelligence of a single AI engine — Loomi AI — so that personalization isn't only autonomous…it's also consistent.From retail to financial services, hospitality to gaming, businesses use Bloomreach to drive higher growth and lasting loyalty. We power personalization for more than 1,400 global brands, including American Eagle, Sonepar, and Pandora.

Become a Senior SRE for Bloomreach!

Join the newly form Datacraft team — the team building the next-generation data platform for Bloomreach Engagement. Datacraft owns three interconnected domains:

Data Warehouses (~60%) — making Bloomreach data first-class in customer DWHs (Snowflake, BigQuery, Databricks). The strategic goal for 2026–27 is to use DWHs to exponentially accelerate data adoption.
Loomi Analytics Agent (~20%) — evolving Loomi Analytics into an agentic analytics assistant that can explore data across systems, explain insights, and act on them.
Dashboards & Analytics Stack (~20%) — moving Engagement reporting onto DWH-backed, modern analytics stacks (semantic layers, headless BI tools).

As a Senior SRE, you will be the reliability backbone of this AI-first data team. Your work will directly impact the deployments, pipelines, reliability, and observability of pipelines and services that hundreds of enterprise customers depend on — from data exports into Databricks and BigQuery, to the AI agent Loomi uses to surface insights.

Datacraft is an AI-first team. We believe code is a commodity and expect every engineer to fluently use coding agents (e.g., Cursor, Claude Code, Copilot, Gemini CLI) as a core part of their daily workflow. The ability to leverage AI tooling to accelerate development, prototyping, and problem-solving is not optional — it's foundational.

For candidates at the P3 / Senior SRE level, starting monthly compensation begins at 3 800 € gross, with the final offer tailored for each candidate based on their skills and experience. Stock options and a comprehensive benefits package are also included. Working in one of our Central European offices (Bratislava, Praha, Brno) or from home on a full-time basis, you'll become a core part of the Engineering team.

What challenge awaits you?

As a P3 (Senior) SRE at Bloomreach, you are an independent professional — expert in reliability engineering, able to decompose objectives into actionable infrastructure improvements, and lead initiatives end-to-end with minimal day-to-day guidance.

We need you to build and operate an ecosystem where data engineers can safely and efficiently develop, debug, and operate data-intensive jobs and services — spanning Kafka ingest pipelines, Iceberg data lakes, multi-DWH exports, Databricks deployment and orchestration (Airflow / Cloud Composer), and agentic AI workloads.

Your responsibilities

a. Platform reliability & observability

Build and maintain the reliability ecosystem where engineers can safely develop, debug, and operate DataCraft services running on GCP and Kubernetes (DataProc, Cloud Composer, BigQuery, Snowflake/Databricks connectors).
Ensure end-to-end observability across the full data platform — from Kafka ingest through GCS/Iceberg staging, Airflow orchestration, to Databricks and BigQuery destinations — enabling the team to catch missing loads, SLA breaches, and data drifts before customers notice, or costs drift.
Drive scalability so services can scale vertically and horizontally based on operational and telemetric data (OpenTelemetry, Prometheus, Victoria Metrics).
Maintain team health dashboards and alerting (Grafana, PagerDuty, Sentry).

b. Infrastructure as Code & deployments

Own and evolve Terraform-based infrastructure for DataCraft services.
Automate deployments, instance setup, and operational runbooks to eliminate manual/semi-manual steps.
Maintain CI/CD pipelines (GitLab) with linters, security scans, and code quality checks, AI code reviews, enabling engineers to produce high-quality MRs.

c. Security & compliance

Help the team fulfill security requirements for ISO and SOC2 audits by enforcing security principles: key distribution, key rotation, authorization & authentication at the service level, data encryption in transit, data isolation, resource limitations, and audit logs.
Ensure data access controls are properly enforced across multi-DWH environments (BigQuery, Snowflake, Databricks).

d. Incident management & L3 support

Participate in and drive L3 on-call rotation and incident resolution for DataCraft services.
Contribute tooling for debugging, troubleshooting, and performance testing of data pipelines and orchestration layers.
Use telemetry data and distributed tracing to navigate complex, distributed service architectures.

e. Agentic platform reliability

Ensure reliability and observability of the Loomi Analytics Agent data infrastructure — LLM API gateway performance, MCP server health, and evaluation pipeline availability.
Monitor and alert on data quality issues that could introduce inconsistencies or hallucinations in Loomi's responses — making the agent's data access patterns reliable and debuggable.

Our tech stack

Languages: Python (primary), Go, SQL Messaging & streaming: Apache Kafka Storage & databases: Databricks, BigQuery, Apache Iceberg, GCS, Mongo, Redis Data processing & orchestration: Apache Spark, DataFlow, Airflow / Cloud Composer Infrastructure: GCP, Kubernetes, Terraform AI / Agentic: LLM APIs, MCP, agent orchestration frameworks Observability: Grafana, Prometheus, Victoria Metrics, PagerDuty, Sentry, OpenTelemetry CI/CD & tooling: GitLab, Jira, Confluence AI coding agents: Cursor, Claude Code

Your qualifications

Professional experience

Impact

You can articulate how your contributions transformed the way engineers work and fostered a strong SRE/DevOps culture.
You can demonstrate how impactful reliability work connects to business success and customer outcomes.

Ownership

You embrace the you build it, you run it principle — you love owning what you ship.
You are cost-aware: effective vertical and horizontal autoscaling and detailed telemetry insights are how you demonstrate mindfulness of cloud spend.

Systematic approach

Infrastructure as Code is the only thing that brings stability into chaos
You design for failure: SLOs, error budgets, and runbooks are first-class artifacts, not afterthoughts.

Data-driven

You use telemetry and metrics to give engineers actionable feedback on how applications and services behave.
You can navigate complex data platform architectures using distributed tracing and debugging.

Technical skills

Solid hands-on experience with GCP (BigQuery, DataProc, Cloud Composer, GCS) and Kubernetes.
Experience with Python; Go is a strong advantage.
Familiarity with data pipeline technologies (Kafka, Airflow/Cloud Composer, Spark, Iceberg) — you don't need to write ETL code, but you need to operate it reliably and know when something is wrong.
Fluent use of AI coding agents (Cursor, Claude Code, Copilot, Gemini CLI, or similar) — you already use these tools daily to accelerate work.
Comfortable with on-call rotation and 24/7 incident response.
Remote-first mindset — you know how to be effective in distributed teams.
You are able to learn and adapt — essential when exploring new tech or navigating our growing codebase.

Strongly preferred

Experience operating single-DWH environments (Snowflak, Databricks or BigQuery).
Familiarity with agentic/LLM workloads — API reliability, latency SLOs, trace observability for AI systems.
Experience with open table formats (Iceberg, Delta Lake) in production environments.
Exposure to data security and compliance in the context of customer-facing DWH integrations (consent, data retention, PII handling).

Personal qualities

Ownership & accountability — you take issues from detection through to resolution and follow-up prevention.
Systematic thinking — you identify root causes, not symptoms, and document your findings so the team learns.
Collaboration & communication — you explain trade-offs and constraints clearly to both engineers and non-engineers.
Bias for reliability — operational excellence (SLOs, oncall friendliness, proactive alerting) is not a chore, it's your craft.
Continuous improvement mindset — you are comfortable iterating, revisiting assumptions, and improving incrementally.
Comfortable operating remote-first in a distributed team across Central Europe.

Your success story

In 30 days:

Get to know the DataCraft team, the company, and the most important processes.
Set up your local and GCP development environment and complete the Engagement engineering onboarding.
Understand the current state of DataCraft services: pipelines, orchestration, observability gaps, and on-call runbooks.

In 90 days:

Start contributing to the L3 on-call rotation, handling incidents, troubleshooting, and debugging — which will sharpen your understanding of the platform and surface fresh improvement ideas.
Deliver your first meaningful reliability improvement: an observability enhancement, a deployment automation, or an SLO definition for a key DataCraft service.

In 180 days:

Own the reliability posture of at least one DataCraft domain end-to-end — able to independently design, operate, and continuously improve it.
Drive measurable improvements in MTTR, alert signal-to-noise ratio, or deployment confidence across the team.
Be a trusted reliability partner in architecture discussions — your input shapes how new DataCraft services are des