Senior LLM Inference Engineer

Netherlands - Amsterdam

PDT – Data Science & AI /

1. Role: Permanent /

Hybrid

Join our AI team at Prosus, the largest consumer internet company in Europe and one of the biggest tech investors in the world. You'll be working on the team that drives growth and innovation across the company, with your work directly impacting how millions of people shop online.

Who we’re looking for

We're looking for a Senior MLOps Engineer whose core expertise is LLM serving at scale. You'll own the infrastructure that gets models into production and keeps them running efficiently — vLLM deployment, inference optimization (quantization, batching, KV cache), GPU cost management, and production-grade APIs with strict latency SLAs. Pipelines and CI/CD matter, but this role is defined by serving performance and infrastructure depth, not pipeline orchestration.

What you’ll do

Model Serving & APIs:

Deploy and optimize LLM serving infrastructure using vLLM
Apply inference optimizations: quantization, continuous batching, PagedAttention, KV cache management to maximize throughput and minimize latency
Design and build production-grade async API services (FastAPI, etc.) with pre/post-processing, business logic, and strict latency SLAs
Continuously optimize serving costs through model compression, batching strategies, and infrastructure tuning
Implement A/B testing infrastructure and canary deployments for safe model rollouts

ML Pipelines:

Build ML pipelines for data ingestion, processing, model deployment, and evaluation
Own CI/CD for ML systems, including automated testing, model versioning, and deployment workflows
Implement monitoring for model performance, latency, throughput, and costs with budget alerting
Set up experiment tracking and model registry systems (MLflow, Weights & Biases, or similar)
Define and monitor SLIs/SLOs for production model serving

Infrastructure & Orchestration:

Manage Kubernetes and Slurm clusters for GPU workloads with multi-tenant resource allocation
Optimize GPU utilization and implement cost controls across training and inference workloads
Own CI/CD pipelines, model versioning, and deployment automation

Enablement & Best Practices:

Create templates and documentation to accelerate team productivity
Establish MLOps best practices and guide teams in their adoption
Support model training experiments when needed

Minimum qualifications

Hands-on production experience with vLLM (or equivalent open-source LLM serving framework) — not managed services. You've tuned inference at the infrastructure level: quantization, continuous batching, KV cache, GPU memory management.
5+ years in MLOps, platform engineering, or infrastructure with a focus on ML/LLM workloads
Proven experience with GPU cost optimization: tracking, budgeting, alerting, and resource efficiency at scale
Strong Python skills with experience building production APIs (FastAPI or similar)
Hands-on experience with Kubernetes and Docker for GPU workloads
Experience with job orchestration systems (Slurm, Ray, Argo, Kubeflow, or similar)
Solid understanding of monitoring and observability for production ML systems
Naturally curious with a track record of proactively identifying and implementing improvements

Preferred qualifications

Deep knowledge of GPU architectures and their performance implications for inference optimization
Expertise in model compression techniques: quantization (INT8, INT4, FP8), pruning, distillation for production deployment
Understanding of security best practices for ML serving: authentication, authorization, rate limiting, model access controls
Experience managing multi-tenant GPU clusters with fair scheduling and resource isolation
Terraform or Pulumi for GPU-optimized infrastructure provisioning at scale
Experience supporting distributed training infrastructure: multi-node job orchestration, checkpoint management, debugging training failures
Contributions to open-source MLOps tools or serving frameworks

What we offer

Critical infrastructure ownership for high-impact AI projects that are strategically vital to the company, with direct visibility to senior leadership including the CEO
State-of-the-art GPU infrastructure: H200 fleet, vLLM serving stack, cutting-edge optimization tools
Expert ML team who have released top Hugging Face models, published at NeurIPS, and built production systems that will run on your infrastructure
Significant autonomy in designing MLOps solutions, choosing tools, and shaping infrastructure strategy for LLM serving
Modern tooling: Latest MLOps frameworks, coding assistants, best-in-class development environment
Hybrid work model with our Amsterdam office - home to the AI House, bringing together 200+ AI professionals through events and collaborations
Competitive compensation, top-spec MacBook Pro, and an environment genuinely built for professional growth and learning

If you want to own the infrastructure that runs some of the most demanding LLM workloads in Europe, let's talk.

Our Diversity & Inclusion Commitment

We respect the dignity and human rights of individuals and communities wherever we operate in the world. Building an inclusive workplace where everyone feels welcome and can thrive is critical for us. We provide access to education, which helps everyone understand the important role they play and the positive impact they can have.

For a deeper look at our journey and future plans, explore our latest Annual Report. Stay up to date with our latest news to see what makes Prosus stand out. Learn more at www.prosus.com.

apply for this job