Senior LLM Inference Engineer

Netherlands - Amsterdam
PDT – Data Science & AI /
1. Role: Permanent /
Hybrid
Join our AI team at Prosus, the largest consumer internet company in Europe and one of the biggest tech investors in the world. You'll be working on the team that drives growth and innovation across the company, with your work directly impacting how millions of people shop online.
 
Who we’re looking for
 
We're looking for a Senior MLOps Engineer whose core expertise is LLM serving at scale. You'll own the infrastructure that gets models into production and keeps them running efficiently — vLLM deployment, inference optimization (quantization, batching, KV cache), GPU cost management, and production-grade APIs with strict latency SLAs. Pipelines and CI/CD matter, but this role is defined by serving performance and infrastructure depth, not pipeline orchestration.

What you’ll do

Model Serving & APIs:
  • Deploy and optimize LLM serving infrastructure using vLLM
  • Apply inference optimizations: quantization, continuous batching, PagedAttention, KV cache management to maximize throughput and minimize latency
  • Design and build production-grade async API services (FastAPI, etc.) with pre/post-processing, business logic, and strict latency SLAs
  • Continuously optimize serving costs through model compression, batching strategies, and infrastructure tuning
  • Implement A/B testing infrastructure and canary deployments for safe model rollouts
 
ML Pipelines:
  • Build ML pipelines for data ingestion, processing, model deployment, and evaluation
  • Own CI/CD for ML systems, including automated testing, model versioning, and deployment workflows
  • Implement monitoring for model performance, latency, throughput, and costs with budget alerting
  • Set up experiment tracking and model registry systems (MLflow, Weights & Biases, or similar)
  • Define and monitor SLIs/SLOs for production model serving
 
Infrastructure & Orchestration:
  • Manage Kubernetes and Slurm clusters for GPU workloads with multi-tenant resource allocation
  • Optimize GPU utilization and implement cost controls across training and inference workloads
  • Own CI/CD pipelines, model versioning, and deployment automation
 
Enablement & Best Practices:
  • Create templates and documentation to accelerate team productivity
  • Establish MLOps best practices and guide teams in their adoption
  • Support model training experiments when needed 

Minimum qualifications

  • Hands-on production experience with vLLM (or equivalent open-source LLM serving framework) — not managed services. You've tuned inference at the infrastructure level: quantization, continuous batching, KV cache, GPU memory management.
  • 5+ years in MLOps, platform engineering, or infrastructure with a focus on ML/LLM workloads
  • Proven experience with GPU cost optimization: tracking, budgeting, alerting, and resource efficiency at scale
  • Strong Python skills with experience building production APIs (FastAPI or similar)
  • Hands-on experience with Kubernetes and Docker for GPU workloads
  • Experience with job orchestration systems (Slurm, Ray, Argo, Kubeflow, or similar)
  • Solid understanding of monitoring and observability for production ML systems
  • Naturally curious with a track record of proactively identifying and implementing improvements

Preferred qualifications

  • Deep knowledge of GPU architectures and their performance implications for inference optimization
  • Expertise in model compression techniques: quantization (INT8, INT4, FP8), pruning, distillation for production deployment
  • Understanding of security best practices for ML serving: authentication, authorization, rate limiting, model access controls
  • Experience managing multi-tenant GPU clusters with fair scheduling and resource isolation
  • Terraform or Pulumi for GPU-optimized infrastructure provisioning at scale
  • Experience supporting distributed training infrastructure: multi-node job orchestration, checkpoint management, debugging training failures
  • Contributions to open-source MLOps tools or serving frameworks

What we offer

  • Critical infrastructure ownership for high-impact AI projects that are strategically vital to the company, with direct visibility to senior leadership including the CEO
  • State-of-the-art GPU infrastructure: H200 fleet, vLLM serving stack, cutting-edge optimization tools
  • Expert ML team who have released top Hugging Face models, published at NeurIPS, and built production systems that will run on your infrastructure
  • Significant autonomy in designing MLOps solutions, choosing tools, and shaping infrastructure strategy for LLM serving
  • Modern tooling: Latest MLOps frameworks, coding assistants, best-in-class development environment
  • Hybrid work model with our Amsterdam office - home to the AI House, bringing together 200+ AI professionals through events and collaborations
  • Competitive compensation, top-spec MacBook Pro, and an environment genuinely built for professional growth and learning
 
If you want to own the infrastructure that runs some of the most demanding LLM workloads in Europe, let's talk.
Our Diversity & Inclusion Commitment
 
We respect the dignity and human rights of individuals and communities wherever we operate in the world. Building an inclusive workplace where everyone feels welcome and can thrive is critical for us. We provide access to education, which helps everyone understand the important role they play and the positive impact they can have.
 
For a deeper look at our journey and future plans, explore our latest Annual Report. Stay up to date with our latest news to see what makes Prosus stand out. Learn more at www.prosus.com.