Josef Doornink

AI Infrastructure & Reliability Engineer

Infrastructure engineer building the safety and reliability layer for autonomous AI systems. Currently designing Agentic SRE pipelines that route Kubernetes failures through LLM analysis, gated behind human approval and OPA admission policies — exploring what it takes to deploy non-deterministic reasoning in systems that require guarantees. 10+ years SRE/MLOps (CKS/CKA certified), preceded by a decade building FDA-cleared medical hardware where reliability meant patient safety. 10 peer-reviewed papers, 2 US patents.

What happens when you deploy a non-deterministic reasoning engine in a system that requires guarantees? K8gentS is an autonomous Kubernetes RCA agent built around that question. It routes cluster failures through Gemini-powered analysis, gates remediation behind both a human approval and an OPA Gatekeeper admission policy, and exposes diagnostics via an MCP server published on the official MCP Registry as io.github.JDoornink/k8gents. See the README for an open discussion of failure modes and confidence calibration tradeoffs.

AI SafetyKubernetesMCPOPA GatekeeperGeminiPython

A self-healing multimodal search engine demonstrating the agent-in-the-reliability-loop pattern applied to ML systems. CLIP embeddings + Qdrant power semantic video search ("find a red truck at night"); Prometheus and Evidently AI surface drift; an LLM agent reasons about drift signals and triggers automated retraining via GitHub Actions. Companion to K8gentS — same thesis (LLM as decision-maker inside a reliability contract), different domain.

LLM AgentSelf-HealingCLIPVector SearchDrift DetectionEvidently AIGKEPython

Static analysis for the agent supply chain. A published Python CLI tool that validates MCP servers and scans AI agent implementations for security vulnerabilities — configurable security levels, CI/CD integration with threshold-based failure conditions, and SARIF output for integration with existing security tooling.

MCPAI SafetyEval InfrastructurePythonCLISecurityCI/CD

The source code driving this exact platform. A Python pipeline that uses Claude to parse unstructured job descriptions and output statically generated, tailored frontend bundles via Next.js — the dogfood project for the Agentic SRE thesis.

Next.jsPython RAGTailwindCSSGitOps
Reason Benefit AI CorporationStartup
Lead MLOps Engineer
October 2025Present
  • Designing and building the AKS-based ML training and inference platform — making the foundational architecture decisions for an early-stage AI company before production traffic.
  • Establishing patterns for distributed training resource management on Azure, balancing cost and iteration speed for research workloads.
  • Translating research-team requirements into cloud architectures, working iteratively as the platform and the model strategy co-evolve.
  • Standing up the distributed training pipeline from scratch — automated pipeline scripts, validation steps, and the reliability scaffolding research teams need to move fast.
Trimble
Lead Site Reliability Engineer (SRE) I -> II -> III
January 2019Present
  • Implemented New Relic observability stack with distributed tracing, cutting MTTR by 50%.
  • Built automation tooling in Python and Go eliminating 80+ hours/month of toil and accelerating deployment velocity 3x.
  • Architected AKS production environments handling 14K+ requests/day across 30+ microservices at 99.9% uptime SLA.
  • Built and maintained Human Resources Information management system, responsible for over $8.3M+ in ARR and 43% YoY growth rate.
  • Led Kubernetes capacity planning and strategies supporting 200% traffic growth.
  • Built CI/CD pipelines (GitHub Actions, Azure DevOps) with automated testing and rollback mechanisms.
  • Managed 100+ cloud resources via Terraform IaC; implemented CKS security controls for SOC2 compliance.
Viewpoint
Software Developer
March 2018January 2019
  • Developed cloud-based SaaS applications using .NET and Angular, migrating on-premise software solutions to Azure cloud platform.
  • Built RESTful APIs for multi-tenant applications serving thousands of users with focus on performance and scalability.
Onfulfillment
Software Developer I
March 2014March 2018
  • Engineered multi-tenant e-commerce platform using Microsoft Stack (.NET, C#, SQL Server) integrated with third-party SaaS APIs.
  • Led 'uplift' initiative migrating legacy codebase to modern greenfield platform, improving response times by 40% measured through New Relic APM.
Legacy Biomechanics Research Lab
Biomechanical Research Engineer II
20072013
  • Lead Test and Development Engineer for NIH-funded multimillion-dollar research project focused on bone fixation solutions.
  • Managed successful implant creation, delivery, and test methodology producing multiple US FDA-approved implants.
jdoornink.github.io