Josef Doornink

AI Infrastructure & Reliability Engineer

Infrastructure engineer with 7 years in SRE and MLOps, now building safety and reliability tooling for autonomous AI systems. CKS/CKA certified, preceded by a decade in FDA-cleared medical hardware where reliability meant patient safety, which produced 10 peer-reviewed papers and 2 US patents. Focused on providing visibility and reliability in non-deterministic systems.

What happens when you deploy a non-deterministic reasoning engine in a system that requires guarantees? K8gentS is an autonomous Kubernetes RCA agent built around that question. It routes cluster failures through Gemini-powered analysis, gates remediation behind both a human approval and an OPA Gatekeeper admission policy, and exposes diagnostics via an MCP server published on the official MCP Registry as io.github.JDoornink/k8gents. See the README for an open discussion of failure modes and confidence calibration tradeoffs.

AI SafetyKubernetesMCPOPA GatekeeperGeminiPython

A self-healing multimodal search engine demonstrating the agent-in-the-reliability-loop pattern applied to ML systems. CLIP embeddings + Qdrant power semantic video search ("find a red truck at night"); Prometheus and Evidently AI surface drift; an LLM agent reasons about drift signals and triggers automated retraining via GitHub Actions. Companion to K8gentS — same thesis (LLM as decision-maker inside a reliability contract), different domain.

LLM AgentSelf-HealingCLIPVector SearchDrift DetectionEvidently AIGKEPython

Static analysis for the agent supply chain. A published Python CLI tool that validates MCP servers and scans AI agent implementations for security vulnerabilities — configurable security levels, CI/CD integration with threshold-based failure conditions, and SARIF output for integration with existing security tooling.

MCPAI SafetyEval InfrastructurePythonCLISecurityCI/CD
Reason Benefit AI CorporationStartup
Lead MLOps Engineer
October 2025February 2026
  • Designing and building the AKS-based ML training and inference platform — making the foundational architecture decisions for an early-stage AI company before production traffic.
  • Establishing patterns for distributed training resource management on Azure, balancing cost and iteration speed for research workloads.
  • Translating research-team requirements into cloud architectures, working iteratively as the platform and the model strategy co-evolve.
  • Standing up the distributed training pipeline from scratch — automated pipeline scripts, validation steps, and the reliability scaffolding research teams need to move fast.
Trimble
Lead Site Reliability Engineer (SRE) I -> II -> III
January 2019Present
  • Implemented New Relic observability stack with distributed tracing, cutting MTTR by 50%.
  • Built automation tooling in Python and Go eliminating 80+ hours/month of toil and accelerating deployment velocity 3x.
  • Architected AKS production environments handling 14K+ requests/day across 30+ microservices at 99.9% uptime SLA.
  • Built and maintained Human Resources Information management system, responsible for over $8.3M+ in ARR and 43% YoY growth rate.
  • Led Kubernetes capacity planning and strategies supporting 200% traffic growth.
  • Built CI/CD pipelines (GitHub Actions, Azure DevOps) with automated testing and rollback mechanisms.
  • Managed 100+ cloud resources via Terraform IaC; implemented CKS security controls for SOC2 compliance.
Viewpoint
Software Developer
March 2018January 2019
  • Developed cloud-based SaaS applications using .NET and Angular, migrating on-premise software solutions to Azure cloud platform.
  • Built RESTful APIs for multi-tenant applications serving thousands of users with focus on performance and scalability.
Onfulfillment
Software Developer I
March 2014March 2018
  • Engineered multi-tenant e-commerce platform using Microsoft Stack (.NET, C#, SQL Server) integrated with third-party SaaS APIs.
  • Led 'uplift' initiative migrating legacy codebase to modern greenfield platform, improving response times by 40% measured through New Relic APM.
Legacy Biomechanics Research Lab
Biomechanical Research Engineer II
20072013
  • Lead Test and Development Engineer for NIH-funded multimillion-dollar research project focused on bone fixation solutions.
  • Managed successful implant creation, delivery, and test methodology producing multiple US FDA-approved implants.
jdoornink.github.io