Josef Doornink

Site Reliability Engineer | AI Infrastructure & MLOps

A creative problem-solver who loves building the infrastructure that brings ideas into reality. An engineer with 10+ years experience, 10 peer-reviewed papers, 2 US patents, a major design award and FDA-cleared hardware. That foundation now drives deep SRE and AI infrastructure work (CKS/CKA certified), delivering systems with integrity and observability at scale — currently designing autonomous Agentic SRE pipelines that leverage LLMs for root-cause analysis.

What happens when you deploy a non-deterministic reasoning engine in a system that requires guarantees? K8gentS is an autonomous Kubernetes RCA agent built around that question. It routes cluster failures through Gemini-powered analysis, gates remediation behind both a human approval and an OPA Gatekeeper admission policy, and exposes diagnostics via an MCP server published on the official MCP Registry as io.github.JDoornink/k8gents.

AI SafetyKubernetesMCPOPA GatekeeperGeminiPython

A production-grade video search engine capable of understanding semantic queries (e.g., "Find a red truck at night"). Demonstrates self-healing infrastructure that automatically detects model performance decay and triggers retraining.

MLOpsCLIPVector SearchDrift DetectionGKEPython

ESLint for agents. A published Python CLI tool that validates MCP servers and scans AI agent implementations for security vulnerabilities. Supports configurable security levels, CI/CD integration with threshold-based failure conditions, and multiple output formats including SARIF.

MCPAI SafetyPythonCLISecurityCI/CD

The source code driving this exact platform. A Next.js (React) infrastructure executing a Python/RAG Agent pipeline that strictly parses unstructured Job Descriptions and outputs statically generated, targeted frontend bundles dynamically.

Next.jsPython RAGTailwindCSSGitOps
Reason Benefit AI CorporationStartup
Lead MLOps Engineer
October 2025Present
  • Architect and maintain large-scale Azure Kubernetes Service (AKS) production environment for ML model training and serving, supporting distributed model inference at scale.
  • Integrate with Azure offerings for resource utilization across distributed training systems.
  • Collaborate with research teams to translate desired microservice architectures into cloud-based systems with focus on reliability and scalability.
  • Drive distributed training pipeline creation by introducing automated pipeline scripts and validations to reduce training cycle time.
Trimble
Lead Site Reliability Engineer (SRE) I -> II -> III
January 2019Present
  • Built and maintained Human Resources Information management system, responsible for over $8.3M+ in ARR and 43% YoY growth rate.
  • Architected AKS production environments handling 14K+ requests/day across 30+ microservices at 99.9% uptime SLA.
  • Built automation tooling in Python and Go eliminating 80+ hours/month of toil and accelerating deployment velocity 3x.
  • Reduced P99 latency 45% and improved throughput 60% through systematic profiling of distributed systems.
  • Implemented New Relic observability stack with distributed tracing, cutting MTTR by 50%.
  • Led Kubernetes capacity planning and strategies supporting 200% traffic growth.
  • Built CI/CD pipelines (GitHub Actions, Azure DevOps) with automated testing and rollback mechanisms.
  • Managed 100+ cloud resources via Terraform IaC; implemented CKS security controls for SOC2 compliance.
Viewpoint
Software Developer
March 2018January 2019
  • Developed cloud-based SaaS applications using .NET and Angular, migrating on-premise software solutions to Azure cloud platform.
  • Built RESTful APIs for multi-tenant applications serving thousands of users with focus on performance and scalability.
Onfulfillment
Software Developer I
March 2014March 2018
  • Engineered multi-tenant e-commerce platform using Microsoft Stack (.NET, C#, SQL Server) integrated with third-party SaaS APIs.
  • Led 'uplift' initiative migrating legacy codebase to modern greenfield platform, improving response times by 40% measured through New Relic APM.
Legacy Biomechanics Research Lab
Biomechanical Research Engineer II
20072013
  • Lead Test and Development Engineer for NIH-funded multimillion-dollar research project focused on bone fixation solutions.
  • Managed successful implant creation, delivery, and test methodology producing multiple US FDA-approved implants.
jdoornink.github.io