Josef Doornink

AI Infrastructure & Reliability Engineer

Infrastructure engineer with 7 years in SRE and MLOps, now building safety and reliability tooling for autonomous AI systems. CKS/CKA certified, preceded by a decade in FDA-cleared medical hardware where reliability meant patient safety, which produced 10 peer-reviewed papers and 2 US patents. Focused on providing visibility and reliability in non-deterministic systems.

Resume PDF GitHub LinkedIn

K8gentS ↗MCP Registry

How to implement a non-deterministic reasoning engine in a system that requires guarantees? K8gents answers through guardrails-as-code: an autonomous Kubernetes root cause analysis (RCA) agent that routes cluster failures through Gemini-powered analysis, but constrains remediation with code-enforced policies—human approval gates paired with OPA Gatekeeper rules create hard boundaries around AI decisions. Diagnostics are exposed via an MCP server (io.github.JDoornink/k8gents on the official registry); see the README for a discussion of failure modes and confidence calibration tradeoffs.

AI SafetyKubernetesMCPOPA GatekeeperGeminiPython

OmniSight-Core ↗

A self-healing multimodal search engine demonstrating the agent-in-the-reliability-loop pattern applied to ML systems. CLIP embeddings + Qdrant power semantic video search ("find a red truck at night"); Prometheus and Evidently AI surface drift; an LLM agent reasons about drift signals and triggers automated retraining via GitHub Actions. Companion to K8gentS — same thesis (LLM as decision-maker inside a reliability contract), different domain.

LLM AgentSelf-HealingCLIPVector SearchDrift DetectionEvidently AIGKEPython

Agent-Lint-CLI ↗PyPI

Static analysis for the agent supply chain. A published Python CLI tool that validates MCP servers and scans AI agent implementations for security vulnerabilities — configurable security levels, CI/CD integration with threshold-based failure conditions, and SARIF output for integration with existing security tooling.

MCPAI SafetyEval InfrastructurePythonCLISecurityCI/CD

Reason Benefit AI CorporationStartup

Lead MLOps Engineer

October 2025 — February 2026

—Designing and building the AKS-based ML training and inference platform — making the foundational architecture decisions for an early-stage AI company before production traffic.
—Establishing patterns for distributed training resource management on Azure, balancing cost and iteration speed for research workloads.
—Translating research-team requirements into cloud architectures, working iteratively as the platform and the model strategy co-evolve.
—Standing up the distributed training pipeline from scratch — automated pipeline scripts, validation steps, and the reliability scaffolding research teams need to move fast.

Trimble

Lead Site Reliability Engineer (SRE) I -> II -> III

January 2019 — Present

—Implemented New Relic observability stack with distributed tracing, cutting MTTR by 50%.
—Built automation tooling in Python and Go eliminating 80+ hours/month of toil and accelerating deployment velocity 3x.
—Architected AKS production environments handling 14K+ requests/day across 30+ microservices at 99.9% uptime SLA.
—Built and maintained Human Resources Information management system, responsible for over $8.3M+ in ARR and 43% YoY growth rate.
—Led Kubernetes capacity planning and strategies supporting 200% traffic growth.
—Built CI/CD pipelines (GitHub Actions, Azure DevOps) with automated testing and rollback mechanisms.
—Managed 100+ cloud resources via Terraform IaC; implemented CKS security controls for SOC2 compliance.

Viewpoint

Software Developer

March 2018 — January 2019

—Developed cloud-based SaaS applications using .NET and Angular, migrating on-premise software solutions to Azure cloud platform.
—Built RESTful APIs for multi-tenant applications serving thousands of users with focus on performance and scalability.

Onfulfillment

Software Developer I

March 2014 — March 2018

—Engineered multi-tenant e-commerce platform using Microsoft Stack (.NET, C#, SQL Server) integrated with third-party SaaS APIs.
—Led 'uplift' initiative migrating legacy codebase to modern greenfield platform, improving response times by 40% measured through New Relic APM.

Legacy Biomechanics Research Lab

Biomechanical Research Engineer II

2007 — 2013

—Lead Test and Development Engineer for NIH-funded multimillion-dollar research project focused on bone fixation solutions.
—Managed successful implant creation, delivery, and test methodology producing multiple US FDA-approved implants.

jdoornink.github.io