Site Reliability Engineer

ThenuleapContractMar 12th, 2026

Remote

8 - 12 YearsMar 12th, 2026

21 ViewsBe an Early Applicant

Required Skillset:

Azure

Job Description

Site Reliability Engineer (SRE) – AI & Cloud Platforms

This is a remote position.

Need only USC and GC candidates.

LinkedIn is must.

This is a w2

Rate is max $60
Job Summary

The SRE – AI Platforms will own reliability, scalability, deployment automation, monitoring, and security of AI workloads running on Azure. This role combines cloud engineering and MLOps capabilities to ensure AI systems operate with high availability, compliance, and cost efficiency.

Key Responsibilities
?? Infrastructure & Deployment
Deploy and manage workloads on Azure Kubernetes Service
Configure serverless components using Azure Functions
Implement CI/CD pipelines for AI applications
Manage containerization (Docker/Kubernetes)
?? MLOps & Model Lifecycle
Deploy models via Azure Machine Learning
Implement model versioning & experiment tracking
Monitor model drift, bias, and performance degradation
Automate rollback and blue-green deployment strategies
?? Observability & Reliability
Configure centralized logging and alerting
Monitor latency, uptime, token consumption, GPU/CPU usage
Define and track SLAs/SLOs for AI services
Implement autoscaling policies
?? Security & Compliance
Implement RBAC using Microsoft Entra ID
Manage secrets with Azure Key Vault
Enforce network isolation (VNET, Private Endpoints)
Ensure compliance logging and auditability

Required Qualifications
6–10 years in Cloud Engineering / DevOps / SRE
Strong Azure infrastructure experience
Experience supporting AI/ML production workloads
Proficiency in Infrastructure-as-Code (Terraform / ARM)
Strong understanding of reliability engineering principles

Similar Jobs