Site Reliability Engineer
Job Description
Site Reliability Engineer (SRE) – AI & Cloud Platforms
This is a remote position.
Need only USC and GC candidates.
LinkedIn is must.
This is a w2
Rate is max $60
Job Summary
The SRE – AI Platforms will own reliability, scalability, deployment automation, monitoring, and security of AI workloads running on Azure. This role combines cloud engineering and MLOps capabilities to ensure AI systems operate with high availability, compliance, and cost efficiency.
Key Responsibilities
?? Infrastructure & Deployment
Deploy and manage workloads on Azure Kubernetes Service
Configure serverless components using Azure Functions
Implement CI/CD pipelines for AI applications
Manage containerization (Docker/Kubernetes)
?? MLOps & Model Lifecycle
Deploy models via Azure Machine Learning
Implement model versioning & experiment tracking
Monitor model drift, bias, and performance degradation
Automate rollback and blue-green deployment strategies
?? Observability & Reliability
Configure centralized logging and alerting
Monitor latency, uptime, token consumption, GPU/CPU usage
Define and track SLAs/SLOs for AI services
Implement autoscaling policies
?? Security & Compliance
Implement RBAC using Microsoft Entra ID
Manage secrets with Azure Key Vault
Enforce network isolation (VNET, Private Endpoints)
Ensure compliance logging and auditability
Required Qualifications
6–10 years in Cloud Engineering / DevOps / SRE
Strong Azure infrastructure experience
Experience supporting AI/ML production workloads
Proficiency in Infrastructure-as-Code (Terraform / ARM)
Strong understanding of reliability engineering principles
Similar Jobs
Site Reliability Engineer
Texas
Site Reliability Engineer
California
Site Reliability Engineer
Ohio
Site Reliability Engineer
New York
Site Reliability Engineer
Texas