Site Reliability Engineer
Job Description
Senior Site Reliability Engineer
Location: NYC, NY (Hybrid)
Long Term Contract
What You’ll Do:
● Support the SRE team in developing and implementing enhancements to support workflows, focusing on automation and efficiency improvements
● Handle technical escalations, troubleshoot complex FIX and API connectivity issues, and actively participate in on-call rotations during non-traditional hours to ensure rapid response and resolution
● Coordinate Incident Post Mortems and RCA analysis
● Design, implement, and maintain comprehensive monitoring, logging, and tracing solutions (observability stack) to provide deep insights into system performance and user experience
Required Qualifications:
● 5+ years in a senior SRE role or a similar position, demonstrating deep knowledge and expertise in site reliability engineering and operations
● Knowledge of FIX protocol and messages, ability to read FIX logs
● Familiarity with REST APIs and a strong understanding of API integration
● Proficient in Python and scripting for automation and system management, with a proven track record of developing and implementing automation solutions
● Expertise in SQL and transactional databases, including querying and troubleshooting
● Strong analytical and troubleshooting skills with a proven ability to identify and resolve technical issues through root cause analysis
● In-depth knowledge of core networking concepts including TCP/IP, routing, and DNS.
● Familiarity with maintaining and troubleshooting systems within both cloud (AWS) and co-location (colo)
● Availability for flexible work hours and willingness to cover US markets trading sessions, including L2 on-call coverage
● Knowledge of change management processes and risk management
Preferred Qualifications:
● Experience in the brokerage or financial industry.
● Proficient with cloud services, particularly AWS, and knowledgeable about cloud architecture best practices, including IAM, EC2, S3, and DynamoDB.
● Experience maintaining and supporting containerized systems, with familiarity in orchestration tools.
● Knowledge of Infrastructure as Code (IaC) practices and tools such as Terraform or CloudFormation.
● Ability to manage and troubleshoot job scheduling tools like Rundeck or Apache Airflow.
● Advanced skills in managing containerized environments using Kubernetes and OpenShift.
● Practical experience with Confluent Cloud, RedPanda for event streaming architectures.
● Experience with API-based applications and a basic understanding of using the browser developer console for front-end debugging.
Similar Jobs
Site Reliability Engineer
Texas
Site Reliability Engineer
California
Site Reliability Engineer
North Carolina
Site Reliability Engineer
North Carolina
Site Reliability Engineer
Remote