Service Reliability Engineer, GI Application Management
American International Group, Inc.
Charlotte, NC, U.S.
Full-time
Posted Aug 24, 2025
Full-time
Compensation
Loading salary analysis...
About the role
As a Site Reliability Engineer (SRE), you will apply software engineering principles to IT operations, ensuring robust and scalable systems.
Responsibilities
- Keep up continuous uptime and accessibility of critical business applications and services.
- Respond to and resolve incidents and outages promptly.
- Automate repetitive, manual tasks (toil) to improve efficiency and reduce human error.
- Establish and maintain robust monitoring and alerting systems to gain real-time insights into system health and performance.
- Analyze usage patterns and forecast resource needs to ensure that systems can handle expected growth and traffic spikes without performance degradation.
- After major incidents causing outages, conduct blameless post-mortem reviews to analyze the root causes of failures, document learnings, and implement corrective measures to prevent future occurrences.
- Act as a bridge between development and operations teams, working closely with developers to improve application architecture, incorporate reliability best practices into the development lifecycle, and ensure optimal delivery efficiency.
- Establish clear, measurable targets for system performance and reliability, often based on Service Level Indicators (SLIs).
- Define and meet Service Level Objectives (SLOs), manage error budgets, and conduct blameless postmortems for continuous improvement.
Requirements
- Bachelor's degree in related field and 3+ years of relevant technology experience, demonstrating progressive responsibility and leadership in overseeing regional technology teams.
- Solid grasp of core technical areas such as programming (Python, Go, Java are common), system administration (Linux/Unix), networking, databases, and cloud computing platforms (like AWS, Azure, GCP).
- Practical experience running production systems, troubleshooting issues, and participating in on-call rotations is highly valued, building crucial intuition for real-world system failures.
- Proficiency in scripting languages (e.g., Python, Bash) and Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible) is crucial.
- Ability to quickly diagnose and resolve system incidents, minimize downtime, and implement solutions to prevent recurrence is paramount.
- Ability to rely on data from metrics, logs, and other sources to understand system behavior, analyze performance, identify trends, and make informed decisions to improve system reliability.
- Excellent communication skills to articulate technical concepts, collaborate on projects, and foster a shared understanding of reliability goals.
- Proactive in learning new technologies, methodologies, and tools to adapt to changing environments and continuously improve their skills and the systems they manage.
Benefits
- 401k matching
- Health insurance
- Veterans encouraged to apply
- Total Rewards Program
- Volunteer Time Off and Matching Grants Programs
About the Company
American International Group, Inc. (AIG) is a leading global insurance organization.
Job Details
Salary Range
Salary not disclosed
Location
Charlotte, NC, U.S.
Employment Type
Full-time
Original Posting
View on company website