Service Reliability Engineer, GI Application Management

American International Group, Inc.

Charlotte, NC, U.S.

Full-time

Posted Aug 24, 2025

Full-time

Compensation

Loading salary analysis...

As a Site Reliability Engineer (SRE), you will apply software engineering principles to IT operations, ensuring robust and scalable systems.

Keep up continuous uptime and accessibility of critical business applications and services.
Respond to and resolve incidents and outages promptly.
Automate repetitive, manual tasks (toil) to improve efficiency and reduce human error.
Establish and maintain robust monitoring and alerting systems to gain real-time insights into system health and performance.
Analyze usage patterns and forecast resource needs to ensure that systems can handle expected growth and traffic spikes without performance degradation.
After major incidents causing outages, conduct blameless post-mortem reviews to analyze the root causes of failures, document learnings, and implement corrective measures to prevent future occurrences.
Act as a bridge between development and operations teams, working closely with developers to improve application architecture, incorporate reliability best practices into the development lifecycle, and ensure optimal delivery efficiency.
Establish clear, measurable targets for system performance and reliability, often based on Service Level Indicators (SLIs).
Define and meet Service Level Objectives (SLOs), manage error budgets, and conduct blameless postmortems for continuous improvement.

Bachelor's degree in related field and 3+ years of relevant technology experience, demonstrating progressive responsibility and leadership in overseeing regional technology teams.
Solid grasp of core technical areas such as programming (Python, Go, Java are common), system administration (Linux/Unix), networking, databases, and cloud computing platforms (like AWS, Azure, GCP).
Practical experience running production systems, troubleshooting issues, and participating in on-call rotations is highly valued, building crucial intuition for real-world system failures.
Proficiency in scripting languages (e.g., Python, Bash) and Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible) is crucial.
Ability to quickly diagnose and resolve system incidents, minimize downtime, and implement solutions to prevent recurrence is paramount.
Ability to rely on data from metrics, logs, and other sources to understand system behavior, analyze performance, identify trends, and make informed decisions to improve system reliability.
Excellent communication skills to articulate technical concepts, collaborate on projects, and foster a shared understanding of reliability goals.
Proactive in learning new technologies, methodologies, and tools to adapt to changing environments and continuously improve their skills and the systems they manage.

American International Group, Inc. (AIG) is a leading global insurance organization.

Salary Range

Salary not disclosed

Location

Charlotte, NC, U.S.

Employment Type

Full-time

Original Posting

Create resume for this position