Back to all jobs
American International Group, Inc. logo

Service Reliability Engineer, GI Application Management

American International Group, Inc.

Charlotte, NC, U.S.
Full-time
Posted Aug 24, 2025
Full-time

Compensation

Loading salary analysis...

About the role

As a Site Reliability Engineer (SRE), you will apply software engineering principles to IT operations, ensuring robust and scalable systems.

Responsibilities

  • Keep up continuous uptime and accessibility of critical business applications and services.
  • Respond to and resolve incidents and outages promptly.
  • Automate repetitive, manual tasks (toil) to improve efficiency and reduce human error.
  • Establish and maintain robust monitoring and alerting systems to gain real-time insights into system health and performance.
  • Analyze usage patterns and forecast resource needs to ensure that systems can handle expected growth and traffic spikes without performance degradation.
  • After major incidents causing outages, conduct blameless post-mortem reviews to analyze the root causes of failures, document learnings, and implement corrective measures to prevent future occurrences.
  • Act as a bridge between development and operations teams, working closely with developers to improve application architecture, incorporate reliability best practices into the development lifecycle, and ensure optimal delivery efficiency.
  • Establish clear, measurable targets for system performance and reliability, often based on Service Level Indicators (SLIs).
  • Define and meet Service Level Objectives (SLOs), manage error budgets, and conduct blameless postmortems for continuous improvement.

Requirements

  • Bachelor's degree in related field and 3+ years of relevant technology experience, demonstrating progressive responsibility and leadership in overseeing regional technology teams.
  • Solid grasp of core technical areas such as programming (Python, Go, Java are common), system administration (Linux/Unix), networking, databases, and cloud computing platforms (like AWS, Azure, GCP).
  • Practical experience running production systems, troubleshooting issues, and participating in on-call rotations is highly valued, building crucial intuition for real-world system failures.
  • Proficiency in scripting languages (e.g., Python, Bash) and Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible) is crucial.
  • Ability to quickly diagnose and resolve system incidents, minimize downtime, and implement solutions to prevent recurrence is paramount.
  • Ability to rely on data from metrics, logs, and other sources to understand system behavior, analyze performance, identify trends, and make informed decisions to improve system reliability.
  • Excellent communication skills to articulate technical concepts, collaborate on projects, and foster a shared understanding of reliability goals.
  • Proactive in learning new technologies, methodologies, and tools to adapt to changing environments and continuously improve their skills and the systems they manage.

Benefits

  • 401k matching
  • Health insurance
  • Veterans encouraged to apply
  • Total Rewards Program
  • Volunteer Time Off and Matching Grants Programs

About the Company

American International Group, Inc. (AIG) is a leading global insurance organization.

Job Details

Salary Range

Salary not disclosed

Location

Charlotte, NC, U.S.

Employment Type

Full-time

Original Posting

View on company website
Create resume for this position