Sr. Site Reliability Engineer @ Capgemini - Malvern, PA

Job Overview

7 days ago

Sr. Site Reliability Engineer

Capgemini - Malvern, PA

Duration: 5+ Months


Job Description:


  • Are you an engineer who loves to solve impactful complex operational problems?
  • Are you passionate about finding opportunities to improve system performance and efficiency, scalability, fault tolerance, and self-healing capabilities?
  • Are you excited about Chaos Engineering? Do you want to apply these principles and creatively experiment with our systems to discover hidden weaknesses?
  • Are you obsessed with understanding systems inner state, interactions between systems or observability-driven development?


A successful candidate will likely have experience in being a Full Stack Engineer who has supported their applications operationally. You will be solutioning reliability problems across product families and continuously seeking opportunities to improve our systems’ “-ilities”. You will also help define, maintain, and carry out subdivisional reliability engineering standards, contribute to enterprise-wide libraries for reliability, and train product SRE and product family SRE leads within the subdivision.


In this role you will:


  • Instrument, enhance and advocate for system observability. Identify and develop solutions to bridge systems observability gaps.
  • Collaborates with internal teams to evaluate the health, stability and reliability of systems/platforms. Looks for opportunity to improve system performance efficiency and resiliency.
  • Develops and communicates new standards and newly available tools and frameworks across subdivisions. Enforces reliability standards. Designs and develops new automated solutions for reliability.
  • Provides technical leadership, consultancy, and coaching on designing and implementing both traditional and serverless architectures in AWS with an emphasis on repeatability, scaling options, resilience, reliability, telemetry, networking, etc., including design patterns for resilient systems
  • Leads failure modes analysis spanning product families when new features and architecture patterns are introduced. Facilitates post-incident reviews for any high severity client impacting events local to the product family.
  • Leads cross-product or cross-subdivision chaos experimentation.
  • Designs, reviews, and coaches others on performance tests using appropriate components (e.g., requests per minute, # of threads, the construction of a request with headers and cookies)
  • Consults, reviews, coaches, and influences architectural decisions, including non-functional aspects, proposing potential technical solutions/enhancements, and explaining convincingly which is better and why.
  • Contributes to or leads Reliability Engineering and Resilience communities of
  • practice. Remains informed about site reliability engineering activities happening within the subdivision.
  • Works with product owners to set subdivision goals for higher availability and SRE impact, and tracks progress toward achieving them.
  • Provides technical leadership, guidance, consulting, training, and governance on SRE to one or more product families in a subdivision.
  • Identifies opportunities to automate away toil and develops solutions, monitors error budget exhaustion rates, configures auto scaling thresholds for the product, and incorporates resilience patterns, such as circuit breakers, into the application code. Develops complex deployment and/or routing strategies for high availability.
  • Maintains and looks for opportunities to improve centralized incident response playbook for the subdivision to document standards for managing communication and escalation during an incident.
  • Oversees blameless post-incident reviews for high severity incidents involving more multiple product families.


Core Responsibilities/ Qualifications


  • Minimum of eight years related work experience, with at least three years of development experience.
  • Undergraduate degree or equivalent combination of training and experience. Graduate degree preferred.
  • Full stack development – JDK8+ preferred with spring boot, Rest APIs, multithreaded, multiprocessing applications, Graphql. Experience with UI development (familiar with Angular, TypeScript, NodeJS etc.) is a plus.
  • Ability to diagnose and resolve problems in high-throughput applications,
  • Experience with one or more observability frameworks or tools – Experience with OpenTelemetry (java, js, etc.), Cloudwatch, Grafana, Splunk, etc.
  • Exposure to *nix environments including some shell script development and basic command execution.
  • Strong understanding of database principles and working knowledge in distributed storage and infrastructural solutions.
  • Experience with container management and micro-services architectures such as Docker in cloud and on-premises infrastructure.
  • Working knowledge of AWS network foundations, application networking, edge, and network security.
  • Excellent communication, and documentation skills.


The Capgemini Freelancer Gateway is enabled by a cutting-edge software platform that leads the contingent labor world for technology innovation. The software platform leverages Machine Learning and Artificial Intelligence to make sure the right people end up in the right job.


A global leader in consulting, technology services, and digital transformation, Capgemini is at the forefront of innovation to address the entire breadth of clients’ opportunities in the evolving world of cloud, digital, and platforms. Building on its strong 50 year heritage and deep industry-specific expertise, Capgemini enables organizations to realize their business ambitions through an array of services from strategy to operations. Capgemini is driven by the conviction that the business value of technology comes from and through people. It is a multicultural company of over 200,000 team members in more than 40 countries. The Group reported 2018 global revenues of EUR 13.2 billion.

Similar Jobs

Release Manager

MongoDB

New York, NY

Consulting, technical support, or site reliability engineering. We're looking for a versatile, highly technical and extremely organized individual to own and…

Site Reliability Engineer

Change Healthcare

New York, NY

We deliver innovative solutions to patients, hospitals, and insurance companies to improve clinical decision making, simplify financial processes, and enable…

Site Reliability Engineer (SRE)- Executive Director

JPMorgan Chase Bank, N.A.

Jersey City, NJ

Engage with development teams throughout the life cycle to help develop software for reliability and scale, ensuring minimal refactoring or changes.

Site Reliability Engineer

JPMorgan Chase Bank, N.A.

New York, NY

Develop software for reliability and scale, ensuring minimal refactoring or changes. Much of our support and software development focuses on optimizing existing…

Site Reliability Engineer

Harford Mutual Insurance Group

Bel Air, MD

Maintain and continually improve the enterprise systems in order to provide maximum performance and reliability. NET or Java web applications.

DevOps Engineer (Kubernetes/HPC) - Poly Clearance Required

Praxis Engineering

Annapolis Junction, MD

The DevOps Engineer shall be responsible for the Operational and Maintenance (O&M) efforts including installation, configuration, integration, monitoring, and…

Sr. Site Reliability Engineer - DevOps

Kraken Digital Asset Exchange

New York, NY

You are a developer who is interested in deployment and network operations, or you're already somewhere on the journey to being a fully fledged DevOps…

Site Reliability Engineering Manager - Staked

Kraken Digital Asset Exchange

New York, NY

To be successful in this role, you will need to be responsible for implementing and maintaining the Staking infrastructure's observability.

Senior Site Reliability Engineer - Cryptowatch

Kraken Digital Asset Exchange

New York, NY

Responsible for the operation, support, and security of production infrastructure. Author automation tools to assist with deployments, logging, monitoring, and…

Senior Site Reliability Engineer - Cryptowatch

Kraken Digital Asset Exchange

Baltimore, MD

Responsible for the operation, support, and security of production infrastructure. Author automation tools to assist with deployments, logging, monitoring, and…

Site Reliability Engineering Manager - Staked

Kraken Digital Asset Exchange

Baltimore, MD

To be successful in this role, you will need to be responsible for implementing and maintaining the Staking infrastructure's observability.

Site Reliability Engineer - Client Support Services

Kraken Digital Asset Exchange

New York, NY

You will bring your own technical expertise to monitor and support staging and production environments, build tooling, CI/CD pipelines, deployment specs and…

Senior Site Reliability Engineer - Business Technology Team

Indeed

New York, NY

Improve service reliability through blameless post-incident reviews and using code to prevent or respond to problem recurrence.

Site Reliability Engineer - Client Support Services

Kraken Digital Asset Exchange

Baltimore, MD

You will bring your own technical expertise to monitor and support staging and production environments, build tooling, CI/CD pipelines, deployment specs and…

Site Reliability Engineering Lead

Recruiting From Scratch

New York, NY

Establishing reliability guidelines and ensuring systems meet our goals around durability, availability and performance.

Site Reliability Engineering Lead

Recruiting From Scratch

Baltimore, MD

Establishing reliability guidelines and ensuring systems meet our goals around durability, availability and performance.

Site Reliability Engineering Lead

Recruiting From Scratch

Brooklyn, NY

Establishing reliability guidelines and ensuring systems meet our goals around durability, availability and performance.

Site Reliability Engineering Lead

Recruiting From Scratch

Hanover, MD

Establishing reliability guidelines and ensuring systems meet our goals around durability, availability and performance.

Site Reliability Engineering Lead

Recruiting From Scratch

Philadelphia, PA

Establishing reliability guidelines and ensuring systems meet our goals around durability, availability and performance.

Site Reliability Engineering Lead

Recruiting From Scratch

Trenton, NJ

Establishing reliability guidelines and ensuring systems meet our goals around durability, availability and performance.

Site Reliability Engineering Lead

Recruiting From Scratch

Newark, NJ

Establishing reliability guidelines and ensuring systems meet our goals around durability, availability and performance.

Site Reliability Engineering Lead

Recruiting From Scratch

Hoboken, NJ

Establishing reliability guidelines and ensuring systems meet our goals around durability, availability and performance.

Site Reliability Engineering Lead

Recruiting From Scratch

Princeton, NJ

Establishing reliability guidelines and ensuring systems meet our goals around durability, availability and performance.

Senior Associate - Site Reliability Engineer

New York Life Insurance Co

New York, NY

Strong networking skills will enable this position to be successful while also work with a Cloud Solution Engineer to automate, deploy, and provide day-to-day…