Senior Site Reliability Engineer, Kubernetes Platform @ Braze - New York, NY

Job Overview

8 days ago

Senior Site Reliability Engineer, Kubernetes Platform

Braze - New York, NY

WHO WE ARE

Braze delivers customer experiences across email, mobile, SMS, and web. Customers, including Burger King, Delivery Hero, HBO Max, Mercari, and Venmo, use the Braze platform to facilitate real-time experiences between brands and consumers in a more authentic and human way. And we do it at scale – each month, hundreds of billions of messages are sent to a network of over 3 billion active users through Braze.

Need more proof? Braze was named a Leader in the Forrester Wave™: Cross-Channel Campaign Management (Independent Platforms), Q3 2021, and was named to the Forbes Cloud 100 list for the fourth consecutive year. The company has also been selected as one of Fortune's Best Workplace for Millennials in 2021, and was ranked #20 on Fortune's Best Medium Sized Workplaces in 2021. Braze is certified as a Great Place to Work in the UK and the U.S. and is recognized as one of the UK's Best Workplaces for Women.

Site Reliability Engineers (SREs) are responsible for keeping all internal-facing services and platforms running smoothly. In a nutshell, SREs ensure site uptime. SREs are a blend of sensible system administrators and software engineers that apply sound engineering principles, operational discipline, and mature automation to the environments and infrastructure services we provide. We specialize in systems–whether it be networking, the Linux kernel, or some more specific interest in scaling–algorithms, or distributed systems.

Our team helps us improve automation, infrastructure reliability, and empowers Braze's other engineering teams to easily leverage the infrastructure products and platforms we create. Braze operates at a massive scale with over 3.3 billion monthly active users across our customers, collecting hundreds of billions of data points each month, and sending billions of messages to end-users daily. We use a diverse technology stack rooted in Ruby on Rails, MongoDB, Redis, Kafka, Kubernetes, and more. As a Site Reliability Engineer at Braze, you will collaborate with your teammates and your consumer engineering teams to continuously improve the infrastructure, automation, and tooling that build internal products from these technologies.

WHAT YOU'LL DO:

  • Partner with Braze's engineering teams on:
    • Architecting products to effectively utilize infrastructure platforms in a scalable, reliable manner
    • Debugging reliability and scalability issues across all layers of the stack, including the products that are built using our infrastructure platforms
    • Make monitoring and alerting alert on symptoms and not on outages
    • Ensure that Braze meets our strict enterprise-grade SLAs with customers
  • Develop Braze's internal platform infrastructure:
    • Create Infrastructure as code using Chef, Terraform, and Kubernetes
    • Develop Build and Deploy pipelines for applications in multiple languages using Docker, Kubernetes, etc.
    • Provide centralized/common tooling, services, and automation frameworks that are critical for scaling operations, capacity management, reducing operational pain, and improving the day-to-day workflow of Braze's engineering teams
  • Manage incidents:
    • Be on a PagerDuty rotation to respond to availability incidents and provide support for other engineers
    • Use your on-call shift to prevent incidents from ever happening
    • Retrospect everything that happens to turn lessons into system improvements/changes, automation, etc.

WHO YOU ARE:

  • 5+ years of experience as a Site Reliability, DevOps, or Software Engineer
  • You think about systems - interfaces, boundaries, edge cases, failure modes, behaviors, specific implementations
  • Have an urge to collaborate, document, and deliver quickly
  • Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it
  • Have a desire to solve everyday challenges facing software engineers and automate their toil away
  • Have an excellent ability to manage multiple tasks and expectations at once
  • Know your way around Linux and the Unix Shell
  • Have strong programming skills - Ruby and/or Go preferred
  • Have experience with Docker, Kubernetes, Terraform, or similar IaC technologies

WHAT WE OFFER

  • Competitive compensation that includes equity
  • Generous time off policy to balance your work and life, including paid parental leave
  • Competitive medical, dental, and vision coverage for you and your dependents
  • Collaborative, transparent, and fun loving office culture

If you are a California resident subject to the California Consumer Privacy Act, click here to understand how Braze processes your personal information and how you can exercise your rights.

If you are located in the EU or UK visit our privacy policy to understand how Braze processes your personal information and how you can exercise your rights.

Similar Jobs

DevOps Engineering Lead

New York Life Insurance Co

Lebanon, NJ

The DevOps Engineering Lead will be responsible for the DevOps transformation strategy execution, will bridge the gap between development, testing, change…

Site Reliability Engineering Manager

Wells Fargo

New York, NY

The team will drive technology transformation and adoption of SRE aligned enterprise capabilities and products, launch new tooling enablement, automate away…

IKP Site Reliability Engineer

HSBC

Jersey City, NJ

Balance feature development speed and reliability with well-defined service level objectives. Improve reliability, quality, and time to upgrade cluster and…

Site Reliability Engineer (Observability and Monitoring)

Underdog Fantasy

Brooklyn, NY

Own UD's production environments hosted in GKE and Anthos and develop processes to maintain uptime requirements. 16 weeks of fully paid parental leave.

Software Dev Eng II - Ads, DSP Site Reliability Engineering

Amazon.com Services LLC

New York, NY

1+ years of experience contributing to the system design or architecture (architecture, design patterns, reliability and scaling) of new and current systems.

DevOps Engineer

1010data

New York, NY

We are seeking a seasoned Senior Devops Engineer with deep Linux and Kubernetes experience to work with a team of talented engineers and developers to build and…

DevOps Engineer

Children's Hospital of Philadelphia

Philadelphia, PA

This position will work approximately 80% remote, 20% on site in our Philadelphia offices. Ensure service reliability and service availability to ensure…

Devops/Cloud Engineer

Qcom

Wayne, NJ

Recommend, develop and implement system enhancements that will improve the performance and reliability of the system including installing, upgrading/patching,…

Site Reliability Engineer

Jotform

Manhattan, NY

This is a full-time, fully remote opportunity in the Pacific time zone, though an exception can be made for a great fit located elsewhere in the U.S. who is…

Site Reliability Engineering Manager, Trello (Storage Layer)

Atlassian

New York, NY

You’re familiar with system design, site reliability engineering and databases. Assuming you have eligible working rights and a sufficient time zone overlap…

Site Reliability/DevOps Engineer - Opportunity for Working Remotely New York, NY

VMware

New York, NY

You will be responsible for improving the reliability and resiliency of microservices by enforcing DevOps/SRE best practices across engineering org.

Site Reliability Engineer / SRE : 10+ years exp needed

PC Services inc

New York, NY

Design, implement and monitor the Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for the services you are supporting.

Infrastructure Site Reliability Engineer

Schrödinger

New York, NY

This position presents the unique opportunity to support researchers and developers who are continually breaching the boundaries of what's possible in drug and…

Site Reliability/DevOps Engineer - Opportunity for Working Remotely Bridgeport, CT

VMware

Bridgeport, CT

You will be responsible for improving the reliability and resiliency of microservices by enforcing DevOps/SRE best practices across engineering org.

Site Reliability/DevOps Engineer - Opportunity for Working Remotely Philadelphia, PA

VMware

Philadelphia, PA

You will be responsible for improving the reliability and resiliency of microservices by enforcing DevOps/SRE best practices across engineering org.

Site Reliability Engineer

infoObject

Philadelphia, PA

Interview*: 2 rounds of interviews: 1st round (30min MS Video Teams Interview), 2nd Interview: 1 hour w/ 3 Engineers on the team. 5-6 years of experience.

Site Reliability/DevOps Engineer - Opportunity for Working Remotely Newark, NJ

VMware

Newark, NJ

You will be responsible for improving the reliability and resiliency of microservices by enforcing DevOps/SRE best practices across engineering org.

Senior DevOps Engineer, VP - hybrid

MUFG

Jersey City, NJ

Experience implementing enterprise systems with security best practices and site reliability engineering principles. Bring code assets under version control.

Site Reliability Engineer

Comcast

Philadelphia, PA

Seek out potential threats to security and reliability, advocate solutions, and assist teams to aim to successful resolution.

Site Reliability Engineer, Americas

Canonical - Jobs

New York, NY

Our site reliability engineers bring Python software-engineering skills and rigour to the operations domain. A wide range of engineering disciplines and career…

Site Reliability Engineer, Americas

Canonical - Jobs

Philadelphia, PA

Our site reliability engineers bring Python software-engineering skills and rigour to the operations domain. A wide range of engineering disciplines and career…

.Net Platform Engineer (CMS)

Comcast

Philadelphia, PA

Experience developing service-oriented architectures and an understanding of design for scalability, performance and reliability.

Site Reliability Engineer

JPMorgan Chase Bank, N.A.

Jersey City, NJ

Engage with development team throughout the life cycle to help develop software for reliability and scale, ensuring minimal refactoring or changes.

Site Reliability Engineer - Private Cloud

JPMorgan Chase Bank, N.A.

Jersey City, NJ

§ Apply standards of cloud compliance to application design to achieve reliability. § Experience in site reliability engineering in one of the following…