About this role

Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest.

Site Reliability Engineering at Affirm is a small, yet crucial, team that helps our Engineering partners to “Operate What They Own” with excellence to protect their customers’ experience. SRE accomplishes this through defining frameworks and best practices for operating applications, building tooling, and providing training and consulting. Some of the many SRE responsibilities are:

Providing data and visibility to teams and leadership on application performance

Guiding the development of SLOs

Driving the Incident Management and Analysis process

Steering the implementation of Change Management and Deployment practices

Engaging in service and architectural conversations

Recommending observability and alerting configurations

The SRE team benefits from experience across many domains including:

infrastructure, platform, and distributed systems

capacity management, load and chaos testing

automation, observability, and configuration management

development and product experience

The SRE team is seeking seasoned and motivated software and systems engineers with the experience to build, iterate on, and expand incident lifecycle, reliability, and resilience practices throughout Affirms Engineering organization and beyond.

Responsibilities

You will be responsible for setting technical strategy vision for your team on a multi year-long time scale, and help your team tie it together with critical, business-impacting projects.

You will collaborate across teams in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics to ensure technical sustainability, risks and trade-offs are well understood and managed.

You will act as a force-multiplier for your team through your definition and advocacy of technical solutions and operational processes

You take ownership of your team’s operations and availability by ensuring you have the right monitoring, triage rotations, playbooks, policies, testing and alerting in place to support “keep the lights on” & on-call efforts.

You will foster a culture of quality and ownership on your team by setting code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks.

You will help develop talent on your team by providing feedback and guidance, and leading by example.

What We Look For:

You have 8+ years of experience designing, developing, advocating as a point subject of reference, and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin.

You have an extensive track record of developing highly available distributed systems using technologies like AWS, MySQL, Spark and Kubernetes.

You have track record of managing, driving and improving the Incident Livecycle process from live incident management through retrospective and post-incident analysis to provide actional insights to enhance overall system reliability, resilience, and performance

You have 7+ years experience in Site Reliability or Production Engineering teams.

You demonstrate curiosity with empathy, and strong opinions loosely held.

You have experience delivering major features, system components or deprecating existing functionality in a system through the definition of a technical and execution plan. You write high quality code that is easily understood and used by others.

You thrive in ambiguity, and are comfortable moving from low level language idioms all the way to the architecture of large systems to understand how they work.

Your growth and impact trajectory demonstrates that you have mastered gathering and iterating on feedback from your engineering and cross-functional peers.

You have strong verbal and written communication skills that support effective collaboration with our global engineering team and key stakeholders of an organization.

This position requires either equivalent practical experience or a Bachelor’s degree in a related field.

Staff Site Reliability Engineer (SRE & Platform Reliability)

About this role

Responsibilities

EU Requirements

Job Details

Contact

Similar Jobs