ABOUT the job
Workflow is an extensive platform that unifies many web and mobile applications.
As an SRE you will be responsible for 1 or 2 applications within the Workflow platform and working with the corresponding development team.
Key Areas of Focus:
• Reducing Technical Debt
• Reducing Toil
• Observability/System Monitoring
• Incident Response throughout SDLC
• Problem Management
First 6 months in the position:
• Cleanup work, bug fixing, preparing the basis for the future SRE work
• Apply automation to any tasks/parts of the system that are performed manually
• Configuring and maintaining the monitoring tooling as it relates to the target application
• Monitor application/infrastructure and take steps to improve overall system software performance, availability, and reliability by incorporating changes through defined feedback loops within the software delivery lifecycle
• Document tribal knowledge as you acquire it over time by creating runbooks/playbooks and ensuring critical system information is readily available to those who need it through dashboards
After the first 6 months in the position:
• Work closely with software developers and testers to ensure the product is responding correctly to non-functional requirements such as security, performance, and availability
• Resolve NOC escalations and help prevent reiteration of incidents by creating processes and automation
• Be key part of our response to high-severity internal customer incidents, ensuring we meet all SLAs and SLOs
• Help build an SRE culture by sharing best practices, approaches, documentation, and code with other engineering teams across the organization
• Assist product development team with managing their error budget
• Embrace failures and treat incidents as learning opportunities through conducting blameless postmortems reports
• Participate in product engineering stand-ups and related design activities
• Coach other team members to ensure systems are supported by following SRE best practices
Job Location
Remote (Hungary, Poland)
ABOUT THE COMPANY
This company is top player in the vehicle lifecycle game! They're all about helping the people who make, insure, repair, and replace cars step up their transportation game using some seriously rad tech, like mobile, artificial intelligence, and connected car stuff. They've built a huge network of over 350 insurance companies, 24,000 repair facilities, OEMs, tons of parts suppliers, and other data and service providers to help their clients make better decisions, work faster, and create an awesome experience for their customers. And get this – they're a pretty big deal! They were ranked #17 in the Top 100 Digital Companies in Chicago in 2020 by Built in Chicago (which is like THE online community for digital tech entrepreneurs in Chicago), and Forbes named them one of the best mid-sized companies to work for in 2019. With over 2,600 full-time employees (plus 350 contractors), they're keeping things real in their sweet downtown Chicago headquarters at the historic Merchandise Mart building. Plus, it's totally eco-friendly – it's LEED certified and a total tech hub in the city. We've won some pretty sweet awards too - like the Innovation Championship by Zurich, where we snagged 1st place out of 1,300 solutions from all over the world. We also won the Global Silver Award for Innovation in Insurance out of 359 innovations from 45 countries. And to top it all off, we were voted one of "the 3 best innovations at a global level" in InsurTech. Plus, Plug and Play Insurance Partners voted us as the #1 InsurTech. We're pretty proud of all that!
ABOUT the candidate
We're on the hunt for a top-notch Site Reliability Engineer (SRE) to join our product development team! As our SRE, you'll be the go-to person for ensuring our applications run smoothly and are always available for our users. You're a master troubleshooter who loves getting to the bottom of any problem that pops up, fixing it, and making sure our teams learn from it. If you're passionate about keeping things running smoothly and thrive in a fast-paced environment, then we want you on our team!
Requirements
• Experience with monitoring and data visualization tools: Appdynamics, Alertsite, Nagios, Grafana, Prometheus, Kibana, Datadog, any cloud native monitoring services such as Cloudwatch
• Experience with source code management tools: Github, GitLab, SVN, Bitbucket
• Experience with incident management tools: RemedyForce, Pagerduty
• Experience with collaboration tools: Teams, Confluence, Microsoft Office 365
• Experience with project management: Version One, JIRA
• Solid understanding of microservices and APIs
• Being versed in system management, monitoring, and analysis to identify opportunities for improving service health, manageability, and reliability
• Proven ability to dig through metrics, logs, and available sources to triage and resolve an incident at any time
• Eager to problem-solve and troubleshoot issues that may arise day-to-day
• Ability to document solutions, SRE architectural patterns, and best practices to ensure that teams have guidance as needed
• Experience and interest in working in an Agile environment
• Effective communication and interpersonal skills
Nice To Have Skills
Benefits
When you join our stellar team, you'll get tons of cool benefits, like:
• Building your skills with our Client Engagement team, who can help with all kinds of projects.
• Joining our awesome community of like-minded folks.
• Becoming a mentor or speaker and getting rewarded for it – both emotionally and financially!
• Attending meetups as a speaker or listener to learn and grow.
• We're all about broadening our horizons and sharing knowledge – so don't be afraid to ask questions and get curious!