Site Reliability Engineer (m/f/d)
US, Remote /
Finance & Administration – IT /
Acrolinx envisions a world connected by amazing content, supercharging the billions of enterprise content touchpoints that power the global customer experience. We deliver best-in-class Customer Experience Management (CEM) linguistic software solutions to many of the largest and most innovative international companies. Chances are, you’ve read some text today that Acrolinx software helped to shape. Our customers rely on our AI to ensure consistent and high-quality writing on a global scale.
We're looking for a motivated Site Reliability Engineer, joining our Corporate IT team and working remotely at the east coast or at our office in Waltham, Boston (MA).
Your mission is building an ecosystem as foundation for a data driven predictive operations (AIOps) approach as part of the Corporate IT department to find an optimal balance between high reliability, maintainability, scalability, resilience and velocity of our hybrid IT infrastructure (on-premise data center / Public Cloud)
Impact & Responsibilities:
You run our infrastructure with VMware vSphere as well as Ansible, Terraform and Kubernetes. You are responsible for monitoring and alerting on symptoms and not on outages. You document every action so your findings turn into repeatable actions – and then into automation. Debugging production issues across services and levels of the stack also belongs to your responsibilities as well as planning the growth of Acrolinx infrastructure.
In addition, you will ensure end user hardware availability of our global operating team by working closely with supplier and logistic partner. Therefor you will develop and maintain an automatic device enrollment program (DEP) for Windows and Apple Computer to streamline the on- and offboarding process. Additionally you act as a mentor and train other team members on design techniques and coding standards. and you work with internal stakeholders to understand their needs. You are also responsible for implementing best practices and providing feedback to team members through peer reviews.
- You have a Master’s degree in Computer Science or a related field with more than 4 years of experience in SRE, Software Engineering or Operations Engineering roles and know your way around Linux and the Unix Shell. You have strong programming skills with experience in Java or Python.
- You have practical working experience with AWS services and IAM and know how to set this up from scratch. System administration experience on traditional on-premise data center infrastructure is a plus but not a must.
- You like to think about systems - edge cases, failure modes, behaviors, and specific implementations. You have worked with Docker, Kubernetes, Helm, Terraform, Ansible or similar technologies and know what the use of config management systems like Ansible is. Past experience tuning and maintaining the performance of Linux and cloud bases systems is desirable.
- You have experience in observability and AIOps using one or more: Dynatrace, DataDog, Grafana, Prometheus, ELK, Elastic, Kibana, CloudWatch, Kinesis.
- You are enthusiastic, have a go-for-it attitude and want to deliver quickly and iterate fast. You like to collaborate and communicate asynchronously. You are able to work independently and you do great work even when no one is watching. The drive to improve and deliver is just a part of your DNA.
- You are a team player and enjoy collaborating with cross-functional teams. You like to share your knowledge and experience and can document all the things, so you don't need to learn the same thing twice.
Additional Skills (recommended)
- Strong knowledge of Linux/Unix system fundamentals
- Experience with build automation, continuous integration, or continuous deployment tools
- Experience with Virtualization Infrastructures such as VirtualBox, OpenStack and VMWare
- Ability to prototype and demonstrate mechanisms for performance improvement, high availability, and system scaling
- Adept at assessing issues with ability to devise workable solutions quickly responding appropriately
- Excellent interpersonal and diplomatic skills as well as a positive attitude
- Excellent written and verbal communication skills with the ability to present complex information in a clear, concise manner to all audiences
- Flexible, ability to change priorities quickly, focus on new ones without distraction
- Ability to deal with conflict and work under pressure to meet deliverable dates / timelines
- Experience in negotiating timelines and deliverables with a strong sense of urgency
- Familiarity with Atlassian tools (including JIRA, Confluence)
- Interested in research and introduce of new technologies, practices, and techniques, and open to continued learning
- Knowledge of German is a plus
- Global responsibilities with a high degree of autonomy and many opportunities to make an impact
- Continuous opportunities for training as well as expanding your skills and experiences alongside the global growth of the company
- Start-up environment with flat, informal hierarchies and quick decision-making processes
- Being part of a truly diverse and international team
- Very attractive compensation with upside for strong performance
- 20 vacation days/year, flexible working hours, and an attractive range of other benefits