Cloud & Compute Operations Engineer
Apply NowCompany: Radiant Digital
Location: Ashburn, VA 20147
Description:
Job Description
Job description:
We hire the brightest minds in the world to tackle some of the biggest questions in finance. We pair this expertise with machine learning, big data, and some of the most advanced technology available to predict movements in financial markets.
The Role
seeking an engineer to join our Compute Management (CPM) team based within our Virginia (US) Data centre to ensure the high availability and smppth running of our High-Performance Compute platform. CPM sits within our Hybrid cloud management function, alongside:
Within that Function, our Infrastructure Engineering teams manage a complex on-premises cloud platform, which runs 24/7 across multiple time zones. Built across OpenStack, VMware and Kubernetes, the platform is comprised of thousands of servers, deployed in multi-MW data canters in Europe and the US, to run parallelised Research simulations. As well as working closely with other members within the Compute team you will provide localised operational assistance and technical oversight, demonstrating sound technical understanding of our platforms and tooling, with the ability provide assistance in a practical way. You will be enthusiastic and autonomous and aspire to be inclusive and collaborative to help accelerate the growth of our Compute Function. In return, you will gain exposure to the latest hardware and software technologies in a forward-thinking company, which values innovation, personal development and training. This role requires close interaction with Infrastructure, Datacentre and other Operations teams, giving broad exposure to a variety of enterprise products and technologies.
Key responsibilities of the role include:
This role will suit someone with a strong technical and methodical approach to any given task.
Who are we looking for?
Job description:
We hire the brightest minds in the world to tackle some of the biggest questions in finance. We pair this expertise with machine learning, big data, and some of the most advanced technology available to predict movements in financial markets.
The Role
seeking an engineer to join our Compute Management (CPM) team based within our Virginia (US) Data centre to ensure the high availability and smppth running of our High-Performance Compute platform. CPM sits within our Hybrid cloud management function, alongside:
- Hardware provisioning and break fix
- Deployment of compute to end-users
- Development of new features to meet user demands
Within that Function, our Infrastructure Engineering teams manage a complex on-premises cloud platform, which runs 24/7 across multiple time zones. Built across OpenStack, VMware and Kubernetes, the platform is comprised of thousands of servers, deployed in multi-MW data canters in Europe and the US, to run parallelised Research simulations. As well as working closely with other members within the Compute team you will provide localised operational assistance and technical oversight, demonstrating sound technical understanding of our platforms and tooling, with the ability provide assistance in a practical way. You will be enthusiastic and autonomous and aspire to be inclusive and collaborative to help accelerate the growth of our Compute Function. In return, you will gain exposure to the latest hardware and software technologies in a forward-thinking company, which values innovation, personal development and training. This role requires close interaction with Infrastructure, Datacentre and other Operations teams, giving broad exposure to a variety of enterprise products and technologies.
Key responsibilities of the role include:
- Investigating hardware and software problems and working with manufacturer support teams to resolve issues.
- Collaborating with UK & Dallas based peers to prioritise, troubleshoot, diagnose and resolve computer hardware related issues.
- Liase with key vendors providing 'on-site' support managing SLA's and delivery.
- Ensure best practice is followed, balancing quality and delivery.
- Issue ownership across compute with upward reporting.
- Support-focused work, triaging an inbound queue of requests\queries\logged issues, but also working directly with Senior Engineers to assist with their project work.
- Working with Datacenter teams to manage hardware implementation and management.
- Monitoring and analysing the health of the compute infrastructure using provided tooling.
- Reacting to alerts and providing resolutions to major incidents.
- Assisting with Security implementation and deployment.
- Decommissioning old hardware: uncabling, removal from racks and organizing ecological disposal.
- Escorting and assisting 3rd party installers and adhering to Company Security policies
- Serve as the primary responder for compute issues within the data centers
- Completing support tickets, and hardware audits, to ensure compliance with standards.
- Onsite spares management.
- Performing component level resolution incl swap out's / reseating
This role will suit someone with a strong technical and methodical approach to any given task.
Who are we looking for?
- The ideal candidate will have the following skills and experience:
- A technical background in compute server Hardware Administration & Operations.
- Hardware component identification / understanding
- An understanding of automation, scripting and CI/CD tooling, such as Ansible, Python and Jenkins
- Background in managing large-scale high-performance compute infrastructure
- Knowledge of managing tickets and work through JIRA or any ITIL enabled systems
- Solid interpersonal skills and comfort interacting with local and remote stakeholders at both infrastructure and management level
- A proactive mind set, initiative and positive work ethic