Network HPC Engineer
Apply NowCompany: Clifyx
Location: Austin, TX 78745
Description:
Designing and deploying HPC clusters consisting of high-performance servers,
interconnected by high-speed networks such as InfiniBand (IB) or Ethernet/RoCE with
RDMA capabilities.
InfiniBand Responsibilities:
Fabric Design and Configuration: Designing InfiniBand fabrics, including switches,
host channel adapters (HCAs), and cables, to ensure optimal performance,
scalability, and fault tolerance. Configuring switch ports, virtual lanes (VLs), and
routing tables to facilitate efficient data communication within the InfiniBand fabric.
Topology Optimization: Analyzing workload characteristics and traffic patterns to
design InfiniBand topologies (e.g., fat-tree, hypercube) that minimize latency and
maximize bandwidth utilization. Implementing routing policies and congestion
control mechanisms to optimize traffic flow and prevent network congestion.
Fabric Monitoring and Management: Monitoring InfiniBand fabric health and
performance using management tools such as Subnet Manager (SM) and
Performance Monitoring Counters (PMCs). Performing regular maintenance tasks,
including firmware updates, port diagnostics, and error detection and correction.
Quality of Service (QoS): Implementing QoS policies to prioritize traffic based on
application requirements and service levels. Configuring traffic classes, service
levels, and virtual lanes (VLs) to ensure predictable performance for latencysensitive applications.
Security and Access Control: Securing the InfiniBand fabric with features such as
subnet partitioning (subnet manager security) and encryption to protect data
integrity and confidentiality. Enforcing access controls and authentication 6
mechanisms to restrict unauthorized access to the InfiniBand network.
RoCE Responsibilities
Network Design and Configuration: Designing and configuring RoCE networks,
including switches, network adapters, and Ethernet fabrics, to provide low-latency,
high-bandwidth communication for RDMA traffic. Optimizing network settings such
as MTU (Maximum Transmission Unit), buffer sizes, and flow control parameters to
maximize RoCE performance.
Congestion Management: Implementing congestion management mechanisms,
such as Priority Flow Control (PFC) and Data Center Bridging (DCB), to prevent
congestion and ensure fair allocation of network resources. Monitoring network
traffic and congestion levels to dynamically adjust congestion control settings and
avoid performance degradation.
Routing and Switching Optimization: Configuring RoCE-aware switches and routers
to support RDMA traffic and enable efficient routing of packets between endpoints.
Tuning switch port settings, forwarding tables, and routing protocols to minimize
packet loss and maximize throughput for RoCE traffic.
Performance Monitoring and Tuning: Monitoring RoCE network performance
metrics, such as latency, throughput, and packet loss, using tools like Ethernet
Performance Monitoring (EPM) and InfiniBand Performance Monitoring (IPM).
Analyzing performance data to identify bottlenecks, optimize network
configurations, and fine-tune RoCE parameters for optimal performance.
Security and Authentication: Implementing security measures, such as MACsec
(Media Access Control Security) and IPsec (Internet Protocol Security), to encrypt
and authenticate RDMA traffic over RoCE networks. Enforcing access controls and
certificate-based authentication to ensure secure communication between RoCE
endpoints.
Vendor Management: Coordinating with hardware and software vendors to ensure
compatibility and support for products in multi-vendor environments. Developing
Billing of Materials. Clearly define technical requirements, including performance,
scalability, compatibility, and specific features needed for RoCE. Assess the
technical specifications, performance benchmarks, and compatibility with existing
infrastructure. Implement a PoC to test the switches in a controlled environment
and ensure they meet performance 7 and reliability expectations. Evaluate the
vendor's technical support capabilities, including responsiveness, expertise, and
available resources. Maintain regular communication with the vendor to stay
informed about product updates, potential issues, and upcoming changes.
Schedule periodic meetings to review performance/bugs, discuss any concerns,
and plan for future needs.
Recommended Qualifications
Bachelor's Degree in Computer Science, Information Technology, or related field: A
solid educational foundation in computer science or IT is essential for
understanding networking principles and protocols.
In-depth understanding of InfiniBand architecture, protocols (IBTA), and
technologies (e.g., Mellanox InfiniBand). Proficiency in RoCE (RDMA over Converged
Ethernet) protocols, including RoCEv2 and related standards.
Experience in designing and configuring high-performance networks, including
InfiniBand fabrics and RoCE-enabled Ethernet networks. Knowledge of fabric design
principles, topology optimization, and performance tuning techniques.
Ability to analyze network performance metrics, diagnose bottlenecks, and optimize
network configurations for low latency and high throughput. Experience in tuning
switch port settings, buffer sizes, and flow control parameters to maximize RoCE
performance.
Familiarity with security measures for InfiniBand and RoCE networks, including
subnet partitioning, encryption, and access controls. Knowledge of authentication
mechanisms and cryptographic protocols for securing RDMA traffic.
Proficiency in network monitoring tools and techniques for monitoring InfiniBand
and RoCE network health and performance. Ability to troubleshoot network issues,
diagnose connectivity problems, and resolve performance-related issues.
Certification programs offered by vendors such as Mellanox (now NVIDIA
Networking) for InfiniBand and RoCE technologies.
Hands-on experience in deploying, managing, and optimizing high-performance
computing (HPC) environments and data center networks. Experience working with
RDMA-enabled applications and parallel computing frameworks (e.g., MPI,
OpenMP).
Experience in implementing and troubleshooting complex network configurations,
including InfiniBand switches, gateways, and RoCE adapters.
Additional Nice to Have's:
Bachelor's degree in Computer Science, Computer Engineering, relevant technical
field, or equivalent practical experience.
CCNA, CCIE, or similar
Ability to work efficiently on multiple projects and under pressure
Previous experience with network equipment vendor products (e.g., Juniper, Cisco,
Arista, OEM).
Working knowledge of stateful and stateless firewalls
Comfortable with Linux or other UNIX implementations, with scripting skills.
Experience with python scripting / ansible for scripting and automation
Ability to "read code" as source documentation
DevOps CI/CD mindset for automation and scale
interconnected by high-speed networks such as InfiniBand (IB) or Ethernet/RoCE with
RDMA capabilities.
InfiniBand Responsibilities:
Fabric Design and Configuration: Designing InfiniBand fabrics, including switches,
host channel adapters (HCAs), and cables, to ensure optimal performance,
scalability, and fault tolerance. Configuring switch ports, virtual lanes (VLs), and
routing tables to facilitate efficient data communication within the InfiniBand fabric.
Topology Optimization: Analyzing workload characteristics and traffic patterns to
design InfiniBand topologies (e.g., fat-tree, hypercube) that minimize latency and
maximize bandwidth utilization. Implementing routing policies and congestion
control mechanisms to optimize traffic flow and prevent network congestion.
Fabric Monitoring and Management: Monitoring InfiniBand fabric health and
performance using management tools such as Subnet Manager (SM) and
Performance Monitoring Counters (PMCs). Performing regular maintenance tasks,
including firmware updates, port diagnostics, and error detection and correction.
Quality of Service (QoS): Implementing QoS policies to prioritize traffic based on
application requirements and service levels. Configuring traffic classes, service
levels, and virtual lanes (VLs) to ensure predictable performance for latencysensitive applications.
Security and Access Control: Securing the InfiniBand fabric with features such as
subnet partitioning (subnet manager security) and encryption to protect data
integrity and confidentiality. Enforcing access controls and authentication 6
mechanisms to restrict unauthorized access to the InfiniBand network.
RoCE Responsibilities
Network Design and Configuration: Designing and configuring RoCE networks,
including switches, network adapters, and Ethernet fabrics, to provide low-latency,
high-bandwidth communication for RDMA traffic. Optimizing network settings such
as MTU (Maximum Transmission Unit), buffer sizes, and flow control parameters to
maximize RoCE performance.
Congestion Management: Implementing congestion management mechanisms,
such as Priority Flow Control (PFC) and Data Center Bridging (DCB), to prevent
congestion and ensure fair allocation of network resources. Monitoring network
traffic and congestion levels to dynamically adjust congestion control settings and
avoid performance degradation.
Routing and Switching Optimization: Configuring RoCE-aware switches and routers
to support RDMA traffic and enable efficient routing of packets between endpoints.
Tuning switch port settings, forwarding tables, and routing protocols to minimize
packet loss and maximize throughput for RoCE traffic.
Performance Monitoring and Tuning: Monitoring RoCE network performance
metrics, such as latency, throughput, and packet loss, using tools like Ethernet
Performance Monitoring (EPM) and InfiniBand Performance Monitoring (IPM).
Analyzing performance data to identify bottlenecks, optimize network
configurations, and fine-tune RoCE parameters for optimal performance.
Security and Authentication: Implementing security measures, such as MACsec
(Media Access Control Security) and IPsec (Internet Protocol Security), to encrypt
and authenticate RDMA traffic over RoCE networks. Enforcing access controls and
certificate-based authentication to ensure secure communication between RoCE
endpoints.
Vendor Management: Coordinating with hardware and software vendors to ensure
compatibility and support for products in multi-vendor environments. Developing
Billing of Materials. Clearly define technical requirements, including performance,
scalability, compatibility, and specific features needed for RoCE. Assess the
technical specifications, performance benchmarks, and compatibility with existing
infrastructure. Implement a PoC to test the switches in a controlled environment
and ensure they meet performance 7 and reliability expectations. Evaluate the
vendor's technical support capabilities, including responsiveness, expertise, and
available resources. Maintain regular communication with the vendor to stay
informed about product updates, potential issues, and upcoming changes.
Schedule periodic meetings to review performance/bugs, discuss any concerns,
and plan for future needs.
Recommended Qualifications
Bachelor's Degree in Computer Science, Information Technology, or related field: A
solid educational foundation in computer science or IT is essential for
understanding networking principles and protocols.
In-depth understanding of InfiniBand architecture, protocols (IBTA), and
technologies (e.g., Mellanox InfiniBand). Proficiency in RoCE (RDMA over Converged
Ethernet) protocols, including RoCEv2 and related standards.
Experience in designing and configuring high-performance networks, including
InfiniBand fabrics and RoCE-enabled Ethernet networks. Knowledge of fabric design
principles, topology optimization, and performance tuning techniques.
Ability to analyze network performance metrics, diagnose bottlenecks, and optimize
network configurations for low latency and high throughput. Experience in tuning
switch port settings, buffer sizes, and flow control parameters to maximize RoCE
performance.
Familiarity with security measures for InfiniBand and RoCE networks, including
subnet partitioning, encryption, and access controls. Knowledge of authentication
mechanisms and cryptographic protocols for securing RDMA traffic.
Proficiency in network monitoring tools and techniques for monitoring InfiniBand
and RoCE network health and performance. Ability to troubleshoot network issues,
diagnose connectivity problems, and resolve performance-related issues.
Certification programs offered by vendors such as Mellanox (now NVIDIA
Networking) for InfiniBand and RoCE technologies.
Hands-on experience in deploying, managing, and optimizing high-performance
computing (HPC) environments and data center networks. Experience working with
RDMA-enabled applications and parallel computing frameworks (e.g., MPI,
OpenMP).
Experience in implementing and troubleshooting complex network configurations,
including InfiniBand switches, gateways, and RoCE adapters.
Additional Nice to Have's:
Bachelor's degree in Computer Science, Computer Engineering, relevant technical
field, or equivalent practical experience.
CCNA, CCIE, or similar
Ability to work efficiently on multiple projects and under pressure
Previous experience with network equipment vendor products (e.g., Juniper, Cisco,
Arista, OEM).
Working knowledge of stateful and stateless firewalls
Comfortable with Linux or other UNIX implementations, with scripting skills.
Experience with python scripting / ansible for scripting and automation
Ability to "read code" as source documentation
DevOps CI/CD mindset for automation and scale