AI Computing Infrastructure Engineer – GPU & High-Performance Computing

Who We Are
Who We Are
We help our customers unlock new business potential through technological innovation. We take pride in helping their businesses be future-proof and future-ready.
- About Us
- Partnership
What We Do
What We Do
Our Services are designed on the themes of simplicity of consumption, and value for money, propelling you on your way forward through accelerated delivery.
Success Stories
Case Studies
We partner with our customers to positively impact their growth stories. We help them create innovative and relevant products, services & capabilities, upgrade business models and ensure happier experiences for their customers.
- Case Studies
Insights
Insights
We believe in creating value and building knowledge. To that end, our employees regularly create and share research-driven content that educate, drive change, and empower positive transformation. This section also features news and events in the Reflections story.
- News & Events
- Blogs
People
People
We believe that people are the most important resource in any organization. We are always on the lookout for people who can embrace change to drive transformation for our customers and the communities we operate in.
Careers

Dubai

10+years

Close Date : 18-07-2025

Apply Now

Home / Home / Careers / Careers / AI Computing Infrast...AI Computing Infrastructure Engineer – GPU & High-Performance Computing

Introduction

We are looking for a highly capable AI Infrastructure Engineer to design, implement, and optimize GPU-accelerated compute environments that power advanced AI and machine learning workloads. This role is critical in building and supporting scalable, high-performance infrastructure across data centers and hybrid cloud platforms, enabling training, fine-tuning, and inference of modern AI models.

Job Description

Must have

3–6 years of experience in AI/ML infrastructure engineering or high-performance computing (HPC).
Solid experience with GPU-based systems, container orchestration, and AI/ML frameworks.
Familiarity with distributed systems, performance tuning, and large-scale deployments.
Expertise in modern GPU architectures (e.g., NVIDIA A100/H100, AMD MI300), multi-GPU configurations (NVLink, PCIe, HBM), and accelerator scheduling for AI training and inference workloads.
Good understanding of modern AI model architectures, including LLMs (e.g., GPT, LLaMA), diffusion models, and multimodal encoder-decoder frameworks, with awareness of their compute and scaling requirements.
Knowledge of leading AI/ML frameworks (e.g., TensorFlow, PyTorch), NVIDIA’s AI stack (CUDA, cuDNN, TensorRT), and open-source tools like Hugging Face, ONNX, and MLPerf for model development and benchmarking.
Familiarity with AI pipelines for supervised/unsupervised training, fine-tuning (PEFT/LoRA/QLoRA), and batch or real-time inference, with expertise in distributed training, checkpointing, gradient strategies, and mixed precision optimization

Responsibilities include:

AI Infrastructure Design & Deployment with multi-GPU clusters using NVIDIA or AMD platforms.
Configure GPU environments using CUDA, DGX Systems, and NVIDIA Kubernetes Device Plugin.
Deploy and manage containerized environments with Docker, Kubernetes, and Slurm.
AI Model Support & Optimization for training, fine-tuning, and inference pipelines for LLMs and deep learning models.
Enable distributed training using DDP, FSDP, and ZeRO, with support for mixed precision.
Tune infrastructure to optimize model performance, throughput, and GPU utilization.
Design and operate high-bandwidth, low-latency networks using InfiniBand and RoCE v2.
Integrate GPUDirect Storage and optimize data flow across Lustre, BeeGFS, and Ceph/S3.
Support fast data ingestion, ETL pipelines, and large-scale data staging.
Leverage NVIDIA’s AI stack including cuDNN, NCCL, TensorRT, and Triton Inference Server.
Conduct performance benchmarking with MLPerf and custom test suites

Certifications :

NVIDIA Certified Professional – Data Center AI
Kubernetes Administrator (CKA)
CCNP or CCIE Data Center
Cloud Certification (AWS, Azure, or GCP

Educational Qualifications

Batchlors in Computer Science/Applications/BTech Computer Science/MCA

Primary Skills :

GPU Infrastructure Design & Optimization (NVIDIA A100/H100, AMD MI300)
CUDA Programming & NVIDIA DGX Systems Setup
Containerization with Docker, Kubernetes, and NVIDIA Device Plugin
Distributed AI Training (DDP, FSDP, ZeRO, Mixed Precision)
PyTorch, TensorFlow, and Model Optimization using TensorRT
High-Performance Networking (InfiniBand, RoCEv2, GPUDirect Storage)
AI Model Deployment using Triton Inference Server
Data Management for AI Pipelines (Lustre, BeeGFS, Ceph, S3)
Infrastructure Performance Benchmarking (MLPerf, NCCL Tests)
Experience with LLMs and AI Model Scaling Requirements

Secondary Skills :

Slurm Workload Manager for Scheduling AI Jobs
PEFT/LoRA/QLoRA-based Fine-tuning Strategies
Open-Source AI Tools – Hugging Face, ONNX, FastAPI for Model Serving
Integration with ETL/Data Ingestion Pipelines (Kafka, Spark, Airflow)
GPU Memory Optimization – HBM Utilization, GPU Resource Scheduling
AI Pipeline Automation using Python, Bash, or Terraform
Basic Cloud Infrastructure Knowledge (AWS EC2 GPU Instances, Azure ML, GCP Vertex AI)
Monitoring & Logging (Prometheus, Grafana, NVIDIA DCGM, ELK Stack)
Hybrid Cloud Setup for AI Workloads
CI/CD Pipelines for ML Ops (GitHub Actions, MLflow, Kubeflow Pipelines)

Job Details

Role:

AI Computing Infrastructure Engineer – GPU & High-Performance Computing

Location :

Dubai

Close Date :

18-07-2025

Interested candidates may forward their detailed resumes to Careers@reflectionsinfos.com along with their notice period, current and expected CTC details. This is to notify jobseekers that some fraudsters are promising jobs with Reflections Info Systems for a fee. Please note that no payment is ever sought for jobs in Reflections. We contact our candidates only through our official website or LinkedIn and all employment related mails are sent through the official HR email id. Please contact careers@reflectionsinfos.com for any clarification/ alerts on this subject.

Apply Now

Recent Jobs

Infrastructure as Code (IaC) Engineer – AI Data Center Automation

Close Date : 18-07-2025

Experience :

Location : Dubai

View Details

Kubernetes Orchestration Engineer – GPU Hypercomputing & AI Workloads

Close Date : 18-07-2025

Experience : 10+ years

Location : Dubai

View Details

Network Consulting Engineer – VXLAN & AI Data Center Networking

Close Date : 18-07-2025

Experience : 10+ years

Location : Dubai

View Details