▸ GPU fleet operations at scale: 1000+ nodes across MI300X, H100, RTX 4090 — daily node lifecycle, capacity verification, health restoration
▸ Vendor coordination and RMA execution with Supermicro, Lenovo, Dell, and Inspur — end-to-end from fault isolation to replacement integration
▸ Maintenance window orchestration with zero customer disruption: workload evacuation, hardware swap, return to service
▸ Living knowledge base ownership: NOC triage checklists for five server/GPU combinations actively used by my team
▸ Production Kubernetes operations: kubectl-level workflows for diagnosing pod scheduling, node states, and workload placement
▸ Internal tooling builder: shipped NOC Handoff Generator and firmware lookup tool (FWSCOUT) to reduce repetitive work
▸ Operate a 1000+ GPU node fleet across AMD MI300X (ROCm 6.4.x), NVIDIA H100, and RTX 4090 platforms; own node-level triage from hardware diagnostics through OS, driver, and fabric layers
▸ Coordinate maintenance windows with internal engineering and external vendors to evacuate unhealthy nodes and integrate replacement hardware without customer disruption
▸ Execute the full vendor RMA lifecycle: fault isolation, RMA documentation, claim approval, replacement coordination — example: Supermicro PSU PWS-3K06G-2R replacement on production GPU node, approved and resolved without workload loss
▸ Diagnose GPU-specific failure modes including ROCm RAS faults (rocm-smi --showrasinfo all), RCCL collective test failures, NVSwitch heartbeat timeouts, PCIe enumeration failures, and InfiniBand port faults on Mellanox mlx5 fabric
▸ Use kubectl in production for pod and node-state inspection on GPU clusters: scheduling diagnostics, log retrieval, exec-into-pod debugging, and node taint/drain workflows
▸ Manage out-of-band hardware via iDRAC, iLO, Lenovo XCC, and Supermicro IPMI; firmware updates, BIOS configuration, BMC-level recovery
▸ Authored five NOC triage checklists covering Supermicro H100, KAYTUS H100, Dell MI300X, and Supermicro MI300X platforms — now the team's reference for first-touch fleet response
▸ Authored standard operating procedures for the NOC team: Jira ticket workflows, hardware replacement procedures, MobaXterm session setup
- مجال الشركة:
- خدمات الدعم التجاري