Submitting more applications increases your chances of landing a job.

Here’s how busy the average job seeker was last month:

Opportunities viewed

Applications submitted

Keep exploring and applying to maximize your chances!

Looking for employers with a proven track record of hiring women?

Click here to explore opportunities now!

We Value Your Feedback

You are invited to participate in a survey designed to help researchers understand how best to match workers to the types of jobs they are searching for

Would You Be Likely to Participate?

If selected, we will contact you via email with further instructions and details about your participation.

You will receive a $7 payout for answering the survey.

Mark Fester

AI and GPU Linux and Infrastructure lead·CIRRASCALE CLOUD SERVICES

South Africa

Bachelor's degree, IT

Work experience

Total years of experience: 4 years, 4 months

AI and GPU Linux and Infrastructure lead

February 2022 - Present

CIRRASCALE CLOUD SERVICES

Texas, United States •Remote

February 2022 - Present

▸ GPU fleet operations at scale: 1000+ nodes across MI300X, H100, RTX 4090 — daily node lifecycle, capacity verification, health restoration
▸ Vendor coordination and RMA execution with Supermicro, Lenovo, Dell, and Inspur — end-to-end from fault isolation to replacement integration
▸ Maintenance window orchestration with zero customer disruption: workload evacuation, hardware swap, return to service
▸ Living knowledge base ownership: NOC triage checklists for five server/GPU combinations actively used by my team
▸ Production Kubernetes operations: kubectl-level workflows for diagnosing pod scheduling, node states, and workload placement
▸ Internal tooling builder: shipped NOC Handoff Generator and firmware lookup tool (FWSCOUT) to reduce repetitive work
▸ Operate a 1000+ GPU node fleet across AMD MI300X (ROCm 6.4.x), NVIDIA H100, and RTX 4090 platforms; own node-level triage from hardware diagnostics through OS, driver, and fabric layers
▸ Coordinate maintenance windows with internal engineering and external vendors to evacuate unhealthy nodes and integrate replacement hardware without customer disruption
▸ Execute the full vendor RMA lifecycle: fault isolation, RMA documentation, claim approval, replacement coordination — example: Supermicro PSU PWS-3K06G-2R replacement on production GPU node, approved and resolved without workload loss
▸ Diagnose GPU-specific failure modes including ROCm RAS faults (rocm-smi --showrasinfo all), RCCL collective test failures, NVSwitch heartbeat timeouts, PCIe enumeration failures, and InfiniBand port faults on Mellanox mlx5 fabric
▸ Use kubectl in production for pod and node-state inspection on GPU clusters: scheduling diagnostics, log retrieval, exec-into-pod debugging, and node taint/drain workflows
▸ Manage out-of-band hardware via iDRAC, iLO, Lenovo XCC, and Supermicro IPMI; firmware updates, BIOS configuration, BMC-level recovery
▸ Authored five NOC triage checklists covering Supermicro H100, KAYTUS H100, Dell MI300X, and Supermicro MI300X platforms — now the team's reference for first-touch fleet response
▸ Authored standard operating procedures for the NOC team: Jira ticket workflows, hardware replacement procedures, MobaXterm session setup

Company industry:: Business Support Services

Education

UNISA

June 2020