كلما زادت طلبات التقديم التي ترسلينها، زادت فرصك في الحصول على وظيفة!

إليك لمحة عن معدل نشاط الباحثات عن عمل خلال الشهر الماضي:

عدد الفرص التي تم تصفحها

عدد الطلبات التي تم تقديمها

استمري في التصفح والتقديم لزيادة فرصك في الحصول على وظيفة!

هل تبحثين عن جهات توظيف لها سجل مثبت في دعم وتمكين النساء؟

اضغطي هنا لاكتشاف الفرص المتاحة الآن!

نُقدّر رأيكِ

ندعوكِ للمشاركة في استطلاع مصمّم لمساعدة الباحثين على فهم أفضل الطرق لربط الباحثات عن عمل بالوظائف التي يبحثن عنها.

هل ترغبين في المشاركة؟

في حال تم اختياركِ، سنتواصل معكِ عبر البريد الإلكتروني لتزويدكِ بالتفاصيل والتعليمات الخاصة بالمشاركة.

ستحصلين على مبلغ 7 دولارات مقابل إجابتك على الاستطلاع.

Mark Fester

AI and GPU Linux and Infrastructure lead·CIRRASCALE CLOUD SERVICES

جنوب أفريقيا

بكالوريوس, IT

الخبرة العملية

مجموع سنوات الخبرة: 4 سنوات, 4 أشهر

AI and GPU Linux and Infrastructure lead

فبراير 2022 - حتى الآن

CIRRASCALE CLOUD SERVICES

تكساس، الولايات المتحدة •عن بُعد

فبراير 2022 - حتى الآن

▸ GPU fleet operations at scale: 1000+ nodes across MI300X, H100, RTX 4090 — daily node lifecycle, capacity verification, health restoration
▸ Vendor coordination and RMA execution with Supermicro, Lenovo, Dell, and Inspur — end-to-end from fault isolation to replacement integration
▸ Maintenance window orchestration with zero customer disruption: workload evacuation, hardware swap, return to service
▸ Living knowledge base ownership: NOC triage checklists for five server/GPU combinations actively used by my team
▸ Production Kubernetes operations: kubectl-level workflows for diagnosing pod scheduling, node states, and workload placement
▸ Internal tooling builder: shipped NOC Handoff Generator and firmware lookup tool (FWSCOUT) to reduce repetitive work
▸ Operate a 1000+ GPU node fleet across AMD MI300X (ROCm 6.4.x), NVIDIA H100, and RTX 4090 platforms; own node-level triage from hardware diagnostics through OS, driver, and fabric layers
▸ Coordinate maintenance windows with internal engineering and external vendors to evacuate unhealthy nodes and integrate replacement hardware without customer disruption
▸ Execute the full vendor RMA lifecycle: fault isolation, RMA documentation, claim approval, replacement coordination — example: Supermicro PSU PWS-3K06G-2R replacement on production GPU node, approved and resolved without workload loss
▸ Diagnose GPU-specific failure modes including ROCm RAS faults (rocm-smi --showrasinfo all), RCCL collective test failures, NVSwitch heartbeat timeouts, PCIe enumeration failures, and InfiniBand port faults on Mellanox mlx5 fabric
▸ Use kubectl in production for pod and node-state inspection on GPU clusters: scheduling diagnostics, log retrieval, exec-into-pod debugging, and node taint/drain workflows
▸ Manage out-of-band hardware via iDRAC, iLO, Lenovo XCC, and Supermicro IPMI; firmware updates, BIOS configuration, BMC-level recovery
▸ Authored five NOC triage checklists covering Supermicro H100, KAYTUS H100, Dell MI300X, and Supermicro MI300X platforms — now the team's reference for first-touch fleet response
▸ Authored standard operating procedures for the NOC team: Jira ticket workflows, hardware replacement procedures, MobaXterm session setup