Site Reliability Engineer
7 hours ago
About GMI
We are a fast-growing AI infrastructure startup based in Silicon Valley, working on cutting-edge technologies that power the future of artificial intelligence.We power developers, startups, and enterprises with scalable GPU cloud and inference solutions, helping AI builders turn ideas into reality. As we expand globally, we are looking for a dynamic and hands-on Site Reliability Engineer
Role Overview
We are seeking a skilled
Site Reliability Engineer
to join the GMI Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of the large-scale high performance AI/ML clusters in our data center. The ideal candidate will bring expertise in system-level troubleshooting, AI cluster maintenance, and operational excellence to ensure maximum performance for our infrastructure. Experience with large-scale infrastructure automation is considered a strong plus.
Preferred Location: Taiwan
Responsibilities
- Design, implement and maintain scalable AI/ML infrastructure solutions.
- Proactively monitor GPU cluster health, performance and troubleshoot issues across compute, accelerator, and storage systems.
- Automate deployment, configuration and management of infrastructure resources.
- Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning and upgrades of GPU nodes.
- Implement CI/CD pipelines for infrastructure deployment and orchestration.
- Ensure security, compliance and best practices across infrastructure.
- Manage incident response related to Infrastructure resources (GPU, CPU, Storage, Network).
- Handle customer provisioning requests for GPU resources, including onboarding, configuration and troubleshooting; resolve customer service requests related to GPU infrastructure, ensuring high customer satisfaction.
- Stay current with emerging GPU hardware and software technologies, integrating improvements as appropriate.
- Regional/international travel to GMI data center locations.
Qualifications
- Bachelor's degree in Computer Science or related field.
- Over 3+ years of experience in data center operations, infrastructure, or systems engineering.
- Proven experience in site reliability engineering and infrastructure automation (e.g. Ansible, Terraform)
- Familiarity with containers orchestration platform (e.g. Kubernetes, Nvidia GPU operator, Nvidia Network operator, CNI, CSI) and job scheduling systems (e.g. Slurm).
- Familiarity with Linux system administration and scripting (Python, Bash).
- Familiarity with logging and monitoring tools such as Prometheus, Grafana, Loki.
- Good knowledge of GPU architecture, Nvidia CUDA, NCCL, or related AI/ML frameworks - added advantage.
- Strong troubleshooting skills and ability to analyze system logs and performance metrics.
- Excellent communication and teamwork abilities.
Meeting every qualification is not required—if you're excited about this role, we'd love to hear from you. We believe diverse perspectives and experiences strengthen our team.
-
Reliability Engineer
7 hours ago
Taipei, Taipei City, Taiwan Motive Full time $60,000 - $120,000 per yearWho We AreMotive empowers the people who run physical operations with tools to make their work safer, more productive, and more profitable. For the first time ever, safety, operations and finance teams can manage their drivers, vehicles, equipment, and fleet related spend in a single system. Combined with industry leading AI, the Motive platform gives you...
-
Reliability Engineer
7 hours ago
Taipei, Taiwan Motive Full time NT$120,000 - NT$180,000 per yearWho we are:Motive empowers the people who run physical operations with tools to make their work safer, more productive, and more profitable. For the first time ever, safety, operations and finance teams can manage their drivers, vehicles, equipment, and fleet related spend in a single system. Combined with industry leading AI, the Motive platform gives you...
-
Taipei, Taiwan hermeneutic Investments Full time $120,000 - $240,000 per yearAbout the Role:We're looking for an Senior Site Reliability/DevOps Engineer to join our hedge fund's technology team. You'll be responsible for building and maintaining our cloud infrastructure that powers our trading operations. This role combines expertise in AWS architecture, database administration, and system monitoring to ensure our platform operates...
-
Principal Engineer, Photonics Reliability
7 hours ago
Hsinchu City,, Taiwan Ayar Labs Full time NT$120,000 - NT$180,000 per yearPrincipal Engineer - Photonics ReliabilityLocation: Hsinchu (on-site)We are seeking an experienced Photonics Reliability Engineer to join our dynamic team. In this role, you will be responsible for reliability engineering, conducting failure analysis investigations, identifying root causes, and implementing corrective actions. You will collaborate closely...
-
Silicon Reliability Engineer
7 hours ago
Taipei, Taipei City, Taiwan Meta Full time $104,000 - $130,878 per yearReality Labs is focused on ensuring the highest quality and reliability of our advanced silicon products. We are seeking a Lead Silicon Reliability Engineer to drive reliability strategies and lead efforts to deliver robust silicon solutions.Silicon Reliability Engineer Responsibilities:Lead the development, execution, and continuous improvement of silicon...
-
Site HSE Manager
7 hours ago
Taiwan, Daya District Air Liquide Full time NT$80,000 - NT$120,000 per yearALES ASIA was founded in Taiwan in 2006. Our factory is located in the Taichung Science Park in Taiwan and it has class 1000 and 100 cleanroom (192 m2) and ISO 9001 certificates. We provide a proven, reliable and cost effective solution to worldwide special gas delivery equipment needs, including gas cabinets, valve manifold boxes. Beside, we also have...
-
Senior Site HSE Manager
7 hours ago
Daya District, Taichung City, Taiwan Air Liquide Full time NT$1,200,000 - NT$2,400,000 per yearALES ASIA was founded in Taiwan in 200Our factory is located in the Taichung Science Park in Taiwan and it has class 1000 and 100 cleanroom (192 m2) and ISO 9001 certificates.We provide a proven, reliable and cost effective solution to worldwide special gas delivery equipment needs, including gas cabinets, valve manifold boxes. Beside, we also have...
-
Product Reliability Engineer
7 hours ago
Zhubei City, Taiwan Nordic Semiconductor Full time $60,000 - $120,000 per yearAbout the jobNordic Semiconductor seeks an experienced Product Reliability Engineer to join our Product Qualification Team in Taiwan. In this role, you will execute product-level qualification and reliability testing programs, support failure analysis efforts, and act as the main point of contact for foundry suppliers, wafer manufacturers, OSATs, and other...
-
Taichung City, Taichung City, Taiwan Micron Technology Full time NT$1,200,000 - NT$2,400,000 per yearOur vision is to transform how the world uses information to enrich life forall.Micron Technology is a world leader in innovating memory and storage solutions that accelerate the transformation of information into intelligence, inspiring the world to learn, communicate and advance faster than ever.Our vision is to transform how the world uses information to...
-
Staff Quality and Reliability Engineer
2 weeks ago
Hsinchu City,, Taiwan Arm Full time NT$900,000 - NT$1,200,000 per yearJob ID Date posted19/09/2025LocationHsinchu, Taiwan; Taipei, TaiwanCategoryQualityJob OverviewAs a Staff Quality and Reliability Engineer at ARM, you will ensure the high standards and reliability of our products. Your responsibilities include developing quality control processes, conducting reliability tests, and performing failure analysis. You will...