
Site Reliability Engineer
22 hours ago
Role Overview
We are seeking a skilled
Site Reliability Engineer
to join the GMI Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of the large-scale high performance AI/ML clusters in our data center. The ideal candidate will bring expertise in system-level troubleshooting, AI cluster maintenance, and operational excellence to ensure maximum performance for our infrastructure. Experience with large-scale infrastructure automation is considered a strong plus.
Responsibilities
- Design, implement and maintain scalable AI/ML infrastructure solutions.
- Proactively monitor GPU cluster health, performance and troubleshoot issues across compute, accelerator, and storage systems.
- Automate deployment, configuration and management of infrastructure resources.
- Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning and upgrades of GPU nodes.
- Implement CI/CD pipelines for infrastructure deployment and orchestration.
- Ensure security, compliance and best practices across infrastructure.
- Manage incident response related to Infrastructure resources (GPU, CPU, Storage, Network).
- Handle customer provisioning requests for GPU resources, including onboarding, configuration and troubleshooting; resolve customer service requests related to GPU infrastructure, ensuring high customer satisfaction.
- Stay current with emerging GPU hardware and software technologies, integrating improvements as appropriate.
- Regional/international travel to GMI data center locations.
Qualifications
- Bachelor's degree in Computer Science or related field.
- Over 3+ years of experience in data center operations, infrastructure, or systems engineering.
- Proven experience in site reliability engineering and infrastructure automation (e.g. Ansible, Terraform)
- Familiarity with containers orchestration platform (e.g. Kubernetes, Nvidia GPU operator, Nvidia Network operator, CNI, CSI) and job scheduling systems (e.g. Slurm).
- Familiarity with Linux system administration and scripting (Python, Bash).
- Familiarity with logging and monitoring tools such as Prometheus, Grafana, Loki.
- Good knowledge of GPU architecture, Nvidia CUDA, NCCL, or related AI/ML frameworks - added advantage.
- Strong troubleshooting skills and ability to analyze system logs and performance metrics.
- Excellent communication and teamwork abilities.
- Meeting every qualification is not required—if you're excited about this role, we'd love to hear from you. We believe diverse perspectives and experiences strengthen our team.
-
Site Reliability Engineer
22 hours ago
Taipei, Taipei City, Taiwan PalUp Full time NT$900,000 - NT$1,200,000 per yearThe engineering team at PalUp is at the core of our mission, building and maintaining systems that make our large-scale social platform stable, reliable, and efficient. As a Site Reliability Engineer, you will play a vital role in ensuring the seamless operation of our infrastructure and services, supporting millions of global users while collaborating...
-
Senior Site Reliability Engineer
22 hours ago
Taipei, Taipei City, Taiwan Circle Full time $125,000 - $175,000 per yearCircle is a financial technology company at the epicenter of the emerging internet of money, where value can finally travel like other digital data — globally, nearly instantly and less expensively than legacy settlement systems. This ground-breaking new internet layer opens up previously unimaginable possibilities for payments, commerce and markets that...
-
Senior Site Reliability Engineer
6 days ago
Taipei, Taiwan BTSE Full time NT$900,000 - NT$1,200,000 per yearAbout BTSE:彼特思方舟 is a specialized service provider dedicated to delivering a full spectrum of front-office and back-office support solutions, each of which are tailored to the unique needs of global financial technology firms. 彼特思方舟 is engaged by BTSE Group to offer several key positions, enabling the delivery of cutting-edge technology...
-
Senior Site Reliability Engineer
1 week ago
Taipei, Taipei City, Taiwan Ubiquiti Inc. Full time $90,000 - $120,000 per yearAbout UbiquitiAt Ubiquiti Inc., we create technology platforms for Businesses, Smart Homes, and Internet Service Providers, driven by our goal to connect everyone, everywhere. To date, Ubiquiti has shipped over 100 million devices worldwide, from ISP networking products to next generation of IT solutions. Our growth is made possible by the dedicated team of...
-
Site Reliability Engineer-AI Cloud
2 weeks ago
Bade District, Taoyuan City, Taiwan Supermicro Full time NT$280,000 - NT$1,440,000 per year*Job Req ID: 26896About Supermicro*Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global...
-
Taipei, Taiwan hermeneutic Investments Full time NT$1,200,000 - NT$3,600,000 per yearAbout the Role:We're looking for an Senior Site Reliability/DevOps Engineer to join our hedge fund's technology team. You'll be responsible for building and maintaining our cloud infrastructure that powers our trading operations. This role combines expertise in AWS architecture, database administration, and system monitoring to ensure our platform operates...
-
Reliability Engineer
6 days ago
Taipei City, Taiwan Snap Inc. Full time $100,000 - $150,000 per yearSnap Inc is a technology company. We believe the camera presents the greatest opportunity to improve the way people live and communicate. Snap contributes to human progress by empowering people to express themselves, live in the moment, learn about the world, and have fun together. The Company's three core products are Snapchat, a visual messaging app that...
-
Staff Site Reliability Engineer
22 hours ago
Taipei, Taipei City, Taiwan Netskope Full time NT$1,200,000 - NT$2,400,000 per yearAbout NetskopeToday, there's more data and users outside the enterprise than inside, causing the network perimeter as we know it to dissolve. We realized a new perimeter was needed, one that is built in the cloud and follows and protects data wherever it goes, so we started Netskope to redefine Cloud, Network and Data Security.Since 2012, we have built the...
-
Silicon Reliability Engineer
22 hours ago
Taipei, Taipei City, Taiwan Meta Full time $104,000 - $130,878 per yearReality Labs is focused on ensuring the highest quality and reliability of our advanced silicon products. We are seeking a Lead Silicon Reliability Engineer to drive reliability strategies and lead efforts to deliver robust silicon solutions.Silicon Reliability Engineer Responsibilities:Lead the development, execution, and continuous improvement of silicon...
-
Hardware Reliability Test Engineer, Reliability Lab
22 hours ago
Banqiao District, New Taipei City, Taiwan Google Full time NT$900,000 - NT$1,200,000 per yearGoogle welcomes people with disabilities.Minimum qualifications:Bachelor's degree in Reliability Engineering, Electrical Engineering, Mechanical Engineering, or a relevant Engineering field, or equivalent practical experience.3 years of experience in lab testing or related experience with hardware evaluation methodologies, including lab measurements,...