Infra Engineer

14 hours ago


Taipei Taipei City, Taiwan GMI Cloud Full time

About GMI

GMI Cloud is a fast-growing AI infrastructure company backed by Headline VC and one of only six cloud providers worldwide to earn NVIDIA's prestigious Reference Platform Cloud Partner designation . We operate 8 of our own GPU clusters across the U.S. and Asia, delivering a full spectrum of services from GPU compute service to AI model inference API solutions. As an NVIDIA Reference Platform Cloud Partner, our infrastructure meets the highest standards for performance, security, and scalability in AI deployments. We empower AI startups and enterprises to "build AI without limits," providing everything they need to prototype, train, and deploy AI models quickly and reliably.

Role Overview

We are seeking a skilled
Site Reliability Engineer
to join the GMI Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of the large-scale high performance AI/ML clusters in our data center. The ideal candidate will bring expertise in system-level troubleshooting, AI cluster maintenance, and operational excellence to ensure maximum performance for our infrastructure. Experience with large-scale infrastructure automation is considered a strong plus.

Responsibilities

  1. Design, implement and maintain scalable AI/ML infrastructure solutions.
  2. Proactively monitor GPU cluster health, performance and troubleshoot issues across compute, accelerator, and storage systems.
  3. Automate deployment, configuration and management of infrastructure resources.
  4. Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning and upgrades of GPU nodes.
  5. Implement CI/CD pipelines for infrastructure deployment and orchestration.
  6. Ensure security, compliance and best practices across infrastructure.
  7. Manage incident response related to Infrastructure resources (GPU, CPU, Storage, Network).
  8. Handle customer provisioning requests for GPU resources, including onboarding, configuration and troubleshooting; resolve customer service requests related to GPU infrastructure, ensuring high customer satisfaction.
  9. Stay current with emerging GPU hardware and software technologies, integrating improvements as appropriate.
  10. Regional/international travel to GMI data center locations.

Qualifications

  1. Bachelor's degree in Computer Science or related field.
  2. Over 3+ years of experience in data center operations, infrastructure, or systems engineering.
  3. Proven experience in site reliability engineering and infrastructure automation (e.g. Ansible, Terraform)
  4. Familiarity with containers orchestration platform (e.g. Kubernetes, Nvidia GPU operator, Nvidia Network operator, CNI, CSI) and job scheduling systems (e.g. Slurm).
  5. Familiarity with Linux system administration and scripting (Python, Bash).
  6. Familiarity with logging and monitoring tools such as Prometheus, Grafana, Loki.
  7. Good knowledge of GPU architecture, Nvidia CUDA, NCCL, or related AI/ML frameworks - added advantage.
  8. Strong troubleshooting skills and ability to analyze system logs and performance metrics.
  9. Excellent communication and teamwork abilities.

Meeting every qualification is not required—if you're excited about this role, we'd love to hear from you. We believe diverse perspectives and experiences strengthen our team.



  • Taipei, Taipei City, Taiwan TrendAI Full time

    Join Trend ‧ Join New Generation趨勢科技 - 全球雲端資安領航者 / 全亞洲最大軟體公司 / 企業版圖橫跨五大洲 / 趨勢全球研發基地在台灣===============================================================OverviewRDSec Team is committed to continuously improving the efficiency and security of Trend Micro product development...


  • New Taipei City, New Taipei City, Taiwan Wistron Full time

    Job DescripitionDesign, implement, and manage OpenStack infra and strategies.Develop automation and deployment templates for infra provisioning, scaling, and management.Ensure HA, disaster recovery, and fault tolerance of all managed systems.Collaborate with development teams to facilitate automation processes and OpenStack integration into our development...

  • DevOps Engineer

    14 hours ago


    Taipei City, Taiwan KKCompany Full time

    Team Segment : Solutions BusinessOur Current Stack Includes:AWS · Kubernetes (EKS) · Terraform · GitLab CI · Jira · Python/Golang(Experience with equivalent tools is also welcomee.g., Jenkins, GitHub Actions, CircleCI, Ansible, etc.)About the Role:We are hiring a DevOps Engineer to accelerate our infrastructure automation, CI/CD reliability, and release...


  • Taipei, Taipei City, Taiwan Global Payments Inc. Full time

    DescriptionSummary of This RoleThis is a senior network role based in Taiwan. We are looking to hire an experienced senior network engineer to manage and maintain our network. To ensure success as a senior network engineer, you should have extensive experience working with enterprise networks, high level managerial skills and advanced project management...

  • Founding LLM

    14 hours ago


    Taipei, Taipei City, Taiwan ExeBrain Co., Ltd. Full time

    Founding LLM & Backend Engineer — Realtime Copilot (Prompt Systems, Tooling, Infra)Type:Full-time, player-coach (hands-on IC now; grow a small team over time)Why this role existsOur AI companion needs tolisten, reason, and respond in-the-moment—with empathy and precision—while staying private and reliable. You'll architect the LLM system that...


  • Taipei, Taiwan Trend Micro Full time

    Join Trend ‧ Join New Generation趨勢科技 - 全球雲端資安領航者 / 全亞洲最大軟體公司 / 企業版圖橫跨五大洲 / 趨勢全球研發基地在台灣 ===============================================================OverviewRDSec Team is committed to continuously improving the efficiency and security of Trend Micro product development...

  • MIS Engineer

    14 hours ago


    Taipei, Taipei City, Taiwan Michael Page Full time

    Own the IT landscape, not just follow ticketsJoin a recognised consumer brand with stable growthAbout Our ClientOur client is a growing consumer-focused company with a strong presence in Taiwan and growing international exposure.With its head office in Taipei, the business is investing in IT and systems to support continuous expansion, and is now looking for...

  • Software Engineer

    14 hours ago


    Taipei, Taipei City, Taiwan Meta Full time

    As a software engineer in the Wearables System Health Team at Reality Labs, you will be a specialist in optimizing and defending performance, power efficiency and stability across our products. You will collaborate with a wide-range of teams, threading together newly developed technologies to deliver most performant, energy efficient and reliable products...


  • Taipei, Taipei City, Taiwan Crypto Full time

    We are a team to design, develop, maintain, and improve software for various ventures projects, i.e., projects that are adjacent to our core businesses and are bootstrapped fast with a lean team. You will be actively involved in the design of various components behind scalable applications, from frontend UI to backend...

  • Software Engineer

    14 hours ago


    Taipei, Taiwan Meta Full time

    As a software engineer in the Wearables System Health Team at Reality Labs, you will be a specialist in optimizing and defending performance, power efficiency and stability across our products. You will collaborate with a wide-range of teams, threading together newly developed technologies to deliver most performant, energy efficient and reliable products...