GPU Cluster Administrator

7 hours ago


New Taipei City, Taiwan 鴻騰精密 FIT- Foxconn Interconnect Technology Ltd Asia Full time NT$1,200,000 - NT$2,400,000 per year

The ideal candidate will be responsible for working cross-functionally with data science, AI, and product teams to understand their compute needs and ensure reliable, scalable access to NVIDIA H200 GPU resources. To be effective in this position, you must feel comfortable owning the entire GPU infrastructure workflow—from cluster design, deployment, and monitoring, to user management, scheduling, and optimization. The GPU H200 Cluster Administrator is responsible for developing and delivering robust infrastructure solutions that enable AI workloads to run efficiently and cost-effectively across the organization.

[Responsibilities]

  • Design, deploy, and administer on-premise or cloud-hosted GPU clusters based on NVIDIA H200 (and related GPU families).
  • Configure and manage Kubernetes (K8s) clusters for GPU workloads, including node pools, taints/tolerations, and GPU-aware schedulers.
  • Implement and optimize load balancing, auto-scaling, and resource quotas to support multiple concurrent users and AI projects.
  • Manage user accounts and access control (RBAC), namespaces, and multi-tenant isolation for data scientists and engineers.
  • Maintain GPU software stack: CUDA, cuDNN, NCCL, drivers, container runtimes, and relevant libraries/frameworks.
  • Develop standardized, reproducible environments using Docker/OCI images and Helm charts for ML and deep learning workloads.
  • Monitor GPU utilization, pod performance, and system health using tools such as Prometheus, Grafana, or equivalent observability stacks.
  • Troubleshoot performance bottlenecks, failed jobs, and scheduling issues across Kubernetes and GPU nodes.
  • Collaborate with data science teams to support training, inference, and experimentation workflows (e.g., PyTorch, TensorFlow).
  • Implement job scheduling policies (fair-share, priority queues, preemption) to balance throughput and responsiveness.
  • Ensure cluster security and compliance, including network policies, image scanning, secrets management, and audit logging.
  • Work with IT/DevOps and cloud teams to integrate GPU clusters with existing CI/CD, storage, and identity platforms.
  • Document best practices, provide training and onboarding materials for users, and serve as the primary point of contact for GPU infrastructure questions.

[Requirements]

  • B.S. with 5+ years of industry experience, M.S. with 3+ years, or Ph.D. in Computer Science, Computer Engineering, or a related field with a focus on infrastructure, distributed systems, or MLOps.
  • Hands-on experience administering GPU clusters (NVIDIA H100/H200/A100 or similar) in production environments.
  • Strong experience with
    Kubernetes
    for production workloads, including cluster setup, networking, storage, and security.
  • Solid understanding of GPU scheduling and resource management in K8s (e.g., device plugins, GPU operators, custom schedulers).
  • Proficiency with Linux system administration (networking, storage, services, security hardening).
  • Experience managing multi-user environments: identity and access management (IAM), RBAC, quotas, and namespace isolation.
  • Strong scripting and automation skills (Python, Bash, or similar) to build tools for cluster operations and user workflows.
  • Familiarity with container technologies (Docker, containerd) and image lifecycle management.
  • Experience with monitoring and logging stacks (Prometheus, Grafana, ELK/EFK, or similar) for infrastructure observability.
  • Understanding of common ML/DL frameworks (PyTorch, TensorFlow) and their resource requirements on GPU clusters.
  • Experience with cloud platforms (Azure, AWS, or GCP) for GPU-based workloads is a plus.
  • Knowledge of CI/CD and infrastructure-as-code tools (e.g., Terraform, Ansible, ArgoCD, Jenkins/GitHub Actions) is a plus.
  • Ability to work in a fast-paced, experimental environment where iteration and adaptation are key.
  • A strong problem-solver who thrives when thrown into complex systems, can "figure things out as you go," and communicates clearly with both engineers and non-technical stakeholders.


  • Neihu District, Taipei City, Taiwan GMI Cloud Full time $100,000 - $120,000 per year

    About GMIWe are a fast-growing AI infrastructure startup based in Silicon Valley, working on cutting-edge technologies that power the future of artificial intelligence.We power developers, startups, and enterprises with scalable GPU cloud and inference solutions, helping AI builders turn ideas into reality. As we expand globally, we are looking for a dynamic...

  • Infra Engineer

    7 hours ago


    Taipei City, Taiwan GMI Cloud Full time NT$120,000 - NT$240,000 per year

    Role OverviewWe are seeking a talented and highly skilled Infrastructure Backend Engineering Development Engineer to design, build, and maintain the scalable infrastructure that supports GMI AI/ML initiatives. The ideal candidate will have a strong background in cloud computing, distributed systems, and DevOps practices to enable efficient AI infrastructure...

  • AI Solution Architect

    7 hours ago


    Xizhi District, New Taipei City, Taiwan Acer Full time $80,000 - $150,000 per year

    Acer's AI Business Unit is seeking a visionary and technically adept AI Solution Architect to lead the design, implementation, and operationalization of next-generation AI platforms and multi-agent systems. This role is responsible for building and governing GPU-based LLM infrastructures, secure AI Gateway layers, and orchestrated multi-agent collaboration...

  • AI HPC Manager

    2 weeks ago


    Hsinchu City,, Taiwan TSMC Full time NT$120,000 - NT$240,000 per year

    Established in 1987 and headquartered in Taiwan, TSMC pioneered the pure-play foundry business model with an exclusive focus on manufacturing its customers' products. As of 2024, TSMC serves more than 500 customers and manufactures over 11,000 products for high-performance computing, smartphones, the Internet of Things (IoT), automotive, and digital consumer...

  • Senior DevOps

    7 hours ago


    Taipei, Taipei City, Taiwan Aster Full time NT$15,000,000 - NT$25,000,000 per year

    Aster is the foremost decentralized exchange for crypto derivatives on major blockchains like BNB Chain and Arbitrum. Offering both order book and on-chain perpetuals, Aster empowers traders and stakers with unparalleled opportunities. Elevate trading with up to 1001x leverage, zero slippage and competitive fees. Amplify liquidity provider (LP) returns with...


  • Taipei, Taipei City, Taiwan Suntory Beverage & Food Health Enrichment Full time NT$720,000 - NT$1,440,000 per year

    Who We AreSuntory Beverage & Food strives to be the most locally beloved beverage company in every market we serve. Guided by our founding principles of "Growing for Good," "Yatte Minahare" (the spirit of bold ambition), and "Giving Back to Society", we pursue sustainable growth, bold innovation, and meaningful social impact. Across the Asia Pacific region,...

  • Solutions Architect

    7 hours ago


    Neihu District, Taipei City, Taiwan GMI Cloud Full time NT$120,000 - NT$240,000 per year

    OverviewWe are seeking a highly skilled Solution Architect with strong expertise in GPU-based cloud infrastructure, capable of bridging technical architecture and business strategy. This role will design scalable GPU cloud solutions, work closely with customers and partners, and translate complex requirements into actionable architectures and business...


  • Taipei, Taiwan XREX Full time

    AboutWant to build a worldwide brand from Taiwan, and to communicate our brand story to millions of users worldwide?Want to be based in Taiwan but work in a silicon-valley-like environment, and to build world-class brand and products?Want to participate in the global fintech and blockchain movement, and work at an English-speaking workplace?Come change the...

  • Sr. Software Engineer

    7 hours ago


    New Taipei, Taiwan Super Micro Computer Full time $60,000 - $120,000 per year

    Job Req ID: 27852 About Supermicro: Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global...


  • Taipei, Taipei City, Taiwan Roche Full time NT$120,000 - NT$180,000 per year

    At Roche you can show up as yourself, embraced for the unique qualities you bring. Our culture encourages personal expression, open dialogue, and genuine connections, where you are valued, accepted and respected for who you are, allowing you to thrive both personally and professionally. This is how we aim to prevent, stop and cure diseases and ensure...