Site Reliability Engineer-AI Cloud

9 hours ago


Bade District Taoyuan City, Taiwan Supermicro Full time NT$600,000 - NT$1,200,000 per year

*Job Req ID: 26896
About Supermicro*
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.

Job Summary
As a Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure. You'll bridge Dev and Ops by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI and MLOps environments.

Essential Duties And Responsibilities
Cloud Infra Automation:
Design and deploy infrastructure on bare metal or cloud using Terraform, Ansible, or Helm. Automate workflows with Python or Go.

Platform Reliability
Maintain and scale GPU clusters, Kubernetes, and AI-optimized storage (Ceph, BeeGFS, Weka) to ensure stability and performance.

Monitoring & Alerting
Use Prometheus, Grafana, ELK, etc., to monitor system health and trigger alerts on anomalies.

Capacity Planning
Analyze usage patterns and forecast infrastructure needs for AI workloads.

Incident Management
Lead root cause analysis and manage SLOs/SLIs/SLAs to maintain high availability.

CI/CD Integration
Work with DevOps/MLOps teams on CI/CD pipelines using GitLab, ArgoCD, or similar tools.

Security & Compliance
Secure Linux systems, manage certificates, and enforce access controls (RBAC, LDAP SSO, TLS, segmentation).

Documentation & Playbooks
Maintain architecture diagrams, runbooks, and incident playbooks to support knowledge sharing and onboarding.

Qualifications

  • Bachelor's degree in Computer Science, Engineering, or a related field—or equivalent experience and 3-7 years of experience in the areas below is preferred.
  • Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes).
  • Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
  • Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.).
  • Strong scripting and coding skills (Bash, Python, or Go).
  • Exposure to secure multi-tenant environments and zero trust architectures.
  • Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics.
  • Excellent collaboration and communication skills for cross-team, partner, and customer initiatives


  • Taipei, Taipei City, Taiwan PalUp Full time $90,000 - $120,000 per year

    The engineering team at PalUp is at the core of our mission, building and maintaining systems that make our large-scale social platform stable, reliable, and efficient. As a Site Reliability Engineer, you will play a vital role in ensuring the seamless operation of our infrastructure and services, supporting millions of global users while collaborating...


  • Taipei, Taipei City, Taiwan Circle Full time $125,000 - $175,000 per year

    Circle is a financial technology company at the epicenter of the emerging internet of money, where value can finally travel like other digital data — globally, nearly instantly and less expensively than legacy settlement systems. This ground-breaking new internet layer opens up previously unimaginable possibilities for payments, commerce and markets that...


  • Taipei, Taipei City, Taiwan Netskope Full time $104,000 - $130,878 per year

    About NetskopeToday, there's more data and users outside the enterprise than inside, causing the network perimeter as we know it to dissolve. We realized a new perimeter was needed, one that is built in the cloud and follows and protects data wherever it goes, so we started Netskope to redefine Cloud, Network and Data Security.Since 2012, we have built the...


  • Neihu District, Taipei City, Taiwan GMI Cloud Full time $150,000 - $200,000 per year

    Position OverviewWe are seeking a highly driven and strategic Director of Business Development to lead and expand our GPU cloud services business. This role requires a proven track record in enterprise sales, cloud computing, and strategic partnerships, with deep understanding of the GPU ecosystem and its applications in AI, HPC, and data-intensive...


  • Taipei, Taipei City, Taiwan Microsoft Full time $104,000 - $130,878 per year

    Are you curious, enthusiastic about infrastructure, and ready to solve complex challenges in the AI era? Join us as aCloud & AI Digital Solution Engineerfocused on theAzure Platformforcommercial customersat Microsoft.In this technical sales role, you will help customers design secure, scalable, and resilient cloud architectures that support their...


  • Bade District, Taoyuan City, Taiwan Super Micro Computer Spain, S.L. Full time $125,000 - $175,000 per year

    Apply now »Date:Jul 23, 2025Location:Bade, Taiwan, TWCompany:Super Micro ComputerJob Req ID: 25705About SupermicroSupermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing...

  • Sr. System Engineer

    1 week ago


    Bade District, Taoyuan City, Taiwan Supermicro Full time $80,000 - $120,000 per year

    *Job Req ID: 27243Essential Duties And Responsibilities*System Engineer are expected to draft, review and validate system product spec. or features.Study new technologies for implementation into products.Benchmark product performanceSupport FAE/PM and Sales for new products promotion.Support new product pilot runSupport issue and failure...

  • Edge AI engineer

    1 week ago


    Taipei City, Taipei City, Taiwan Accton智邦科技 Full time $104,000 - $130,878 per year

    Company DescriptionWith over 35 years of experience in the design and development of network products, Accton's professional international team is committed to developing advanced, economic, and reliable products. As a leader in the design of open hardware platforms, including data centers, Metro Ethernet, telecommunication networks, school/enterprise...


  • Bade District, Taoyuan City, Taiwan Supermicro Full time $150,000 - $200,000 per year

    *Job Req ID: 27256About Supermicro*Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global...


  • Taoyuan District, Taiwan Alarm Full time $90,000 - $120,000 per year

     is a rapidly expanding technology company developing innovative cloud and device solutions across a wide range of cutting-edge technologies that provide a unified, connected service to millions of residential and commercial properties in the US and abroad. We are looking for a talented and passionate video embedded software engineer for the Device...