Site Reliability Engineer-AI Cloud

2 weeks ago


Bade District Taoyuan City, Taiwan Supermicro Full time NT$280,000 - NT$1,440,000 per year

*Job Req ID: 26896
About Supermicro*
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.

Job Summary
As a Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure. You'll bridge Dev and Ops by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI and MLOps environments.

Essential Duties And Responsibilities
Cloud Infra Automation:
Design and deploy infrastructure on bare metal or cloud using Terraform, Ansible, or Helm. Automate workflows with Python or Go.

Platform Reliability
Maintain and scale GPU clusters, Kubernetes, and AI-optimized storage (Ceph, BeeGFS, Weka) to ensure stability and performance.

Monitoring & Alerting
Use Prometheus, Grafana, ELK, etc., to monitor system health and trigger alerts on anomalies.

Capacity Planning
Analyze usage patterns and forecast infrastructure needs for AI workloads.

Incident Management
Lead root cause analysis and manage SLOs/SLIs/SLAs to maintain high availability.

CI/CD Integration
Work with DevOps/MLOps teams on CI/CD pipelines using GitLab, ArgoCD, or similar tools.

Security & Compliance
Secure Linux systems, manage certificates, and enforce access controls (RBAC, LDAP SSO, TLS, segmentation).

Documentation & Playbooks
Maintain architecture diagrams, runbooks, and incident playbooks to support knowledge sharing and onboarding.

Qualifications

  • Bachelor's degree in Computer Science, Engineering, or a related field—or equivalent experience and 3-7 years of experience in the areas below is preferred.
  • Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes).
  • Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
  • Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.).
  • Strong scripting and coding skills (Bash, Python, or Go).
  • Exposure to secure multi-tenant environments and zero trust architectures.
  • Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics.
  • Excellent collaboration and communication skills for cross-team, partner, and customer initiatives


  • Taipei, Taipei City, Taiwan Appier Full time NT$1,200,000 - NT$3,600,000 per year

    About AppierAppier is a software-as-a-service (SaaS) company that uses artificial intelligence (AI) to power business decision-making. Founded in 2012 with a vision of democratizing AI, Appier's mission is turning AI into ROI by making software intelligent. Appier now has 17 offices across APAC, Europe and U.S., and is listed on the Tokyo Stock Exchange...


  • Taipei, Taipei City, Taiwan Ubiquiti Inc. Full time $90,000 - $120,000 per year

    About UbiquitiAt Ubiquiti Inc., we create technology platforms for Businesses, Smart Homes, and Internet Service Providers, driven by our goal to connect everyone, everywhere. To date, Ubiquiti has shipped over 100 million devices worldwide, from ISP networking products to next generation of IT solutions. Our growth is made possible by the dedicated team of...


  • Taipei City, Taiwan BTSE Full time NT$600,000 - NT$1,200,000 per year

    About BTSE彼特思方舟 is a specialized service provider dedicated to delivering a full spectrum of front-office and back-office support solutions, each of which are tailored to the unique needs of global financial technology firms. 彼特思方舟 is engaged by BTSE Group to offer several key positions, enabling the delivery of cutting-edge technology...


  • Guishan District, Taoyuan City, Taiwan ASML Foundation Full time NT$900,000 - NT$1,200,000 per year

    Introduction to the jobThe Cloud Engineer will be responsible to design, implement, and maintain scalable , secure, and cost-effective cloud infrastructure for Core IT APAC. The role ensures high availability, performance, and reliability of cloud-based systems while collaboration with cross-functional teams to automate deployments, monitor environments, and...


  • Bade District, Taoyuan City, Taiwan Super Micro Computer Spain, S.L. Full time NT$1,200,000 - NT$2,400,000 per year

    Apply now »Date:Aug 28, 2025Location:Bade, Taiwan, TWCompany:Super Micro ComputerJob Req ID: 26717About SupermicroSupermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing...

  • Software Engineer

    5 days ago


    Taipei, Taipei City, Taiwan Morale AI Full time NT$900,000 - NT$1,200,000 per year

    We are a fast-growing AI startup looking for a passionate Software Engineer to join our team and help build next-generation AI platforms. If you're excited about combining modern frontend and backend technologies, working with containerized environments, and deploying systems on the cloud, we'd love to hear from you.ResponsibilitiesFrontend Development:...


  • Linkou District, New Taipei City, Taiwan ASML Full time NT$120,000 - NT$240,000 per year

    Introduction to the jobThe Cloud Engineer will be responsible to design, implement, and maintain scalable , secure, and cost-effective cloud infrastructure for Core IT APAC. The role ensures high availability, performance, and reliability of cloud-based systems while collaboration with cross-functional teams to automate deployments, monitor environments, and...


  • Hsinchu City,, Taiwan Ayar Labs Full time NT$120,000 - NT$180,000 per year

    Principal Engineer - Photonics ReliabilityLocation: Hsinchu (on-site)We are seeking an experienced Photonics Reliability Engineer to join our dynamic team. In this role, you will be responsible for reliability engineering, conducting failure analysis investigations, identifying root causes, and implementing corrective actions. You will collaborate closely...


  • Bade District, Taoyuan City, Taiwan Super Micro Computer Spain, S.L. Full time NT$900,000 - NT$1,200,000 per year

    Apply now »Date:Oct 13, 2025Location:Bade, Taiwan, TWCompany:Super Micro ComputerJob Req ID: 27670About SupermicroSupermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing...


  • Bade District, Taoyuan City, Taiwan Super Micro Computer Spain, S.L. Full time NT$600,000 - NT$1,200,000 per year

    Apply now »Date:Oct 15, 2025Location:Bade, Taiwan, TWCompany:Super Micro ComputerJob Req ID: 27671About SupermicroSupermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing...