
Site Reliability Engineer-AI Cloud
2 weeks ago
*Job Req ID: 26896
About Supermicro*
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.
Job Summary
As a Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure. You'll bridge Dev and Ops by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI and MLOps environments.
Essential Duties And Responsibilities
Cloud Infra Automation:
Design and deploy infrastructure on bare metal or cloud using Terraform, Ansible, or Helm. Automate workflows with Python or Go.
Platform Reliability
Maintain and scale GPU clusters, Kubernetes, and AI-optimized storage (Ceph, BeeGFS, Weka) to ensure stability and performance.
Monitoring & Alerting
Use Prometheus, Grafana, ELK, etc., to monitor system health and trigger alerts on anomalies.
Capacity Planning
Analyze usage patterns and forecast infrastructure needs for AI workloads.
Incident Management
Lead root cause analysis and manage SLOs/SLIs/SLAs to maintain high availability.
CI/CD Integration
Work with DevOps/MLOps teams on CI/CD pipelines using GitLab, ArgoCD, or similar tools.
Security & Compliance
Secure Linux systems, manage certificates, and enforce access controls (RBAC, LDAP SSO, TLS, segmentation).
Documentation & Playbooks
Maintain architecture diagrams, runbooks, and incident playbooks to support knowledge sharing and onboarding.
Qualifications
- Bachelor's degree in Computer Science, Engineering, or a related field—or equivalent experience and 3-7 years of experience in the areas below is preferred.
- Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes).
- Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
- Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.).
- Strong scripting and coding skills (Bash, Python, or Go).
- Exposure to secure multi-tenant environments and zero trust architectures.
- Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics.
- Excellent collaboration and communication skills for cross-team, partner, and customer initiatives
-
Software Engineer, Site Reliability Engineering
2 weeks ago
Taipei, Taipei City, Taiwan Appier Full time NT$1,200,000 - NT$3,600,000 per yearAbout AppierAppier is a software-as-a-service (SaaS) company that uses artificial intelligence (AI) to power business decision-making. Founded in 2012 with a vision of democratizing AI, Appier's mission is turning AI into ROI by making software intelligent. Appier now has 17 offices across APAC, Europe and U.S., and is listed on the Tokyo Stock Exchange...
-
Senior Site Reliability Engineer
6 days ago
Taipei, Taipei City, Taiwan Ubiquiti Inc. Full time $90,000 - $120,000 per yearAbout UbiquitiAt Ubiquiti Inc., we create technology platforms for Businesses, Smart Homes, and Internet Service Providers, driven by our goal to connect everyone, everywhere. To date, Ubiquiti has shipped over 100 million devices worldwide, from ISP networking products to next generation of IT solutions. Our growth is made possible by the dedicated team of...
-
IT - Cloud Engineer - Linkou
1 week ago
Guishan District, Taoyuan City, Taiwan ASML Foundation Full time NT$900,000 - NT$1,200,000 per yearIntroduction to the jobThe Cloud Engineer will be responsible to design, implement, and maintain scalable , secure, and cost-effective cloud infrastructure for Core IT APAC. The role ensures high availability, performance, and reliability of cloud-based systems while collaboration with cross-functional teams to automate deployments, monitor environments, and...
-
Sr. Production Software Engineer
4 days ago
Bade District, Taoyuan City, Taiwan Super Micro Computer Spain, S.L. Full time NT$1,200,000 - NT$2,400,000 per yearApply now »Date:Aug 28, 2025Location:Bade, Taiwan, TWCompany:Super Micro ComputerJob Req ID: 26717About SupermicroSupermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing...
-
Software Engineer
6 days ago
Taipei, Taipei City, Taiwan Morale AI Full time NT$900,000 - NT$1,200,000 per yearWe are a fast-growing AI startup looking for a passionate Software Engineer to join our team and help build next-generation AI platforms. If you're excited about combining modern frontend and backend technologies, working with containerized environments, and deploying systems on the cloud, we'd love to hear from you.ResponsibilitiesFrontend Development:...
-
IT - Cloud Engineer - Linkou
1 week ago
Linkou District, New Taipei City, Taiwan ASML Full time NT$120,000 - NT$240,000 per yearIntroduction to the jobThe Cloud Engineer will be responsible to design, implement, and maintain scalable , secure, and cost-effective cloud infrastructure for Core IT APAC. The role ensures high availability, performance, and reliability of cloud-based systems while collaboration with cross-functional teams to automate deployments, monitor environments, and...
-
Sr. Firmware Engineer
6 days ago
Bade District, Taoyuan City, Taiwan Super Micro Computer Spain, S.L. Full time NT$900,000 - NT$1,200,000 per yearApply now »Date:Oct 13, 2025Location:Bade, Taiwan, TWCompany:Super Micro ComputerJob Req ID: 27670About SupermicroSupermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing...
-
System Validation Engineer
4 days ago
Bade District, Taoyuan City, Taiwan Super Micro Computer Spain, S.L. Full time NT$600,000 - NT$1,200,000 per yearApply now »Date:Oct 15, 2025Location:Bade, Taiwan, TWCompany:Super Micro ComputerJob Req ID: 27671About SupermicroSupermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing...
-
Electrical Engineer
2 weeks ago
Taoyuan District, Taiwan Alarm Full time NT$720,000 - NT$1,440,000 per yearSummary: is seeking an Electrical Engineer with strong technical expertise and a focus on delivering high-reliability hardware to support the design, validation, and manufacturing of embedded hardware systems. As part of a focused and experienced engineering team, you will take ownership of schematic design, component selection, and layout review, ensuring...
-
System Validation Engineer
4 days ago
Bade District, Taoyuan City, Taiwan Supermicro Full time NT$360,000 - NT$720,000 per year*Job Req ID: 27671About Supermicro*Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global...