Site Reliability Engineer-AI Cloud
1 week ago
*Job Req ID: 26896
About Supermicro*
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.
Job Summary
As a Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure. You'll bridge Dev and Ops by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI and MLOps environments.
Essential Duties And Responsibilities
Cloud Infra Automation:
Design and deploy infrastructure on bare metal or cloud using Terraform, Ansible, or Helm. Automate workflows with Python or Go.
Platform Reliability
Maintain and scale GPU clusters, Kubernetes, and AI-optimized storage (Ceph, BeeGFS, Weka) to ensure stability and performance.
Monitoring & Alerting
Use Prometheus, Grafana, ELK, etc., to monitor system health and trigger alerts on anomalies.
Capacity Planning
Analyze usage patterns and forecast infrastructure needs for AI workloads.
Incident Management
Lead root cause analysis and manage SLOs/SLIs/SLAs to maintain high availability.
CI/CD Integration
Work with DevOps/MLOps teams on CI/CD pipelines using GitLab, ArgoCD, or similar tools.
Security & Compliance
Secure Linux systems, manage certificates, and enforce access controls (RBAC, LDAP SSO, TLS, segmentation).
Documentation & Playbooks
Maintain architecture diagrams, runbooks, and incident playbooks to support knowledge sharing and onboarding.
Qualifications
- Bachelor's degree in Computer Science, Engineering, or a related field—or equivalent experience and 3-7 years of experience in the areas below is preferred.
- Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes).
- Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
- Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.).
- Strong scripting and coding skills (Bash, Python, or Go).
- Exposure to secure multi-tenant environments and zero trust architectures.
- Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics.
- Excellent collaboration and communication skills for cross-team, partner, and customer initiatives
-
Software Engineer, Site Reliability Engineering
2 weeks ago
Taipei, Taipei City, Taiwan Appier Full time NT$1,200,000 - NT$3,600,000 per yearAbout AppierAppier is a software-as-a-service (SaaS) company that uses artificial intelligence (AI) to power business decision-making. Founded in 2012 with a vision of democratizing AI, Appier's mission is turning AI into ROI by making software intelligent. Appier now has 17 offices across APAC, Europe and U.S., and is listed on the Tokyo Stock Exchange...
-
Senior Site Reliability Engineer
5 days ago
Taipei, Taipei City, Taiwan Ubiquiti Inc. Full time $90,000 - $120,000 per yearAbout UbiquitiAt Ubiquiti Inc., we create technology platforms for Businesses, Smart Homes, and Internet Service Providers, driven by our goal to connect everyone, everywhere. To date, Ubiquiti has shipped over 100 million devices worldwide, from ISP networking products to next generation of IT solutions. Our growth is made possible by the dedicated team of...
-
Database Site Reliability Engineer
2 weeks ago
Taipei City, Taiwan BTSE Full time NT$600,000 - NT$1,200,000 per yearAbout BTSE彼特思方舟 is a specialized service provider dedicated to delivering a full spectrum of front-office and back-office support solutions, each of which are tailored to the unique needs of global financial technology firms. 彼特思方舟 is engaged by BTSE Group to offer several key positions, enabling the delivery of cutting-edge technology...
-
IT - Cloud Engineer - Linkou
1 week ago
Guishan District, Taoyuan City, Taiwan ASML Foundation Full time NT$900,000 - NT$1,200,000 per yearIntroduction to the jobThe Cloud Engineer will be responsible to design, implement, and maintain scalable , secure, and cost-effective cloud infrastructure for Core IT APAC. The role ensures high availability, performance, and reliability of cloud-based systems while collaboration with cross-functional teams to automate deployments, monitor environments, and...
-
Sr. Production Software Engineer
3 days ago
Bade District, Taoyuan City, Taiwan Super Micro Computer Spain, S.L. Full time NT$1,200,000 - NT$2,400,000 per yearApply now »Date:Aug 28, 2025Location:Bade, Taiwan, TWCompany:Super Micro ComputerJob Req ID: 26717About SupermicroSupermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing...
-
Software Engineer
5 days ago
Taipei, Taipei City, Taiwan Morale AI Full time NT$900,000 - NT$1,200,000 per yearWe are a fast-growing AI startup looking for a passionate Software Engineer to join our team and help build next-generation AI platforms. If you're excited about combining modern frontend and backend technologies, working with containerized environments, and deploying systems on the cloud, we'd love to hear from you.ResponsibilitiesFrontend Development:...
-
IT - Cloud Engineer - Linkou
7 days ago
Linkou District, New Taipei City, Taiwan ASML Full time NT$120,000 - NT$240,000 per yearIntroduction to the jobThe Cloud Engineer will be responsible to design, implement, and maintain scalable , secure, and cost-effective cloud infrastructure for Core IT APAC. The role ensures high availability, performance, and reliability of cloud-based systems while collaboration with cross-functional teams to automate deployments, monitor environments, and...
-
Sr. Firmware Engineer
5 days ago
Bade District, Taoyuan City, Taiwan Super Micro Computer Spain, S.L. Full time NT$900,000 - NT$1,200,000 per yearApply now »Date:Oct 13, 2025Location:Bade, Taiwan, TWCompany:Super Micro ComputerJob Req ID: 27670About SupermicroSupermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing...
-
Hsinchu City,, Taiwan Ayar Labs Full time NT$120,000 - NT$180,000 per yearPrincipal Engineer - Photonics ReliabilityLocation: Hsinchu (on-site)We are seeking an experienced Photonics Reliability Engineer to join our dynamic team. In this role, you will be responsible for reliability engineering, conducting failure analysis investigations, identifying root causes, and implementing corrective actions. You will collaborate closely...
-
System Validation Engineer
3 days ago
Bade District, Taoyuan City, Taiwan Super Micro Computer Spain, S.L. Full time NT$600,000 - NT$1,200,000 per yearApply now »Date:Oct 15, 2025Location:Bade, Taiwan, TWCompany:Super Micro ComputerJob Req ID: 27671About SupermicroSupermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing...