Site Reliability Engineer-AI Cloud
1 day ago
Job Req ID: 26896 About Supermicro:
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.
As a Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure. You'll bridge Dev and Ops by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI and MLOps environments.
Essential Duties and Responsibilities:Cloud Infra Automation:
Design and deploy infrastructure on bare metal or cloud using Terraform, Ansible, or Helm. Automate workflows with Python or Go.
Platform Reliability:
Maintain and scale GPU clusters, Kubernetes, and AI-optimized storage (Ceph, BeeGFS, Weka) to ensure stability and performance.
Monitoring & Alerting:
Use Prometheus, Grafana, ELK, etc., to monitor system health and trigger alerts on anomalies.
Capacity Planning:
Analyze usage patterns and forecast infrastructure needs for AI workloads.
Incident Management:
Lead root cause analysis and manage SLOs/SLIs/SLAs to maintain high availability.
CI/CD Integration:
Work with DevOps/MLOps teams on CI/CD pipelines using GitLab, ArgoCD, or similar tools.
Security & Compliance:
Secure Linux systems, manage certificates, and enforce access controls (RBAC, LDAP SSO, TLS, segmentation).
Documentation & Playbooks:
Maintain architecture diagrams, runbooks, and incident playbooks to support knowledge sharing and onboarding.
- Bachelor's degree in Computer Science, Engineering, or a related field—or equivalent experience and 3-7 years of experience in the areas below is preferred.
- Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes).
- Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
- Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.).
- Strong scripting and coding skills (Bash, Python, or Go).
- Exposure to secure multi-tenant environments and zero trust architectures.
- Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics.
- Excellent collaboration and communication skills for cross-team, partner, and customer initiatives
-
Principal Site Reliability Engineer
1 day ago
Taiwan AIFT Full time $120,000 - $150,000 per yearJob Overview We are seeking a self-driven Principal Site Reliability Engineer with a strong technical background and excellent communication skills. This individual will lead the development, construction, and management of reliable and distributed systems that support our business operations. In this role, you will play a vital part in supporting the...
-
Site Reliability Engineer
1 day ago
Taipei City, Taiwan Fortinet Full time $120,000 - $180,000 per yearDescriptionLocation: Taiwan (Taipei)Join Fortinet, a cybersecurity pioneer with over two decades of excellence, as we continue to shape the future of cybersecurity and redefine the intersection of networking and security. At Fortinet, our mission is to safeguard people, devices, and data everywhere. We are currently seeking a dynamic Site Reliability...
-
Senior Site Reliability Engineer
1 day ago
Taiwan Ubiquiti Full time NT$900,000 - NT$1,200,000 per yearAbout Ubiquiti At Ubiquiti Inc., we create technology platforms for Businesses, Smart Homes, and Internet Service Providers, driven by our goal to connect everyone, everywhere. To date, Ubiquiti has shipped over 100 million devices worldwide, from ISP networking products to next generation of IT solutions. Our growth is made possible by the dedicated team...
-
SRE】Site Reliability Engineer
1 day ago
Taipei City, , Taiwan FUNNOW Group Full time $60,000 - $120,000 per year【Capsule】At FunNow, we're building joyful experiences, at the speed of now. As a Site Reliability Engineer, you'll play a crucial role in ensuring our platform stays fast, resilient, and secure for millions of users booking spontaneous fun across Asia. But here's the twist: we don't just monitor uptime — we build with AI and automation. From Kubernetes...
-
Cloud Engineer
1 day ago
Taiwan AIFT Full time $100,000 - $120,000 per yearJob OverviewWe are seeking a skilled and passionate Site Reliability Engineer with a strong technical background and excellent communication skills. This individual will lead the development, construction, and management of reliable and distributed systems that support our business operations. In this role, you will play a vital part in supporting the...
-
Taiwan Google Full time NT$120,000 - NT$240,000 per yearinfo_outlineXGoogle welcomes people with disabilities.Minimum qualifications:Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.5 years of experience managing projects, delivering technical solutions/architecture, and developing and maintaining technical solutions in visualized environments.Experience in data systems...
-
Taiwan Crypto Full time $120,000 - $180,000 per yearWe are a team to design, develop, maintain, and improve software for various ventures projects, i.e., projects that are adjacent to our core businesses and are bootstrapped fast with a lean team. You will be actively involved in the design of various components behind scalable applications, from frontend UI to backend infrastructure. What you'll be...
-
Cloud Engineer
1 day ago
Taiwan Shopline Full time NT$420,000 - NT$840,000 per year【What you'll be doing】As a SHOPLINE Cloud Engineer, your primary responsibility is to enhance the availability, performance and reliability of all infrastructures supporting SHOPLINE products. You will be tasked with:- Automation: Have a strong focus on implementing workflow automation and optimizing developer experience.- Troubleshooting: Root cause...
-
New Taipei, Banqiao District, New Taipei City, Taiwan Google Full time $60,000 - $120,000 per yearinfo_outlineXGoogle welcomes people with disabilities.Minimum qualifications:Bachelor's degree in Reliability Engineering, Electrical Engineering, Mechanical Engineering, or a relevant Engineering field, or equivalent practical experience.3 years of experience in lab testing or related experience with hardware evaluation methodologies, including lab...
-
Silicon Quality and Reliability Engineer
1 day ago
(Taiwan) HsinChu, Taiwan Rivos Full time NT$120,000 - NT$240,000 per yearThe primary responsibility of the Quality and Reliability Engineer is to ensure that our products meet the reliability criteria set by ourselves, our vendors and suppliers and our Customers. This means that we test our design and products to achieve high standards of reliability, identify weaknesses in design and manufacturing and support improving design...