I’d like to share Slurm-web, an open source web interface to monitor Slurm HPC & AI clusters in real time.
It provides a read-only overview of job queues and node status with metrics and advanced visualizations – useful to sysadmins, managers and end-users – without needing to SSH in or run command-line tools.
Slurm is the world's leading workload manager and the de facto standard for HPC and AI clusters. It has a powerful command-line interface, but it's complex to use. Many HPC centers deploy advanced stacks to create custom dashboards with exporters, Prometheus, InfluxDB, Grafana, which are complex and time-consuming to design and maintain.
Slurm-web offers clear and accessible views of the cluster – with minimal setup and overhead.
And it is released as free software under the GNU GPL v3!
Key features:
- Dashboard with interactive charts of resources and jobs status
- Instant jobs filtering and sorting
- Live jobs status update with colored badges to visualize at a glance
- GPU resources utilization monitoring
- Advanced visualization of node status with racking topology
- Intuitive visualization of QOS and advanced reservations
- Dark mode support
- Multi-clusters support
- LDAP authentication (including Active Directory support)
- Advanced RBAC permissions management
- Transparent caching
- Integration with Prometheus to collect and chart timeseries metrics of Slurm
Architecture: - Backend: Python (Flask), uses Slurm REST API to consume data from scheduler
- Frontend: Vue.js and Tailwind CSS
It is easy to deploy with native deb and RPM packages for most common Linux distributions.The roadmap is full of features such as reporting and job submissions. In the meantime, I would be more than happy to get your feedback on this software!