-
Notifications
You must be signed in to change notification settings - Fork 199
Insights: GoogleCloudPlatform/cluster-toolkit
Overview
Could not load contribution data
Please try again later
4 Releases published by 3 people
-
v1.51.0 Release v1.51.0
published
May 13, 2025 -
v1.51.1 Address bug in updated NVIDIA package causing Slurm job failures
published
May 20, 2025 -
v1.52.0 Release v1.52.0
published
May 22, 2025 -
v1.53.0 Release v1.53.0
published
Jun 5, 2025
116 Pull requests merged by 28 people
-
Allow to run slum tests on already deployed cluster
#4236 merged
Jun 6, 2025 -
Update Toolkit release to v1.54.0
#4250 merged
Jun 6, 2025 -
Adjust OFI tunables and disable rndv write mode
#4249 merged
Jun 6, 2025 -
Support disabling Single Node Health Checks using --task-prolog switc…
#4238 merged
Jun 6, 2025 -
Revert changes to fix test GKE kueue integration test
#4246 merged
Jun 6, 2025 -
add integration tests for GKE A3U DWS flex-start and queued-provisioning
#4245 merged
Jun 6, 2025 -
update terraform-provider version to v6.38.0
#4244 merged
Jun 5, 2025 -
Merge V1.53.0 into Develop
#4243 merged
Jun 5, 2025 -
Release candidate v1.53.0
#4221 merged
Jun 5, 2025 -
Made a separate file for kueue config integration test
#4241 merged
Jun 5, 2025 -
Kueue config application verification Integration tests
#4237 merged
Jun 5, 2025 -
Move Slurm H4D Blueprint and remove firewall rules for IRDMA VPCs
#4240 merged
Jun 5, 2025 -
Remove Docker configuration warning
#4229 merged
Jun 3, 2025 -
Update to datacenter-gpu-manager-4 package in A-series blueprints
#4228 merged
Jun 3, 2025 -
speed up deployment of GKE clusters
#4215 merged
Jun 3, 2025 -
Implement A3 High network blocking script as startup-script feature
#4233 merged
Jun 2, 2025 -
Add Managed Lustre support for non-default ports (GKE compatibility)
#4210 merged
Jun 2, 2025 -
Bump cryptography from 44.0.2 to 45.0.3 in /community/front-end/ofe
#4225 merged
Jun 2, 2025 -
Bump crispy-bootstrap5 from 2024.10 to 2025.4 in /community/front-end/ofe
#4224 merged
Jun 2, 2025 -
Bump the go-minor-and-patch-updates group with 2 updates
#4227 merged
Jun 2, 2025 -
Make controller to self-report IP address
#4212 merged
May 30, 2025 -
Add pairwise constraints to avoid scheduling some tests too close
#4219 merged
May 30, 2025 -
Update Toolkit release to v1.53.0
#4220 merged
May 30, 2025 -
updating a3m lssd slurm blueprint
#4218 merged
May 30, 2025 -
refactor a3m slurm ubuntu solutions into 1 blueprint
#4216 merged
May 30, 2025 -
Avoid "drop user" noise in setup.log
#4213 merged
May 30, 2025 -
Cleanup GCS Fuse configurations and add required permissions for fio-job-template
#4214 merged
May 30, 2025 -
add persistence to A3 Mega
#4203 merged
May 29, 2025 -
Fixing the missing comma between the mount_options config for gcs A3U and A4
#4207 merged
May 29, 2025 -
Replace extended_reservation variable with reservation for TPU
#4197 merged
May 28, 2025 -
add jobset for a4x
#4206 merged
May 28, 2025 -
Added GKE a4x blueprint and related configs
#4199 merged
May 28, 2025 -
Add authorized_cidr to deployment vars in gke integration tests
#4205 merged
May 28, 2025 -
Fixed the parser error in test-gke-a2-highgpu-kueue
#4204 merged
May 28, 2025 -
Updating the checkpoint PV for A3U and A4 to the recommended mount options
#4180 merged
May 27, 2025 -
Improve API consumption of
_handle_bulk_insert_op
#4091 merged
May 27, 2025 -
Several minor fixes + missing file added to AF3 solution
#4202 merged
May 27, 2025 -
Re-order runners in HCLS blueprint
#4186 merged
May 27, 2025 -
Adding deprecation warnings to DDN-Exascaler module (and references)
#4189 merged
May 27, 2025 -
add a4H flex test
#4118 merged
May 27, 2025 -
Parallelstore switch region and update network name
#4131 merged
May 27, 2025 -
Added Verbose Flags in Integration Tests
#4196 merged
May 27, 2025 -
Deprecate imex, gpu driver and dra driver manifests
#4198 merged
May 27, 2025 -
Remove redundant http provider from kubectl apply
#4190 merged
May 24, 2025 -
Use terraform regex restriction on login name for python tests
#4188 merged
May 24, 2025 -
Add ubuntu 24.04 ARM64 check for ansible-vm test
#4183 merged
May 23, 2025 -
Crd/ubuntu
#4184 merged
May 23, 2025 -
Add recurse to condor spool directory
#4178 merged
May 23, 2025 -
Update variable names in the GKE A3 Ultra example to match the cloud doc
#4185 merged
May 23, 2025 -
A3M Slurm Solution- deprecate debian slurm image and add ubuntu slurm image
#4170 merged
May 22, 2025 -
Merge V1.52.0 into Develop
#4182 merged
May 22, 2025 -
Release candidate v1.52.0
#4160 merged
May 22, 2025 -
Update hpc-gke example and readme
#4179 merged
May 22, 2025 -
A3 Mega Slurm: fix package URL after Ansible upgrade in #4167
#4176 merged
May 22, 2025 -
Update readme for gke managed hyperdisk example
#4166 merged
May 22, 2025 -
Update kubernetes provider version to >=2.36 in gke-cluster module
#4175 merged
May 22, 2025 -
Update condition for workload policy
#4172 merged
May 22, 2025 -
Update documentation regarding deploying parallelstore blueprints using gcluster
#4174 merged
May 21, 2025 -
Drop PBS Pro modules and tests
#4165 merged
May 21, 2025 -
HCLS solution: Fix mistaken variable name in README
#4173 merged
May 21, 2025 -
Improvements to
python-integration-tests
#4169 merged
May 21, 2025 -
Drop Ubuntu 20.04 support for Ansible installation
#4167 merged
May 21, 2025 -
Updated pfs-lustre deprecation message
#4171 merged
May 21, 2025 -
Merge v1.51.1 hotfix release into develop
#4154 merged
May 21, 2025 -
Update cluster-tool-kit-writers-json for shubpal07
#4149 merged
May 21, 2025 -
Update readme for gke managed parallelstore example
#4163 merged
May 21, 2025 -
Drop CentOS 7 from integration test for DDN/WhamCloud Lustre implementation
#4164 merged
May 21, 2025 -
Update the link for gke a3 mega example on readme
#4161 merged
May 21, 2025 -
Updating variable documentation in blueprints
#4162 merged
May 21, 2025 -
Allow deploying cluster without live reservation
#4057 merged
May 21, 2025 -
Auto execute basic Slurm integration test
#4157 merged
May 20, 2025 -
Fix invalid import
#4156 merged
May 20, 2025 -
Add kueue dependency for nvidia dra driver
#4147 merged
May 20, 2025 -
Merge v1.51.1 hotfix release into v1.52.0 candidate release branch
#4155 merged
May 20, 2025 -
Block broken release of nvidia-container-toolkit
#4152 merged
May 20, 2025 -
Update cluster-toolkit-writers.json
#4148 merged
May 20, 2025 -
Improve Slurm GPU testing
#4146 merged
May 20, 2025 -
Block broken release of nvidia-container-toolkit
#4145 merged
May 20, 2025 -
Do not fail
setup_controller
ifsync_instances
fails#4141 merged
May 19, 2025 -
Use controller address instead of hostame when using network-attachment
#4143 merged
May 19, 2025 -
Add Ubuntu 24.04 Ansible installation and test coverage
#4140 merged
May 19, 2025 -
Upgrade Ansible to maximum allowed version on oldest supported OS distributions
#4139 merged
May 19, 2025 -
Drop CentOS 7 support for Ansible installation
#4138 merged
May 19, 2025 -
Add Debian 12 test coverage to Ansible installation
#4137 merged
May 19, 2025 -
deprecate old gke a3 mega blueprint
#4134 merged
May 19, 2025 -
Adding github username to toolkit writers list
#4142 merged
May 19, 2025 -
Update Toolkit release to v1.52.0
#4133 merged
May 19, 2025 -
New way to timeout on dpkg during
install_ansible
#4136 merged
May 17, 2025 -
CloudSQL improvements: database flags, query insights, bump default version
#4115 merged
May 16, 2025 -
Fix script destination path concat in setup.py
#4124 merged
May 15, 2025 -
Updating Slurm dev key logic to use static file instead of env var
#4123 merged
May 15, 2025 -
Fix missing region
#4113 merged
May 15, 2025 -
Fixes to nodeset dynamic
#4125 merged
May 15, 2025 -
Move core logic of helm install into independent helm module
#4127 merged
May 15, 2025 -
Remove pre-set deployment names from GKE example blueprints
#4122 merged
May 14, 2025 -
remove gvnic-1 from pods manifests
#4088 merged
May 14, 2025 -
Fix documentation related to lustre blueprint friction
#4077 merged
May 13, 2025 -
Make delete instances status tracking "asynchronous"
#4031 merged
May 13, 2025 -
Small adjustment to README for modules to include managed-lustre
#4117 merged
May 13, 2025 -
update terraform-provider version to v6.34.1
#4116 merged
May 13, 2025 -
Merge V1.51.0 into Develop
#4112 merged
May 13, 2025 -
Add support for multiple task prolog and epilog scripts. Closes #4100
#4105 merged
May 12, 2025 -
Relax
retry_exception
test#4065 merged
May 12, 2025 -
Fix multiple bugs in notifying jobs during failed resume
#4107 merged
May 12, 2025 -
Release candidate: v1.51.0
#4104 merged
May 12, 2025 -
Set resumeTimeout to max for all flex partitions
#4108 merged
May 12, 2025 -
Improve logging details of
subprocess.run
#4106 merged
May 9, 2025 -
Get rid of template_cache
#4040 merged
May 9, 2025 -
Update Toolkit release to v1.51.0
#4101 merged
May 9, 2025 -
Update default Accelerator Image in A3 Ultra and A4 Slurm blueprints
#4089 merged
May 9, 2025 -
Better integrate, optimize, and control the gIB NCCL RDMA plugin installer
#4069 merged
May 9, 2025 -
Add developer key option for slurm
#4094 merged
May 9, 2025 -
Add NCCL tests to the DWS flex start examples
#4095 merged
May 9, 2025 -
Update Open Front End Django dependency and fix OFE integration test
#4092 merged
May 9, 2025 -
Fix gke a3 mega integration test
#4093 merged
May 9, 2025 -
Remove DWS Calendar example
#4090 merged
May 9, 2025
19 Pull requests opened by 15 people
-
Support for TPUv5 and v6e
#4109 opened
May 12, 2025 -
Improve handling of unexpected values for `upcomingMaintenance`
#4119 opened
May 13, 2025 -
Fix documentation links in the examples README.md
#4121 opened
May 14, 2025 -
Wait for startup
#4126 opened
May 15, 2025 -
Enable hpc tollkit for google cloud netapp volume
#4130 opened
May 16, 2025 -
Remove deprecated DDN-EXAScaler module
#4132 opened
May 16, 2025 -
Initial test of using gpu_topology
#4150 opened
May 20, 2025 -
Test PR, to see if this works
#4158 opened
May 20, 2025 -
Update vm-images supported images
#4168 opened
May 21, 2025 -
Slurm: fix permissions for compute_sa, disable public ips
#4193 opened
May 26, 2025 -
Improve Slurm tests
#4201 opened
May 27, 2025 -
feat: add skip_reservation_validation option to schedmd-slurm-gcp-v6-nodeset
#4211 opened
May 29, 2025 -
OFE: Local Development Environment | Startup-Script: Docker Install script fix
#4217 opened
May 30, 2025 -
AlphaFold 3 High Throughput Solution (af3-slurm)
#4231 opened
Jun 2, 2025 -
Slurm tests improve SSH
#4239 opened
Jun 3, 2025 -
kubernetes provider module implementation
#4247 opened
Jun 6, 2025 -
Bump django from 5.1.9 to 5.1.10 in /community/front-end/ofe
#4248 opened
Jun 6, 2025 -
Release candidate: v1.54.0
#4251 opened
Jun 6, 2025 -
Different Kueue Config Integration Tests for different machine types
#4252 opened
Jun 8, 2025
9 Issues closed by 7 people
-
CRD module improvements
#4235 closed
Jun 5, 2025 -
Is there value in naming the partitions "inference" and "datapipeline"?
#4151 closed
Jun 4, 2025 -
PermissionError: [Errno 13] Permission denied: b'/slurm/scripts/template_info.cache.dat'
#3946 closed
Jun 2, 2025 -
Duplicate
#4187 closed
May 23, 2025 -
Add main task prolog and epilog to support multiple scripts
#4100 closed
May 22, 2025 -
GPU-enabled enroot containers fail after updating to latest nvidia-container-toolkit
#4144 closed
May 20, 2025 -
Can't login into machines created by image-builder.yaml
#3792 closed
May 17, 2025 -
Generalize extra disk feature
#3936 closed
May 14, 2025 -
Missing prolog and epilog scripts on compute nodes
#4070 closed
May 9, 2025
1 Issue opened by 1 person
-
a3-ultragpu-8g NCCL Tests Failing
#4208 opened
May 29, 2025
6 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Intermittent node setup failure
#3782 commented on
May 16, 2025 • 0 new comments -
job_db_uuid fails to insert into BigQuery
#3944 commented on
Jun 5, 2025 • 0 new comments -
New DWS Flex breaks terraform state
#4079 commented on
Jun 6, 2025 • 0 new comments -
Add workload policy to mig
#3791 commented on
May 13, 2025 • 0 new comments -
Quantum Simulation, write up of Sam Skillman's work.
#3893 commented on
May 20, 2025 • 0 new comments -
feat: AF3 notebook launcher example and unified memory option
#4018 commented on
Jun 3, 2025 • 0 new comments