Closed-Loop GPU Cooling
Feed-forward + PID daemon driving an Arctic P9 Max through an NZXT HID controller — because the BIOS wouldn't.
- GPU
- AMD V620 · 215W
- Setpoint
- 75 °C
- Slew limit
- 15 %/s
- Override
- 95 °C E-stop
Premise
The Gigabyte motherboard hosting this GPU refuses to drive a fan in closed-loop mode against a GPU temperature sensor — the BIOS only exposes CPU- and system-level sensors to the PWM headers. Which means either the GPU runs uncontrolled on its stock blower, the case fans roar at full duty cycle whenever the GPU touches its thermal cap, or something else takes over the loop.
This is that something else. A user-space daemon reads junction and memory temperatures from sysfs, computes a duty target with a feed-forward + PID controller, and writes PWM to an Arctic P9 Max through an NZXT USB-HID controller. The motherboard PWM header sits unused — that's the only architecture that closed the loop on this specific hardware combination.
Today's run
- Peak temp (junc)
- 106°C
- Avg temp (junc)
- 54.5°C
- Peak power
- 252W
- Avg duty
- 49%
Architecture
The control law is intentionally simple: feed-forward on GPU power (load → expected duty), corrected by PID on the temperature error against a 75 °C setpoint. A 95 °C hardware override kicks the fan to 100% regardless of the controller state. The duty signal is rate- limited to 15 %/s so the fan doesn't audibly chase noise.
- GPU
- AMD V620 · 215W TGP
- Fan
- Arctic P9 Max
- Controller
- NZXT RGB & Fan via liquidctl
- Setpoint
- 75 °C junction
- Override
- 95 °C E-stop
- Slew limit
- 15 %/s
- Poll interval
- 2 s
- Telemetry retention
- 30 days · SQLite
The daemon runs as a systemd service with a safe_stop.sh
ExecStopPost so the fan always exits at a safe duty even on
crash. Metrics persist to a local SQLite database; the dashboard
above pulls a daily snapshot of the last 24 h, lagged 30 minutes
for privacy.
Current state
Production. Daily-driver. Has kept this GPU alive through every multi-hour training run on the self-improvement loop without a thermal trip below the 95 °C override. The dashed green line in the chart is the 75 °C setpoint; the amber dashed line is the 95 °C override. The temperature trace should sit near the green line under sustained load and well below the amber line at all times.
The repo includes the daemon, a small Chart.js dashboard for local monitoring, a smoke-test script that exercises the full control path without spinning up the systemd service, and the physical-swap documentation for taking the fan + controller in and out of the build.
What's next
The codebase is portable in spirit but currently hard-coded to this particular GPU + motherboard + fan controller triple. The natural next step is to generalize it: a config-driven hardware abstraction layer, a more general PID + FF tuning tool, and a published GitHub release for anyone else fighting a BIOS that won't close the loop where they need it to.
∴⎯Related work