The call

It was 3:12 AM on a Saturday when my phone started buzzing. Not a gentle notification. A call. From a farmer named Henk who, under normal circumstances, would never voluntarily speak to a software person.

"Your computer is cooking my tomatoes."

That sentence will wake you up faster than any amount of coffee. I pulled up the remote dashboard on my phone and saw the greenhouse temperature graph doing something I'd never seen before. It was a perfectly smooth ramp from 22 degrees C straight to 48 degrees C. The heaters were running at 100%. The ventilation was off. The humidity control had given up entirely.

The control system had been running fine for three weeks. Nothing had changed in the firmware. Nothing had changed in the hardware. And yet, at some point around midnight, the climate controller had decided these tomatoes needed a tropical vacation.

The investigation

First thing Saturday morning, I SSH'd into the gateway and started pulling logs. The FreeRTOS system had three main tasks: sensor reads at priority 3, the PID control loop at priority 4, and CAN TX for the actuator commands at priority 5. All three were still running. No watchdog resets. No task stack overflows. The system was perfectly healthy, except for the part where it was trying to cook produce.

The PID controller output was pinned at maximum. That much was obvious from the heater state. But the setpoint was correct at 21 degrees C, and the temperature reading was... also correct. The sensor was faithfully reporting 47.8 degrees C. So the controller should have been driving the heaters down, not up.

I pulled the PID internals from the diagnostic registers. Proportional term: negative, as expected. Derivative term: near zero, reasonable for a slow temperature rise. Integral term: 4.2 billion.

That's not a typo. The integral accumulator had wound up to a value so large that the proportional and derivative terms were rounding errors by comparison.

The 3 AM visit

I drove to the greenhouse. You could feel the heat from the parking lot. Inside, it was like walking into a steam room. The tomato plants looked personally offended.

I connected a debugger to the control board and started dumping the integral accumulator history. Using xTaskGetTickCount timestamps, I could trace when the windup started. It wasn't gradual. At 23:47:12, the integral term was a reasonable 34.7. At 23:47:13, it was NaN. At 23:47:14, it was still NaN. And it stayed NaN for hours.

But wait. If the integral term was NaN, how did the PID output become a very real, very large positive number? That's when I looked at the anti-windup guard.

The NaN that broke everything

The anti-windup code looked like this:

if (integral > max_integral) {
    integral = max_integral;
}
if (integral < -max_integral) {
    integral = -max_integral;
}

Perfectly reasonable code. One small problem. In C, NaN > max_integral evaluates to false. So does NaN < -max_integral. NaN is not greater than anything. NaN is not less than anything. NaN is not equal to anything, including itself. The anti-windup guard looked at the NaN, shrugged, and let it through.

But the PID output calculation multiplied the NaN integral by the integral gain, added the proportional and derivative terms, and then clamped the result to the 0-100% output range. And here's the fun part: when the output clamp ran, the NaN had already propagated into the sum, but the PWM driver interpreted the resulting float as a raw bit pattern, which happened to map to a very large positive number. The output clamp then pinned it at 100%.

So the heaters went to full blast, and the anti-windup guard kept saying "everything's fine" because NaN comparisons kept returning false.

Where did the NaN come from?

The humidity sensor was on the I2C bus, configured with a 10ms timeout. During the night, when the greenhouse temperature dropped and humidity rose, condensation formed on the sensor connector. The SDA line got pulled low by moisture bridging the pins, causing I2C read timeouts. The sensor driver, when it timed out, returned -1 as a raw ADC value. The conversion function dutifully computed the humidity from -1, which involved a logarithm of a negative number.

log(-0.0023) equals NaN. That NaN got passed to the PID controller as the process variable. The proportional and derivative terms handled it correctly by accident (they produced NaN, which the output clamping caught). But the integral term accumulated it. Once the integral was NaN, it stayed NaN forever, because NaN + anything = NaN.

The I2C timeout of 10ms was too aggressive. During condensation events, the bus recovery took longer, and every retry produced another NaN. The sensor would eventually clear up as the condensation evaporated, but the integral accumulator was already poisoned.

The fix

Two changes. Both simple. Both obvious in hindsight.

First, a NaN guard before the PID input stage:

if (isnan(process_value) || isinf(process_value)) {
    // Hold last known good value
    process_value = last_valid_pv;
    sensor_fault_count++;
}

Second, conformal coating on the sensor connectors to prevent moisture bridging. Also bumped the I2C timeout from 10ms to 50ms with a single retry, which let the bus recover from transient moisture events without triggering a read failure.

I also added a NaN check inside the anti-windup guard itself, because defense in depth isn't paranoia when you've seen what NaN can do at 3 AM.

The lesson

The bug wasn't in the PID algorithm. The PID math was textbook correct. The bug was in the assumption that sensor inputs would always be valid numbers. In a lab, they are. In a greenhouse at midnight in October, when condensation is forming on every cold surface, they are not.

Three takeaways from this one:

Guard your inputs, not just your outputs. The anti-windup guard protected against large numbers but was blind to NaN. Input validation at the boundary of your control system is not optional.

NaN is not a number, and it's not your friend. In C (and C++, and most languages), NaN has specific comparison behavior that will break every conditional you write. If you work with floating point in embedded systems, you need explicit NaN checks. Period.

Test with fault injection, not just happy paths. We had hundreds of hours of simulation time on the PID controller. All with clean sensor data. Nobody thought to inject NaN into the process variable because "the sensor driver handles errors." It did. By returning a value that produced NaN one function call later.

Henk's tomatoes survived, by the way. Most of them. He still calls the control system "the sauna machine," which I suppose is fair. Plants don't respect your weekend plans, and neither do the physics of condensation on a cold October night.