postslaptop-battery-woes

Laptop Battery Woes

Posted 03/02/19 #hardware #linux

My daily driver laptop is a HP EliteBook 8570p. Pretty nice little machine, and I got it for sub-$200 used. Total brick of a business laptop – it's got a serial port!

It's now been about two years since I bought it used, and it's probably coming up on being six or seven years old all told. When I bought it I threw an SSD in it, and didn't do anything else to it until recently. It's battery life has never been great, and it's been frozen a couple times so it barely worked on battery.

As a result I recently got a replacement battery, and got the super-XL sticking-out-the-back kind. Popped the old one out, socketed the new one in, and was off to the races.

Until it died about a week later.

What the hell? Oh well, I thought, infant mortality happens. The failure mode of this battery in particular was that it would report 72% charge constantly, but the laptop died immediately when unplugged from AC. Amazon replacement was processed quickly and the replacement battery (of the same brand) seemed fine.

Until it died about a week later.

WTF! Same failure mode (but this one was stuck reporting 30 instead of 72 percent.) This time I had a suspicion as to what killed it – it died as I was building the signal client from the AUR... excessive load? Perhaps this battery wasn't actually built to the HP spec.

So I returned the replacement battery, and got a different brand of 9-cell. That battery worked fine...

Until it died about a week later.

Well, actually, this one didn't die. I'm using it right now to type this. It's a little more complicated. Some manufacturing difference meant that instead of permanently kicking the bucket this battery would crap out temporarily instead of permanently. The laptop would die immediately, but after letting it sit for about 30 seconds it could be started again.

I did some testing and discovered that high CPU load was what was causing it to drop out. stress-ng --cpu 8 would kill it immediately, and even tasks like pipenv install (really not that intensive) would kill the laptop after about five seconds of load. I suspect that this battery is also not manufactured to the actual standard, and is instead a good-enough-for-most knockoff. (For reference, the 6-cell first-party battery the laptop came with worked just fine under stress-ng, but had degraded over the years to only last about 10 mins with nothing running.)


What to do about it?

After some more experimentation, and a lot of waiting for whatever fuse in the battery to reset, I figured out the actual cause is not directly power, but frequency. The operating range for an i7-3740QM is 1200MHz to 3700MHz. I discovered that on my machine, any time the CPU turbos over 3.1GHz the battery eats it. I don't have an adequate explanation as to why it does this, but it's repeatable.

Thankfully, the linux power management subsystem is relatively easy to interface with. It exposes the relevant attributes of the CPU frequency scaling under /sys/devices/system/cpu/cpufreq/, and it can also be managed neatly with the tool cpupower. The immediate solution was to cpupower frequency-set -u 3.0GHz, and that solved the crashing issue. With the maximum turbo speed limited to 3.0GHz the system was stable on battery, even under load from stress-ng --cpu 8.

Sticking this in a script to run at startup isn't the end of the story though, for two reasons:

The first is straightforward enough: I want all 3.7GHz when I'm plugged in. The second is more dastardly: The laptop would fail to boot intermittently on battery. I have full disk encryption set up on my system, and it would (sometimes) turn off right after I entered my passphrase during boot.

The issue was that some intensive computation that was run as part of the decryption step (my guess would be some key-derivation or keyfile-unlocking) was causing the system to turbo itself above the magic 3.1GHz number. The problem though is: How do I add a startup script to fix this when the problem occurs before the root filesystem is even mounted?


The Solution: Custom mkinitcpio hook

If you haven't looked into it, most linux systems have two phases of boot: the "pre" boot init ramdisk is loaded and devices are started (including the full disk crypto) and then the system switches to the main partition and runs what most of us think of as startup scripts. (How do the initial ramdisk and the kernel get loaded you ask? The bootloader does it, and the bootloader cheats. In my system, the kernel and init ramdisk aren't encrypted, and they aren't in the root partition, they are in the EFI system partition. That's where rEFInd looks for them and loads them into memory, using interfaces to the disk provided by the UEFI implementation from HP.)

On arch the construction initial ramdisk is governed by a took called mkinitcpio. These ramdisks are built from a collection of "hooks", each of which is a little script that adds some files to the ramdisk (where these files could be ones run during early boot.) These are respectively called build hooks and run hooks. The hooks that ship with your system live in /usr/lib/initcpio/{install,hooks}. The set of hooks that compose the init ramdisk image are configured in /etc/mkinitcpio.conf. The relevant line is

HOOKS=(base udev autodetect modconf block \
	encrypt lvm2 filesystems keyboard resume fsck)

There's also /etc/initcpio/{install,hooks} that is where local hooks are intended to be placed – just what we are looking for!

After doing the research to figure out the problem, the solution is remarkably simple.

/etc/initcpio/install/setfreqlim:

#!/bin/bash

build() {
    add_runscript #Copy the runtime hook into the image
}

help() {
    cat <<HELPEOF
This hook sets upper cpu freq limit at boot
HELPEOF
}

/etc/initcpio/hooks/setfreqlim:

#!/usr/bin/ash

run_hook() {
  for file in /sys/devices/system/cpu/cpufreq/* ; do
  	echo 3000000 > $file/scaling_max_freq
  	#Limit the maximum frequency to 3GHz
  done
}

/etc/mkinitcpio.conf: (relevant part)

HOOKS=(base udev setfreqlim autodetect modconf block \
	encrypt lvm2 filesystems keyboard resume fsck)

Note that setfreqlim (our new hook) appears before encrypt – the one that was causing our crashes.

After adding our custom hook, all that was needed was the standard invocation mkinitcpio -p linux to update the init ramdisk stored in the EFI system partition.


udev rules for dynamic frequency scaling limiting

The other issue I mentioned – wanting to only limit frequency on battery – had a simpler solution. All that was required were two udev rules to limit and un-limit frequency in response to events from the power subsystem:

/etc/udev/rules.d/90-battery.rules:

SUBSYSTEM=="power_supply", ENV{POWER_SUPPLY_STATUS}!="Discharging", \
	RUN+="/usr/bin/cpupower frequency-set -u 3.7GHz"
SUBSYSTEM=="power_supply", ENV{POWER_SUPPLY_STATUS}=="Discharging", \
	RUN+="/usr/bin/cpupower frequency-set -u 3GHz"

I was actually kind of surprised that this works! That is, I was surprised that the update happens fast enough to keep the battery from crapping out, even if you unplug it while it's under 100% load and turbo'd up. If you have to do the same thing, it's helpful to plug in and unplug a few times to see what even fires first on your system. For example, on my box the battery device notes that it is discharging a good half second before the AC notices it has been unplugged.


Bonus issue: ACPI power information not updating

One final problem I encountered was that the battery charge state was not being correctly updated. Checking the state with upower would report it as having been updated every 30 seconds or so, but the state would not change. Polling it directly with acpi -V would report an accurate state, and cause the cached state in upower to update. This had a simple, ugly, and functional solution:

/etc/systemd/system/acpipoll.service:

[Unit]
Description=Force ACPI info to update

[Service]
Type=oneshot
ExecStart=/usr/bin/acpi -V

/etc/systemd/system/acpipoll.timer:

[Unit]
Description=Run acpi -V regularly

[Timer]
OnBootSec=5s
OnUnitActiveSec=5s

[Install]
WantedBy=timers.target

Once this was ironed out, the info in upower was up to date. This is visible in my i3blocks setup. I use the battery2 block from the i3blocks-contrib collection. The last trick was to make the block respond to updates via udev, instead of polling. This meant configuring the block with a signal and adding a rule to my udev config.

~/.config/i3/i3blocks/i3blocks.conf: (location depends on your setup)

[battery2]
command=~/.config/i3/i3blocks/battery2
markup=pango
interval=30
signal=2

/etc/udev/rules.d/90-battery.rules: (addition)

SUBSYSTEM=="power_supply", RUN+="/usr/bin/pkill -RTMIN+2 i3blocks"

The end

Yikes! What a mess.

Well, at least it works now. In fact, it works great and I get ~4 hours of active use out of a charge. Interesting learning experience, but I think next time I might just spring for the first-party 9-cell.

I'm also curious of if/how this would have (not) worked under Windows. This smells like the kind of thing Microsoft might have a special "hack" for. (Un)Fortunately I've totally nuked the Windows install on this thing and don't plan on reinstalling it any time soon.

I'll leave it to a future post to talk about the other two hardware issues I've fixed on my laptop recently, both thermal.