Skip to content

hwmon: disambiguate colliding chip labels#3646

Open
mwimpelberg28 wants to merge 8 commits into
prometheus:masterfrom
mwimpelberg28:mwimpelberg/hwmon-dedup-chip-labels
Open

hwmon: disambiguate colliding chip labels#3646
mwimpelberg28 wants to merge 8 commits into
prometheus:masterfrom
mwimpelberg28:mwimpelberg/hwmon-dedup-chip-labels

Conversation

@mwimpelberg28

Copy link
Copy Markdown

Summary

Fixes #3637.

Multiple hwmon nodes can be registered under a single parent device — for example, asus-nb-wmi on recent ASUS laptops registers one hwmon for fan control and another for WMI sensors. Both device symlinks resolve to the same /sys/devices/platform/asus-nb-wmi, so hwmonName produces the same platform_asus_nb_wmi chip label for both, and any sensor file that exists in both nodes (e.g. pwm1_enable) trips:

collected metric "node_hwmon_pwm_enable" { ... chip="platform_asus_nb_wmi" ... } was collected before with the same name and label values

Approach

Update now does two passes:

  • Pass 1: enumerate /sys/class/hwmon/*, compute the device-derived base chip name for each, and count collisions.
  • Pass 2: when a base name is shared, suffix the chip label with the chip's name file content if it disambiguates, otherwise with the hwmonX basename (always unique within a boot). The include/exclude filter is also moved here so user regexes match the label that is actually emitted in the metric.

Entries that already produce a unique chip label are unaffected — no surprise suffixes for users not hitting the collision.

This is closer in spirit to the discussion in #333 (the same class of bug for dual-socket coretemp boxes), but contained: the fix only kicks in when an actual collision is detected.

Test plan

New collector/hwmon_linux_test.go:

  • TestHwmonDuplicateChipNamesAreDisambiguated — reproduces the Metric node_hwmon_pwm_enable was collected before with the same name and label values #3637 ASUS WMI scenario (two hwmon dirs sharing one platform device, both exposing pwm1_enable) and asserts both Gather succeeds and the chip labels are distinct.
  • TestHwmonUniqueChipNamesAreUnchanged — guards against unintended label drift for users not hitting the collision.
  • TestHwmonDuplicateChipNamesWithSameNameFile — exercises the hwmonX-basename fallback when the name file content also collides.
  • Full collector test suite still passes (go test ./collector/), including the existing fixture-driven e2e checks.
  • go vet ./... clean.

Multiple hwmon nodes can be registered under a single parent device
(for example asus-nb-wmi exposes one hwmon for fan control and another
for WMI sensors). Both currently resolve to the same chip label
(`platform_asus_nb_wmi`) and trigger "metric collected before with the
same name and label values" errors at scrape time.

Detect this collision in a first pass and append the chip's `name` file
content (or the hwmonX basename if names also collide) to the chip
label in a second pass. The include/exclude filter is moved into the
same pass so user regexes match the label that is actually emitted.

Fixes: prometheus#3637

Signed-off-by: Matthew Wimpelberg <matt.wimpelberg@grafana.com>
@mwimpelberg28

Copy link
Copy Markdown
Author

@SuperQ would you have a moment to take a look? Happy to address any feedback.

@SuperQ

SuperQ commented May 15, 2026

Copy link
Copy Markdown
Member

This should add to the test fixtures so that it's tested in the end-to-end test.

Comment thread collector/hwmon_linux_test.go Outdated
mwimpelberg28 and others added 2 commits May 15, 2026 06:20
Co-authored-by: Ben Kochie <superq@gmail.com>
Signed-off-by: Matthew Wimpelberg <120263653+mwimpelberg28@users.noreply.github.com>
Two new hwmon nodes share a single platform device (asus-nb-wmi) with
distinct `name` file contents, exercising the disambiguation path in
the end-to-end test. Without the fix in the previous commit, the
duplicate base chip name `platform_asus_nb_wmi` would have triggered a
registry error before any metrics were scraped.

Also expand the e2e chip-include regex to admit the new chips so
their disambiguated labels appear in the expected output.

Signed-off-by: Matthew Wimpelberg <matt.wimpelberg@grafana.com>
@mwimpelberg28 mwimpelberg28 requested a review from SuperQ May 15, 2026 16:40
@mwimpelberg28

mwimpelberg28 commented May 16, 2026

Copy link
Copy Markdown
Author

This should add to the test fixtures so that it's tested in the end-to-end test.

This should be fixed. @SuperQ can you please review?

@champtar

champtar commented Jun 1, 2026

Copy link
Copy Markdown

Just opened issue #3673 and found this PR

Why not just always have chip and chip_name labels ?

@mwimpelberg28

Copy link
Copy Markdown
Author

Why not just always have chip and chip_name labels?

Thanks for the pointer to #3673 — this PR actually does fix that case. In your mt7996 example all three hwmon* nodes resolve to the same device (ieee80211/phy0), so they collide on the chip label today. With this PR the collision is detected and each one gets a unique chip (suffixed with the distinct name content, e.g. ieee80211_phy0_mt7996_phy0_0/_1/_2), so the was collected before with the same name and label values error goes away.

I'd lean away from always emitting chip_name as a label on every metric for two reasons:

  1. It's a breaking change for everyone. Adding chip_name to every hwmon series changes the label set on all existing time series, which breaks existing queries, recording rules, and dashboards even for users who never hit a collision. This PR is deliberately conservative — labels only change for nodes that actually collide.
  2. It wouldn't fully solve it anyway. Two hwmon nodes under the same parent can also share identical name content (covered here by TestHwmonDuplicateChipNamesWithSameNameFile), so chip+chip_name still isn't guaranteed unique — you'd need the hwmonX fallback regardless.

The node_hwmon_chip_names info-metric is also the idiomatic Prometheus join pattern (like node_uname_info) for attaching human-readable names without bloating every series, so I'd prefer to keep it.

Happy to add an e2e fixture mirroring the mt7996 layout if it'd help demonstrate the #3673 fix concretely.

Comment thread collector/hwmon_linux.go Outdated
}

entries := make([]hwmonEntry, 0, len(hwmonFiles))
chipCounts := make(map[string]int)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could initialise this is a pre defined length.

Suggested change
chipCounts := make(map[string]int)
chipCounts := make(map[string]int, len(hwmonFiles))

Seems chipCounts will always have all the chips inside hwmonFiles

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — both maps only ever get keys while iterating hwmonFiles, so len(hwmonFiles) is a correct upper bound. Done in 9dd5608.

Comment thread collector/hwmon_linux.go Outdated

entries := make([]hwmonEntry, 0, len(hwmonFiles))
chipCounts := make(map[string]int)
nameCounts := make(map[string]int)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here — pre-sized nameCounts with len(hwmonFiles) as well in 9dd5608.

@nicolastakashi

Copy link
Copy Markdown

@mwimpelberg28 overall LGTM, can you just fix DCO?

mwimpelberg28 and others added 2 commits June 10, 2026 07:25
Three hwmon nodes share a single ieee80211/phy0 parent device, each with
a distinct name file (mt7996_phy0_{0,1,2}) and a temp1 sensor. This is the
multi-hwmon-per-device layout reported in prometheus#3673, where all three collide on
the base chip label ieee80211_phy0 and the duplicate temp1 series trips the
registry error before any metrics are scraped.

With the disambiguation fix each node gets a unique chip label suffixed by
its name file, so the end-to-end test now exercises the prometheus#3673 scenario in
addition to the existing asus-nb-wmi case. The chip-include regex is widened
to admit the new ieee80211 chips.

Signed-off-by: Matthew Wimpelberg <matt.wimpelberg@grafana.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matthew Wimpelberg <matt.wimpelberg@grafana.com>
Both maps are populated only while iterating hwmonFiles, so len(hwmonFiles)
is a correct upper bound on the number of distinct keys and avoids
incremental rehashing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matthew Wimpelberg <matt.wimpelberg@grafana.com>
@mwimpelberg28 mwimpelberg28 force-pushed the mwimpelberg/hwmon-dedup-chip-labels branch from 9dd5608 to 736fbfb Compare June 10, 2026 11:26
@mwimpelberg28

Copy link
Copy Markdown
Author

@nicolastakashi thanks! DCO is fixed now — added the missing Signed-off-by to the two commits that were lacking it. The check is passing. 🟢

@mwimpelberg28

Copy link
Copy Markdown
Author

@SuperQ could you please have a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metric node_hwmon_pwm_enable was collected before with the same name and label values

4 participants