hwmon: disambiguate colliding chip labels#3646
Conversation
Multiple hwmon nodes can be registered under a single parent device (for example asus-nb-wmi exposes one hwmon for fan control and another for WMI sensors). Both currently resolve to the same chip label (`platform_asus_nb_wmi`) and trigger "metric collected before with the same name and label values" errors at scrape time. Detect this collision in a first pass and append the chip's `name` file content (or the hwmonX basename if names also collide) to the chip label in a second pass. The include/exclude filter is moved into the same pass so user regexes match the label that is actually emitted. Fixes: prometheus#3637 Signed-off-by: Matthew Wimpelberg <matt.wimpelberg@grafana.com>
|
@SuperQ would you have a moment to take a look? Happy to address any feedback. |
|
This should add to the test fixtures so that it's tested in the end-to-end test. |
Co-authored-by: Ben Kochie <superq@gmail.com> Signed-off-by: Matthew Wimpelberg <120263653+mwimpelberg28@users.noreply.github.com>
Two new hwmon nodes share a single platform device (asus-nb-wmi) with distinct `name` file contents, exercising the disambiguation path in the end-to-end test. Without the fix in the previous commit, the duplicate base chip name `platform_asus_nb_wmi` would have triggered a registry error before any metrics were scraped. Also expand the e2e chip-include regex to admit the new chips so their disambiguated labels appear in the expected output. Signed-off-by: Matthew Wimpelberg <matt.wimpelberg@grafana.com>
This should be fixed. @SuperQ can you please review? |
|
Just opened issue #3673 and found this PR Why not just always have |
Thanks for the pointer to #3673 — this PR actually does fix that case. In your mt7996 example all three I'd lean away from always emitting
The Happy to add an e2e fixture mirroring the mt7996 layout if it'd help demonstrate the #3673 fix concretely. |
| } | ||
|
|
||
| entries := make([]hwmonEntry, 0, len(hwmonFiles)) | ||
| chipCounts := make(map[string]int) |
There was a problem hiding this comment.
I'm wondering if we could initialise this is a pre defined length.
| chipCounts := make(map[string]int) | |
| chipCounts := make(map[string]int, len(hwmonFiles)) |
Seems chipCounts will always have all the chips inside hwmonFiles
There was a problem hiding this comment.
Good call — both maps only ever get keys while iterating hwmonFiles, so len(hwmonFiles) is a correct upper bound. Done in 9dd5608.
|
|
||
| entries := make([]hwmonEntry, 0, len(hwmonFiles)) | ||
| chipCounts := make(map[string]int) | ||
| nameCounts := make(map[string]int) |
There was a problem hiding this comment.
Same here — pre-sized nameCounts with len(hwmonFiles) as well in 9dd5608.
|
@mwimpelberg28 overall LGTM, can you just fix DCO? |
Three hwmon nodes share a single ieee80211/phy0 parent device, each with
a distinct name file (mt7996_phy0_{0,1,2}) and a temp1 sensor. This is the
multi-hwmon-per-device layout reported in prometheus#3673, where all three collide on
the base chip label ieee80211_phy0 and the duplicate temp1 series trips the
registry error before any metrics are scraped.
With the disambiguation fix each node gets a unique chip label suffixed by
its name file, so the end-to-end test now exercises the prometheus#3673 scenario in
addition to the existing asus-nb-wmi case. The chip-include regex is widened
to admit the new ieee80211 chips.
Signed-off-by: Matthew Wimpelberg <matt.wimpelberg@grafana.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Matthew Wimpelberg <matt.wimpelberg@grafana.com>
Both maps are populated only while iterating hwmonFiles, so len(hwmonFiles) is a correct upper bound on the number of distinct keys and avoids incremental rehashing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matthew Wimpelberg <matt.wimpelberg@grafana.com>
9dd5608 to
736fbfb
Compare
|
@nicolastakashi thanks! DCO is fixed now — added the missing |
|
@SuperQ could you please have a look? |
Summary
Fixes #3637.
Multiple hwmon nodes can be registered under a single parent device — for example,
asus-nb-wmion recent ASUS laptops registers one hwmon for fan control and another for WMI sensors. Bothdevicesymlinks resolve to the same/sys/devices/platform/asus-nb-wmi, sohwmonNameproduces the sameplatform_asus_nb_wmichip label for both, and any sensor file that exists in both nodes (e.g.pwm1_enable) trips:Approach
Updatenow does two passes:/sys/class/hwmon/*, compute the device-derived base chip name for each, and count collisions.namefile content if it disambiguates, otherwise with the hwmonX basename (always unique within a boot). The include/exclude filter is also moved here so user regexes match the label that is actually emitted in the metric.Entries that already produce a unique chip label are unaffected — no surprise suffixes for users not hitting the collision.
This is closer in spirit to the discussion in #333 (the same class of bug for dual-socket
coretempboxes), but contained: the fix only kicks in when an actual collision is detected.Test plan
New
collector/hwmon_linux_test.go:TestHwmonDuplicateChipNamesAreDisambiguated— reproduces the Metric node_hwmon_pwm_enable was collected before with the same name and label values #3637 ASUS WMI scenario (two hwmon dirs sharing one platform device, both exposingpwm1_enable) and asserts bothGathersucceeds and the chip labels are distinct.TestHwmonUniqueChipNamesAreUnchanged— guards against unintended label drift for users not hitting the collision.TestHwmonDuplicateChipNamesWithSameNameFile— exercises the hwmonX-basename fallback when thenamefile content also collides.go test ./collector/), including the existing fixture-driven e2e checks.go vet ./...clean.