Improve histogram, summary performance under contention by striping observationCount #1794
+24
−7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Was working on improving the performance of opentelemetry-java metrics under high contention, and realized that the same strategy I identified to help over there helps for the prometheus implementation as well!
The idea here is recognizing that
Buffer.observationCountis the bottleneck under contention. In contrast to the other histogram / summaryLongAdderfields,Buffer.observationCountisAtomicLongwhich performs much worse thanLongAdderunder high contention. Its necessary that the type isAtomicLongbecause the CAS APIs accommodate the two way communication that the record / collect paths need to signal that a collection has started and all records have successfully completed (preventing partial writes).However, we can "have our cake and eat it to" by striping
Buffer.observationCountinto many instances, such that the contention on any instance is reduced. This is actually whatLongAdderdoes under the covers. This implementation stripes it intoRuntime.getRuntime().availableProcessors()instances, and usesThread.currentThread().getId()) % stripedObservationCounts.lengthto select which instance any particular record thread should use.Performance increase is substantial. Here's the before and after of
HistogramBenchmarkon my machine (Apple M4 Mac Pro w/ 48gb RAM):Before:
After: