|
perfc 0.11.0
|
perfc is a C++17 library providing lightweight performance counters.
Use waf to build run unit tests and install
To use in a wtools project import pkg-config dependency perfc, e.g.:
Trivial use case (see also perfcDemo in roadrunner/extras):
Many performance sensitive applications require metrics with such a low overhead that they can safely be included in release builds and act as a source of application telemetry. The perfc project provides building blocks to do this using atomic performance counters.
The typical scenario is that performance counters are updated by performance sensitive threads and sampled by performance insensitive thread.
The core of the library is the perfc::Counter type (see also section Counters below for examples and group Counters for details) which represents a thread safe atomic (possibly lock free) value with or without timestamp.
A common scenario that is foreseen is that an application will have potentially many performance counters that needs to be sampled at intervals and e.g. published as application telemetry. To facilitate this perfc provides perfc::Register (see also section Register below and group Counter Register) which allows registration of counters together with metadata and enumeration of registered counters.
#include <perfc/counter.hpp>perfc provide the templated counter type perfc::Counter<T,Clock> which represent either a
Clock is a TrivialClock orClock is void.A small example where a timestamped counter is written to from a worker thread and read from by a monitoring thread:
All templates provide the following basic operations showed above:
perfc::Counter<T>::Store(ValueType, MemoryOrder) -> void
perfc::Counter<T>::Load(MemoryOrder) -> ValueType
Where ValueType is either simply T if Clock=void or perfc::Timestamped<T,TimePoint>.
If Clock is a TrivialClock the partial specialization perfc::Counter<T,Clock> provides the additional operation where the current time is automatically sampled from the clock:
perfc::Counter<T,Clock>::Store(T, MemoryOrder) -> void
There are aliases defined for common counter types:
| Alias | Type |
|---|---|
| Counters without timestamp | |
perfc::CounterDouble | perfc::Counter<double, void> |
perfc::CounterU64 | perfc::Counter<std::uint64_t, void> |
perfc::CounterI64 | perfc::Counter<std::int64_t, void> |
| Counters with timestamp | |
perfc::CounterDoubleTs | perfc::Counter<double, std::chrono::steady_clock> |
perfc::CounterU64Ts | perfc::Counter<std::uint64_t, std::chrono::steady_clock> |
perfc::CounterI64Ts | perfc::Counter<std::int64_t, std::chrono::steady_clock> |
See group Counter Register and perfc::Register for interface details.
A lock-free counter type has very little performance impact. On x86-6 a relaxed counter store results in a mov instruction to main memory, however with internal timestamp there's additional cost to query clock and likely non-lock free counter load/store operations. perfc project contains benchmarks (see section Benchmarks) that can be executed to observe actual results.
For absolute highest performance lock-free counters should be used. In addition, counters updated at the same time (e.g. at the end of a loop) should be laid out in contiguous memory with the idea that multiple counters fit on the same cache line will lead to higher throughput. This has been observed in the benchmarks. For example, without contention on a Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz with benchmark ReleaseAcquireBatchedContention (tip: repeat test yourself with command perfBench --benchmark_filter=CounterFixture/ReleaseAcquireBatchedContention/.*threads:1/*):
| Number of counters | Mean Store Rate |
|---|---|
| 1 | 680.183M/s |
| 8 | 1.37381G/s |
| 16 | 1.45461G/s |
| 1 with timestamp | 15.6979M/s (1) |
(1): To give something to compare with this result use non-lock-free operations and queries clock for every store operation.
roadrunner/extras project.The test names have the following pattern:
<Fixture>/<Test Name>/<Counter#>/.../threads:<Thread#>
or
<Benchmark>_<Stats>
when reporting statistics of multiple runs of the run for <Benchmark>.
| <Fixture> | Notes |
|---|---|
| CounterFixture | 64bit counters without timestamp (lock-free on x86-64). |
| CounterSteadyClockFixture | 64bit counters with timestamp. |
| <Test Name> | Notes |
|---|---|
| ReleaseAcquireContention | Explores performance impact of worst-case counter access contention. The first thread continuously performs stores whereas subsequent threads continuously performs reads. Results from benchmark with 1 thread serves as the baseline with no contention. Results with two (or more) threads show impact of cache coherency. |
| ReleaseAcquireBatchedContention | Similar to ReleaseAcquireContention test but performs relaxed operations in batches. At the end of each batch the operation is synchronized with barrier perfc::CounterRelease and perfc::CounterAcquire. |
| <Stats> | Notes |
|---|---|
| mean | Mean |
| median | Median |
| stddev | Standard deviation |
Examples:
CounterFixture/ReleaseAcquireContention/1/repeats:5/real_time/threads:1
Use fixture CounterFixture, test ReleaseAcquireContention, each iteration of the test operates on 1 counter and benchmark is executed with 1 thread (i.e. no contention).
CounterSteadyClockFixture/ReleaseAcquireContention/8/repeats:5/real_time/threads:2
Use fixture CounterSteadyClockFixture, test ReleaseAcquireBatchedContention, each iteration of the test operate on 8 counters and two threads are competing to access the same counters.
CounterSteadyClockFixture/ReleaseAcquireBatchedContention/8/repeats:5/real_time/threads:2_stddev
Provide statistics of the (repeats:5) separate benchmark runs.
The result of each test has the form:
Where each result of "Time", "CPU" and "Rate" is the average result of each iteration. The user counter "Rate" represents a sort of normalized throughput under each case so different configurations and benchmarks can be compared. More precisely it is the average per-thread load/store operation rate on all counters.
AtomicCounter specify requirements that is met by all perfc::Counter types.
For a type C:
C satisfy DefaultConstructibleC::ValueType satisfy TriviallyCopyable, CopyConstructible and CopyAssignable.C::ClockType satisfy TrivialClock or is voidC::TimePointType is C::ClockType::time_point or void if C::ClockType is void.C::CounterType is std::atomic<C::ValueType>.C::IS_ALWAYS_LOCK_FREE is true if operations are always lock free, false otherwise.Given ...
c lvalue type Ct lvalue type C::ValueTypeo expression of type perfc::MemoryOrder| Expression | Return Value | Semantics |
|---|---|---|
c.Store(t) (1) | void | Stores value atomically using (1) a default memory order or (2) the specified memory order. |
c.Store(t, o) (2) | void | |
c.Load() (1) | C::ValueType | Load value atomically using (1) a default memory order or (2) the specified memory order. |
c.Load(o) (2) | C::ValueType | |
c.IsLockFree() | bool | Queries whether all operations are lock free (returning true) or not (returning false). |
Thread-safe - means that member or non-member functions are safe to call in parallel.
Thread-compatible - applies to classes as a whole and means that no external synchronization is required for parallel accesses to separate class instances or for const-only accesses on the same instance. Mixed const/non-const access must be externally synchronized.
Thread-hostile - means that parallel access is never safe. This may occur if e.g. a free function (or const member function) accesses a global shared state in a non-const fashion.
This table summarize operations that are safe in parallel for the different classifications:
| Parallel operation | Thread-safe | Thread-compatible | Thread-hostile |
|---|---|---|---|
| Const-only | x | x | |
| Non-const w/ separate instances | x | x | |
| Non-const w/ same instance | x |
See also related terminology: