Paul Mackerras
perf superceeds operfmon and similar tools for understanding.
- perf_event; initial proposal in Dec 2008; developerd through 2009 in the tip tree.
- In mainline in 2.6.31.
- Initially supported recent x86 processors and 64-bit POWER.
- Now have hardware support for SH and sparc64.
- Most have at least basic support, as long as you have high-res timers.
Concepts
- Basic abstraction: a counter to some event. Not the whole PMU. Makes it easier for users.
- Types of event: Hardware, counted by the PMU; Software, function call in kernel code; Tracepoint; Hardware data breakpoint.
- Sampling: Reord information in ring buffer. Can record IP, timestamp, addresses, etc.
- Can sample every N counts or N per second.
The kernel now has a variety of interesting events already added to various subsystems.
Counting and sampling a complimentary; you can sample based on reaching a $count number of events, and that sample can have a broader data collection.
Samples can also be expressed in terms of the number per second you’d like.
Events can be per-task or per-cpu; you can also set inheritance so that it can follow forks and aggregate the child counts back to the parent. CPU mode can be filters; user, kernel, hypervisor (woot!).
If the PMU is too heavily loaded by our rules the kernel will arbitrate the list by a scheduler in an effort to give all rules time. This leads to a problem; if you have counters that are dependent, the time slicing can mean changes in program behaviour will chage between time slices for the counter. Whoops.
The answer is grouping; you can flag those dependencies to force them to be scheduled together. You can also pin groups to make sure that it’s always there, or throw an error trying.
Optionally records mmaps, exec events, exits, forks, and similar strace() type data.
There’s also another layer of abstraction that allows us to express things in terms of hardware events, such as CPU cycles, cache rederences and misses, banching prediction stats; cache stats can come from the various L1 and internal CPU caches.
This is all obviously architecture dependant.
Software/Tracepoint Events
- Elapsed times; this is a bit of fakery/guesstimation.
- Page faults, context switches and CPU migrations.
Tracepoints are interesting; ftrace has been blown out with bolt-ons; any kernel tracepoint can be used as an event. Integrates the ftrace infrastructure.
Process level monitoring doesn’t require root privs, although system-wide doesn’t.
perf diff can take dumps and compare them for you. perftop gives you, well, top for perf.
Q&A
- No plans for System Z.
- Josh Berkus wants to know if PostgreSQL could utilised it in the manner that they have dtrace probes. “Not very clear on dtrace probes or how it works.” Another attendee thinks that perf_events are not really amenable to the problem space.