psi.txt 6.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180
  1. ================================
  2. PSI - Pressure Stall Information
  3. ================================
  4. :Date: April, 2018
  5. :Author: Johannes Weiner <[email protected]>
  6. When CPU, memory or IO devices are contended, workloads experience
  7. latency spikes, throughput losses, and run the risk of OOM kills.
  8. Without an accurate measure of such contention, users are forced to
  9. either play it safe and under-utilize their hardware resources, or
  10. roll the dice and frequently suffer the disruptions resulting from
  11. excessive overcommit.
  12. The psi feature identifies and quantifies the disruptions caused by
  13. such resource crunches and the time impact it has on complex workloads
  14. or even entire systems.
  15. Having an accurate measure of productivity losses caused by resource
  16. scarcity aids users in sizing workloads to hardware--or provisioning
  17. hardware according to workload demand.
  18. As psi aggregates this information in realtime, systems can be managed
  19. dynamically using techniques such as load shedding, migrating jobs to
  20. other systems or data centers, or strategically pausing or killing low
  21. priority or restartable batch jobs.
  22. This allows maximizing hardware utilization without sacrificing
  23. workload health or risking major disruptions such as OOM kills.
  24. Pressure interface
  25. ==================
  26. Pressure information for each resource is exported through the
  27. respective file in /proc/pressure/ -- cpu, memory, and io.
  28. The format for CPU is as such:
  29. some avg10=0.00 avg60=0.00 avg300=0.00 total=0
  30. and for memory and IO:
  31. some avg10=0.00 avg60=0.00 avg300=0.00 total=0
  32. full avg10=0.00 avg60=0.00 avg300=0.00 total=0
  33. The "some" line indicates the share of time in which at least some
  34. tasks are stalled on a given resource.
  35. The "full" line indicates the share of time in which all non-idle
  36. tasks are stalled on a given resource simultaneously. In this state
  37. actual CPU cycles are going to waste, and a workload that spends
  38. extended time in this state is considered to be thrashing. This has
  39. severe impact on performance, and it's useful to distinguish this
  40. situation from a state where some tasks are stalled but the CPU is
  41. still doing productive work. As such, time spent in this subset of the
  42. stall state is tracked separately and exported in the "full" averages.
  43. The ratios are tracked as recent trends over ten, sixty, and three
  44. hundred second windows, which gives insight into short term events as
  45. well as medium and long term trends. The total absolute stall time is
  46. tracked and exported as well, to allow detection of latency spikes
  47. which wouldn't necessarily make a dent in the time averages, or to
  48. average trends over custom time frames.
  49. Monitoring for pressure thresholds
  50. ==================================
  51. Users can register triggers and use poll() to be woken up when resource
  52. pressure exceeds certain thresholds.
  53. A trigger describes the maximum cumulative stall time over a specific
  54. time window, e.g. 100ms of total stall time within any 500ms window to
  55. generate a wakeup event.
  56. To register a trigger user has to open psi interface file under
  57. /proc/pressure/ representing the resource to be monitored and write the
  58. desired threshold and time window. The open file descriptor should be
  59. used to wait for trigger events using select(), poll() or epoll().
  60. The following format is used:
  61. <some|full> <stall amount in us> <time window in us>
  62. For example writing "some 150000 1000000" into /proc/pressure/memory
  63. would add 150ms threshold for partial memory stall measured within
  64. 1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
  65. would add 50ms threshold for full io stall measured within 1sec time window.
  66. Triggers can be set on more than one psi metric and more than one trigger
  67. for the same psi metric can be specified. However for each trigger a separate
  68. file descriptor is required to be able to poll it separately from others,
  69. therefore for each trigger a separate open() syscall should be made even
  70. when opening the same psi interface file.
  71. Monitors activate only when system enters stall state for the monitored
  72. psi metric and deactivates upon exit from the stall state. While system is
  73. in the stall state psi signal growth is monitored at a rate of 10 times per
  74. tracking window.
  75. The kernel accepts window sizes ranging from 500ms to 10s, therefore min
  76. monitoring update interval is 50ms and max is 1s. Min limit is set to
  77. prevent overly frequent polling. Max limit is chosen as a high enough number
  78. after which monitors are most likely not needed and psi averages can be used
  79. instead.
  80. When activated, psi monitor stays active for at least the duration of one
  81. tracking window to avoid repeated activations/deactivations when system is
  82. bouncing in and out of the stall state.
  83. Notifications to the userspace are rate-limited to one per tracking window.
  84. The trigger will de-register when the file descriptor used to define the
  85. trigger is closed.
  86. Userspace monitor usage example
  87. ===============================
  88. #include <errno.h>
  89. #include <fcntl.h>
  90. #include <stdio.h>
  91. #include <poll.h>
  92. #include <string.h>
  93. #include <unistd.h>
  94. /*
  95. * Monitor memory partial stall with 1s tracking window size
  96. * and 150ms threshold.
  97. */
  98. int main() {
  99. const char trig[] = "some 150000 1000000";
  100. struct pollfd fds;
  101. int n;
  102. fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
  103. if (fds.fd < 0) {
  104. printf("/proc/pressure/memory open error: %s\n",
  105. strerror(errno));
  106. return 1;
  107. }
  108. fds.events = POLLPRI;
  109. if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
  110. printf("/proc/pressure/memory write error: %s\n",
  111. strerror(errno));
  112. return 1;
  113. }
  114. printf("waiting for events...\n");
  115. while (1) {
  116. n = poll(&fds, 1, -1);
  117. if (n < 0) {
  118. printf("poll error: %s\n", strerror(errno));
  119. return 1;
  120. }
  121. if (fds.revents & POLLERR) {
  122. printf("got POLLERR, event source is gone\n");
  123. return 0;
  124. }
  125. if (fds.revents & POLLPRI) {
  126. printf("event triggered!\n");
  127. } else {
  128. printf("unknown event received: 0x%x\n", fds.revents);
  129. return 1;
  130. }
  131. }
  132. return 0;
  133. }
  134. Cgroup2 interface
  135. =================
  136. In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
  137. mounted, pressure stall information is also tracked for tasks grouped
  138. into cgroups. Each subdirectory in the cgroupfs mountpoint contains
  139. cpu.pressure, memory.pressure, and io.pressure files; the format is
  140. the same as the /proc/pressure/ files.
  141. Per-cgroup psi monitors can be specified and used the same way as
  142. system-wide ones.