intel-pstate.txt 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222
  1. Intel P-State driver
  2. --------------------
  3. This driver provides an interface to control the P-State selection for the
  4. SandyBridge+ Intel processors.
  5. The following document explains P-States:
  6. http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
  7. As stated in the document, P-State doesn’t exactly mean a frequency. However, for
  8. the sake of the relationship with cpufreq, P-State and frequency are used
  9. interchangeably.
  10. Understanding the cpufreq core governors and policies are important before
  11. discussing more details about the Intel P-State driver. Based on what callbacks
  12. a cpufreq driver provides to the cpufreq core, it can support two types of
  13. drivers:
  14. - with target_index() callback: In this mode, the drivers using cpufreq core
  15. simply provide the minimum and maximum frequency limits and an additional
  16. interface target_index() to set the current frequency. The cpufreq subsystem
  17. has a number of scaling governors ("performance", "powersave", "ondemand",
  18. etc.). Depending on which governor is in use, cpufreq core will call for
  19. transitions to a specific frequency using target_index() callback.
  20. - setpolicy() callback: In this mode, drivers do not provide target_index()
  21. callback, so cpufreq core can't request a transition to a specific frequency.
  22. The driver provides minimum and maximum frequency limits and callbacks to set a
  23. policy. The policy in cpufreq sysfs is referred to as the "scaling governor".
  24. The cpufreq core can request the driver to operate in any of the two policies:
  25. "performance" and "powersave". The driver decides which frequency to use based
  26. on the above policy selection considering minimum and maximum frequency limits.
  27. The Intel P-State driver falls under the latter category, which implements the
  28. setpolicy() callback. This driver decides what P-State to use based on the
  29. requested policy from the cpufreq core. If the processor is capable of
  30. selecting its next P-State internally, then the driver will offload this
  31. responsibility to the processor (aka HWP: Hardware P-States). If not, the
  32. driver implements algorithms to select the next P-State.
  33. Since these policies are implemented in the driver, they are not same as the
  34. cpufreq scaling governors implementation, even if they have the same name in
  35. the cpufreq sysfs (scaling_governors). For example the "performance" policy is
  36. similar to cpufreq’s "performance" governor, but "powersave" is completely
  37. different than the cpufreq "powersave" governor. The strategy here is similar
  38. to cpufreq "ondemand", where the requested P-State is related to the system load.
  39. Sysfs Interface
  40. In addition to the frequency-controlling interfaces provided by the cpufreq
  41. core, the driver provides its own sysfs files to control the P-State selection.
  42. These files have been added to /sys/devices/system/cpu/intel_pstate/.
  43. Any changes made to these files are applicable to all CPUs (even in a
  44. multi-package system).
  45. max_perf_pct: Limits the maximum P-State that will be requested by
  46. the driver. It states it as a percentage of the available performance. The
  47. available (P-State) performance may be reduced by the no_turbo
  48. setting described below.
  49. min_perf_pct: Limits the minimum P-State that will be requested by
  50. the driver. It states it as a percentage of the max (non-turbo)
  51. performance level.
  52. no_turbo: Limits the driver to selecting P-State below the turbo
  53. frequency range.
  54. turbo_pct: Displays the percentage of the total performance that
  55. is supported by hardware that is in the turbo range. This number
  56. is independent of whether turbo has been disabled or not.
  57. num_pstates: Displays the number of P-States that are supported
  58. by hardware. This number is independent of whether turbo has
  59. been disabled or not.
  60. For example, if a system has these parameters:
  61. Max 1 core turbo ratio: 0x21 (Max 1 core ratio is the maximum P-State)
  62. Max non turbo ratio: 0x17
  63. Minimum ratio : 0x08 (Here the ratio is called max efficiency ratio)
  64. Sysfs will show :
  65. max_perf_pct:100, which corresponds to 1 core ratio
  66. min_perf_pct:24, max_efficiency_ratio / max 1 Core ratio
  67. no_turbo:0, turbo is not disabled
  68. num_pstates:26 = (max 1 Core ratio - Max Efficiency Ratio + 1)
  69. turbo_pct:39 = (max 1 core ratio - max non turbo ratio) / num_pstates
  70. Refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual
  71. Volume 3: System Programming Guide" to understand ratios.
  72. cpufreq sysfs for Intel P-State
  73. Since this driver registers with cpufreq, cpufreq sysfs is also presented.
  74. There are some important differences, which need to be considered.
  75. scaling_cur_freq: This displays the real frequency which was used during
  76. the last sample period instead of what is requested. Some other cpufreq driver,
  77. like acpi-cpufreq, displays what is requested (Some changes are on the
  78. way to fix this for acpi-cpufreq driver). The same is true for frequencies
  79. displayed at /proc/cpuinfo.
  80. scaling_governor: This displays current active policy. Since each CPU has a
  81. cpufreq sysfs, it is possible to set a scaling governor to each CPU. But this
  82. is not possible with Intel P-States, as there is one common policy for all
  83. CPUs. Here, the last requested policy will be applicable to all CPUs. It is
  84. suggested that one use the cpupower utility to change policy to all CPUs at the
  85. same time.
  86. scaling_setspeed: This attribute can never be used with Intel P-State.
  87. scaling_max_freq/scaling_min_freq: This interface can be used similarly to
  88. the max_perf_pct/min_perf_pct of Intel P-State sysfs. However since frequencies
  89. are converted to nearest possible P-State, this is prone to rounding errors.
  90. This method is not preferred to limit performance.
  91. affected_cpus: Not used
  92. related_cpus: Not used
  93. For contemporary Intel processors, the frequency is controlled by the
  94. processor itself and the P-State exposed to software is related to
  95. performance levels. The idea that frequency can be set to a single
  96. frequency is fictional for Intel Core processors. Even if the scaling
  97. driver selects a single P-State, the actual frequency the processor
  98. will run at is selected by the processor itself.
  99. Tuning Intel P-State driver
  100. When HWP mode is not used, debugfs files have also been added to allow the
  101. tuning of the internal governor algorithm. These files are located at
  102. /sys/kernel/debug/pstate_snb/. The algorithm uses a PID (Proportional
  103. Integral Derivative) controller. The PID tunable parameters are:
  104. deadband
  105. d_gain_pct
  106. i_gain_pct
  107. p_gain_pct
  108. sample_rate_ms
  109. setpoint
  110. To adjust these parameters, some understanding of driver implementation is
  111. necessary. There are some tweeks described here, but be very careful. Adjusting
  112. them requires expert level understanding of power and performance relationship.
  113. These limits are only useful when the "powersave" policy is active.
  114. -To make the system more responsive to load changes, sample_rate_ms can
  115. be adjusted (current default is 10ms).
  116. -To make the system use higher performance, even if the load is lower, setpoint
  117. can be adjusted to a lower number. This will also lead to faster ramp up time
  118. to reach the maximum P-State.
  119. If there are no derivative and integral coefficients, The next P-State will be
  120. equal to:
  121. current P-State - ((setpoint - current cpu load) * p_gain_pct)
  122. For example, if the current PID parameters are (Which are defaults for the core
  123. processors like SandyBridge):
  124. deadband = 0
  125. d_gain_pct = 0
  126. i_gain_pct = 0
  127. p_gain_pct = 20
  128. sample_rate_ms = 10
  129. setpoint = 97
  130. If the current P-State = 0x08 and current load = 100, this will result in the
  131. next P-State = 0x08 - ((97 - 100) * 0.2) = 8.6 (rounded to 9). Here the P-State
  132. goes up by only 1. If during next sample interval the current load doesn't
  133. change and still 100, then P-State goes up by one again. This process will
  134. continue as long as the load is more than the setpoint until the maximum P-State
  135. is reached.
  136. For the same load at setpoint = 60, this will result in the next P-State
  137. = 0x08 - ((60 - 100) * 0.2) = 16
  138. So by changing the setpoint from 97 to 60, there is an increase of the
  139. next P-State from 9 to 16. So this will make processor execute at higher
  140. P-State for the same CPU load. If the load continues to be more than the
  141. setpoint during next sample intervals, then P-State will go up again till the
  142. maximum P-State is reached. But the ramp up time to reach the maximum P-State
  143. will be much faster when the setpoint is 60 compared to 97.
  144. Debugging Intel P-State driver
  145. Event tracing
  146. To debug P-State transition, the Linux event tracing interface can be used.
  147. There are two specific events, which can be enabled (Provided the kernel
  148. configs related to event tracing are enabled).
  149. # cd /sys/kernel/debug/tracing/
  150. # echo 1 > events/power/pstate_sample/enable
  151. # echo 1 > events/power/cpu_frequency/enable
  152. # cat trace
  153. gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107
  154. scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618
  155. freq=2474476
  156. cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2
  157. Using ftrace
  158. If function level tracing is required, the Linux ftrace interface can be used.
  159. For example if we want to check how often a function to set a P-State is
  160. called, we can set ftrace filter to intel_pstate_set_pstate.
  161. # cd /sys/kernel/debug/tracing/
  162. # cat available_filter_functions | grep -i pstate
  163. intel_pstate_set_pstate
  164. intel_pstate_cpu_init
  165. ...
  166. # echo intel_pstate_set_pstate > set_ftrace_filter
  167. # echo function > current_tracer
  168. # cat trace | head -15
  169. # tracer: function
  170. #
  171. # entries-in-buffer/entries-written: 80/80 #P:4
  172. #
  173. # _-----=> irqs-off
  174. # / _----=> need-resched
  175. # | / _---=> hardirq/softirq
  176. # || / _--=> preempt-depth
  177. # ||| / delay
  178. # TASK-PID CPU# |||| TIMESTAMP FUNCTION
  179. # | | | |||| | |
  180. Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func
  181. gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func
  182. gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
  183. <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func