123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183 |
- Real-Time group scheduling
- --------------------------
- CONTENTS
- ========
- 0. WARNING
- 1. Overview
- 1.1 The problem
- 1.2 The solution
- 2. The interface
- 2.1 System-wide settings
- 2.2 Default behaviour
- 2.3 Basis for grouping tasks
- 3. Future plans
- 0. WARNING
- ==========
- Fiddling with these settings can result in an unstable system, the knobs are
- root only and assumes root knows what he is doing.
- Most notable:
- * very small values in sched_rt_period_us can result in an unstable
- system when the period is smaller than either the available hrtimer
- resolution, or the time it takes to handle the budget refresh itself.
- * very small values in sched_rt_runtime_us can result in an unstable
- system when the runtime is so small the system has difficulty making
- forward progress (NOTE: the migration thread and kstopmachine both
- are real-time processes).
- 1. Overview
- ===========
- 1.1 The problem
- ---------------
- Realtime scheduling is all about determinism, a group has to be able to rely on
- the amount of bandwidth (eg. CPU time) being constant. In order to schedule
- multiple groups of realtime tasks, each group must be assigned a fixed portion
- of the CPU time available. Without a minimum guarantee a realtime group can
- obviously fall short. A fuzzy upper limit is of no use since it cannot be
- relied upon. Which leaves us with just the single fixed portion.
- 1.2 The solution
- ----------------
- CPU time is divided by means of specifying how much time can be spent running
- in a given period. We allocate this "run time" for each realtime group which
- the other realtime groups will not be permitted to use.
- Any time not allocated to a realtime group will be used to run normal priority
- tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
- SCHED_OTHER.
- Let's consider an example: a frame fixed realtime renderer must deliver 25
- frames a second, which yields a period of 0.04s per frame. Now say it will also
- have to play some music and respond to input, leaving it with around 80% CPU
- time dedicated for the graphics. We can then give this group a run time of 0.8
- * 0.04s = 0.032s.
- This way the graphics group will have a 0.04s period with a 0.032s run time
- limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
- needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
- 0.00015s. So this group can be scheduled with a period of 0.005s and a run time
- of 0.00015s.
- The remaining CPU time will be used for user input and other tasks. Because
- realtime tasks have explicitly allocated the CPU time they need to perform
- their tasks, buffer underruns in the graphics or audio can be eliminated.
- NOTE: the above example is not fully implemented yet. We still
- lack an EDF scheduler to make non-uniform periods usable.
- 2. The Interface
- ================
- 2.1 System wide settings
- ------------------------
- The system wide settings are configured under the /proc virtual file system:
- /proc/sys/kernel/sched_rt_period_us:
- The scheduling period that is equivalent to 100% CPU bandwidth
- /proc/sys/kernel/sched_rt_runtime_us:
- A global limit on how much time realtime scheduling may use. Even without
- CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
- processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
- available to all realtime groups.
- * Time is specified in us because the interface is s32. This gives an
- operating range from 1us to about 35 minutes.
- * sched_rt_period_us takes values from 1 to INT_MAX.
- * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
- * A run time of -1 specifies runtime == period, ie. no limit.
- 2.2 Default behaviour
- ---------------------
- The default values for sched_rt_period_us (1000000 or 1s) and
- sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by
- SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
- realtime tasks will not lock up the machine but leave a little time to recover
- it. By setting runtime to -1 you'd get the old behaviour back.
- By default all bandwidth is assigned to the root group and new groups get the
- period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
- want to assign bandwidth to another group, reduce the root group's bandwidth
- and assign some or all of the difference to another group.
- Realtime group scheduling means you have to assign a portion of total CPU
- bandwidth to the group before it will accept realtime tasks. Therefore you will
- not be able to run realtime tasks as any user other than root until you have
- done that, even if the user has the rights to run processes with realtime
- priority!
- 2.3 Basis for grouping tasks
- ----------------------------
- Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
- CPU bandwidth to task groups.
- This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
- to control the CPU time reserved for each control group.
- For more information on working with control groups, you should read
- Documentation/cgroup-v1/cgroups.txt as well.
- Group settings are checked against the following limits in order to keep the
- configuration schedulable:
- \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
- For now, this can be simplified to just the following (but see Future plans):
- \Sum_{i} runtime_{i} <= global_runtime
- 3. Future plans
- ===============
- There is work in progress to make the scheduling period for each group
- ("<cgroup>/cpu.rt_period_us") configurable as well.
- The constraint on the period is that a subgroup must have a smaller or
- equal period to its parent. But realistically its not very useful _yet_
- as its prone to starvation without deadline scheduling.
- Consider two sibling groups A and B; both have 50% bandwidth, but A's
- period is twice the length of B's.
- * group A: period=100000us, runtime=10000us
- - this runs for 0.01s once every 0.1s
- * group B: period= 50000us, runtime=10000us
- - this runs for 0.01s twice every 0.1s (or once every 0.05 sec).
- This means that currently a while (1) loop in A will run for the full period of
- B and can starve B's tasks (assuming they are of lower priority) for a whole
- period.
- The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
- full deadline scheduling to the linux kernel. Deadline scheduling the above
- groups and treating end of the period as a deadline will ensure that they both
- get their allocated time.
- Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
- the biggest challenge as the current linux PI infrastructure is geared towards
- the limited static priority levels 0-99. With deadline scheduling you need to
- do deadline inheritance (since priority is inversely proportional to the
- deadline delta (deadline - now)).
- This means the whole PI machinery will have to be reworked - and that is one of
- the most complex pieces of code we have.
|