The PMCSched framework: Scheduling algorithms made easy
PMCSched is an open-source framework designed to simplify the creation and testing of novel scheduling and resource-management strategies. This webpage introduces the framework, outlines its key abstractions, and provides a tutorial on its usage.
PMCSched was born as a continuation of PMCTrack, an OS-oriented performance monitoring tool for Linux. With the PMCSched framework, we take PMCTrack’s potential one step further by easing scheduling development. In particular, PMCSched was designed to facilitate the development of scheduling or resource management algorithms in the operating system (kernel) space, while also allowing for cooperation with user space runtimes. Unlike other existing frameworks that require patching the Linux kernel to function, PMCSched makes it possible to incorporate new scheduling-related OS-level support in Linux via a kernel module (creating a PMCSched plugin) that can be loaded in unmodified (vanilla) kernels, making its adoption easier in production systems.
PMCSched leverages PMCTrack’s APIs for hardware performance monitoring (hardware PMCs) and cache partitioning. Besides that, PMCSched prepares key data structures, such as pmcsched_thread_data_t for thread representation, group_app_t and sched_app_t for tracking applications at different levels, and sched_thread_group_t for managing scheduling decisions per core group, enabling scalable implementations. The framework also allows for efficient communication between applications and the OS kernel via shared memory, exposed through /proc/pmc/schedctl. Inspired by Solaris’ schedctl(), PMCSched enables direct access to scheduling-related data using mmap(), allowing optimizations in loadable kernel modules without additional system calls. This capability helped us implement cache-partitioning policies and runtime-OS kernel interactions.
pmcsched_thread_data_t
group_app_t
sched_app_t
sched_thread_group_t
/proc/pmc/schedctl
schedctl()
mmap()
Building a PMCSched plugin boils down to instantiating an interface of scheduling operations (sched_ops_t) and implementing the corresponding interface functions in a separate .c file. This way, all plugins adhere to a “contract” – this is, a predefined set of requirements, which is a common practice in kernel-module development. This interface is represented by the sched_ops_t structure, defined as follows:
sched_ops_t
typedef struct sched_ops { char* description; sched_policy_mm_t policy; unsigned long flags; struct list_head link_schedulers; pmcsched_counter_config_t* counter_config; /* Callbacks */ int (*on_fork_thread) (pmcsched_thread_data_t* t, unsigned char is_new_app); void (*on_exec_thread) (pmon_prof_t* prof); void (*on_active_thread) (pmcsched_thread_data_t* t); void (*on_inactive_thread) (pmcsched_thread_data_t* t); void (*on_exit_thread) (pmcsched_thread_data_t* t); void (*on_free_thread) (pmcsched_thread_data_t* t, unsigned char is_last_thread); void (*on_migrate_thread) (pmcsched_thread_data_t* t, int prev_cpu, int new_cpu); int (*on_read_plugin) (char *aux); int (*on_write_plugin) (char *line); void (*on_migrate_thread) (pmcsched_thread_data_t* t, int prev_cpu, int new_cpu); void (*sched_timer_periodic)(void); void (*sched_kthread_periodic) (sized_list_t* migration_list); int (*on_new_sample) (pmon_prof_t* prof, int cpu, pmc_sample_t* sample,int flags,void* data); } sched_ops_t;
The structure sched_ops_t consists of a set of fields and callbacks. The purpose of these callbacks is as follows:
on_fork_thread: Called when a thread invokes the fork() system call. It receives a pmcsched_thread_data_t parameter with thread information and an additional boolean parameter, is_new_app, to distinguish between newly created processes and existing process threads. This is an ideal location for plugins to initialize per-thread metrics.
on_fork_thread
fork()
is_new_app
on_exec_thread: Invoked when a thread calls the exec() system call.
on_exec_thread
exec()
on_active_thread: Executes when a thread becomes runnable, receiving the thread’s PMCSched descriptor as a parameter. Similarly, on_inactive_thread is invoked when the thread blocks, sleeps, or terminates.
on_active_thread
on_inactive_thread
on_exit_thread: Triggered when a thread terminates execution.
on_exit_thread
on_free_thread: Executes when the kernel frees the memory associated with a thread’s task structure. If the current thread is the last in its process, the is_last_thread parameter is set to 1, signaling plugins to free process-wide data structures.
on_free_thread
is_last_thread
1
on_migrate_thread: Called when a thread migrates from one core to another, enabling tracking of thread movement for load balancing.
on_migrate_thread
on_write_plugin: PMCSched exposes the /proc/pmc/sched special file for framework configuration. This callback allows plugins to expose configurable parameters, enabling users to update plugin-specific settings dynamically.
on_write_plugin
/proc/pmc/sched
on_read_plugin: Invoked when the read() system call is used on /proc/pmc/sched, allowing plugins to include relevant information in the output.
on_read_plugin
read()
sched_timer_periodic: Allows plugins to perform periodic operations per core group, such as recalculating scheduling metrics.
sched_timer_periodic
sched_kthread_periodic: Called periodically by a kernel thread (kthread), allowing the execution of blocking operations such as thread migrations.
sched_kthread_periodic
kthread
on_new_sample: Defined by plugins that use PMCs to gather scheduling-relevant statistics per thread. Invoked when PMCTrack collects new samples, allowing plugins to compute high-level metrics from hardware event data.
on_new_sample
This structure provides a flexible interface for implementing scheduling and resource-management strategies in PMCSched. The remaining fields in sched_ops_t serve the following purposes:
policy: Stores a constant enumerated value that uniquely identifies the scheduling policy.
policy
description: Contains a human-readable description of the scheduling policy, displayed when reading the /proc/pmc/sched file.
description
flags: Determines whether the plugin relies on PMCSched to handle locking synchronization (PMCSCHED_CPUGROUP_LOCK) or manages the locks independently (PMCSCHED_CUSTOM_LOCK). Plugins choosing the latter must handle their own synchronization, such as managing lists of active/inactive threads and avoiding race conditions, but gain finer control over locking mechanisms.
flags
PMCSCHED_CPUGROUP_LOCK
PMCSCHED_CUSTOM_LOCK
link_schedulers: Allows the plugin’s descriptor to be inserted into the doubly-linked list of plugins maintained by PMCSched. Many Linux kernel data structures include list_head fields for integration into generic doubly-linked lists.
link_schedulers
list_head
counter_config: Defines the hardware performance counter configuration for PMC events gathered on a per-thread basis. This configuration follows PMCTrack’s raw event format. Additionally, it specifies the high-level metrics the plugin needs to compute from the gathered event values. PMCTrack provides an API to automate metric calculations and expose virtual counters to its components. If the plugin does not utilize performance monitoring counters, counter_config must be set to NULL.
counter_config
NULL
To illustrate the process of plugin creation, let us walk through a simple example and its associated code. First, we must implement the predefined sched_ops_t interface (described in the previous section) in a separate .c source file. We begin by developing our new plugin by creating a new file named example_plugin.c. As explained, its functions will handle particular scheduling events. The minimal required callbacks, along with a policy ID, optional flags, and string description, are defined as follows:
.c
example_plugin.c
sched_ops_t thesis_plugin = { .policy = SCHED_EXAMPLE, .description = "Example plugin", .flags = PMCSCHED_CPUGROUP_LOCK, .sched_timer_periodic = sched_timer_periodic_example, .sched_kthread_periodic = sched_kthread_example, .on_exec_thread = on_exec_thread_example, .on_active_thread = on_active_thread_example, .on_inactive_thread = on_inactive_thread_example, .on_fork_thread = on_fork_thread_example, .on_exit_thread = on_exit_thread_example, .on_migrate_thread = on_migrate_thread_example, };
The following code is an example of a possible on_active_thread_example, actions will be performed only on even-numbered invocations of the plugin’s functions. For this purpose, it uses an atomic counter to track invocations and only execute its main logic on even counts. When executed, it will unconditionally add the task to the list of active threads for both its associated application and the application group. It will also check if the task represents a newly created application. If the DEBUG option is enabled, the function will output relevant information to the kernel buffer, including the current invocation count. Implementing conditional behavior based on invocation frequency could be useful for various scheduling strategies, such as periodic task migrations or load balancing operations that don’t need to occur on every function call.
on_active_thread_example
DEBUG
#define DPRINTK(fmt, args...) \ do { if (IS_ENABLED(DEBUG)) \ trace_printk(fmt, ##args); } while (0) static atomic_t invocation_count = ATOMIC_INIT(0); static void on_active_thread_thesis(pmcsched_thread_data_t *t) { sched_thread_group_t *cur_group = get_cur_group_sched(); app_t_pmcsched *app = get_group_app_cpu(t, cur_group->cpu_group->group_id); int count = atomic_inc_return(&invocation_count); t->cur_group = cur_group; t->cmt_data = &app->app_cache.app_cmt_data; insert_sized_list_tail(&app->app_active_threads, t); insert_sized_list_tail(&cur_group->active_threads, t); if (sized_list_length(&app->app_active_threads) == 1) { insert_sized_list_tail(&cur_group->active_apps, app); DPRINTK("App active (invocation %d)\n", count); } /* Else, a thread of a multi-threaded app activated */ if (count % 2 == 0) /* Some periodic action (core logic) ... */ if (count >= (1 << 30)) atomic_set(&invocation_count, 0); }
Registering our new plugin in PMCSched requires declaring the plugin’s descriptor in the framework’s main header file, pmcsched.h. Since we implement the functions in a separate file (example_plugin.c), the descriptor must be declared as extern:
pmcsched.h
extern
extern struct sched_ops thesis_plugin;
Two additional changes in the header file are necessary to finish registering our SCHED_EXAMPLE plugin: first, in the enumeration of scheduling policies, and second, in the array of available schedulers. We start by adding the plugin’s ID to the enumeration of available plugins:
SCHED_EXAMPLE
/* Supported scheduling policies */ typedef enum { /* Examples of previous scheduling plugins */ SCHED_DUMMY_MM=0, SCHED_GROUP_MM, SCHED_BUSYBCS_MM, /* New plugin */ SCHED_EXAMPLE, NUM_SCHEDULERS } sched_policy_mm_t;
We now include the plugin’s descriptor in the array of available plugins:
static __attribute__ ((unused)) struct sched_ops* available_schedulers[NUM_SCHEDULERS] = { /* Examples of previous scheduling plugins */ &dummy_plugin, &group_plugin, &busybcs_plugin, /* Our new plugin */ &thesis_plugin, };
The last step is to include the source file of our new plugin in the list of .c files to be compiled, which is found in the architecture-specific Makefile of PMCTrack’s kernel module. For instance, to target Intel processors, the example plugin’s object file (example_plugin.o) is added as shown below:
example_plugin.o
MODULE_NAME=mchw_intel_core obj-m += $(MODULE_NAME).o PMCSCHED-objs= pmcsched.o dummy_plugin.o group_plugin.o busy_plugin.o example_plugin.o
Once the PMCTrack kernel module is loaded into the system (see PMCTrack’s official documentation), the scheduling plugins can be selected and configured by writing to special files in Linux’s procfs. To do so, we first activate PMCSched via the /proc/pmc/mm_manager file, managed by PMCTrack:
# Check PMCSched's monitoring module ID (platform specific) $ cat /proc/pmc/mm_manager [*] 0 - This is just a proof of concept [ ] 1 - IPC sampling SF estimation module [ ] 2 - PMCSched [ ] 3 - AMD QoS extensions (monitoring and allocation) ## Activate PMCSched $ echo 'activate 2' > /proc/pmc/mm_manager ## Make sure that PMCSched has been activated $ cat /proc/pmc/mm_manager [ ] 0 - This is just a proof of concept [ ] 1 - IPC sampling SF estimation module [*] 2 - PMCSched [ ] 3 - AMD QoS extensions (monitoring and allocation)
Reading from /proc/pmc/sched allows us to determine which PMCSched plugin is currently active, as well as to retrieve the ID of our new plugin:
$ cat /proc/pmc/sched The developed schedulers in PMCSched are: [*] 0 - Dummy default plugin (Proof of concept) [ ] 1 - Group Scheduling Plugin (Proof of concept) [ ] 2 - Busy scheduler [ ] 3 - Example scheduler - To change the active scheduler echo 'scheduler <number>' --- (Plugin specific output)
Once this ID is known – referred to as scheduler_id –, the active plugin can be changed with the following command: echo scheduler_id > /proc/pmc/sched.
scheduler_id
echo scheduler_id > /proc/pmc/sched
Arguably, one of PMCSched’s coolest features is its ability to collect information regarding Performance Monitoring Counters (PMCs), using the APIs provided by PMCTrack. You can configure your plugin to collect certain metrics and events, such as instruction count, cycles, LLC misses and LLC references. This is particularly interesting to profile entering applications. Let us illustrate how to collect a number of interesting PMCs:
Firstly, we need to prepare the descriptors for the various performance metrics:
static metric_experiment_set_t metric_description= { .nr_exps=1, /* 1 set of metrics */ .exps={ /* Metric set 0 */ { .metrics= { PMC_METRIC("IPC",op_rate,INSTR_EVT,CYCLES_EVT,1000), PMC_METRIC("RPKI",op_rate,LLC_ACCESSES,INSTR_EVT,1000000), PMC_METRIC("MPKI",op_rate,LLC_MISSES,INSTR_EVT,1000000), PMC_METRIC("MPKC",op_rate,LLC_MISSES,CYCLES_EVT,1000000), PMC_METRIC("STALLS_L2",op_rate,STALLS_L2_MISS,CYCLES_EVT,1000), }, .size=NR_METRICS, .exp_idx=0, }, } };
using a number of metrics and their indexes, that we have to define upfront:
enum event_indexes { INSTR_EVT=0, CYCLES_EVT, LLC_ACCESSES, LLC_MISSES, STALLS_L2_MISS, L2_LINES, PMC_EVENT_COUNT }; enum metric_indices { IPC_MET=0, RPKI_MET, MPKI_MET, MPKC_MET, STALLS_L2_MET, NR_METRICS, };
Finally, we can prepare the pmcsched_counter_config_t, which is the configuration exposed to PMCTrack. In the example below, we set the profiling mode to TBS_SCHED_MODE (as opposed to event based sampling, with EBS_SCHED_MODE).
pmcsched_counter_config_t
TBS_SCHED_MODE
EBS_SCHED_MODE
static pmcsched_counter_config_t cconfig={ .counter_usage={ .hwpmc_mask=0x3b, /* bitops -h 0,1,3,4,5 */ .nr_virtual_counters=CMT_MAX_EVENTS, .nr_experiments=1, .vcounter_desc={"llc_usage","total_llc_bw","local_llc_bw"}, }, .pmcs_descr=&pmc_configuration, .metric_descr={&metric_description,NULL}, .profiling_mode=TBS_SCHED_MODE, };
which we can pass as part of the plugin definition, using field counter_config. We also specify that for every new sample collected, we want PMCSched to call our plugin’s function profile_thread_example().
profile_thread_example()
sched_ops_t lfoc_plugin = { .policy = SCHED_EXAMPLE_PMCs, .description = "Plugin that uses PMCs", .counter_config=&cconfig, (...) .on_new_sample = profile_thread_example, };
The profiling function can then update global instruction counters from the sample:
static int profile_thread_example(pmon_prof_t* prof, int cpu,pmc_sample_t* sample,int flags,void* data) { pmcsched_thread_data_t* t = prof->monitoring_mod_priv_data; (...) lt->instr_counter += sample->pmc_counts[0]; }
and use this information to decide, depending on the algorithm, how to classify the application.
Here’s a list of relevant publications that leverage PMCSched:
Carlos Bilbao, Juan Carlos Saez, Manuel Prieto-Matias, Flexible system software scheduling for asymmetric multicore systems with PMCSched: A case for Intel Alder Lake. Concurrency and Computation: Practice and Experience, 2023. DOI: 10.1002/cpe.7814
Carlos Bilbao, Juan Carlos Saez, Manuel Prieto-Matias, Divide&Content: A Fair OS-Level Resource Manager for Contention Balancing on NUMA Multicores, IEEE Transactions on Parallel and Distributed Systems, 2023. DOI: 10.1109/TPDS.2023.3309999
Javier Rubio, Carlos Bilbao, Juan Carlos Saez, Manuel Prieto-Matias, Exploiting Elasticity via OS-runtime Cooperation to Improve CPU Utilization in Multicore Systems. 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Dublin, Irlanda, 2024.. DOI: 10.1109/PDP62718.2024.00014.
Carlos Bilbao, Juan Carlos Saez, Manuel Prieto-Matias, Rapid Development of OS Support with PMCSched for Scheduling on Asymmetric Multicore Systems. Euro-Par 2022: Parallel Processing Workshops. Also in book Lecture Notes in Computer Science (LNCS, volume 13835). DOI: 10.1007/978-3-031-31209-0_14
You can contact the two main project contributors: