The BFQ disk scheduler is invented by Paolo Valente. The current version of BFQ in DragonFlyBSD is implemented according to his technique report. Also, some additional features are added into the current version, they are inspired by the Linux version, but are totally written from scratch.
Like the CFQ (complete fair queue) disk scheduler under Linux, BFQ is a fair queueing scheduler that aims to improve the interactivity and lower the latency of the system. Maximize throughput, however, is not the major design goal of BFQ. So it is better to switch to BFQ if the computer is for desktop usage, in which interactivity eclipses throughput in general.
The core conception of BFQ is the “budget” of every thread. It means the maximum amount of service (measured by the size of the I/O requests) that a thread can receive when the scheduler is serving it exclusively. Once a thread consumes up its budget, it gets off from the scheduler, assigned with a new budget, and queued (again) into the fair queue. Then BFQ will select another thread to serve exclusively.
BFQ is based on a fair queueing algorithm named WF^2Q+. This algorithm was first used on routers to fairly dispatch network packets from various connections. If we replace the term “packets” and “connections” by “I/O requests” and “threads (or processes)”, we have reached the basic idea of how this algorithm is applied to BFQ scheduler.
The WF^2Q+ algorithm decides which thread to select and to be served by BFQ when the last thread runs up its budget. It is based on the term “virtual time”, which is actually the service offered and received (measured by bytes or sectors in implementation). It maintains a global virtual time, which is the amount of service offered globally. It also maintains two attributes for every thread: the virtual eligible time and the virtual deadline. The former one means the total service received while the latter one means the expected “time” to be selected, that is, it expects to be selected by the algorithm when the global virtual time reaches its deadline.
The WF^2Q+ algorithm will always select the thread with minimum deadline among the threads whose eligible time is no later than the global virtual time. Intuitively, if all threads consume the same amount of budget, they will be selected alternately and have a same share of disk distribution; if one thread consumes more budget than others, it will get selected fewer.
The BFQ scheduler is written on top of the
framework. However, more features are needed from
than it could provide: the scheduler has to be notified when the disk is
idle or about to idle and only with this notification can it dispatch
further I/O requests to the driver. Therefore, before implementing the
scheduler itself, request polling feature is added to
Before request polling is implemented, the
does not have a
dequeue() interface for scheduling policy
running on top of it. Instead, it provides some
functions for a scheduler to call when it “guesses” that the
disk may be able to receive more I/O requests.
The request polling feature transfers the guessing work to
dsched by maintaining a variable called
tag_queue_depth, which is the estimated depth of the
disk’s NCQ or TCQ. A variable called
max_tag_queue_depth is initialized as the maximum depth of
the disk’s TCQ or NCQ, which can be acquired from the driver.
The request polling feature is not restricted only to BFQ but can be
made use of by any policy on
dsched framework. To use this
feature, a policy must:
current_tag_queue_depth, and push as many
bios as it can until the depth reaches the maximum value. Monitoring can be achieved by:
bios are done
bios are pushed to the scheduler by
queue()interface. Actually, the policy may register a
polling_funccallback, being called by
dsched_strategy_request_polling()to dispatch the
strategy()call will decrease the
current_tag_queue_depth. Note that unlike
dsched_strategy_async(), a policy cannot register a
biodone()callback which gets called when the dispatched
biois done. Instead, if such a callback is needed, the policy should:
dsched_bio_done_t) by assigning it to
polling_funcin the policy structure. Note: this function should not be blocked, (eg. by locks) and should be MPSAFE; this function should not be changed after the
prepare()interface is called.
The WF^2Q fair queueing algorithm is implemented in
To efficiently implement the functions that WF^2Q provides, a data structure named “augmented binary tree” is used. With its help, WF^2Q+ can select a proper thread described above within O(log(N)) time, where N is the number of threads in the tree. The inserting and deleting operations are scaled O(log(N)) as well. The detailed information about how to implement WF2^Q with augmented tree is in .
Before the implementation of BFQ, the
contains the definition of red-black tree in DragonFly does not support
the augment function. Thus the
tree.h from FreeBSD is
In current version, a helper thread is used to executing the following operations:
bfq_dequeue() function is the core of the BFQ
scheduler. It takes the responsibility to serve a thread within a preset
time slice, dispatche
bios of that thread and select
another thread from the WF^2Q+ fair queue when current thread runs out
of its budget. It should be called whenever the disk is idle or about to
To avoid blocking
ithreads (interrupt threads), we use a
helper thread to dispatch all bios to the lower driver in current
version, that is to say, the
bfq_dequeue() function is only
called by the helper thread.
bfq_dequeue() could be called by:
dsched_request_polling_biodone(), which is called by a interrupt thread when a I/O request is done by the hard drive.
bfq_queue(), after a user thread pushing its bios to the scheduler.
bfq_timeout(), after the scheduler finishing suspending.
bfq_destroy_tdio(), when the tdio being destroyed is waited by the scheduler.
Now these callers will uniformly send an lwkt message to the helper thread, and all bfq_dequeue() will thus be serialized.
bfq_timeout() needs to acquire BFQ_LOCK, which may cause
the calling thread, the callout facility to block on it. To avoid this
situation, in current version a function sending message to the helper
thread will be called when the callout alarm strikes.
Due to high latency experienced in some test case (blogbench), we have
found that blocking on destroying a thread is not healthy. Therefore the
helper thread now receives message of destroying a tdio and call
bfq_destroy_tdio() instead. Note that in that function, no
operation on the destroyed
thread_io structure should be
applied, because it may have been recycled.
As almost all major scheduler operations are serialized (actually, only
bfq_queue() and the customized biodone function are
exceptions), the performance will be not as high as expected, and it is
proved in some benchmarks. The helper thread seems to be the most
possible source of the high latency, and this should be fixed in the
next version, by refactoring all the synchronizing operations and use as
few lockings as possible.
Ideally, equal budgets is excepted to assegned to all threads, and they should run out of their budgets immediately. However, the abstract is far from the real world conditions. First, a thread could do random I/O which is very time consuming. Second, it could be a CPU-intensive thread that seldom does I/O and thus consumes its budget very slowly.
As the BFQ scheduler runs on the service domain and it cares no time domain latency issues, the actual performance (and interactivity) could be affected by the two types of threads above. As a result, we have to add time domain restrictions to all threads to ensure low latency.
First, we assign a preset time slice to every thread and they are only served within the interval (200ms). If a thread does not consume up its budget, the scheduler will reduce its budget to the amount it has consumed in the current time slice. Note that a lower budget does mean that lower bandwidth shared, because of the WF^2Q+ algorithm, the thread will be more frequently selected.
Second, if a thread having enough budget pushes no further I/O requests
even after the whole scheduler suspends to wait a while for it, the
budget of it will be reduced as well. And if the the thread consumes its
budget too slow (for example, at current speed, it will only consume
less than 2/3 of its budget), it will be punished by
charging a full budget. As a result, the
time when it is selected next time will be later than expected.
Third, if a thread runs up its budget within the time slice, its budget gets increased. There are two types of the increment:
As one can expect, through the process of budget adjusting, every thread will be assigned a proper budget to be consumed just in the time slice.
It is possible that a thread pushes one
bio and then waits
for it to be done before pushing another. Although it may be doing
sequential I/O, the scheduler could misunderstand this behavior and
switch to another thread too early.
To avoid the above issue, the AS feature is introduced in BFQ: the
scheduler suspends for a while, when the current serving thread has
enough budget but no
bio exists in its queue. If the thread
pushes one or more
bios during waiting, the service will
not be interrupted after the scheduler resumes.
However, if a thread takes too long to “think”, it can not enjoy the AS feature. This will be described in the next section.
Now the AS feature is implemented with the help of the
ttime means the interval between the time when a thread
bio and the time when the last
it is done.
We accumulate the think time and calculate an average value, by which the scheduler judges whether a thread takes too long to “think”.
If a thread is too “thinking”, the AS waiting could be wasting of time, thus we turn of the AS feature of such a thread.
seek_avg is calculated by accumulating value
current_seek_start - last_seek_end. A
“seeky” thread tends to have less budget, and the scheduler
will not sample the disk peak rate after serving it.
The peak speed of the hard drive is estimated by the amount I/O done when:
The peak rate is used to adapt the max budget automatically:
max_budget = peak_rate * time_slice_length
We have defined three
BFQ_DEBUG_CRITICAL: printing errors or warnings.
BFQ_DEBUG_NORMAL: printing important and non-frequently appearing scheduler decisions.
BFQ_DEBUG_VERBOSE: printing all scheduler decisions.
Also, we make use of the KTR facility to print the
ttime_avg before a thread is destroyed. To enable KTR,
add the following lines in your kernel configure file:
BFQ creates a subtree under node
dsched for every device
using it. The subtree has the following nodes:
max_budget: [R/W] global maximum budget; if the auto max budget feature is turned on, this is the automatically adjusted maximum budget.
peak_rate: [R] Estimated disk speed, unit: 1/1024 byte per microsecond (fixed point representation)
peak_samples: [R] Valid number of samples that are used to calculate the peak rate. It remains still after reaching 80.
as_miss: [R] Counter of times that a thread does not push any
bioafter AS waiting.
as_hit: [R] Counter of times that a thread pushes at least one
bioafter AS waiting.
as_wait_avg_all: [R] Average AS waiting time (ms).
as_wait_avg_miss: [R] Average AS waiting time (ms), when AS is missed.
as_wait_max: [R] The Maximum AS waiting time (ms), measured in the helper thread.
as_wait_max2: [R] The Maximum AS waiting time (ms), measured in the
as_high_wait_count: [R] Counter of times that the scheduler does an AS waiting for longer than 50ms, measured in the helper thread.
as_high_wait_count: [R] Counter of times that the scheduler does an AS waiting for longer than 50ms, measured in the
avg_time_slice: [R] Average length of time slice.
max_time_slice: [R] Maximum length of time slice.
as_switch: [R/W] Switch controlling the global AS feature.
auto_max_budget_switch: [R/W] Switch controlling the auto max budget adapting feature.
Now BFQ has two tunable parameters: the global AS switch and the max budget.
It is reported that turning AS on may affect the interactivity and increase max latency greatly. It is probably due to the over-serialized implementation of BFQ. However, the blogbench result shows that turning AS on will also increase the throughput greatly.
Suggestion: turn on the AS feature, for it effects little on averate latency.
One thread could be assigned a budget no more than the max budget. Generally, a higher budget means higher throughput because of less operations on WF2Q+ augtree, while a lower budget force the scheduler cost more on those operations.
However, the real world experiments show that a too high budget affects
interactivity heavily. A too low budget will also cause higher latency,
and if the budget is less than 64KB (65536), which is smaller than the
size of some
bios , the scheduler will retrograde to a
round-robin scheduler, which is not a good form for a disk scheduler.
Suggestions: Do not use auto max budget, it is usually too high. A budget of 1/10 of the automatic max budget may be proper. In general, 512K(default), 256K, 192K can be good. It is better to determine the best max budget by binary selecting by the result of some benchmarks.
dschedpolicy from BFQ, the system may deadlock. (Happens when the sysctl process and the helper thread are on the same CPU.)
bios, as the async ones takes less time to complete, the budget and the length of time slice should be different from those of the sync
 Paolo Valente, Fabio Checconi, High Throughput Disk Scheduling with Fair Bandwidth Distribution, IEEE Transactions on Computers, vol. 59 no. 9
 I. Stoica and H. Abdel-Wahab, Earliest eligible virtual deadline first: A flexible and accurate mechanism for proportional share resource allocation