logo
home | user blogs | submit | FAQ | search | top stories | user account
Linux: Anticipatory I/O Scheduler Submitted by Jeremy on Friday, January 24, 2003 - 23:11
With his recent release of 2.5.59-mm5, Andrew Morton [interview] explained the anticipatory I/O scheduler that he has recently merged from Nick Piggin. The resulting thread is an informative read, explaining how and why this noticeably increases performance.

Much of the benefit gained from anticipatory I/O scheduling is from the fact that read operations are usually synchronous, meaning that the first read has to happen and report back before the next read is scheduled. Without this recent patch, a command requiring multiple synchronous reads gets each of them scheduled at the back of the queue one at a time, resulting in noticeably poor performance during streamed writes. With this patch, the reads are moved toward the front of the queue, and a few millisecond pause is actually added "anticipating" the next read request. Andrew's full explanation follows, going into much more detail.
From: Andrew Morton
To: linux-kernel, linux-mm
Subject: 2.5.59-mm5
Date: Thu, 23 Jan 2003 19:50:44 -0800

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/

. -mm3 and -mm4 were not announced - they were sync-up patches as we
worked on the I/O scheduler.

. -mm5 has the first cut of Nick Piggin's anticipatory I/O scheduler.
Here's the scoop:

The problem being addressed here is (mainly) kernel behaviour when there
is a stream of writeout happening, and someone submits a read.

In 2.4.x, the disk queues contain up to 30 megabytes of writes (say, one
seconds's worth). When a read is submitted the 2.4 I/O scheduler will try
to insert that at the right place between the writes. Usually, there is no
right place and the read is appended to the queue. That is: it will be
serviced in one second.

But the problem with reads is that they are dependent - neither the
application nor the kernel can submit read #N until read #N-1 has
completed. So something as simple as

cat /usr/src/linux/kernel/*.c > /dev/null

requires several hundred dependent reads. And in the presence of a
streaming write, each and every one of those reads gets stuck at the end of
the queue, and takes a second to propagate to the head. The `cat' takes
hundreds of seconds.

The celebrated read-latency2 patch recognises the fact that appending a
read to a tail of writes is dumb, and puts the read near the head of the
queue of writes. It provides an improvement of up to 30x. The deadline
I/O scheduler in 2.5 does the same thing: if reads are queued up, promote
them past writes, even if those writes have been waiting longer.

So far so good, but these fixes are still dumb. Because we're solving
the dependent read problem by creating a seek storm. Every time someone
submits a read, we stop writing, seek over and service the read, and then
*immediately* seek back and start servicing writes again.

But in the common case, the application which submitted a read is about
to go and submit another one, closeby on-disk to the first. So whoops, we
have to seek back to service that one as well.

So what anticipatory scheduling does is very simple: if an application
has performed a read, do *nothing at all* for a few milliseconds. Just
return to userspace (or to the filesystem) in the expectation that the
application or filesystem will quickly submit another read which is
closeby.

If the application _does_ submit the read then fine - we service that
quickly. If it does not submit a read then we lose. Time out and go back
to doing writes.

The end result is a large reduction in seeking - decreased read latency,
increased read bandwidth and increased write bandwidth.

The code as-is has rough spots and still needs quite some work. But it
appears to be stable. The test which I have concentrated on is "how long
does my laptop take to compile util-linux when there is a continuous write
happening". On ext2, mounted noatime:

2.4.20: 538 seconds
2.5.59: 400 seconds
2.5.59-mm5: 70 seconds
No streaming write: 48 seconds

A couple of VFS changes were needed as well.

More details on anticipatory scheduling may be found at

http://www.cs.rice.edu/~ssiyer/r/antsched/

Changes since 2.5.59-mm2:

+preempt-locking.patch

Speed up the smp preempt locking.

+ext2-allocation-failure-fix.patch

ext2 ENOSPC crash fix

+ext2_new_block-fixes.patch

ext2 cleanups

+hangcheck-timer.patch

A form of software watchdog

+slab-irq-fix.patch

Fix a BUG() in slab when memory exhaustion happens at a bad time.

+sendfile-security-hooks.patch

Reinstate lost security hooks around sendfile()

+buffer-io-accounting.patch

Fix IO-wait acounting

+aic79xx-linux-2.5.59-20030122.patch

aic7xxx driver update

+topology-remove-underbars.patch

cleanup

+mandlock-oops-fix.patch

file locking fix

+reiserfs_file_write.patch

reworked reiserfs write code.

-exit_mmap-fix2.patch

Dropped

+generic_file_readonly_mmap-fix.patch

Fix MAP_PRIVATE mmaps for filesystems which don't support ->writepage()

+seq_file-page-defn.patch

Compile fix

+exit_mmap-fix-ppc64.patch
+exit_mmap-ia64-fix.patch

Fix the exit_mmap() problem in arch code.

+show_task-fix.patch

Fix oops in show_task()

+scsi-iothread.patch

software suspend fix

+numaq-ioapic-fix2.patch

NUMAQ stuff

+misc.patch

Random fixes

+writeback-sync-cleanup.patch

remove some junk from fs-writeback.c

+dont-wait-on-inode.patch

Fix large delays in the writeback path

+unlink-latency-fix.patch

Fix large delays in unlink()

+anticipatory_io_scheduling-2_5_59-mm3.patch

Anticipatory scheduling implementation

All 65 patches:

kgdb.patch

devfs-fix.patch

deadline-np-42.patch
(undescribed patch)

deadline-np-43.patch
(undescribed patch)

setuid-exec-no-lock_kernel.patch
remove lock_kernel() from exec of setuid apps

buffer-debug.patch
buffer.c debugging

warn-null-wakeup.patch

reiserfs-readpages.patch
reiserfs v3 readpages support

fadvise.patch
implement posix_fadvise64()

ext3-scheduling-storm.patch
ext3: fix scheduling storm and lockups

auto-unplug.patch
self-unplugging request queues

less-unplugging.patch
Remove most of the blk_run_queues() calls

lockless-current_kernel_time.patch
Lockless current_kernel_timer()

scheduler-tunables.patch
scheduler tunables

htlb-2.patch
hugetlb: fix MAP_FIXED handling

kirq.patch

kirq-up-fix.patch
Subject: Re: 2.5.59-mm1

ext3-truncate-ordered-pages.patch
ext3: explicitly free truncated pages

prune-icache-stats.patch
add stats for page reclaim via inode freeing

vma-file-merge.patch

mmap-whitespace.patch

read_cache_pages-cleanup.patch
cleanup in read_cache_pages()

remove-GFP_HIGHIO.patch
remove __GFP_HIGHIO

quota-lockfix.patch
quota locking fix

quota-offsem.patch
quota semaphore fix

oprofile-p4.patch

oprofile_cpu-as-string.patch
oprofile cpu-as-string

preempt-locking.patch
Subject: spinlock efficiency problem [was 2.5.57 IO slowdown with CONFIG_PREEMPT enabled)

wli-11_pgd_ctor.patch
(undescribed patch)

wli-11_pgd_ctor-update.patch
pgd_ctor update

stack-overflow-fix.patch
stack overflow checking fix

ext2-allocation-failure-fix.patch
Subject: [PATCH] ext2 allocation failures

ext2_new_block-fixes.patch
ext2_new_block cleanups and fixes

hangcheck-timer.patch
hangcheck-timer

slab-irq-fix.patch
slab IRQ fix

Richard_Henderson_for_President.patch
Subject: [PATCH] Richard Henderson for President!

parenthesise-pgd_index.patch
Subject: i386 pgd_index() doesn't parenthesize its arg

sendfile-security-hooks.patch
Subject: [RFC][PATCH] Restore LSM hook calls to sendfile

macro-double-eval-fix.patch
Subject: Re: i386 pgd_index() doesn't parenthesize its arg

mmzone-parens.patch
asm-i386/mmzone.h macro paren/eval fixes

blkdev-fixes.patch
blkdev.h fixes

remove-will_become_orphaned_pgrp.patch
remove will_become_orphaned_pgrp()

buffer-io-accounting.patch
correct wait accounting in wait_on_buffer()

aic79xx-linux-2.5.59-20030122.patch
aic7xxx update

MAX_IO_APICS-ifdef.patch
MAX_IO_APICS #ifdef'd wrongly

dac960-error-retry.patch
Subject: [PATCH] linux2.5.56 patch to DAC960 driver for error retry

topology-remove-underbars.patch
Remove __ from topology macros

mandlock-oops-fix.patch
ftruncate/truncate oopses with mandatory locking

put_user-warning-fix.patch
Subject: Re: Linux 2.5.59

reiserfs_file_write.patch
Subject: reiserfs file_write patch

vmlinux-fix.patch
vmlinux fix

smalldevfs.patch
smalldevfs

sound-firmware-load-fix.patch
soundcore.c referenced non-existent errno variable

generic_file_readonly_mmap-fix.patch
Fix generic_file_readonly_mmap()

seq_file-page-defn.patch
Include in fs/seq_file.c, as it uses PAGE_SIZE

exit_mmap-fix-ppc64.patch

exit_mmap-ia64-fix.patch
Fix ia64's 64bit->32bit app switching

show_task-fix.patch
Subject: [PATCH] 2.5.59: show_task() oops

scsi-iothread.patch
scsi_eh_* needs to run even during suspend

numaq-ioapic-fix2.patch
NUMAQ io_apic programming fix

misc.patch
misc fixes

writeback-sync-cleanup.patch

dont-wait-on-inode.patch

unlink-latency-fix.patch

anticipatory_io_scheduling-2_5_59-mm3.patch
Subject: [PATCH] 2.5.59-mm3 antic io sched



From: Alex Bligh
Subject: Re: 2.5.59-mm5
Date: Fri, 24 Jan 2003 11:03:28 -0000

--On 23 January 2003 19:50 -0800 Andrew Morton wrote:

> So what anticipatory scheduling does is very simple: if an application
> has performed a read, do *nothing at all* for a few milliseconds. Just
> return to userspace (or to the filesystem) in the expectation that the
> application or filesystem will quickly submit another read which is
> closeby.

I'm sure this is a really dumb question, as I've never played
with this subsystem, in which case I apologize in advance.

Why not follow (by default) the old system where you put the reads
effectively at the back of the queue. Then rather than doing nothing
for a few milliseconds, you carry on with doing the writes. However,
promote the reads to the front of the queue when you have a "good
lump" of them. If you get further reads while you are processing
a lump of them, put them behind the lump. Switch back to the putting
reads at the end when we have done "a few lumps worth" of
reads, or exhausted the reads at the start of the queue (or
perhaps are short of memory).

IE (with a "lump" = 20) and "a few" = 3.

W0 W1 W2 ... W50 W51

[Read arrives, we process some writes]

W5 ... W50 W51 R0

[More reads arrive, more writes processed]

W10 ... W50 W51 R0 R1 R2 .. R7

[Haven't got a big enough lump, but a
write arrives]

W12 W13... W50 W51 W52 R0 R1 R2 .. R7

[More reads arrive, more writes processed]

W14 W15 ... W50 W51 W52 R0 R1 R2 .. R7 R8 R9.. R19

[Another read arrives, after 4 more writes have
been processed, and we move the lump to the
front]

R0 R1 R2 .. R7 R8 R9.. R19 R20 W18 W19 ... W50 W51 W52

[Some reads are processed, and some more arrive, which
we insert into our lump at the front]

R0 R1 R2 .. R7 R8 R9.. R19 R20 R21 R22 W18 W19 ... W50 W51 W52

Then either if the reads are processed at the front of
the queue faster than they arrive, and the "lump" disappears,
or if we've processed 3 x 20 = 60 reads, we revert to
sticking reads back at the end.

All this does is lump between 20 and 60 reads together.

The advantage being that you don't "do nothing" for a few
milliseconds, and can attract larger lumps, than by
waiting without incurring additional latency.

Now of course you have the ordering problem (in that I've
assumed you can insert things into the queue at will),
but you have that anyway.

--
Alex Bligh



From: Andrew Morton
Subject: Re: 2.5.59-mm5
Date: Fri, 24 Jan 2003 03:16:32 -0800

Alex Bligh wrote:
>
>
>
> --On 23 January 2003 19:50 -0800 Andrew Morton wrote:
>
> > So what anticipatory scheduling does is very simple: if an application
> > has performed a read, do *nothing at all* for a few milliseconds. Just
> > return to userspace (or to the filesystem) in the expectation that the
> > application or filesystem will quickly submit another read which is
> > closeby.
>
> I'm sure this is a really dumb question, as I've never played
> with this subsystem, in which case I apologize in advance.
>
> Why not follow (by default) the old system where you put the reads
> effectively at the back of the queue. Then rather than doing nothing
> for a few milliseconds, you carry on with doing the writes. However,
> promote the reads to the front of the queue when you have a "good
> lump" of them.

That is the problem. Reads do not come in "lumps". They are dependent.
Consider the case of reading a file:

1: Read the directory.

This is a single read, and we cannot do anything until it has
completed.

2: The directory told us where the inode is. Go read the inode.

This is a single read, and we cannot do anything until it has
completed.

3: Go read the first 12 blocks of the file and the first indirect.

This is a single read, and we cannot do anything until it has
completed.

The above process can take up to three trips through the request queue.

In this very common scenario, the only way we'll ever get "lumps" of reads is
if some other processes come in and happen to want to read nearby sectors.
In the best case, the size of the lump is proportional to the number of
processes which are concurrently trying to read something. This just doesn't
happen enough to be significant or interesting.

But writes are completely different. There is no dependency between them and
at any point in time we know where on-disk a lot of writes will be placed.
We don't know that for reads, which is why we need to twiddle thumbs until the
application or filesystem makes up its mind.



From: Alex Tomas
Subject: Re: 2.5.59-mm5
Date: 24 Jan 2003 14:23:58 +0300

>>>>> Andrew Morton (AM) writes:

AM> But writes are completely different. There is no dependency
AM> between them and at any point in time we know where on-disk a lot
AM> of writes will be placed. We don't know that for reads, which is
AM> why we need to twiddle thumbs until the application or filesystem
AM> makes up its mind.

it's significant that application doesn't want to wait read completion
long and doesn't wait for write completion in most cases.



From: Andrew Morton
Subject: Re: 2.5.59-mm5
Date: Fri, 24 Jan 2003 03:50:17 -0800

Alex Tomas wrote:
>
> >>>>> Andrew Morton (AM) writes:
>
> AM> But writes are completely different. There is no dependency
> AM> between them and at any point in time we know where on-disk a lot
> AM> of writes will be placed. We don't know that for reads, which is
> AM> why we need to twiddle thumbs until the application or filesystem
> AM> makes up its mind.
>
>
> it's significant that application doesn't want to wait read completion
> long and doesn't wait for write completion in most cases.

That's correct. Reads are usually synchronous and writes are rarely
synchronous.

The most common place where the kernel forces a user process to wait on
completion of a write is actually in unlink (truncate, really). Because
truncate must wait for in-progress I/O to complete before allowing the
filesystem to free (and potentially reuse) the affected blocks.

If there's a lot of writeout happening then truncate can take _ages_. Hence
this patch:

Truncates can take a very long time. Especially if there is a lot of
writeout happening, because truncate must wait on in-progress I/O.

And sys_unlink() is performing that truncate while holding the parent
directory's i_sem. This basically shuts down new accesses to the entire
directory until the synchronous I/O completes.

In the testing I've been doing, that directory is /tmp, and this hurts.

So change sys_unlink() to perform the actual truncate outside i_sem.

When there is a continuous streaming write to the same disk, this patch
reduces the time for `make -j4 bzImage' from 370 seconds to 220.

namei.c | 12 ++++++++++++
1 files changed, 12 insertions(+)

diff -puN fs/namei.c~unlink-latency-fix fs/namei.c
--- 25/fs/namei.c~unlink-latency-fix 2003-01-24 02:41:04.000000000 -0800
+++ 25-akpm/fs/namei.c 2003-01-24 02:47:36.000000000 -0800
@@ -1659,12 +1659,19 @@ int vfs_unlink(struct inode *dir, struct
return error;
}

+/*
+ * Make sure that the actual truncation of the file will occur outside its
+ * diretory's i_sem. truncate can take a long time if there is a lot of
+ * writeout happening, and we don't want to prevent access to the directory
+ * while waiting on the I/O.
+ */
asmlinkage long sys_unlink(const char * pathname)
{
int error = 0;
char * name;
struct dentry *dentry;
struct nameidata nd;
+ struct inode *inode = NULL;

name = getname(pathname);
if(IS_ERR(name))
@@ -1683,6 +1690,9 @@ asmlinkage long sys_unlink(const char *
/* Why not before? Because we want correct error value */
if (nd.last.name[nd.last.len])
goto slashes;
+ inode = dentry->d_inode;
+ if (inode)
+ inode = igrab(inode);
error = vfs_unlink(nd.dentry->d_inode, dentry);
exit2:
dput(dentry);
@@ -1693,6 +1703,8 @@ exit1:
exit:
putname(name);

+ if (inode)
+ iput(inode); /* truncate the inode here */
return error;

slashes:

Related Links:
[ add new comment ]

Control panel

Comment viewing options:

Select your prefered way to display the comments and click 'Update settings' to active your changes.


Subject: All Hail the Kernel Gods!
Author:Anonymous
Date:Saturday, 01/25/2003 - 00:57
I Love Linux! But really though, 2.6 is looking to be a kick ass release. Most of the areas where 2.4 was weak are being fixed. Wonderful stuff!
[ reply to this comment ]

Subject: different idea
Author:Anonymous
Date:Saturday, 01/25/2003 - 01:38
seems to me the only way to avoid the seek storm and idling while waiting for the next read is to read ahead using something along the lines of a processor's branch prediction: from what i know of filesystems the _realistic_ choices for what is supposed the be read next (i.e. after a read()) are usually rather limited.

now that kind of prediction might waste more time than idling, and it'd be difficult to finetune how much to read ahead, but _i_ certainly know too little about the kernel code to decide how errr... uninformed my idea was :)
[ reply to this comment ]

 
Subject: Some problems
Author:Anonymous
Date:Saturday, 01/25/2003 - 02:00
Yeah you have a couple of problems here.
First is memory consumption. Reads fill memory. This means you can't do massive readahead.

Second is predicting reads. For non sequential files this really is _very_ difficult, and trying to do accurate prediction is an exercise in futility. Even applications themselves often have little idea about what they will read next. Processors at least have run the path before and have the knowledge the code hasn't changed.
[ reply to this comment ]

 
Subject: memory consumption & prediction
Author:Anonymous
Date:Saturday, 01/25/2003 - 02:27
yup, while i'd personally think of memory as a less valuable resource than time, i completely agree (as i mentioned) that prediction might waste more time than idling. though i hate having ideas dashed, i'm glad to have that suspicion confirmed, thanks (also to riel for pointing out times).

i'm currently digging my way into the kernel code, and while the code is easy enough to read, understanding which considerations went into the design is another matter altogether.

therefore i assumed that reading ahead like that might result in better performance in the same way as memory pooling (like apr does as opposed to massive amounts of subsequent malloc()s and free()s), or blocked reading (at the expense of some memory) might speed things up.

obviously at the kernel level, such things are influenced by more hardware-specific matters, of which, alas, i know too little.
[ reply to this comment ]

 
Subject: Memory resource...
Author:Anonymous
Date:Saturday, 01/25/2003 - 03:08
Well memory actually becomes a concern when thinking about servers especially. Say an FTP server. This is one problem that happened in 2.4 sometime.

You have say 1000 clients reading at 15K/s, 15MB/s. Thats not hard, but say most are reading different files, so to get decent streaming performance out of a disk in this situation, you'd like probably 1MB readahead for each file.

(At 200K readahead for each stream, a 20MB/s disk with 10ms avg seek time will spend half its time seeking - 10MB/s, not enough, 1MB gives you 17MB/s).

This means at any time you could have 1GB of readahead outstanding. In practice maybe 750MB. Once you get a few more clients and run into some memory pressure your readahead starts getting discarded and you start running into all sorts of trouble.

Anticipatory IO scheduling can get as good numbers with say 32KB readahead per file (~20MB readahead outstanding).

Thats how the theory goes, anyway.
[ reply to this comment ]

 
Subject: who reads ahead here?
Author:mysticalreaper
Date:Monday, 01/27/2003 - 21:30
I'm just curious... in the FTP server example you mentioned above, you say that a readahead of a particular size per file would be maintained on the server... who does the read ahead though? I'm assuming the FTP server would, because the kernel would have no idea what the application (ftpd) is attempting to achieve. And if the FTP server is doing the read-ahead, can you configure how much readahead goes on?
[ reply to this comment ]

 
Subject: The presumption is that the k
Author:Anonymous
Date:Monday, 01/27/2003 - 22:47
The presumption is that the kernel does the readahead, assuming the file is sent sequentially. Indeed, isn't this the type of application (FTP/HTTP server) for which sendfile() was invented?
[ reply to this comment ]

 
Subject: Yep kernel
Author:Anonymous
Date:Monday, 01/27/2003 - 23:12
Yes the kernel has pretty good mechanisms for detecting sequential IO. This is about the only common read load you can detect and plan for ahead of time.
[ reply to this comment ]

 
Subject: Re: Yep Kernel
Author:rml
Date:Wednesday, 01/29/2003 - 15:43
Yes the kernel has pretty good mechanisms for detecting sequential IO. This is about the only common read load you can detect and plan for ahead of time.

Actually, the kernel currently has no mechanism for detecting sequential I/O.
[ reply to this comment ]

 
Subject: mm/readahead.c
Author:Anonymous
Date:Wednesday, 01/29/2003 - 22:05
...
[ reply to this comment ]

 
Subject: idling is no problem
Author:riel
Date:Saturday, 01/25/2003 - 02:07
Lets take your run-of-the-mill 7200 RPM disk. This disk turns 120 times a second, meaning you have on average 4.2 milliseconds rotational latency and a seek time of 7 to 10 ms average, less for nearby tracks. This means that reading far-away data should take at least 15ms, while reading nearby data should take at most 8ms.

CPUs, on the other hand, are extremely fast. For most small reads it should be possible to return the follow-up read within 1 millisecond. This means that anticipatory scheduling with idle time will in the very worst case make the disk a few percent slower, measurably slower, but not noticably slower. In the best case (which is also the normal case) it will make the workload faster.
[ reply to this comment ]

 
Subject: idling might still be a problem?
Author:Anonymous
Date:Monday, 01/27/2003 - 22:54
True, this will be an improvement to the current situation. But the readahead idea still has me wondering if we can improve even more:

In the case that the next read will be directly adjacent, the CPU issues the next read after 1 millisecond. The disk still has to finish the rotation it just started, though, so we loose 4.2 milliseconds, more on notebooks.

If we instead read another n pages and add them as the least recently used, neither the disc, the CPU or the memory will pay much. But the user gains 3.2 milliseconds on the next read, ~75%, quite impressive.

Ok, now tell me why I am braindead and this trick doesn't work. ;-)
[ reply to this comment ]

 
Subject: you are right and wrong
Author:Anonymous
Date:Monday, 01/27/2003 - 23:14
You are right in analysis, but actually 1, we _do_ do some readahead, and 2, drives have a track buffer so when it gets a read request it will read the entire track that sector is on.
[ reply to this comment ]

 
Subject: What I'm curious about is...
Author:Anonymous
Date:Tuesday, 01/28/2003 - 05:31
How about the occasional case of non-dependent reads? If we're streaming a file out of an FTP server (eg. w/ sendfile or similar), or requesting a bunch of reads via an aio interface, we don't have the dependent-read problem. Given that, does it make sense to treat these reads in the traditional fashion, and only perform the idling trick for reads that directly resulted in the task blocking?

Also, would there be any value to letting the task hint when a given read is the last read in a chain of dependent reads?

My guess, from the comment above, is that the most you'd save is that 'few percent'. Thoughts?
[ reply to this comment ]

 
Subject: We do
Author:Anonymous
Date:Tuesday, 01/28/2003 - 06:41
we do have the dependant read problem. Readahead might mask it but we still have to send reads in a specific order. There can be non dependant reads however. We have a mechanism for detecting them and stopping anticipation if we get any. It just isn't in mm yet!

Nick
[ reply to this comment ]

Subject: are they ignoring Knuth?
Author:Anonymous
Date:Saturday, 01/25/2003 - 07:32
I know DE Knuth is slightly out of date, but he showed that an adaptive double buffer algorithm can do O(log n) reads even for non-predictive data. Under some circumstances, anticipation looks like it could improve performance, and I've done some simulations which suggest it does.

However, I think there are some race-conditions which could cause anticipatory lockups under heavy read conditions (eg - multithreaded databse queries).
[ reply to this comment ]

 
Subject: What is the algorithm?
Author:Anonymous
Date:Saturday, 01/25/2003 - 11:45
What is Knuth's algorithm? And I don't know what you are referring to that is O(ln N), but its not the algorithm cost that we're worried about, its being able to properly predict reads and service them without using a lot of memory.

Also what reace conditions do you think there are?
[ reply to this comment ]

Subject: this sounds great but...
Author:Anonymous
Date:Tuesday, 01/28/2003 - 10:02
This looks like an interesting and pragmatic way to solve the problem, it's counter-intuitive to me but it makes sense and the benchmarks are in agreement. My only question is, does this (once all the rough spots and problems are worked out) have any chance of making it into 2.5 mainline?
[ reply to this comment ]

 
Subject: Yes
Author:Anonymous
Date:Tuesday, 01/28/2003 - 12:12
You could say it is a new feature, but it is also a performance bugfix to the code which is already in mainline. It is having some obscure problems at the moment, but once it is very well tested, debugged and benchmarked, I think Linus would be mad not to. I think for a lot of disk bound workloads, it will be the single biggest performance in 2.5
[ reply to this comment ]

Subject: Some disperse thoughts
Author:Anonymous
Date:Tuesday, 01/28/2003 - 19:45
I suppose the process wich sent the request must continue running during that "do *nothing at all* for a few milliseconds" (otherwise, how could it send another request?); i.e., the kernel has to tell the process that the data has been read and is available for use (even if it is not).

Then, what happens if that process tries to use such data inmediately? Perhaps there is some kind of page-fault that blocks the process until the pending read has really ended?

Now, I do not understand the "few milliseconds" pause. Why cannot the kernel just enqueue the request and let the process continue?. This could be somethink like asynchronous reads for synchronous processes, where synchronizations happens at kernel level and only when data is strictly needed.

Just some disperse thoughts from a newbe, please excuse me if I am too naive.
[ reply to this comment ]

 
Subject: You got the order of things w
Author:Anonymous
Date:Tuesday, 01/28/2003 - 21:03
You got the order of things wrong. First the kernel reads the requested block and delivers it to the process. Then the disk does nothing (the kernel doesn't send any more requests) for a few milliseconds, while the application processes the data, and figures out that it needs another block. If it does, the head is already in place, and the data is read very fast. If it doesn't need another block, I/O has been delayed a minimal amount of time, much less than the time saved in the first case.

So, on average, time is saved, and the disk needs fewer seeks, thus reducing the wear on the drive mechanics at the same time.
[ reply to this comment ]

 
Subject: What about reading changed data?
Author:Anonymous
Date:Wednesday, 01/29/2003 - 09:09
I'm not even experienced enough to call myself a kernel newbie, but one thing jumps out at me:

When moving the reads to the beginning of the queue, instead of appending them to the end, what happens when one of the writes in the queue will change the data that's being read?
[ reply to this comment ]

 
Subject: Not an issue.
Author:Anonymous
Date:Wednesday, 01/29/2003 - 09:28
I'm pretty sure the read won't get scheduled in that manner. All the read I/O requests are for accesses that missed the block cache. All the write requests presumably are for blocks being purged from the block cache. As I understand it, blocks queued for writeout aren't removed from the block cache until after they're written out. Therefore, a read request that comes in for a block that happens to be queued for writeout would still be a "hit", and thus would not be inserted on the I/O queue.
[ reply to this comment ]

 
Subject: Changes over time
Author:sgp
Date:Wednesday, 01/29/2003 - 09:51
As a fellow newbie, that was my initial reaction, but if you had two totally independant processes, one writing, one reading, then they can't really make any expectations about scheduling.

On the other hand, if these processes are in some way linked, (like the traditional producer / consumer pair), then they need to sort out file locking between them.
That's my understanding; I could be wrong :)
[ reply to this comment ]

 
Subject: Many thanks
Author:Anonymous
Date:Wednesday, 01/29/2003 - 17:46
Now everything makes sense for me.
[ reply to this comment ]

Subject: This might deteriorate performance for databases
Author:Anonymous
Date:Wednesday, 01/29/2003 - 19:34
In a database that does mostly reads, I think the suggested scheme will deteriorate performance. In such a database, you will have several reads in the request queue at any one time, and the application will usually know the amount of data it needs to read. There is no correlation between the location of the next read by a thread and the previous read, so you could just as well start servicing a completely different read as opposed to waiting for the next read from that process.

What I would like to see is a way to detect this scenario. Two features of a system where the proposed scheme would not work are:
  1. There are several reads in the request queue (thus there are possibilities for "lumping" them together).
  2. Reads do not go to the same area of the disk. This might be a process attribute or something like that. I would really like to see my database processes tagged automatically as "do not anticipate reads" while a find command would be different.
In general I think the scheme would work, but it has to detect when the scheme does not work.
[ reply to this comment ]

 
Subject: Just because you put reads on
Author:Anonymous
Date:Wednesday, 01/29/2003 - 22:39
Just because you put reads on the "fast track" to the front of the queue and you hold off a stream of writes does not mean you disable the elevator for reads. It just means that writes are temporarily throttled when a random read request comes in.
[ reply to this comment ]

Subject: Reads starve writes
Author:Con Kolivas
Date:Friday, 01/31/2003 - 16:20
Normally writes being much slower than reads cause starvation of the reads. With the anticipatory scheduler my first set of contest results (sent to lkml) show that for the first time, reads are progressing so unabated that writes are now being starved. Having said that, the anticipatory i/o definitely does virtually fix the problem of writes starving reads, so I assume the next step needs to be taken now to prevent the new problem.

Quick summary of the relevant results:
io_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.59 3 153 50.3 8 13.7 1.94*
2.5.59-mm6 3 106 70.8 4 9.4 1.36*
read_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.59 3 102 76.5 5 4.9 1.29*
2.5.59-mm6 3 733 10.8 56 6.3 9.40*
[ reply to this comment ]

Linux Daily News Network
logo

Log in
Username:

Password:

Remember me

» Register
» New password

- FAQ
- Submit story
- Contact Us
- Features (Jan 26, 10:42 PM)
- FreeBSD (Feb 15, 12:03 AM)
- Hurd (Dec 25, 03:54 PM)
- Linux (Feb 25, 06:59 AM)
- NetBSD (Feb 22, 12:36 AM)
- OpenBSD (Feb 11, 03:41 PM)
- Tools (Feb 18, 02:59 PM)

Donate To KernelTrap
Total Raised: $418.65
Total Donations: 22
Make a donation to KernelTrap
Help buy our new server.

Who's online
This site is currently sustaining more than 75 page views a minute.

Top stories
Today's top stories:
- Linux: Where The Anticipatory Scheduler Shines
- Linux: Anticipatory I/O Scheduler
- Linux: Fair Queuing Disk Schedulers
- Linux: A Deadline I/O Scheduler
- Interview: Andrew Morton
- KernelTrap: Logo and Theme Contest, Cash Prizes
- FAQ
- Linux: Humor In Kernel Code
- KernelTrap: Buying A New Server
- Linux: Listing Threads in /proc

All time top stories:
- Linux: Humor In Kernel Code
- Linux: Native POSIX Threading Library (NPTL)
- Linux: Andrew Morton on 2.5's performance improvements
- Linux: Benchmarking 2.4 vs 2.5, UP vs SMP
- Linux: qconf Screenshots
- Linux: 2.6 vs. 3.0; What's In A Name?
- Linux: What's Left To Merge By 2.6
- Linux: Panicking In Morse Code


home | user blogs | submit | FAQ | search | top stories | user account

Except where otherwise stated original content is (c)2003 KernelTrap.
Trademarks are the property of their respective owners. Comments are the property and responsibility of their authors.
KernelTrap is powered by Drupal.