| Linux: Anticipatory I/O Scheduler |
| Submitted by Jeremy on Friday, January 24, 2003 - 23:11 |
|
With his recent release of 2.5.59-mm5, Andrew Morton [interview] explained the anticipatory I/O scheduler that he has recently merged from Nick Piggin. The resulting thread is an informative read, explaining how and why this noticeably increases performance.
Much of the benefit gained from anticipatory I/O scheduling is from the fact that read operations are usually synchronous, meaning that the first read has to happen and report back before the next read is scheduled. Without this recent patch, a command requiring multiple synchronous reads gets each of them scheduled at the back of the queue one at a time, resulting in noticeably poor performance during streamed writes. With this patch, the reads are moved toward the front of the queue, and a few millisecond pause is actually added "anticipating" the next read request. Andrew's full explanation follows, going into much more detail.
From: Andrew Morton To: linux-kernel, linux-mm Subject: 2.5.59-mm5 Date: Thu, 23 Jan 2003 19:50:44 -0800
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm5/
. -mm3 and -mm4 were not announced - they were sync-up patches as we worked on the I/O scheduler.
. -mm5 has the first cut of Nick Piggin's anticipatory I/O scheduler. Here's the scoop:
The problem being addressed here is (mainly) kernel behaviour when there is a stream of writeout happening, and someone submits a read.
In 2.4.x, the disk queues contain up to 30 megabytes of writes (say, one seconds's worth). When a read is submitted the 2.4 I/O scheduler will try to insert that at the right place between the writes. Usually, there is no right place and the read is appended to the queue. That is: it will be serviced in one second.
But the problem with reads is that they are dependent - neither the application nor the kernel can submit read #N until read #N-1 has completed. So something as simple as
cat /usr/src/linux/kernel/*.c > /dev/null
requires several hundred dependent reads. And in the presence of a streaming write, each and every one of those reads gets stuck at the end of the queue, and takes a second to propagate to the head. The `cat' takes hundreds of seconds.
The celebrated read-latency2 patch recognises the fact that appending a read to a tail of writes is dumb, and puts the read near the head of the queue of writes. It provides an improvement of up to 30x. The deadline I/O scheduler in 2.5 does the same thing: if reads are queued up, promote them past writes, even if those writes have been waiting longer.
So far so good, but these fixes are still dumb. Because we're solving the dependent read problem by creating a seek storm. Every time someone submits a read, we stop writing, seek over and service the read, and then *immediately* seek back and start servicing writes again.
But in the common case, the application which submitted a read is about to go and submit another one, closeby on-disk to the first. So whoops, we have to seek back to service that one as well.
So what anticipatory scheduling does is very simple: if an application has performed a read, do *nothing at all* for a few milliseconds. Just return to userspace (or to the filesystem) in the expectation that the application or filesystem will quickly submit another read which is closeby.
If the application _does_ submit the read then fine - we service that quickly. If it does not submit a read then we lose. Time out and go back to doing writes.
The end result is a large reduction in seeking - decreased read latency, increased read bandwidth and increased write bandwidth.
The code as-is has rough spots and still needs quite some work. But it appears to be stable. The test which I have concentrated on is "how long does my laptop take to compile util-linux when there is a continuous write happening". On ext2, mounted noatime:
2.4.20: 538 seconds 2.5.59: 400 seconds 2.5.59-mm5: 70 seconds No streaming write: 48 seconds
A couple of VFS changes were needed as well.
More details on anticipatory scheduling may be found at
http://www.cs.rice.edu/~ssiyer/r/antsched/
Changes since 2.5.59-mm2:
+preempt-locking.patch
Speed up the smp preempt locking.
+ext2-allocation-failure-fix.patch
ext2 ENOSPC crash fix
+ext2_new_block-fixes.patch
ext2 cleanups
+hangcheck-timer.patch
A form of software watchdog
+slab-irq-fix.patch
Fix a BUG() in slab when memory exhaustion happens at a bad time.
+sendfile-security-hooks.patch
Reinstate lost security hooks around sendfile()
+buffer-io-accounting.patch
Fix IO-wait acounting
+aic79xx-linux-2.5.59-20030122.patch
aic7xxx driver update
+topology-remove-underbars.patch
cleanup
+mandlock-oops-fix.patch
file locking fix
+reiserfs_file_write.patch
reworked reiserfs write code.
-exit_mmap-fix2.patch
Dropped
+generic_file_readonly_mmap-fix.patch
Fix MAP_PRIVATE mmaps for filesystems which don't support ->writepage()
+seq_file-page-defn.patch
Compile fix
+exit_mmap-fix-ppc64.patch +exit_mmap-ia64-fix.patch
Fix the exit_mmap() problem in arch code.
+show_task-fix.patch
Fix oops in show_task()
+scsi-iothread.patch
software suspend fix
+numaq-ioapic-fix2.patch
NUMAQ stuff
+misc.patch
Random fixes
+writeback-sync-cleanup.patch
remove some junk from fs-writeback.c
+dont-wait-on-inode.patch
Fix large delays in the writeback path
+unlink-latency-fix.patch
Fix large delays in unlink()
+anticipatory_io_scheduling-2_5_59-mm3.patch
Anticipatory scheduling implementation
All 65 patches:
kgdb.patch
devfs-fix.patch
deadline-np-42.patch (undescribed patch)
deadline-np-43.patch (undescribed patch)
setuid-exec-no-lock_kernel.patch remove lock_kernel() from exec of setuid apps
buffer-debug.patch buffer.c debugging
warn-null-wakeup.patch
reiserfs-readpages.patch reiserfs v3 readpages support
fadvise.patch implement posix_fadvise64()
ext3-scheduling-storm.patch ext3: fix scheduling storm and lockups
auto-unplug.patch self-unplugging request queues
less-unplugging.patch Remove most of the blk_run_queues() calls
lockless-current_kernel_time.patch Lockless current_kernel_timer()
scheduler-tunables.patch scheduler tunables
htlb-2.patch hugetlb: fix MAP_FIXED handling
kirq.patch
kirq-up-fix.patch Subject: Re: 2.5.59-mm1
ext3-truncate-ordered-pages.patch ext3: explicitly free truncated pages
prune-icache-stats.patch add stats for page reclaim via inode freeing
vma-file-merge.patch
mmap-whitespace.patch
read_cache_pages-cleanup.patch cleanup in read_cache_pages()
remove-GFP_HIGHIO.patch remove __GFP_HIGHIO
quota-lockfix.patch quota locking fix
quota-offsem.patch quota semaphore fix
oprofile-p4.patch
oprofile_cpu-as-string.patch oprofile cpu-as-string
preempt-locking.patch Subject: spinlock efficiency problem [was 2.5.57 IO slowdown with CONFIG_PREEMPT enabled)
wli-11_pgd_ctor.patch (undescribed patch)
wli-11_pgd_ctor-update.patch pgd_ctor update
stack-overflow-fix.patch stack overflow checking fix
ext2-allocation-failure-fix.patch Subject: [PATCH] ext2 allocation failures
ext2_new_block-fixes.patch ext2_new_block cleanups and fixes
hangcheck-timer.patch hangcheck-timer
slab-irq-fix.patch slab IRQ fix
Richard_Henderson_for_President.patch Subject: [PATCH] Richard Henderson for President!
parenthesise-pgd_index.patch Subject: i386 pgd_index() doesn't parenthesize its arg
sendfile-security-hooks.patch Subject: [RFC][PATCH] Restore LSM hook calls to sendfile
macro-double-eval-fix.patch Subject: Re: i386 pgd_index() doesn't parenthesize its arg
mmzone-parens.patch asm-i386/mmzone.h macro paren/eval fixes
blkdev-fixes.patch blkdev.h fixes
remove-will_become_orphaned_pgrp.patch remove will_become_orphaned_pgrp()
buffer-io-accounting.patch correct wait accounting in wait_on_buffer()
aic79xx-linux-2.5.59-20030122.patch aic7xxx update
MAX_IO_APICS-ifdef.patch MAX_IO_APICS #ifdef'd wrongly
dac960-error-retry.patch Subject: [PATCH] linux2.5.56 patch to DAC960 driver for error retry
topology-remove-underbars.patch Remove __ from topology macros
mandlock-oops-fix.patch ftruncate/truncate oopses with mandatory locking
put_user-warning-fix.patch Subject: Re: Linux 2.5.59
reiserfs_file_write.patch Subject: reiserfs file_write patch
vmlinux-fix.patch vmlinux fix
smalldevfs.patch smalldevfs
sound-firmware-load-fix.patch soundcore.c referenced non-existent errno variable
generic_file_readonly_mmap-fix.patch Fix generic_file_readonly_mmap()
seq_file-page-defn.patch Include in fs/seq_file.c, as it uses PAGE_SIZE
exit_mmap-fix-ppc64.patch
exit_mmap-ia64-fix.patch Fix ia64's 64bit->32bit app switching
show_task-fix.patch Subject: [PATCH] 2.5.59: show_task() oops
scsi-iothread.patch scsi_eh_* needs to run even during suspend
numaq-ioapic-fix2.patch NUMAQ io_apic programming fix
misc.patch misc fixes
writeback-sync-cleanup.patch
dont-wait-on-inode.patch
unlink-latency-fix.patch
anticipatory_io_scheduling-2_5_59-mm3.patch Subject: [PATCH] 2.5.59-mm3 antic io sched
From: Alex Bligh Subject: Re: 2.5.59-mm5 Date: Fri, 24 Jan 2003 11:03:28 -0000
--On 23 January 2003 19:50 -0800 Andrew Morton wrote:
> So what anticipatory scheduling does is very simple: if an application > has performed a read, do *nothing at all* for a few milliseconds. Just > return to userspace (or to the filesystem) in the expectation that the > application or filesystem will quickly submit another read which is > closeby.
I'm sure this is a really dumb question, as I've never played with this subsystem, in which case I apologize in advance.
Why not follow (by default) the old system where you put the reads effectively at the back of the queue. Then rather than doing nothing for a few milliseconds, you carry on with doing the writes. However, promote the reads to the front of the queue when you have a "good lump" of them. If you get further reads while you are processing a lump of them, put them behind the lump. Switch back to the putting reads at the end when we have done "a few lumps worth" of reads, or exhausted the reads at the start of the queue (or perhaps are short of memory).
IE (with a "lump" = 20) and "a few" = 3.
W0 W1 W2 ... W50 W51
[Read arrives, we process some writes]
W5 ... W50 W51 R0
[More reads arrive, more writes processed]
W10 ... W50 W51 R0 R1 R2 .. R7
[Haven't got a big enough lump, but a write arrives]
W12 W13... W50 W51 W52 R0 R1 R2 .. R7
[More reads arrive, more writes processed]
W14 W15 ... W50 W51 W52 R0 R1 R2 .. R7 R8 R9.. R19
[Another read arrives, after 4 more writes have been processed, and we move the lump to the front]
R0 R1 R2 .. R7 R8 R9.. R19 R20 W18 W19 ... W50 W51 W52
[Some reads are processed, and some more arrive, which we insert into our lump at the front]
R0 R1 R2 .. R7 R8 R9.. R19 R20 R21 R22 W18 W19 ... W50 W51 W52
Then either if the reads are processed at the front of the queue faster than they arrive, and the "lump" disappears, or if we've processed 3 x 20 = 60 reads, we revert to sticking reads back at the end.
All this does is lump between 20 and 60 reads together.
The advantage being that you don't "do nothing" for a few milliseconds, and can attract larger lumps, than by waiting without incurring additional latency.
Now of course you have the ordering problem (in that I've assumed you can insert things into the queue at will), but you have that anyway.
-- Alex Bligh
From: Andrew Morton Subject: Re: 2.5.59-mm5 Date: Fri, 24 Jan 2003 03:16:32 -0800
Alex Bligh wrote: > > > > --On 23 January 2003 19:50 -0800 Andrew Morton wrote: > > > So what anticipatory scheduling does is very simple: if an application > > has performed a read, do *nothing at all* for a few milliseconds. Just > > return to userspace (or to the filesystem) in the expectation that the > > application or filesystem will quickly submit another read which is > > closeby. > > I'm sure this is a really dumb question, as I've never played > with this subsystem, in which case I apologize in advance. > > Why not follow (by default) the old system where you put the reads > effectively at the back of the queue. Then rather than doing nothing > for a few milliseconds, you carry on with doing the writes. However, > promote the reads to the front of the queue when you have a "good > lump" of them.
That is the problem. Reads do not come in "lumps". They are dependent. Consider the case of reading a file:
1: Read the directory.
This is a single read, and we cannot do anything until it has completed.
2: The directory told us where the inode is. Go read the inode.
This is a single read, and we cannot do anything until it has completed.
3: Go read the first 12 blocks of the file and the first indirect.
This is a single read, and we cannot do anything until it has completed.
The above process can take up to three trips through the request queue.
In this very common scenario, the only way we'll ever get "lumps" of reads is if some other processes come in and happen to want to read nearby sectors. In the best case, the size of the lump is proportional to the number of processes which are concurrently trying to read something. This just doesn't happen enough to be significant or interesting.
But writes are completely different. There is no dependency between them and at any point in time we know where on-disk a lot of writes will be placed. We don't know that for reads, which is why we need to twiddle thumbs until the application or filesystem makes up its mind.
From: Alex Tomas Subject: Re: 2.5.59-mm5 Date: 24 Jan 2003 14:23:58 +0300
>>>>> Andrew Morton (AM) writes:
AM> But writes are completely different. There is no dependency AM> between them and at any point in time we know where on-disk a lot AM> of writes will be placed. We don't know that for reads, which is AM> why we need to twiddle thumbs until the application or filesystem AM> makes up its mind.
it's significant that application doesn't want to wait read completion long and doesn't wait for write completion in most cases.
From: Andrew Morton Subject: Re: 2.5.59-mm5 Date: Fri, 24 Jan 2003 03:50:17 -0800
Alex Tomas wrote: > > >>>>> Andrew Morton (AM) writes: > > AM> But writes are completely different. There is no dependency > AM> between them and at any point in time we know where on-disk a lot > AM> of writes will be placed. We don't know that for reads, which is > AM> why we need to twiddle thumbs until the application or filesystem > AM> makes up its mind. > > > it's significant that application doesn't want to wait read completion > long and doesn't wait for write completion in most cases.
That's correct. Reads are usually synchronous and writes are rarely synchronous.
The most common place where the kernel forces a user process to wait on completion of a write is actually in unlink (truncate, really). Because truncate must wait for in-progress I/O to complete before allowing the filesystem to free (and potentially reuse) the affected blocks.
If there's a lot of writeout happening then truncate can take _ages_. Hence this patch:
Truncates can take a very long time. Especially if there is a lot of writeout happening, because truncate must wait on in-progress I/O.
And sys_unlink() is performing that truncate while holding the parent directory's i_sem. This basically shuts down new accesses to the entire directory until the synchronous I/O completes.
In the testing I've been doing, that directory is /tmp, and this hurts.
So change sys_unlink() to perform the actual truncate outside i_sem.
When there is a continuous streaming write to the same disk, this patch reduces the time for `make -j4 bzImage' from 370 seconds to 220.
namei.c | 12 ++++++++++++ 1 files changed, 12 insertions(+)
diff -puN fs/namei.c~unlink-latency-fix fs/namei.c --- 25/fs/namei.c~unlink-latency-fix 2003-01-24 02:41:04.000000000 -0800 +++ 25-akpm/fs/namei.c 2003-01-24 02:47:36.000000000 -0800 @@ -1659,12 +1659,19 @@ int vfs_unlink(struct inode *dir, struct return error; } +/* + * Make sure that the actual truncation of the file will occur outside its + * diretory's i_sem. truncate can take a long time if there is a lot of + * writeout happening, and we don't want to prevent access to the directory + * while waiting on the I/O. + */ asmlinkage long sys_unlink(const char * pathname) { int error = 0; char * name; struct dentry *dentry; struct nameidata nd; + struct inode *inode = NULL; name = getname(pathname); if(IS_ERR(name)) @@ -1683,6 +1690,9 @@ asmlinkage long sys_unlink(const char * /* Why not before? Because we want correct error value */ if (nd.last.name[nd.last.len]) goto slashes; + inode = dentry->d_inode; + if (inode) + inode = igrab(inode); error = vfs_unlink(nd.dentry->d_inode, dentry); exit2: dput(dentry); @@ -1693,6 +1703,8 @@ exit1: exit: putname(name); + if (inode) + iput(inode); /* truncate the inode here */ return error; slashes: Related Links:
|
| [ add new comment ] | |
|
| Linux Daily News Network |  | |
| Donate To KernelTrap | Total Raised: $418.65
Total Donations: 22
Help buy our new server. | |
| Who's online | | This site is currently sustaining more than 75 page views a minute. | |
|