Superpage support for FreeBSD
7.0-CURRENT
Background
The Opteron has two distinct TLBs, an instruction
TLB (ITLB) that is used for fetching
instructions and a data TLB (DTLB)
that is used for accessing data. These
two TLBs have the same basic organization. In both TLBs, the 4KB
and 2MB page mappings are implemented by two distinct groups of
entries. The group for caching 4KB page mappings is organized as
a two-level hierarchy. The first level has 32 entries and is
fully associative. The second level has 512 entries and is
four-way set associative. Thus, this group provides coverage for 2MB of memory.
In contrast, the group for caching 2MB page mappings is organized as a
single level. This single level has 8 entries and is fully
associative. Thus, this group provides coverage for 16MB of
memory. In total, each TLB provides coverage for 18MB of memory.
A fundamental consequence of the Opteron's TLB organization is that
the
use of 2MB page mappings instead of 4KB page mappings is not certain to
result in a smaller number of TLB misses. Depending on the degree
of spatial locality in a given stream of memory accesses, the larger
number of entries for mapping 4KB pages may have greater impact on the
number of TLB misses than the larger coverage provided by 2MB page
mappings.
Benchmarks
Results for the following benchmarks are presented below:
- NAS BT
- NAS CG
- NAS IS
- NAS LU
- HPCC RandomAccess
updates a sequence of pseudo-randomly chosen elements within a large
array of integers.
- NAS SP
- ASCI Sweep3d
Results
The following results were obtained on a system with two Opteron
model
875 CPUs, providing four 2.2GHz processor cores, and 4GB of
DDR333/PC2700 memory.
NAS BT
Base (4KB) pages only:
Class A fastest:
NAS Parallel Benchmarks
(NPB3.1-SER) - BT Benchmark
...
Class
=
A
Size
=
64x 64x 64
Iterations
=
200
Time in seconds
=
139.50
Mop/s
total
=
1206.32
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
36551697
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
2519186
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
348159278
|
Class A
slowest:
NAS Parallel Benchmarks
(NPB3.1-SER) - BT Benchmark
...
Class
=
A
Size
=
64x 64x 64
Iterations
=
200
Time in seconds
=
146.10
Mop/s total
=
1151.88
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
36589963
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
2580998
|
k8-bu-fill-request-l2-miss,mask=dc-fill
|
369602599
|
Class B
fastest:
NAS Parallel Benchmarks
(NPB3.1-SER) - BT Benchmark
...
Class
=
B
Size
=
102x 102x 102
Iterations
=
200
Time in seconds
=
648.49
Mop/s total
=
1082.80
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
344909514
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
12524942
|
k8-bu-fill-request-l2-miss,mask=dc-fill
|
1396996205
|
Class B
slowest:
NAS Parallel Benchmarks
(NPB3.1-SER) - BT Benchmark
...
Class
=
B
Size
=
102x 102x 102
Iterations
=
200
Time in seconds
=
658.58
Mop/s total
=
1066.20
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
345186532
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
12606643
|
k8-bu-fill-request-l2-miss,mask=dc-fill
|
1495266107
|
Class C
fastest:
NAS Parallel Benchmarks
(NPB3.1-SER) - BT Benchmark
...
Class
=
C
Size
=
162x 162x 162
Iterations
=
200
Time in seconds
=
2710.04
Mop/s total
=
1057.65
Operation type
= floating
point
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
3408494462
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
52826901
|
k8-bu-fill-request-l2-miss,mask=dc-fill
|
6111631586
|
Class C
slowest:
NAS Parallel Benchmarks
(NPB3.1-SER) - BT Benchmark
...
Class
=
C
Size
=
162x 162x 162
Iterations
=
200
Time in seconds
=
2804.08
Mop/s total
=
1022.18
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
3408302257
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
52799739
|
k8-bu-fill-request-l2-miss,mask=dc-fill
|
6816122830
|
Supepages (2MB) enabled:
Class A fastest:
NAS Parallel Benchmarks
(NPB3.1-SER) - BT Benchmark
...
Class
=
A
Size
=
64x 64x 64
Iterations
=
200
Time in seconds
=
137.53
Mop/s
total
=
1223.65
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
147302424
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
526256
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
321393533
|
Class A
slowest:
NAS Parallel Benchmarks
(NPB3.1-SER) - BT Benchmark
...
Class
=
A
Size
=
64x 64x 64
Iterations
=
200
Time in seconds
=
143.66
Mop/s total
=
1171.38
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
147248436
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
476324
|
k8-bu-fill-request-l2-miss,mask=dc-fill
|
344940375
|
Class B
fastest:
NAS Parallel Benchmarks
(NPB3.1-SER) - BT Benchmark
...
Class
=
B
Size
=
102x 102x 102
Iterations
=
200
Time in seconds
=
641.63
Mop/s total
=
1094.38
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
2901287350
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
3037011
|
k8-bu-fill-request-l2-miss,mask=dc-fill
|
1459851057
|
Class B
slowest:
NAS Parallel Benchmarks
(NPB3.1-SER) - BT Benchmark
...
Class
=
B
Size
=
102x 102x 102
Iterations
=
200
Time in seconds
=
683.84
Mop/s total
=
1026.82
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
2906122665
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
2796492
|
k8-bu-fill-request-l2-miss,mask=dc-fill
|
1504506460
|
Class C
fastest:
NAS Parallel Benchmarks
(NPB3.1-SER) - BT Benchmark
...
Class
=
C
Size
=
162x 162x 162
Iterations
=
200
Time in seconds
=
2681.53
Mop/s total
=
1068.90
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
1172510928
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
52119375
|
k8-bu-fill-request-l2-miss,mask=dc-fill
|
6289777367
|
Class C
slowest:
NAS Parallel Benchmarks
(NPB3.1-SER) - BT Benchmark
...
Class
=
C
Size
=
162x 162x 162
Iterations
=
200
Time in seconds
=
2816.99
Mop/s total
=
1017.50
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
1131505508
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
50963655
|
k8-bu-fill-request-l2-miss,mask=dc-fill
|
7025429869
|
NAS CG
Base (4KB) pages only:
Class A fastest:
NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
Class
=
A
Size
=
14000
Iterations
=
15
Time in seconds
=
4.78
Mop/s total
=
313.01
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
3106717
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
482850
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
22043504
|
Class A slowest:
NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
Class
=
A
Size
=
14000
Iterations
=
15
Time in seconds
=
5.51
Mop/s total
=
271.70
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
3066947
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
481963
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
34579348
|
Class B fastest:
NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
Class
=
B
Size
=
75000
Iterations
=
75
Time in seconds
=
244.10
Mop/s total
=
224.12
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
111854837
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
21003069
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
1314393607
|
Class B slowest:
NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
Class
=
B
Size
=
75000
Iterations
=
75
Time in seconds
=
285.97
Mop/s total
=
191.31
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
112057803
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
21108396
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
1364662192
|
Class C fastest:
NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
Class
=
C
Size
=
150000
Iterations
=
75
Time in seconds
=
1191.48
Mop/s total
=
120.31
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
464226489
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
88299146
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
23605375212
|
Class C slowest:
NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
Class
=
C
Size
=
150000
Iterations
=
75
Time in seconds
=
1317.57
Mop/s total
=
108.80
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
461912607
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
88598334
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
23643448137
|
Superpages (2MB) enabled:
Class A fastest:
NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
Class
=
A
Size
=
14000
Iterations
=
15
Time in seconds
=
4.71
Mop/s total
=
317.41
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
647957
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
129030
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
23246708
|
Class A slowest:
NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
Class
=
A
Size
=
14000
Iterations
=
15
Time in seconds
=
5.15
Mop/s total
=
290.49
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
684237
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
127660
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
30077877
|
Class B fastest:
NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
Class
=
B
Size
=
75000
Iterations
=
75
Time in seconds
=
235.79
Mop/s total
=
232.02
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
10990087
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
2750068
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
1296215696
|
Class B slowest:
NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
Class
=
B
Size
=
75000
Iterations
=
75
Time in seconds
=
237.56
Mop/s total
=
230.29
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
10844272
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
2896637
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
1298386998
|
Class C fastest:
NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
Class
=
C
Size
=
150000
Iterations
=
75
Time in seconds
=
954.65
Mop/s total
=
150.16
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
34646011
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
12812213
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
23587747901
|
Class C slowest:
NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
Class
=
C
Size
=
150000
Iterations
=
75
Time in seconds
=
1187.69
Mop/s total
=
120.69
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
35294567
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
13562483
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
23607632109
|
NAS IS
Base (4KB) pages only:
NAS
Parallel Benchmarks (NPB3.1-SER) - IS Benchmark
...
Class
=
B
Size
=
33554432
Iterations
=
10
Time in seconds
=
14.61
Mop/s
total
=
22.97
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
266275465
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
8462468
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
357401963
|
NAS Parallel Benchmarks
(NPB3.1-SER) - IS Benchmark
...
Class
=
C
Size
=
134217728
Iterations
=
10
Time in seconds
=
75.86
Mop/s
total
=
17.69
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
1578142847
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
110965762
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
1693873350
|
Superpages (2MB) enabled:
NAS
Parallel Benchmarks (NPB3.1-SER) - IS Benchmark
...
Class
=
B
Size
=
33554432
Iterations
=
10
Time in seconds
=
14.61
Mop/s
total
=
22.97
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
48427949
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
1029579
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
354548602
|
NAS Parallel Benchmarks
(NPB3.1-SER) - IS Benchmark
...
Class
=
C
Size
=
134217728
Iterations
=
10
Time in seconds
=
73.15
Mop/s
total
=
18.35
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
674948451
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
3152255
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
1680029146
|
NAS LU
Base (4KB) pages only:
Class A fastest:
NAS
Parallel Benchmarks (NPB3.1-SER) - LU Benchmark
...
Class
=
A
Size
=
64x 64x 64
Iterations
=
250
Time in seconds
=
161.49
Mop/s total
=
738.72
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
82601586
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
8450603
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
1308714190
|
Class B fastest:
NAS Parallel Benchmarks (NPB3.1-SER) - LU Benchmark
...
Class
=
B
Size
=
102x 102x 102
Iterations
=
250
Time in seconds
=
754.71
Mop/s total
=
660.95
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
486988395
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
38374743
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
6291580047
|
Class B slowest:
NAS Parallel Benchmarks (NPB3.1-SER) - LU Benchmark
...
Class
=
B
Size
=
102x 102x 102
Iterations
=
250
Time in seconds
=
851.09
Mop/s total
=
586.10
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
483748670
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
37579675
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
6556588998
|
Class C fastest/slowest:
NAS Parallel Benchmarks (NPB3.1-SER) - LU Benchmark
...
Class
=
C
Size
=
162x 162x 162
Iterations
=
250
Time in seconds
=
3321.74
Mop/s total
=
613.83
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
4990875972
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
150705836
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
26863403465
|
Superpages (2MB) enabled:
Class A fastest:
NAS
Parallel Benchmarks (NPB3.1-SER) - LU Benchmark
...
Class
=
A
Size
=
64x 64x 64
Iterations
=
250
Time in seconds
=
156.50
Mop/s
total
=
762.29
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
23515311
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
1861857
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
1301863982
|
Class
A slowest:
NAS
Parallel Benchmarks (NPB3.1-SER) - LU Benchmark
...
Class
=
A
Size
=
64x 64x 64
Iterations
=
250
Time in seconds
=
187.51
Mop/s total
=
636.22
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
23998922
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
1924850
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
1337772973
|
Class B fastest:
NAS Parallel Benchmarks
(NPB3.1-SER) - LU Benchmark
...
Class
=
B
Size
=
102x 102x 102
Iterations
=
250
Time in seconds
=
757.81
Mop/s total
=
658.24
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
978918453
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
9097857
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
6460987793
|
Class B
slowest:
NAS Parallel Benchmarks
(NPB3.1-SER) - LU Benchmark
...
Class
=
B
Size
=
102x 102x 102
Iterations
=
250
Time in seconds
=
876.22
Mop/s total
=
569.29
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
985982705
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
9043611
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
6598315671
|
Class C fastest/slowest:
HPCC RandomAccess (single CPU)
Despite the relatively small coverage of the Opteron's TLB, the
implementation of superpages has a significant effect on HPCC
RandomAccess's execution time. In fact, the relative benefit
grows as the data array grows. Although the number of TLB misses is not
significantly reduced by the implementation of superpages, the number
of level two cache misses that occur as a result of the page table walk
on a TLB
miss is significantly reduced.
Base (4KB) pages only:
Main table
size = 2^25 =
33554432 words
Number of updates = 134217728
CPU time used = 9.664062 seconds
Real time used = 9.661612 seconds
0.013891856 Billion(10^9) Updates per second [GUP/s]
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
267821389
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
77020792
|
Main table size = 2^26 = 67108864 words
Number of updates = 268435456
CPU time used = 25.812500 seconds
Real time used = 25.808392 seconds
0.010401092 Billion(10^9) Updates per second [GUP/s]
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
538577658
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
307834660
|
Main table size = 2^27 = 134217728 words
Number of updates = 536870912
CPU time used = 60.531250 seconds
Real time used = 60.532559 seconds
0.008869126 Billion(10^9) Updates per second [GUP/s]
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
1080880550
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
830909203
|
Main table size = 2^28 = 268435456 words
Number of updates = 1073741824
CPU time used = 147.843750 seconds
Real time used = 147.858119 seconds
0.007261974 Billion(10^9) Updates per second [GUP/s]
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
2168714320
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
1914222731
|
Superpages (2MB) enabled:
Main table
size = 2^25 =
33554432 words
Number of updates = 134217728
CPU time used = 7.554688 seconds
Real time used = 7.570868 seconds
0.017728182 Billion(10^9) Updates per second [GUP/s]
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
250758853
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
707255
|
Main table size = 2^26 = 67108864 words
Number of updates = 268435456
CPU time used = 15.195312 seconds
Real time used = 15.209009 seconds
0.017649766 Billion(10^9) Updates per second [GUP/s]
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
521179270
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
1522535
|
Main table size = 2^27 = 134217728 words
Number of updates = 536870912
CPU time used = 30.460938 seconds
Real time used = 30.508333 seconds
0.017597517 Billion(10^9) Updates per second [GUP/s]
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
1066126972
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
4084499
|
Main table size = 2^28 = 268435456 words
Number of updates = 1073741824
CPU time used = 45.148438 seconds
Real time used = 45.142784 seconds
0.023785459 Billion(10^9) Updates per second [GUP/s]
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
|
2148464301
|
k8-bu-fill-request-l2-miss,mask=tlb-reload
|
5553789
|
NAS SP
Base (4KB) pages only:
NAS Parallel Benchmarks (NPB3.1-SER) -
SP Benchmark
...
Class
=
B
Size
=
102x 102x 102
Iterations
=
400
Time in seconds
=
548.91
Mop/s
total
=
646.75
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
527845506
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
26849809
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
3655324848
|
Superpages (2MB) enabled:
Class B fastest:
NAS Parallel Benchmarks (NPB3.1-SER) - SP Benchmark
...
Class
=
B
Size
=
102x 102x 102
Iterations
=
400
Time in seconds
=
571.43
Mop/s total
=
621.27
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
5902207237
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
2593303
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
3702943459
|
Class B slowest:
NAS
Parallel
Benchmarks (NPB3.1-SER) - SP Benchmark
...
Class
=
B
Size
=
102x 102x 102
Iterations
=
400
Time in seconds
=
598.55
Mop/s total
=
593.11
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
5898831120
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
2845902
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
3884630919
|
ASCI Sweep3d
The execution time for Sweep3d varies widely when only 4KB pages are
used. Results for both the fastest and the slowest executions are
reported below. The variance in execution time correlates with
the number of level-two cache misses as a result of data
accesses. In contrast, the number of
TLB misses remains almost constant. Likewise, the number of TLB
misses that result in a level-two cache miss remains almost
constant. Roughly, 70% of TLB misses result in a level two cache
miss.
Base (4KB) pages only:
SWEEP3D -
Method 5 - Pipelined
Wavefront with Line-Recursion
Version 2.2b
S6P1 - 6 angles/octant, 4 moments
global grid: 150x150x150
1domains - 1x 1decomposition
1domain pipelined blocks - 150k-planes by 6angles each
estimated memory usage per domain: 433.6 MB
0global messages per iteration
100.00% domain parallel efficiency - due to decomposition & blocking
98.22% multitasking efficiency on 16 processors
98.22% combined efficiency on a cluster of
1 16-way SMPs
DSA leakage calculation: ON
Flux fixups: ON (after 7iterations)
Iteration Monitor:
its = 1 err = 1. fixs = 0
its = 2 err = 197.571813 fixs = 0
its = 3 err = 1.43683571 fixs = 0
its = 4 err = 0.659707703 fixs = 0
its = 5 err = 0.403871684 fixs = 0
its = 6 err = 0.260737027 fixs = 0
its = 7 err = 0.169897955 fixs = 0
its = 8 err = 0.246048596 fixs =
873936
its = 9 err = 0.0704761545 fixs =
835176
its = 10 err = 0.0436432705 fixs =
818336
its = 11 err = 0.0267308566 fixs =
809760
its = 12 err = 0.0155931674 fixs =
804960
Balance quantities:
External Source: 125.
Absorption: 124.346867
I-leakages:
-0.104657868 0.104657868
J-leakages:
-0.104657868 0.104657868
K-leakages:
-0.104657869 0.104657869
CPU time was: 242.857002
Elapsed time was: 243.100158
CPU grind time: 0.125
Wall grind time: 0.125
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
213398828
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
152724857
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
1914878620
|
SWEEP3D - Method 5 - Pipelined Wavefront with Line-Recursion
...
global grid: 150x150x150
...
CPU time was: 215.805087
Elapsed time was: 215.999874
CPU grind time: 0.111
Wall grind time: 0.111
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
213089346
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
152228694
|
k8-bu-fill-request-l2-miss,mask=dc-fill
|
1411701049
|
When both 4KB
and 2MB pages are used, the execution time for Sweep3d varies
similarly, correlating with the number of level-two cache misses
resulting from data accesses. Results for both the fastest and
the
slowest executions are
reported below. In this case, the number of TLB misses and the
number
of level-two cache misses resulting from TLB misses vary
somewhat.
However, the variance in these amounts is insignificant relative to
that for level-two cache misses resulting from data accesses.
Superpages (2MB) enabled:
SWEEP3D -
Method 5 - Pipelined
Wavefront with Line-Recursion
...
global grid: 150x150x150
...
CPU time was: 197.694443
Elapsed time was: 197.898695
CPU grind time: 0.102
Wall grind time: 0.102
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
281363041
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
1773114
|
k8-bu-fill-request-l2-miss,mask=dc-fill |
1319508735
|
SWEEP3D - Method 5 - Pipelined Wavefront with Line-Recursion
...
global grid: 150x150x150
...
CPU time was: 228.654639
Elapsed time was: 229.247475
CPU grind time: 0.118
Wall grind time: 0.118
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss |
283856498
|
k8-bu-fill-request-l2-miss,mask=tlb-reload |
2604953
|
k8-bu-fill-request-l2-miss,mask=dc-fill
|
2029245350
|
Overall, the fastest execution time with both 4KB and 2MB pages is
8.4%
lower than the
fastest execution time using only 4KB pages. The number of TLB
misses
actually increases when 2MB
pages are used. Recall that there are only 8 TLB entries
supporting data accesses to 2MB pages. However, only 6-7% of
these TLB misses result in a level-two cache miss. Thus, the
number of level-two cache misses resulting from TLB misses decreases dramatically when 2MB
pages are used, specifically, from about 152 million to about 1.78
million.