Superpage support for FreeBSD 7.0-CURRENT

Background

The Opteron has two distinct TLBs, an instruction TLB (ITLB) that is used for fetching instructions and a data TLB (DTLB) that is used for accessing data.  These two TLBs have the same basic organization.  In both TLBs, the 4KB and 2MB page mappings are implemented by two distinct groups of entries.  The group for caching 4KB page mappings is organized as a two-level hierarchy.  The first level has 32 entries and is fully associative.  The second level has 512 entries and is four-way set associative.  Thus, this group provides coverage for 2MB of memory.  In contrast, the group for caching 2MB page mappings is organized as a single level.  This single level has 8 entries and is fully associative.  Thus, this group provides coverage for 16MB of memory.  In total, each TLB provides coverage for 18MB of memory.

A fundamental consequence of the Opteron's TLB organization is that the use of 2MB page mappings instead of 4KB page mappings is not certain to result in a smaller number of TLB misses.  Depending on the degree of spatial locality in a given stream of memory accesses, the larger number of entries for mapping 4KB pages may have greater impact on the number of TLB misses than the larger coverage provided by 2MB page mappings.

Benchmarks

Results for the following benchmarks are presented below:

Results

The following results were obtained on a system with two Opteron model 875 CPUs, providing four 2.2GHz processor cores, and 4GB of DDR333/PC2700 memory.

NAS BT

Base (4KB) pages only:

Class A fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - BT Benchmark
...
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      200
 Time in seconds =                   139.50
 Mop/s total     =                  1206.32
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 36551697
k8-bu-fill-request-l2-miss,mask=tlb-reload 2519186
k8-bu-fill-request-l2-miss,mask=dc-fill 348159278

Class A slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - BT Benchmark
...
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      200
 Time in seconds =                   146.10
 Mop/s total     =                  1151.88
...

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 36589963
k8-bu-fill-request-l2-miss,mask=tlb-reload 2580998
k8-bu-fill-request-l2-miss,mask=dc-fill 369602599

Class B fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - BT Benchmark
...
 Class
           =                        B
 Size            =            102x 102x 102
 Iterations      =                      200
 Time in seconds =                   648.49
 Mop/s total     =                  1082.80
...

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 344909514
k8-bu-fill-request-l2-miss,mask=tlb-reload 12524942
k8-bu-fill-request-l2-miss,mask=dc-fill 1396996205

Class B slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - BT Benchmark
...
 Class           =                        B
 Size            =            102x 102x 102
 Iterations      =                      200
 Time in seconds =                   658.58
 Mop/s total     =                  1066.20
...

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 345186532
k8-bu-fill-request-l2-miss,mask=tlb-reload 12606643
k8-bu-fill-request-l2-miss,mask=dc-fill 1495266107

Class C fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - BT Benchmark
...
 Class           =                        C
 Size            =            162x 162x 162
 Iterations      =                      200
 Time in seconds =                  2710.04
 Mop/s total     =                  1057.65
 Operation type  =           floating point
...

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 3408494462
k8-bu-fill-request-l2-miss,mask=tlb-reload 52826901
k8-bu-fill-request-l2-miss,mask=dc-fill 6111631586

Class C slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - BT Benchmark
...
 Class           =                        C
 Size            =            162x 162x 162
 Iterations      =                      200
 Time in seconds =                  2804.08
 Mop/s total     =                  1022.18
...

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 3408302257
k8-bu-fill-request-l2-miss,mask=tlb-reload 52799739
k8-bu-fill-request-l2-miss,mask=dc-fill 6816122830

Supepages (2MB) enabled:

Class A fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - BT Benchmark
...
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      200
 Time in seconds =                   137.53
 Mop/s total     =                  1223.65
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 147302424
k8-bu-fill-request-l2-miss,mask=tlb-reload 526256
k8-bu-fill-request-l2-miss,mask=dc-fill 321393533

Class A slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - BT Benchmark
...
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      200
 Time in seconds =                   143.66
 Mop/s total     =                  1171.38
...

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 147248436
k8-bu-fill-request-l2-miss,mask=tlb-reload 476324
k8-bu-fill-request-l2-miss,mask=dc-fill 344940375

Class B fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - BT Benchmark
...
 C
lass           =                        B
 Size            =            102x 102x 102
 Iterations      =                      200
 Time in seconds =                   641.63
 Mop/s total     =                  1094.38
...

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 2901287350
k8-bu-fill-request-l2-miss,mask=tlb-reload 3037011
k8-bu-fill-request-l2-miss,mask=dc-fill 1459851057

Class B slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - BT Benchmark
...
  Class           =                        B
 Size            =            102x 102x 102
 Iterations      =                      200
 Time in seconds =                   683.84
 Mop/s total     =                  1026.82
...

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 2906122665
k8-bu-fill-request-l2-miss,mask=tlb-reload 2796492
k8-bu-fill-request-l2-miss,mask=dc-fill 1504506460

Class C fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - BT Benchmark
...
 Class           =                        C
 Size            =            162x 162x 162
 Iterations      =                      200
 Time in seconds =                  2681.53
 Mop/s total     =                  1068.90
...

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 1172510928
k8-bu-fill-request-l2-miss,mask=tlb-reload 52119375
k8-bu-fill-request-l2-miss,mask=dc-fill 6289777367

Class C slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - BT Benchmark
...
 Class           =                        C
 Size            =            162x 162x 162
 Iterations      =                      200
 Time in seconds =                  2816.99
 Mop/s total     =                  1017.50
...

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 1131505508
k8-bu-fill-request-l2-miss,mask=tlb-reload 50963655
k8-bu-fill-request-l2-miss,mask=dc-fill 7025429869

NAS CG

Base (4KB) pages only:

Class A fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
 Class           =                        A
 Size            =                    14000
 Iterations      =                       15
 Time in seconds =                     4.78
 Mop/s total     =                   313.01
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 3106717
k8-bu-fill-request-l2-miss,mask=tlb-reload 482850
k8-bu-fill-request-l2-miss,mask=dc-fill 22043504

Class A slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
 Class           =                        A
 Size            =                    14000
 Iterations      =                       15
 Time in seconds =                     5.51
 Mop/s total     =                   271.70
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 3066947
k8-bu-fill-request-l2-miss,mask=tlb-reload 481963
k8-bu-fill-request-l2-miss,mask=dc-fill 34579348

Class B fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
 Class           =                        B
 Size            =                    75000
 Iterations      =                       75
 Time in seconds =                   244.10
 Mop/s total     =                   224.12
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 111854837
k8-bu-fill-request-l2-miss,mask=tlb-reload 21003069
k8-bu-fill-request-l2-miss,mask=dc-fill 1314393607

Class B slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
 Class           =                        B
 Size            =                    75000
 Iterations      =                       75
 Time in seconds =                   285.97
 Mop/s total     =                   191.31
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 112057803
k8-bu-fill-request-l2-miss,mask=tlb-reload 21108396
k8-bu-fill-request-l2-miss,mask=dc-fill 1364662192

Class C fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
 Class           =                        C
 Size            =                   150000
 Iterations      =                       75
 Time in seconds =                  1191.48
 Mop/s total     =                   120.31
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 464226489
k8-bu-fill-request-l2-miss,mask=tlb-reload 88299146
k8-bu-fill-request-l2-miss,mask=dc-fill 23605375212

Class C slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
 Class           =                        C
 Size            =                   150000
 Iterations      =                       75
 Time in seconds =                  1317.57
 Mop/s total     =                   108.80
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 461912607
k8-bu-fill-request-l2-miss,mask=tlb-reload 88598334
k8-bu-fill-request-l2-miss,mask=dc-fill 23643448137

Superpages (2MB) enabled:

Class A fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
 Class           =                        A
 Size            =                    14000
 Iterations      =                       15
 Time in seconds =                     4.71
 Mop/s total     =                   317.41
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 647957
k8-bu-fill-request-l2-miss,mask=tlb-reload 129030
k8-bu-fill-request-l2-miss,mask=dc-fill 23246708

Class A slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
 Class           =                        A
 Size            =                    14000
 Iterations      =                       15
 Time in seconds =                     5.15
 Mop/s total     =                   290.49
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 684237
k8-bu-fill-request-l2-miss,mask=tlb-reload 127660
k8-bu-fill-request-l2-miss,mask=dc-fill 30077877

Class B fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
 Class           =                        B
 Size            =                    75000
 Iterations      =                       75
 Time in seconds =                   235.79
 Mop/s total     =                   232.02
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 10990087
k8-bu-fill-request-l2-miss,mask=tlb-reload 2750068
k8-bu-fill-request-l2-miss,mask=dc-fill 1296215696

Class B slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
 Class           =                        B
 Size            =                    75000
 Iterations      =                       75
 Time in seconds =                   237.56
 Mop/s total     =                   230.29
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 10844272
k8-bu-fill-request-l2-miss,mask=tlb-reload 2896637
k8-bu-fill-request-l2-miss,mask=dc-fill 1298386998

Class C fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
 Class           =                        C
 Size            =                   150000
 Iterations      =                       75
 Time in seconds =                   954.65
 Mop/s total     =                   150.16
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 34646011
k8-bu-fill-request-l2-miss,mask=tlb-reload 12812213
k8-bu-fill-request-l2-miss,mask=dc-fill 23587747901

Class C slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - CG Benchmark
...
 Class           =                        C
 Size            =                   150000
 Iterations      =                       75
 Time in seconds =                  1187.69
 Mop/s total     =                   120.69
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 35294567
k8-bu-fill-request-l2-miss,mask=tlb-reload 13562483
k8-bu-fill-request-l2-miss,mask=dc-fill 23607632109

NAS IS

Base (4KB) pages only:

 NAS Parallel Benchmarks (NPB3.1-SER) - IS Benchmark
...
 Class           =                        B
 Size            =                 33554432
 Iterations      =                       10
 Time in seconds =                    14.61
 Mop/s total     =                    22.97
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 266275465
k8-bu-fill-request-l2-miss,mask=tlb-reload 8462468
k8-bu-fill-request-l2-miss,mask=dc-fill 357401963

 NAS Parallel Benchmarks (NPB3.1-SER) - IS Benchmark
...
 Class           =                        C
 Size            =                134217728
 Iterations      =                       10
 Time in seconds =                    75.86
 Mop/s total     =                    17.69
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 1578142847
k8-bu-fill-request-l2-miss,mask=tlb-reload 110965762
k8-bu-fill-request-l2-miss,mask=dc-fill 1693873350

Superpages (2MB) enabled:

 NAS Parallel Benchmarks (NPB3.1-SER) - IS Benchmark
...
 Class           =                        B
 Size            =                 33554432
 Iterations      =                       10
 Time in seconds =                    14.61
 Mop/s total     =                    22.97
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 48427949
k8-bu-fill-request-l2-miss,mask=tlb-reload 1029579
k8-bu-fill-request-l2-miss,mask=dc-fill 354548602

 NAS Parallel Benchmarks (NPB3.1-SER) - IS Benchmark
...
 Class           =                        C
 Size            =                134217728
 Iterations      =                       10
 Time in seconds =                    73.15
 Mop/s total     =                    18.35
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 674948451
k8-bu-fill-request-l2-miss,mask=tlb-reload 3152255
k8-bu-fill-request-l2-miss,mask=dc-fill 1680029146

NAS LU

Base (4KB) pages only:

Class A fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - LU Benchmark
...
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      250
 Time in seconds =                   161.49
 Mop/s total     =                   738.72
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 82601586
k8-bu-fill-request-l2-miss,mask=tlb-reload 8450603
k8-bu-fill-request-l2-miss,mask=dc-fill 1308714190

Class B fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - LU Benchmark
...
 Class           =                        B
 Size            =            102x 102x 102
 Iterations      =                      250
 Time in seconds =                   754.71
 Mop/s total     =                   660.95
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 486988395
k8-bu-fill-request-l2-miss,mask=tlb-reload 38374743
k8-bu-fill-request-l2-miss,mask=dc-fill 6291580047

Class B slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - LU Benchmark
...
 Class           =                        B
 Size            =            102x 102x 102
 Iterations      =                      250
 Time in seconds =                   851.09
 Mop/s total     =                   586.10
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 483748670
k8-bu-fill-request-l2-miss,mask=tlb-reload 37579675
k8-bu-fill-request-l2-miss,mask=dc-fill 6556588998

Class C fastest/slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - LU Benchmark
...
 Class           =                        C
 Size            =            162x 162x 162
 Iterations      =                      250
 Time in seconds =                  3321.74
 Mop/s total     =                   613.83
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 4990875972
k8-bu-fill-request-l2-miss,mask=tlb-reload 150705836
k8-bu-fill-request-l2-miss,mask=dc-fill 26863403465

Superpages (2MB) enabled:

Class A fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - LU Benchmark

...
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      250
 Time in seconds =                   156.50
 Mop/s total     =                   762.29
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 23515311
k8-bu-fill-request-l2-miss,mask=tlb-reload 1861857
k8-bu-fill-request-l2-miss,mask=dc-fill 1301863982

Class A slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - LU Benchmark

...
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      250
 Time in seconds =                   187.51
 Mop/s total     =                   636.22
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 23998922
k8-bu-fill-request-l2-miss,mask=tlb-reload 1924850
k8-bu-fill-request-l2-miss,mask=dc-fill 1337772973

Class B fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - LU Benchmark

...
 Class           =                        B
 Size            =            102x 102x 102
 Iterations      =                      250
 Time in seconds =                   757.81
 Mop/s total     =                   658.24
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 978918453
k8-bu-fill-request-l2-miss,mask=tlb-reload 9097857
k8-bu-fill-request-l2-miss,mask=dc-fill 6460987793

Class B slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - LU Benchmark
...
 Class           =                        B
 Size            =            102x 102x 102
 Iterations      =                      250
 Time in seconds =                   876.22
 Mop/s total     =                   569.29
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 985982705
k8-bu-fill-request-l2-miss,mask=tlb-reload 9043611
k8-bu-fill-request-l2-miss,mask=dc-fill 6598315671

Class C fastest/slowest:

HPCC RandomAccess (single CPU)

Despite the relatively small coverage of the Opteron's TLB, the implementation of superpages has a significant effect on HPCC RandomAccess's execution time.  In fact, the relative benefit grows as the data array grows. Although the number of TLB misses is not significantly reduced by the implementation of superpages, the number of level two cache misses that occur as a result of the page table walk on a TLB miss is significantly reduced.

Base (4KB) pages only:

Main table size   = 2^25 = 33554432 words
Number of updates = 134217728
CPU time used  = 9.664062 seconds
Real time used = 9.661612 seconds
0.013891856 Billion(10^9) Updates    per second [GUP/s]

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
267821389
k8-bu-fill-request-l2-miss,mask=tlb-reload
77020792

Main table size   = 2^26 = 67108864 words
Number of updates = 268435456
CPU time used  = 25.812500 seconds
Real time used = 25.808392 seconds
0.010401092 Billion(10^9) Updates    per second [GUP/s]

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
538577658
k8-bu-fill-request-l2-miss,mask=tlb-reload
307834660

Main table size   = 2^27 = 134217728 words
Number of updates = 536870912
CPU time used  = 60.531250 seconds
Real time used = 60.532559 seconds
0.008869126 Billion(10^9) Updates    per second [GUP/s]

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
1080880550
k8-bu-fill-request-l2-miss,mask=tlb-reload
830909203

Main table size   = 2^28 = 268435456 words
Number of updates = 1073741824
CPU time used  = 147.843750 seconds
Real time used = 147.858119 seconds
0.007261974 Billion(10^9) Updates    per second [GUP/s]

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
2168714320
k8-bu-fill-request-l2-miss,mask=tlb-reload
1914222731

Superpages (2MB) enabled:

Main table size   = 2^25 = 33554432 words
Number of updates = 134217728
CPU time used  = 7.554688 seconds
Real time used = 7.570868 seconds
0.017728182 Billion(10^9) Updates    per second [GUP/s]

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
250758853
k8-bu-fill-request-l2-miss,mask=tlb-reload
707255

Main table size   = 2^26 = 67108864 words
Number of updates = 268435456
CPU time used  = 15.195312 seconds
Real time used = 15.209009 seconds
0.017649766 Billion(10^9) Updates    per second [GUP/s]

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
521179270
k8-bu-fill-request-l2-miss,mask=tlb-reload
1522535

Main table size   = 2^27 = 134217728 words
Number of updates = 536870912
CPU time used  = 30.460938 seconds
Real time used = 30.508333 seconds
0.017597517 Billion(10^9) Updates    per second [GUP/s]

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
1066126972
k8-bu-fill-request-l2-miss,mask=tlb-reload
4084499

Main table size   = 2^28 = 268435456 words
Number of updates = 1073741824
CPU time used  = 45.148438 seconds
Real time used = 45.142784 seconds
0.023785459 Billion(10^9) Updates    per second [GUP/s]

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss
2148464301
k8-bu-fill-request-l2-miss,mask=tlb-reload
5553789

NAS SP

Base (4KB) pages only:

 NAS Parallel Benchmarks (NPB3.1-SER) - SP Benchmark
...
 Class           =                        B
 Size            =            102x 102x 102
 Iterations      =                      400
 Time in seconds =                   548.91
 Mop/s total     =                   646.75
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 527845506
k8-bu-fill-request-l2-miss,mask=tlb-reload 26849809
k8-bu-fill-request-l2-miss,mask=dc-fill 3655324848

Superpages (2MB) enabled:

Class B fastest:

 NAS Parallel Benchmarks (NPB3.1-SER) - SP Benchmark
...
 Class           =                        B
 Size            =            102x 102x 102
 Iterations      =                      400
 Time in seconds =                   571.43
 Mop/s total     =                   621.27
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 5902207237
k8-bu-fill-request-l2-miss,mask=tlb-reload 2593303
k8-bu-fill-request-l2-miss,mask=dc-fill 3702943459

Class B slowest:

 NAS Parallel Benchmarks (NPB3.1-SER) - SP Benchmark
...
 Class           =                        B
 Size            =            102x 102x 102
 Iterations      =                      400
 Time in seconds =                   598.55
 Mop/s total     =                   593.11
...
k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 5898831120
k8-bu-fill-request-l2-miss,mask=tlb-reload 2845902
k8-bu-fill-request-l2-miss,mask=dc-fill 3884630919

ASCI Sweep3d

The execution time for Sweep3d varies widely when only 4KB pages are used.  Results for both the fastest and the slowest executions are reported below.  The variance in execution time correlates with the number of level-two cache misses as a result of data accesses.  In contrast, the number of TLB misses remains almost constant.  Likewise, the number of TLB misses that result in a level-two cache miss remains almost constant.  Roughly, 70% of TLB misses result in a level two cache miss.

Base (4KB) pages only:

 SWEEP3D - Method 5 - Pipelined Wavefront with Line-Recursion
 Version 2.2b
 S6P1   -  6 angles/octant,  4 moments
 global grid: 150x150x150
 1domains   -  1x 1decomposition
 1domain pipelined blocks - 150k-planes by 6angles each
 estimated memory usage per domain:  433.6 MB
 0global messages per iteration
100.00% domain parallel efficiency - due to decomposition & blocking
 98.22% multitasking efficiency on  16 processors
 98.22% combined efficiency on a cluster of    1  16-way SMPs
 DSA leakage calculation: ON
 Flux fixups: ON (after 7iterations)
 Iteration Monitor:
  its =  1 err =   1.  fixs =  0
  its =  2 err =   197.571813  fixs =  0
  its =  3 err =   1.43683571  fixs =  0
  its =  4 err =   0.659707703  fixs =  0
  its =  5 err =   0.403871684  fixs =  0
  its =  6 err =   0.260737027  fixs =  0
  its =  7 err =   0.169897955  fixs =  0
  its =  8 err =   0.246048596  fixs =  873936
  its =  9 err =   0.0704761545  fixs =  835176
  its =  10 err =   0.0436432705  fixs =  818336
  its =  11 err =   0.0267308566  fixs =  809760
  its =  12 err =   0.0155931674  fixs =  804960
 Balance quantities:
  External Source:   125.
  Absorption:        124.346867
  I-leakages:       -0.104657868  0.104657868
  J-leakages:       -0.104657868  0.104657868
  K-leakages:       -0.104657869  0.104657869
 CPU     time was:   242.857002
 Elapsed time was:   243.100158
 CPU grind time:   0.125   
 Wall grind time:  0.125   

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 213398828
k8-bu-fill-request-l2-miss,mask=tlb-reload 152724857
k8-bu-fill-request-l2-miss,mask=dc-fill 1914878620

 SWEEP3D - Method 5 - Pipelined Wavefront with Line-Recursion
...
 global grid: 150x150x150
...
 CPU     time was:   215.805087
 Elapsed time was:   215.999874
 CPU grind time:   0.111   
 Wall grind time:  0.111   

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 213089346
k8-bu-fill-request-l2-miss,mask=tlb-reload 152228694
k8-bu-fill-request-l2-miss,mask=dc-fill
1411701049

When both 4KB and 2MB pages are used, the execution time for Sweep3d varies similarly, correlating with the number of level-two cache misses resulting from data accesses.  Results for both the fastest and the slowest executions are reported below.  In this case, the number of TLB misses and the number of level-two cache misses resulting from TLB misses vary somewhat.  However, the variance in these amounts is insignificant relative to that for level-two cache misses resulting from data accesses.

Superpages (2MB) enabled:

 SWEEP3D - Method 5 - Pipelined Wavefront with Line-Recursion
...
 global grid: 150x150x150
...
 CPU     time was:   197.694443
 Elapsed time was:   197.898695
 CPU grind time:   0.102   
 Wall grind time:  0.102   

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 281363041
k8-bu-fill-request-l2-miss,mask=tlb-reload 1773114
k8-bu-fill-request-l2-miss,mask=dc-fill 1319508735

 SWEEP3D - Method 5 - Pipelined Wavefront with Line-Recursion
...
 global grid: 150x150x150
...
 CPU     time was:   228.654639
 Elapsed time was:   229.247475
 CPU grind time:   0.118   
 Wall grind time:  0.118   

k8-dc-l1-dtlb-miss-and-l2-dtlb-miss 283856498
k8-bu-fill-request-l2-miss,mask=tlb-reload 2604953
k8-bu-fill-request-l2-miss,mask=dc-fill
2029245350

Overall, the fastest execution time with both 4KB and 2MB pages is 8.4% lower than the fastest execution time using only 4KB pages.  The number of TLB misses actually increases when 2MB pages are used.  Recall that there are only 8 TLB entries supporting data accesses to 2MB pages.  However, only 6-7% of these TLB misses result in a level-two cache miss.  Thus, the number of level-two cache misses resulting from TLB misses decreases dramatically when 2MB pages are used, specifically, from about 152 million to about 1.78 million.