Computer Systemen 2010: Labbook

Started Labbook 2011

April 12, 2010

No connection with ow137, ow132, or ow127. Could connect to ow140.

March 26, 2010

There was still a performance gap between gcc-2.95.3 and gcc-4.2.
Played with the CFLAGS. Most important seems to be the -march=core2 flag. (-m64, -mssse3 -mfpmath=sse didn't had a comparable effect). With -march=core2 the performance gap is closed. There is also no difference between -02 and -03, although there is between -01 and -02.
The trick for stable speedups is too check if the baseline = 1.0. If not, than gives the baseline-speed a good indication how to scale the speed-up down.

March 25, 2010

Todd Allen's cpuid-tool remembers from the standard 02h query to Intel machines only the size of the L2-cache and for some if the blocks are 4w or 8w:
- L2_256K: 0x3c, 0x42 (4w), 0x7a, 0x7e, 0x82 (8w)
- L2_512K: 0x3e, 0x43 (4w), 0x7b, 0x7f, 0x83 (8w), 0x86
- L2_1Mor2M: 0x44 (4w), 0x45 (4w), 0x84 (8w), 0x85 (8w)
- L2_2M: 0x45, 0x85, 0x88
This could be needed to port the driver code to older processors.

March 24, 2010

Todd Allen's cpuid-tool indicates that it is possible to read the 0x80000006 on faro:
L2 unified cache information (0x80000006/ecx):
line size (bytes) = 0x40 (64)
lines per tag = 0x0 (0)
associativity = 8-way (6)
size (Kb) = 0x400 (1024)
Note that Todd Allen's cpuid also displays the cache characteristics with the highest bit unequal to 0.
According to Todd, only part of the cache characteristics are relevant for the ow-machines:
cache and TLB information (2):
0xb1: unknown
0xb0: instruction TLB: 4K, 4-way, 128 entries
0x05: unknown
0xf0: 64 byte prefetching
0x57: unknown
0x56: unknown
0x78: L2 cache: 1M, 4-way, 64 byte lines
0x30: L1 cache: 32K, 8-way, 64 byte lines
0xb4: unknown
0x2c: L1 data cache: 32K, 8-way, 64 byte lines
Todd also gets zeros for the standard 05h request.
The block-size can be queried on ow-machines with 0x80000006:
L2 unified cache information (0x80000006/ecx):
line size (bytes) = 0x40 (64)
lines per tag = 0x0 (0)
associativity = 4-way (4)
size (Kb) = 0x400 (1024)
Get the same extended 06h information in my cpuinfo_memory code on ow137:
80000006 : 00000000 00000000 04004040 00000000
unified L2 cache line size in bytes : 0x40h (64 bytes)
So, it seems that CACHE_BLOCK is actually present in /proc/cpuinfo, as cache_alignment (not at faro).
Checked Minfo.txt for configurations different from faro / ow137:
- u015425; cache size : 6144 KB, cache_alignment : 64 (0x40h and 0x1800)
- u016743; cache size : 3072 KB, cache_alignment : 64 (0x40h and 0xc00)
- licor ; cache size : 2048 KB, cache_alignment : 128 (not online)
- combi ; cache size : 2048 KB, cache_alignment : 128 (not online)
- u003453; cache size : 2048 KB, cache_alignment : 128 0x40h (64 bytes) (register_d(2) 7Dh) Pentium D
- noordkaap; cache size : 2048 KB, cache_alignment : 128 0x40h (64 bytes) (7Dh) Pentium 4
- flex2 ; cache size : 2048 KB, cache_alignment : 128
- synanceja; cache size : 2048 KB, cache_alignment : 128 0x40h (64 bytes) (7Dh) Pentium 4
- schaap ; cache size : 2048 KB, cache_alignment : 128 (not online)
- advance; cache size : 2048 KB, cache_alignment : 128 0x40h (64 bytes) (7Dh) Pentium 4
- ....
- gonzo ; cache size : 2048 KB, cache_alignment : 64 (register_d(0) 7Dh)
- kermit ; cache size : 2048 KB, cache_alignment : 64 (register_d(0) 7Dh)
- cassia ; cache size : 2048 KB, cache_alignment : 64 (not online anymore)
- ork ; cache size : 2048 KB, cache_alignment : ? 0x40h (64 bytes) (register_d(2) 7Dh)
- cedar ; cache size : 2048 KB, cache_alignment : ? 0x40h (64 bytes) (register_d(2) 7Dh)
- es ; cache size : 2048 KB, cache_alignment : ? 0x40h (64 bytes) (register_d(0) 78h)
- beijing; cache size : 2048 KB, cache_alignment : ? (not online anymore)
- kers ; cache size : 2048 KB, cache_alignment : ? (not online anymore)
- rolic ; cache size : 2048 KB, cache_alignment : ? (not online anymore)
- foobar2; cache size : 1024 KB, cache_alignment : 128 (no home)
- foobar3; cache size : 1024 KB, cache_alignment : 128 0x40h (64 bytes) (7Ch) Pentium D
- flex1 ; cache size : 1024 KB, cache_alignment : 128 0x40h (64 bytes) (7Ch) Pentium D
- plataan; cache size : 1024 KB, cache_alignment : ? 0x40h (64 bytes) (register_d(0) 7Ch)
- u014026; cache size : 1024 KB, cache_alignment : ? 0x40h (64 bytes) (register_d(0) 78h) E2180
- faro ; cache size : 1024 KB, cache_alignment : ? (64 bytes register_d(2) 7Ch) Pentium 4
- bacchus; cache size : 1024 KB, cache_alignment : ? (not online anymore)
- acachia; cache size : 1024 KB, cache_alignment : ? (not online anymore)
- dadel ; cache size : 1024 KB, cache_alignment : ? (not online anymore)
- mopti ; cache size : 1024 KB, cache_alignment : ? (not online anymore)
- bonsai ; cache size : 1024 KB, cache_alignment : ? (not online anymore)
- schoten2; cache size : 1024 KB, cache_alignment : 128 (not online)
- arena ; cache size : 512 KB, cache_alignment : ? (32 bytes register_d(0): 43h) (core dump Todd ) Pentium II
- dkmp002; cache size : 512 KB, cache_alignment : ? (not online anymore)
- trantor; cache size : 512 KB, cache_alignment : ? (not online anymore)
Arena is an interesting old machine. cache size can only queried with standard 02h -> register_d(0) 43h.
Also note u003453, where the cache_alignment and line size differ. My conclusion, cache_alignment is not related with block / line size.
Updated fcyc.c with this code (seems to be quite stable): (arnoud@ow137 43) ./driver -t
Processor Cache Size ~= 1024 Kb
Processor Block Size ~= 64 Kb
Rotate: Version = naive_rotate: baseline implementation:
Dim 256 512 1024 1536 2048 Mean
Your CPEs 12.8 28.7 54.4 60.8 63.3
Baseline CPEs 12.0 30.0 60.0 99.0 107.0
Speedup 0.9 1.0 1.1 1.6 1.7 1.2

Summary of Your Best Scores:
Rotate: 1.2 (naive_rotate: baseline implementation)
Without cache_clear, the code was faster: Rotate: Version = naive_rotate: baseline implementation:
Dim 256 512 1024 1536 2048 Mean
Your CPEs 19.7 43.6 91.1 101.2 105.7
Baseline CPEs 12.0 30.0 60.0 99.0 107.0
Speedup 0.6 0.7 0.7 1.0 1.0 0.8
To be tested with student solutions.

March 22, 2010

Wrote a version of cpuid which queries the cache characteristics for faro:
register_a : 605B5001
register_b : 00000000
register_c : 00000000
register_d : 007C7040

register_d(0) : 40h (no integrated L2 cache (P6 core) or L3 cache (P4 core))
register_a(1) : 50h (code TLB, 4K/4M/2M pages, fully, 64 entries)
register_d(1) : 70h (trace L1 cache, 12 KµOPs, 8 ways)
register_a(2) : 5Bh (data TLB, 4K/4M pages, fully, 64 entries)
register_d(2) : 7Ch (code and data L2 cache, 1024 KB, 8 ways, 64 byte lines, dual-sectored)
register_a(3) : 60h (data L1 cache, 16 KB, 8 ways, 64 byte lines, sectored)
This can be truth, according to Intel's quickref, and in correspondence with the info in /proc/cpuinfo. Yet, it is a bit difficult to get the 64 bytes out of a string.
Lets try if I can directly query the blocksize with the other cache characteristics call 04h. Unfortunatelly, this call doesn't work on faro (hypothesis: MISC_ENABLE.LCMV is not set to 0).
On ow137 request 04h actually works (but 02h cannot be interpreted as indicated at 02h):
cpuid 02h
register_b(0) : F0h
register_d(0) : 78h (code and data L2 cache, 1024 KB, 4 ways, 64 byte lines)
register_a(1) : B1h (code TLB, 4M pages, 4 ways, 4 entries and code TLB, 2M pages, 4 ways, 8 entries)
register_b(1) : 57h
register_d(1) : 30h (code L1 cache, 32 KB, 8 ways, 64 byte lines)
register_a(2) : B0h (code TLB, 4K pages, 4 ways, 128 entries)
register_b(2) : 56h
register_d(2) : B4h (data TLB, 4K pages, 4 ways, 256 entries)
register_a(3) : 5h (data TLB, 4M pages, 4 ways, 32 entries)
register_d(3) : 2Ch

cpuid 04h 02h
register_a : 04000121 (4 cores, data cache L2, self initializing)
register_b : 01C0003F (C00h=3072 physical line partitons)
register_c : 0000003F (63 sets)
register_d : 00000001
sandpile indicated that highest bytes should be 0 to be valid. Added a test on highest bytes, still to many values to be consistent. Strange enough, the response on 04h is now 0000000.
Tried to interpret manually the information as indicated above, but seems rubbish. Try tomorrow 8000_0005h and 8000_0006h.
Also try Todd Allen's cpuid-tool and msr-tool(to active MISC_ENABLE.LCMV).

March 10, 2010

Looked for the difference between handout32 and handout64. Found the reason for the small arrays (MAX_DIM=2304 vs 1280).
Modified handout64/driver.c to BSIZE=32, to test if the measurement is stable on 32-bits machine (faro). Compiled with gcc-4.1.2.
Had to finish implementation of fcyc_v. Result:
Rotate: Version = naive_rotate: baseline implementation:
Dim 64 128 256 512 1024 Mean
Your CPEs 1617.0 1617.0 1617.0 1617.0 1617.0
Baseline CPEs 16.0 12.0 12.0 30.0 60.0
Speedup 0.0 0.0 0.0 0.0 0.0 0.0
Tried handout64/driver.c again with OLD-call num_cycles = fcyc_v(). On faro this gives reliable measurements:
Rotate: Version = naive_rotate: baseline implementation:
Dim 256 512 1024 1536 2048 Mean
Your CPEs 39.3 61.0 198.2 227.6 239.0
Baseline CPEs 12.0 30.0 60.0 99.0 107.0
Speedup 0.3 0.5 0.3 0.4 0.4 0.4

Summary of Your Best Scores:
Rotate: 0.4 (naive_rotate: baseline implementation)
(arnoud@faro 72) ./driver -t
Rotate: Version = naive_rotate: baseline implementation:
Dim 256 512 1024 1536 2048 Mean
Your CPEs 39.5 68.2 199.4 227.1 239.0
Baseline CPEs 12.0 30.0 60.0 99.0 107.0
Speedup 0.3 0.4 0.3 0.4 0.4 0.4
Result on ow137 also seems deterministic (compiled with gcc-2.95.3): Rotate: Version = naive_rotate: baseline implementation: Dim 256 512 1024 1536 2048 Mean Your CPEs 11.8 27.9 55.1 57.4 65.7 Baseline CPEs 12.0 30.0 60.0 99.0 107.0 Speedup 1.0 1.1 1.1 1.7 1.6 1.3 Summary of Your Best Scores: Rotate: 1.3 (naive_rotate: baseline implementation) bash-3.2$ ./driver -t Rotate: Version = naive_rotate: baseline implementation: Dim 256 512 1024 1536 2048 Mean Your CPEs 11.4 23.9 54.9 63.4 65.8 Baseline CPEs 12.0 30.0 60.0 99.0 107.0 Speedup 1.1 1.3 1.1 1.6 1.6 1.3
With gcc-4.1.2, the result is less deterministic (but not zero): bash-3.2$ ./driver -t Rotate: Version = naive_rotate: baseline implementation: Dim 256 512 1024 1536 2048 Mean Your CPEs 18.0 46.8 92.0 104.8 109.4 Baseline CPEs 12.0 30.0 60.0 99.0 107.0 Speedup 0.7 0.6 0.7 0.9 1.0 0.8 Summary of Your Best Scores: Rotate: 0.8 (naive_rotate: baseline implementation) bash-3.2$ ./driver -t Rotate: Version = naive_rotate: baseline implementation: Dim 256 512 1024 1536 2048 Mean Your CPEs 11.8 30.2 55.9 63.4 66.1 Baseline CPEs 12.0 30.0 60.0 99.0 107.0 Speedup 1.0 1.0 1.1 1.6 1.6 1.2
Added code to inspect cache size:
bash-3.2$ ./driver -t
Processor Cache Size ~= 1024 Kb
Now add the code to actually clear this cache size. Maybe have to do a cpuid to find size of CACHE_BLOCK (not in /proc/cpuinfo).

March 5, 2010

Replaced perflab64/fcyc.c with code-all/fcyc.c. faro is 32bits, emicro has no default gcc. u015425 has a default gcc. Compilation complains about missing void version of fcyc.
fcyc is now a wrapper around fcyc_full. Default parameters are k=3 (samples within convergence, epsilon = 0.01, maxsamples = 20, compensate=0. Additional paramete is clear_cache (no default), but seems to be mainly used with value=0. In the original driver, clear_cache=1, compensate=1 and cache_size is defined!
In original fcyc, both CACHE_BYTES and CACHE_BLOCK are explicitly defined (32 bits). In code-all/fcyc.c, the cache size is also explicitly set (to Pentium III values). No interface to set the cache_size. This explains why fcyc is called with clear_cache mostly off.
fcyc is also called with only one parameter, instead of a parameter list.
Created fcyc_v version. Works, yet still sometimes CPE 0.0
Created find_cpe_v version. Needs maxcount, which is the dimension of the data (do I need dimension or dimension*dimension). The dimension is used for sampling. To be tested.

March 3, 2010

Students encounter non-deterministic CPEs measurements. Analyzed the measurement-code:
- Difference between 2004code/fcyc.c and perflab64/fcyc.c is just the existance of fcyc_v() with void *params.
- Difference between 2010code/fcyc.c and perflab64/fcyc.c is 20 lines.
- 2010/opt uses benchmark.c. This file calls find_cpe_full in ../src/cpe.c. This function samples several times measure_function, which calls fcyc and fits a slope through the cycle_vals.
- perflab64/driver.c does in test_rotate a direct call to fcyc_v, and calculates the cpe by num_cycles/work
- perflab64/driver.c could improve by calling find_cpe(elem_fun_t f, int maxcnt), yet this means that the function arguments have to be converted from voids to ints.

March 1, 2010

Tried perflab on 64-bits machines. Program compiles, but crashes. Segmentation fault is in get_comp_counter in clock.c:85. This is a bug in the code-all/src/clock.c when #ifndef USE_TSC (get_comp_counter calling itself).
Naieve result (-g) and (-g -mssse3 -mfpmath=sse -m64):
Rotate: Version = naive_rotate: baseline implementation:
Dim 64 128 256 512 1024 Mean
Your CPEs 19.2 10.8 11.2 23.7 58.5
Baseline CPEs 12.0 20.0 45.0 125.0 145.0
Speedup 0.6 1.8 4.0 5.3 2.5 2.3
Naieve result (-g -O1 ) and (-g -O2 -mssse3 -mfpmath=sse -m64 ) and (-g -O4):
Dim 64 128 256 512 1024 Mean
Your CPEs 5.8 6.3 6.7 22.5 58.7
Baseline CPEs 12.0 20.0 45.0 125.0 145.0
Speedup 2.1 3.2 6.7 5.6 2.5 3.6
Lucky Naieve result (-g -O1 -mssse3 -mfpmath=sse -m64):
Dim 64 128 256 512 1024 Mean
Your CPEs 5.7 3.8 5.7 22.5 58.6
Baseline CPEs 12.0 20.0 45.0 125.0 145.0
Speedup 2.1 5.3 7.9 5.5 2.5 4.1
Modified alignment from
while ((unsigned)orig % BSIZE)
((char *)orig)++;
via
while ((unsigned long)orig % BSIZE)
orig=(pixel *)(char *)(((char *)orig)+1);
to
while ((unsigned long)orig % BSIZE)
orig=(pixel *)(((char *)orig)+1);
Need some bigger matrices, but driver crashes on matrices bigger than 1024. Rotate: Version = naive_rotate: baseline implementation:
Dim 64 128 256 512 1024 Mean
Your CPEs 19.0 10.8 13.6 27.7 58.5
Baseline CPEs 16.0 14.0 12.0 32.0 60.0
Speedup 0.8 1.3 0.9 1.2 1.0 1.0
Check alignment in driver.

February 23, 2010

Studied the webaside Achieving Greater Parallelism with SIMD Instructions. Randy Bryant answered that the code of the new book was compiled with 4.2.1, although they also compiled this code with 3.4.2.
Compiling opt/combine.c with 4.2.0 on ow147 gave the error 'unrecognized command line option "-msse4"'. The same for the standard 4.1.2. The older versions gcc-3.2, gcc-3.3, gcc-3.4.0 and gcc-3.4.3 are not available on this machines. Solved the issue by calling the option "-msse4a"
Following result for combine:
Integer Sum combine4: Array reference, accumulate in temporary:
1.44 cycles/element
Integer Sum simd_v8a: SSE code, 8*VSIZE-way parallelism, reassociate:
0.09 cycles/element
With option "-mssse3" (note the extra s) the result is:
Integer Sum simd_v8a: SSE code, 8*VSIZE-way parallelism, reassociate:
0.03 cycles/element
With option "-msse3" the result is:
Integer Sum simd_v8a: SSE code, 8*VSIZE-way parallelism, reassociate:
0.13 cycles/element
With option "-msse2" the result is:
Integer Sum simd_v8a: SSE code, 8*VSIZE-way parallelism, reassociate:
0.07 cycles/element
With option "-msse" the 4-way parallelism is faster:
Integer Sum simd_v4a: SSE code, 4*VSIZE-way parallelism, reassociate:
0.14 cycles/element
Integer Sum simd_v8a: SSE code, 8*VSIZE-way parallelism, reassociate:
0.28 cycles/element
See why SSSE3 is the Core 2 extenstion of SSE3. SSE4 exist in three variants. SSE4a refers to AMD small subset. The other two subset (SSE4.1 and SSE4.2 are available since GCC 4.3.4 with switch -msse4 (selects both, use -msse4.1 or -msse4.2 for use subsets seperately)). Gcc 4.3.4 is the current latest previous release.
Latest release version is gcc 4.4.3.
/proc/cpuinfo shows that the machines in P1.27 support at most ssse3. Only found two machines which supported sse4_1 (u016743 and u015425), but both didn't had only gcc-4.2.
Luckely enough, the code comes including the compiled programs. Result on u15425:
Integer Sum simd_v8a: SSE code, 8*VSIZE-way parallelism, reassociate:
0.16 cycles/element
Read the article Improving Particle Filter Performance Using SSE Instructions by Peter Djeu, Michael Quinlan and Peter Stone

February 15, 2010

Replaced the 2009 phase4 with the original phase4.

February 2, 2010

Reactivated computer system environment by adding binutils-2.17 and gcc-2.95.3 to .pkgrc. /home/cs_ai/pkg/ has to be added to PACKAGEPATH in .bashrc.

Previous Labbook

Labbook 2009