$Id$ Add standard deviation and other statistics calculations to "make stats" in results. Alternatively, we might report min, 1Q, median, 3Q, max, as standard deviation for non-normal distributions isn't always sensible. Add flags to various file-related benchmarks bw_file_rd, bw_mmap_rd, lat_fcntl.c, lat_fs, lat_mmap, and lat_pagefault, for parallelism which selects whether each instance has its own file or shares a file. Figure out how to improve lat_select. It doesn't really work for multi-processor systems. Linus suggests that we have each process do some amount of work, and vary the amount of work until context switch times for the producer degrade. The current architecture of lat_select is too synchronous and favors simple hand-off scheduling too much. From Linus. Look into threads vs. process scaling. benchmp currently uses separate processes (via fork()); some benchmarks such as page faults and VM mapping might have very different performance for threads vs. processes since Linux (at least) has per-memory space locks for many of these things. From Linus. Add a '-f' option to lat_ctx which causes the work to be floating point summation (so we get floating point state too). (Suggestion by Ingo Molnar) Add a threads benchmark suite (context switch, mutex, semaphore, ...). Create a new process for each measurement, rather than reusing the same process. This is mostly to get different page layouts and mostly impacts the memory latency benchmarks, although it can also affect lat_ctx. Write/extend the results processing system/scripts to graph/display/ process results in the "-P " dimension, and to properly handle results with differing parallelism when reporting standard results. The parallelism is stored in the results file as SYNC_MAX. Add "bw_udp" benchmark to measure UDP bandwidth [in progress] Make a bw_tcp mode that measures bandwidth for each block and graph that as offset/bandwidth. Make the disk stuff autosize such that you get the same number of data points regardless of disk size. Fix the getsummary to include simple calls. Think about the issues of int/long/long long/double/long double load/stores. Maybe do them all. This will (at least) require adding a test to scripts/build for the presence of long double on this system. Make all results print out bandwidths in powers of 10/sizes in powers of two. Documentation on porting. Check that file size is right in the benchmarking system. Compiler version info included in results. XXX - do this! memory store latency (complex) Why can't I just use the read one and make it write? Well, because the read one is list oriented and I need to figure out reasonable math for the write case. The read one is a load per statement whereas the write one will be more work, I think. RPC numbers reserved for the benchmark. Check all the error outputs and make sure they are consistent. On all the normalized graphs, make sure that they mean the same thing. I do not think that the bandwidth measurements are "correct" in this sense. Document the timing.c interfaces. Run the whole suite through gcc -Wall and fix all the errors. Also make sure that it compiles and has the right sizes for 64 bit OS. [Mon Jul 1 13:30:01 PDT 1996, after meeting w/ Kevin] Do the load latency like so loop: load r1 { increase the number of nops until they start to make the run time longer - the last one was the memory latency. } use the register { increase the number of nops until they start to make the run time longer - the last one was the cache fill shadow. } repeat Do the same thing w/ a varying number of loads (& their uses), showing the number of outstanding loads implemented to L1, L2, mem. Do hand made assembler to get accurate numbers. Provide C source that mimics the hand made assembler for new machines. Think about a report format for the hardware stuff that showed the numbers as triples L1/L2/mem (or quadruples for alphas).