Python Interpreters Benchmarks

26 Jan 2020, Sunday, 7:18 pm GMT

 How programs are measured
The Process
  1. Each program is run and measured at the smallest input value, program output redirected to a file and compared to expected output. As long as the output matches expected output, the program is then run and measured at the next larger input value until measurements have been made at every input value.
  2. If the program gives the expected output within an arbitrary cutoff time (120 seconds) the program is measured again (5 more times) with output redirected to /dev/null.
  3. If the program doesn't give the expected output within an arbitrary timeout (usually one hour) the program is forced to quit. If measurements at a smaller input value have been successful within an arbitrary cutoff time (120 seconds), the program is measured again (5 more times) at that smaller input value, with output redirected to /dev/null.
  4. The measurements shown on the website are either
    • within the arbitrary cutoff - the lowest time and highest memory use from 6 measurements
    • outside the arbitrary cutoff - the sole time and memory use measurement
  5. For sure, programs taking 4 and 5 hours were only measured once!
How do you measure Time-used?

Each program is run as a child-process of a Python script using Popen.

  • CPU secs: The script child-process usr+sys rusage time is taken using os.wait3
  • Elapsed secs: The time is taken before forking the child-process and after the child-process exits, using time.time()

Time measurements include program startup time - see ↓ What about Java® VM warm-up?

On win32 -

How do you measure Memory-used?

By sampling GLIBTOP_PROC_MEM_RESIDENT for the program and it's child processes every 0.2 seconds. Obviously those measurements are unlikely to be reliable for programs that run for less than 0.2 seconds.

On win32 - QueryInformationJobObject(hJob,JobObjectExtendedLimitInformation) PeakJobMemoryUsed

Why was Memory-used?

Huge differences in default memory allocation don't necessarily mean there'll be huge differences in Memory-used for tasks that require memory to be allocated:

Java C Free Pascal
13,996KB 320KB 8KB n-body

Look at programs for the tasks that do require memory to be allocated:
494,040KB 153,452KB 128,396KB k-nucleotide
511,484KB 248,632KB 128,584KB reverse-complement
557,080KB 289,088KB 124,544KB regex-dna
506,592KB 99,448KB 65,828KB binary-trees
How do you measure Code-used?

We start with the source-code markup you can see, remove comments, remove duplicate whitespace characters, and then apply minimum GZip compression. The Code-used measurement is the size in bytes of that GZip compressed source-code file.

Thanks to Brian Hurt for the idea of using size of compressed source code instead of lines of code.

(Note: There is some evidence that complexity metrics don't provide any more information than SLoC or LoC.)

How do you measure ≈ CPU Load?

The GTop cpu idle and GTop cpu total are taken before forking the child-process and after the child-process exits, The percentages represent the proportion of cpu not-idle to cpu total for each core.

On win32 - GetSystemTimes UserTime and IdleTime are taken before forking the child-process and after the child-process exits. The percentage represents the proportion of TotalUserTime to UserTime+IdleTime (because that's like the percentage you'll see in Task Manager).

What hardware and OS do you measure the programs on?

dual-core 2.5Ghz Intel® i5-7200U® with 16GB of RAM and 250GB SSD disk drive.

Archlinux™ Kernel Linux 4.13.11-1-ARCH

 FAQs
Why don't you include language X?
Why don't you include my favorite implementation of language X?
Why don't you include Microsoft® Windows®?

Because I know it will take more time than I choose to donate. Been there; done that.

A couple of years ago, someone complained that publishing their own measurements wouldn't show their favorite language implementation on a website highly ranked by search engines. By now - if they had actually made measurements, and published and promoted them - their website would be highly ranked. But they did nothing.

afaict we all feel the same way about this, we all feel that we should sit on our hands and wait for someone else to do the chores we don't wish to do. Of course, measurements of the "proggit popular" language implementations like PyPy and LuaJIT will attract attention and be the basis of yet another successful website (unlike more Fortran or Ada or Pascal or Lisp).

If you're interested in something not shown on the benchmarks game website then please take the program source code and the measurement scripts and publish your own measurements.

The Python script "bencher does repeated measurements of program cpu time, elapsed time, resident memory usage, cpu load while a program is running, and summarizes those measurements" - download bencher and unzip into your ~ directory, check the requirements and recommendations, and read the license before use.

(As an alternative, you should take a look at these Python measurement scripts designed for statistically rigorous Java performance evaluation - JavaStats.)

Why don't you accept every program that gives the correct result?

We are trying to show the performance of various programming language implementations - so we ask that contributed programs not only give the correct result, but also use the same algorithm to calculate that result.

We do show one contest where you can use different algorithms - meteor-contest.

If the point was to compare different algorithms we'd need to do something like the Problem Based Benchmark Suite.

If the point was to broadly compare different programming platforms we'd need to do something like Plat_Forms - The web development platform comparison.

What about Java® VM warm-up?

Let's see how much, or how little, the time taken to invoke the JVM might contribute to the usual Java program times shown in the benchmarks game. Here are some additional (Intel® i5-7200U® dual-core) elapsed time measurements, taken after the Java programs started and before they exited.

In the first case (Cold), we simply started and measured the program 66 times; and then discarded the first measurement leaving 65 data points.

   public static void main(String[] args){
      for (int i=0; i<1; ++i){
         System.gc();
         long t1 = System.nanoTime();
         nbody.program_main(args);
         long t2 = System.nanoTime();
         System.err.println( String.format( "%.6f", (t2 - t1) * 1e-9 ) );
      }
   }

In the second case (Warmed), we started the program once and repeated measurements again and again and again 66 times without restarting the JVM; and then discarded the first measurement leaving 65 data points.

   public static void main(String[] args){
      for (int i=0; i<66; ++i){
         System.gc();
         long t1 = System.nanoTime();
         nbody.program_main(args);
         long t2 = System.nanoTime();
         System.err.println( String.format( "%.6f", (t2 - t1) * 1e-9 ) );
      }
   }

Compare these additional measurements against the usual Java program measurements shown in the benchmarks game --

"1.7.0_06" Java HotSpot(TM) 64-Bit Server VM
System.nanoTime()  1) Cold   2) Warmed   
  mean σ mean σ   usual
meteor contest   0.0118s 0.0007 0.0016s 0.0002 0.22s
fasta-redux   2.45s 0.00 2.32s 0.00 2.51s
spectral-norm   4.44s 0.02 4.20s 0.16 4.51s
pidigits   4.69s 0.09 4.44s 0.05 4.61s
fasta   5.07s 0.46 4.84s 0.02 5.13s
chameneos-redux   5.84s 0.46 5.70s 0.48 5.65s
mandelbrot   7.93s 0.23 7.99s 0.01 7.02s
k-nucleotide   8.09s 0.28  --   --  8.05s
regex-dna   8.65s 0.27  --   --  8.61s
binary-trees   10.54s 0.28 7.66s 0.16 9.08s
fannkuch-redux   16.89s 1.32 17.26s 0.10 17.38s
nbody   22.43s 0.00 22.41s 0.00 22.50s
binary-trees-redux   34.15s 0.39 33.93s 0.31 33.38s

The largest and most obvious effects of bytecode loading and dynamic optimization can be seen with the meteor-contest program which only runs for a fraction of a second.

What does N mean?

N means the value passed to the program on the command-line (or the value used to create the data file passed to the program on stdin). Larger N causes the program to do more work - mostly measurements are shown for the largest N, the largest workload.

Read ↓ How programs were measured

What does '27% 34% 28% 67%' ≈ CPU Load mean?

When the program was being measured: the first core was not-idle about 27% of the time, the second core was not-idle about 34% of the time, the third core was not-idle about 28% of the time, the fourth core was not-idle about 67% of the time.

When all the programs show ≈ CPU Load like this '0% 0% 0% 100%' you are probably looking at measurements of programs forced to use just one core - the fourth core (rather than being allowed to use any or all of the CPU cores).

Read ↓ How did you measure ≈ CPU Load?

What do #2 #3 mean?

Nothing - they are arbitrary suffixes that identify a specific program.

 How to contribute programs
How much effort should I put into getting the program correct?

Do design-iteration on your own computer, or in a language newsgroup. Only contribute programs which give correct results - diff the program output with the provided output file before you contribute the program.

How should I implement programs?

Prefer plain vanilla programs - after all we're trying to compare language implementations not programmer effort and skill. We'd like your programs to be easily viewable - so please format your code to fit in less than 80 columns (we don't measure lines-of-code!).

We are trying to show the performance of various programming language implementations - so we ask that contributed programs not only give the correct result, but also use the same algorithm to calculate that result.

We do show one contest where you can use different algorithms - meteor-contest.

How should I implement loops?

Don't manually unroll loops!

How should I implement data-input?

Programs are measured across a range of input-values; programs are expected to either take a single command-line parameter or read text from stdin.

(Look at what the other programs do.)

How should I implement data-output?

Programs should write to stdout. Program output is redirected to a log-file and diff'd with the expected output.

(Look at what the other programs do.)

How should I identify my program?

Include a header comment in the program like this:

/* The Computer Language Benchmarks Game
   http://benchmarksgame.alioth.debian.org/

   contributed by …
   modified by …
*/
Finally! Use the Tracker to contribute programs

Attach the full source-code file of a tested program. Please don't paste source-code into the description field. Please don't contribute patch-files.

The Tracker

  • Go to the "Play the Benchmarks Game" Tracker
  • debian issue their own security certificate - your web browser will complain.
  • Click the "Log In" link at the top right of the page.
  • (If you don't have an Alioth account click the "[New Account]" link and create your account.)
  • Click the "Play the Benchmarks Game" "Submit New" link
  • You may only attach a program you have written. By attaching a program, you give permission for the program to be published under a Revised BSD License.
  • Now start from the bottom of the form and work your way up

Start from the bottom

  1. Attach the program source-code file - do this first because it's easy to forget.
  2. Say in the Detailed description how this program fixes an error or is faster or was missing or … Give us reasons to accept your program.
  3. Each Summary text must be unique! Follow this convention:
    language, benchmark, your-name, date, (version)
    Ruby nsieve Glenn Parker 2005-03-28
  4. Language Implementation: select the language implementation
  5. Task: select the benchmark
  6. click the Submit button
How can I track what happens to the program I contributed?

You created an ↓ Alioth ID with a valid email address so you'll receive email updates when your program is accepted and measured.

 What…? Where…? Why…?

Please create an Alioth ID, login and ask your questions in the discussion forum.

Note: Debian issue their own security certificate - your web browser will complain.

Revised BSD license

  Home   Conclusions   License   Play