And please don't jump to conclusions! | Python Interpreters Benchmarks Game

18 Jun 2020, Thursday, 7:29 pm GMT

And please don't jump to conclusions!

"Since compiler optimizations and code changes also alter layout, it is currently impossible to distinguish the impact of an optimization from that of its layout effects. ... the performance impact of -03 over -02 optimzations is indistinguishable from random noise."
STABILIZER: Statistically Sound Performance Evaluation pdf

"The performance of a benchmark, even if it is derived from a real program, may not help to predict the performance of similar programs that have different hot spots."
Benchmarks are a crock

"It may seem paradoxical to use an interpreted language in a high-throughput environment, but we have found that the CPU time is rarely the limiting factor; the expressibility of the language means that most programs are small and spend most of their time in I/O and native run-time code."
Interpreting the Data: Parallel Analysis with Sawzall page 27 pdf

"We measure three specific areas of JavaScript runtime behavior: 1) functions and code; 2) heap-allocated objects and data; 3) events and handlers. We find that the benchmarks are not representative of many real websites and that conclusions reached from measuring the benchmarks may be misleading."
JSMeter: Characterizing Real-World Behavior of JavaScript Programs

"Would you believe us if we told you: “we can predict the benefit of our optimization, O, by evaluating it in one or a few experimental setups using a handful of benchmarks?” Again, you should not: we all know that computer systems are highly sensitive and there is no reason to believe that the improvement with O is actually due to O; it may be a result of a biased experimental setup."
Producing Wrong Data Without Doing Anything Obviously Wrong!
Your application is the ultimate benchmark: "In order to find the optimal cost/benefit ratio, Wirth used a highly intuitive metric, the origin of which is unknown to me but that may very well be Wirth's own invention. He used the compiler's self-compilation speed as a measure of the compiler's quality. Considering that Wirth's compilers were written in the languages they compiled, and that compilers are substantial and non-trivial pieces of software in their own right, this introduced a highly practical benchmark that directly contested a compiler's complexity against its performance. Under the self compilation speed benchmark, only those optimizations were allowed to be incorporated into a compiler that accelerated it by so much that the intrinsic cost of the new code addition was fully compensated."
Oberon: The Overlooked Jewel (pdf) Michael Franz, in L. Boszormenyi, J. Gutknecht, G. Pomberger "The School of Niklaus Wirth" 2000.

Programming languages are compared against each other as though their designers intended them to be used for the exact same purpose - that just isn't so.

"Lua is a tiny and simple language, partly because it does not try to do what C is already good for, such as sheer performance, low-level operations, or interface with third-party software. Lua relies on C for those tasks."
Programming in Lua, preface

"Most (all?) large systems developed using Erlang make heavy use of C for low-level code, leaving Erlang to manage the parts which tend to be complex in other languages, like controlling systems spread across several machines and implementing complex protocol logic."
Frequently Asked Questions about Erlang

The difficulty is that programming languages (and programming language implementations) are more different than apples and oranges, but the question is still asked - Will my program be faster if I write it in language X? - and there's still a wish for a simpler answer than - It depends how you write it!

Programmer skill and effort

No attempt has been made to assess whether programs contributed for a particular interpreter were consistently the work of more highly skilled programmers than programs contributed for other. The source code comments show that some programs were contributed by core developers of the interpreter implementation - world renowned expert programmers. It's fair to say that other programs were contributed by programmers who are less skilled.

No attempt has been made to assess whether programs contributed for a particular language were consistently worked on longer and harder than programs contributed for other languages. As a very crude indication of how that may vary, look at which languages have had more programs contributed and which have had fewer programs contributed (than an even split).

A very modest resource

Quite unexpectedly, some of the tasks shown on the benchmarks game website have become a very modest resource for programming language researchers. For example -

Having a collection of programs that implement the same tasks in different programming languages is at least convenient. Presumably convenience is why Bjarne Stroustrup refers to the benchmarks game (to support his point about run time typing) in "Software Development for Infrastructure" pdf, IEEE Computer, January 2012.

Several of the tasks had already been used as benchmarks and were adopted almost unchanged for the benchmarks game, for example - fannkuch (now fannkuch-redux) and binary-trees. Similarly, many of the tasks from the benchmarks game have been adopted by other projects - for example, 10 of the 26 WebKit SunSpider JavaScript tests.

A Decade in the Making

Once upon a time, Doug Bagley had a burst of crazy curiosity -

"When I started this project, my goal was to compare all the major scripting languages. Then I started adding in some compiled languages for comparison ... and it's still growing with no end in sight. I'm doing it so that I can learn about new languages, compare them in various (possibly meaningless) ways, and most importantly, have some fun. ... By the way, the word Great in the title refers to quantity, not quality (I will let the reader judge that)."

The Internet Archive preserves some of those web pages abandoned in early 2002 -

"Hi, the shootout is an unfinished project. I've decided to discontinue updates to it for now while I work on some other things. Thanks for everyone's help."

Aldo Calpini ported Doug Bagley's "Great Computer Language Shootout" to Win32 but there have been no updates since 2003.

2004 saw another revival, this time by Brent Fulgham -

"The project goals have not changed substantially since Doug's original project. This work is continuing so that we all can learn about new languages, compare them in various (possibly meaningless) ways, and most importantly, have some fun!"

Brent Fulgham chose to host the project at alioth.debian.org (like sourceforge or savannah but a service for Debian Developers). Sometimes that seemed to confuse people into thinking "the Great Computer Language Shootout" was an integral part of Debian rather than a tiny independent project.

Although Brent Fulgham started off with those same programs from Doug Bagley's Great Computer Language Shootout, new benchmark tasks were created and those Doug Bagley programs were replaced during 2005. Isaac Gouy designed a new website and started making performance measurements on Gentoo Linux Intel® Pentium® 4 and took on the day-to-day admin work.

The Virginia Tech shooting in April 2007 once again pushed gun violence into the media headlines. There was no wish to be associated with or to trivialise the slaughter behind the phrase shootout so the project was renamed back on 20th April 2007 - The Computer Language Benchmarks Game.

By 2008 Brent had stopped updating the Debian Linux AMD™ Sempron™ measurements and moved on to other projects.

x64 and multi-core

By 2008 new computer hardware was commonly dual-core or dual-core; by 2008 operating systems were often 64 bit - so the benchmarks game was moved to a new dual-core Intel® i5-4210U® machine:

ArchLinux x64 programs enabled for quad core

That was a lot more work.

Fresh! Fresh! Fresh!

Since September 2008 new measurements have been recorded and published more than 1000 times. Fresh measurements are usually published several times a week in response to some trigger event - a new program was contributed, a new version of a programming language implementation was released, a new version of Ubuntu was released.

Although up-to-date content has been a priority and new measurements are made soon after new versions of a programming implementation are released - that isn't enough.

Just like any other long running software project, some of the measured programs will be legacy code that hasn't been updated to take advantage of the latest improvements in the language implementation. Just like any other long running software project, there's a time lag between the latest improvements in the language implementation and understanding how those improvements can be used, and overcoming community fatigue to contribute programs that use those improvements.

Unique Visitors, October through September 2011 - 319,829
Unique Visitors, October through September 2012 - 340,917
Unique Visitors, October through September 2013 - 337,231

Home Conclusions License Play