There's a sudden swarm of compute intensive projects being written in Python instead of C and its descendants. Specifically, while in performance sensitive, latency sensitive projects C(++) is still king, performance sensitive and throughput sensitive projects are suddenly being written in a language where for loops are 100x slower. The core example is of course neural network training. What's going on? Basically, python is both easier to learn for novices, less bug prone for experts, and has a faster interpreter than C++.
C++ interpreter? Well of course, C is faster than python. But C++, at least in its earliest implementations, is a turing complete interpreted language that outputs C code. Python has a faster interpreter than C++ template metaprograms, which are python's primary competition.
Why every project ends up having a metalanguage and an inner language
What is the fastest way to add 4 numbers in a function? In our dreams, it's
int sum = 0;
for(int i = 0; i < number_elements; i++) {
sum += *(++array);
}
If number_elements is known at compile time, we really want the compiled program to be
int sum = *(array)
sum += *(array + 1)
sum += *(array + 2)
sum += *(array + 3)
How do we get that? C++ has a wonderful answer. As long as number_elements is a constexpr, the lines of code that compute it are
Structure of a High Performance latency insensitive codebase
Typical C Structure:
Metalanguage produces program written in inner language. inner program is compiled inner program processes input metadata inner program processes input and produces output
Python Structure:
Metalanguage processes input metadata Metalanguage produces program written in inner language inner program is compiled inner program processes input and produces output
Language diversity vs metalanguage diversity
High performance C/C++ is many interpreted metalanguages targetting a single high performance compiled language. C++, CMake, m4, autoconf and friends are all metalanguages that produce C.
High performance python is a single interpreted metalanguage, Python, targetting many high performance compiled languages like torch, jax, tensorflow, sql, and z3. It turns out that
Turing completeness
For maintainability, you don't really want your inner program and your metaprogram to both have complex control flow. Traditionally when writing C or C descendants, this has been managed by picking a weak metalanguage like macro substitution, or even just by reprimanding Jimmy from MIT when he submits a pull request that's calculating primes at compile time using SFINAE.