In reply to @apropos "makes sense. it seems": Data parallelism (parallel for
loop) is kind of solve by polyhedral compilers.

But it's really a pain to implement. I tried.

Basically you have your loop boundaries, you create constraints to model read
after writes write after reads to preserve significant ordering.

And then you model your scheduling as "constraint programming" or "integer
linear programming" and you get a list of valid schedules.

Then you need to optimize scheduling so that data is reused to limit L1/L2/TLB
cache optimization and then you're way to deep in the rabbit hole because no one
but someone with a PhD can maintain that stuff