In reply to @apropos "makes sense. it seems": Data parallelism (parallel for loop) is kind of solve by polyhedral compilers. But it's really a pain to implement. I tried. Basically you have your loop boundaries, you create constraints to model read after writes write after reads to preserve significant ordering. And then you model your scheduling as "constraint programming" or "integer linear programming" and you get a list of valid schedules. Then you need to optimize scheduling so that data is reused to limit L1/L2/TLB cache optimization and then you're way to deep in the rabbit hole because no one but someone with a PhD can maintain that stuff