Part 2: The Truth About GPU Optimization, Why “GPU Support” Isn’t Always What It Seems

Published on October 6, 2025
In Part 1, we looked at why GPUs are so well suited to hard, time-sensitive optimization problems. But there’s an important caveat: not every solver can simply bolt a GPU on and expect breakthrough results.
In Jensen Huang’s recent keynote at GTC Paris, he stated at the start of his talk that, “What makes accelerated computing special, is it’s not just a new processor that you compile software to. You have to reformulate how you do computing. You have to reformulate your algorithm. And it turns out to be incredibly hard for people to reformulate software and algorithms to be highly parallelized.”
That is the crux in optimization. Solvers already decompose problems into steps, but CPUs and GPUs execute those steps very differently. Some operations can run across thousands of data elements at once, while others must run in strict sequence, like the way potatoes must be peeled before they’re chopped. This is Amdahl’s Law in plain terms: the sequential parts set the pace no matter how much parallel horsepower you add, and too much coordination overhead can even slow the system down. Because of this, you cannot just “port” a CPU solver to a GPU and expect dramatic gains.
This is also why “GPU support” can be misleading. In many cases it means a narrow slice of code has been offloaded to the GPU, often the matrix work inside the linear programming relaxation. The core search through the combinatorial space — branch and bound, branch and cut, constraint propagation — still runs sequentially on the CPU. For large problems, that sequential backbone dominates runtime, so the headline acceleration rarely appears, and shuttling many small tasks between CPU and GPU can even make things slower.
Coordination and orchestration are therefore as important as raw parallelism. The solver needs dynamic scheduling so threads abandon unpromising regions early, load balancing so no set of workers goes idle, and asynchronous pipelines so new work starts the instant data is ready rather than waiting for fixed checkpoints. Getting kernel sequencing, memory movement, and early-exit logic right is key. Done well, this can even yield superlinear scaling because wasted work is pruned quickly and resources are immediately redirected to the most promising parts of the search.
To get the most out of GPUs, you must re-engineer for parallelism. That means rethinking neighborhood exploration, candidate evaluation, and how data is laid out for efficient GPU memory access. NVIDIA cuOpt takes this approach, using parallel heuristics and metaheuristics to explore thousands of candidates at once. That is why it shines on vehicle routing problems such as the Traveling Salesman Problem (TSP) and the broader Vehicle Routing Problem (VRP) family, where exploring more options faster directly improves routes, costs, and service times — and why these gains matter in real logistics
Accelerating and Broadening cuOpt with SimpleRose
At SimpleRose, we’re aiming to be the fastest solver on the planet, no matter the problem type. We recognize that in order to get there we need to go well beyond just building on top of cuOpt. That’s why we’re innovating with novel approaches to parallelism.
For problem classes that benefit from specific forms of data parallelism, like PDLP (Primal–Dual Linear Programming) and heuristic MIP algorithms, cuOpt already excels. We leverage cuOpt in those scenarios, but we’re also extending its capabilities by combining task parallelism with new data parallelization strategies to drive significantly better overall performance.
In traditional task parallelism, the MIP solver processes distinct tasks independently. We’ve been able to do this for some time. In fact we’ve tested real-world problems using hundreds of machines in parallel to solve MIPs, achieving not only linear speedups but in some cases even superlinear speedups. That’s because of our ability to embrace asynchronous parallelism so that nodes can be processed opportunistically and cuts can be applied as soon as they’re ready. In our next article, we’ll dive deeper into this approach, illustrating how it eliminates idle waiting time and maximizes CPU utilization, drastically reducing solve times compared to deterministic solvers.
We’re also working on novel data parallelization techniques. One such technique is to use cuOpt’s PDLP to solve LPs quickly at low accuracy, then leveraging a CPU Simplex solver for higher accuracy, and using cuOpt’s heuristic solver for branch pruning in branch-and-bound by running our solver and cuOpt in parallel.
By designing the solver to handle tasks asynchronously, meaning processing each piece as soon as it’s ready instead of waiting for a fixed sequence, we can maximize parallelism across CPUs and GPUs. This delivers faster, more efficient results while letting each type of hardware and algorithm focus on what it does best.
What we’ve outlined here is undoubtedly ambitious, but we believe it’s worth the effort! We’ve been at it for 6 years, and are already seeing incredible results in our early testing across a variety of problem types. This is why we believe we can truly revolutionize optimization across industries.
Want to be part of our early access program? Contact us today!


