Part 4: Where GPUs Really Speed Up Optimization: Targeted Acceleration in Mathematical Solvers

GPU

Published on October 20, 2025 by Brian Schaefer

GPUs have worn many hats. They’ve powered ultra-realistic video games, fueled the cryptocurrency mining boom, and, more recently, trained deep learning models with billions of parameters. But at SimpleRose, we’ve been asking a different question: where can GPUs make a practical difference in mathematical optimization? It turns out the answer isn’t “everywhere.” By focusing only on the parts of the solver where GPUs’ massive parallelism actually moves the needle, we avoid wasted computation, reduce overhead, and keep the rest of the pipeline running at peak efficiency.

It’s tempting to think you can just drop a GPU into the equation and make every optimization problem faster. In reality, many steps in solving an optimization model are inherently sequential or involve irregular data patterns that GPUs don’t handle well. By identifying the tasks that do map well to thousands of parallel threads, we get the benefits of GPU speed without introducing bottlenecks elsewhere. This philosophy drives every GPU-related feature we’ve explored.

In this post, we’ll take you behind the scenes of our experiments, prototypes, and pilot results, showing exactly where GPU acceleration earns its keep, and where the CPU still rules.

1. LP Acceleration with PDLP + Crossover

One of our most promising targeted uses is a hybrid LP solving approach. We can start with NVIDIA cuOpt’s PDLP solver on the GPU. PDLP is a first-order method that’s blisteringly fast at finding a low-accuracy feasible, but not necessarily optimal, solution for massive LPs. Then we crossover to our CPU-based simplex solver to refine that solution to optimality with high accuracy.

This works with our philosophy because PDLP’s heavy matrix–vector operations are perfect for GPUs. The crossover step leverages the CPU’s efficiency for fine-grained, sequential refinements. We avoid running Simplex entirely on GPU, which would waste cycles and memory bandwidth on work that doesn’t parallelize well. In some pilots, we’ve even run PDLP and Simplex concurrently, and whichever finishes first wins. It’s not about chasing GPU usage for its own sake. It’s about winning the wall-clock race.

2. Parallel Heuristics for Faster MILP Search

MILP branch-and-bound trees start very sequentially and finding good incumbents early can dramatically shorten solve times. This is a perfect opportunity for targeted GPU use. While the Rose CPU solver runs branch-and-bound as usual, we have a parallel GPU-based heuristic search (via cuOpt) hunt for better feasible solutions. When the GPU finds a new incumbent, it sends it back to the CPU solver, tightening bounds and pruning more of the search tree. In certain MIP instances, identifying our own set of features about the problem for generating cutting planes can also be done in parallel. GPUs can accelerate the process feeding the CPU solver more cuts in less time.

This works with our philosophy because the GPU is only used for the part of the MILP process which is inherently parallelizable: running many heuristic attempts at once. We don’t try to run the whole branch-and-bound on GPU, avoiding the mismatch between GPU strengths and branching logic.

3. Presolve

Presolve is the solver’s “tidy-up” phase, simplifying the problem before the heavy solving begins. Some presolve passes, like bound tightening, coefficient scanning, exploiting symmetry, or detecting redundant constraints are data-parallel and can benefit from GPU execution. Others are too conditional to map efficiently to GPU threads. Our presolve strategy evaluates which tasks to offload and which to keep on CPUs, ensuring we’re not spending GPU cycles where they won’t pay dividends in reduced time.

We’re also leveraging NVIDIA’s latest high-performance computing hardware, which supports NVLink connections between CPU and GPU. NVLink enables aggregate bidirectional bandwidth exceeding 1,000 GB/sec on supported systems — a massive leap from the roughly 64 GB/sec available with PCIe. This ultra-fast interconnect reduces data transfer bottlenecks, making GPU acceleration practical for more parts of the solver pipeline. Because our approach selectively offloads only the most parallelizable tasks to the GPU, rapid CPU–GPU communication ensures we can hand off and retrieve data with minimal overhead, keeping the entire process efficient.

At SimpleRose we use GPUs where they actually matter and let CPUs do the rest. It’s not about chasing hardware trends. It’s about utilizing the best of both worlds to find the fastest, most efficient path to the best solution.