From “One Hour” to “Reasonable”: A Spectral Trick for Accelerating Repeated Linear Solves
A friend asked me whether GPU acceleration could speed up a Python simulation that took roughly an hour to run. Before going for CUDA, JIT compilers, or rewriting everything, we looked for the true bottleneck. It turned out the code was repeatedly solving the same parameterized linear system thousands of times. That structure admits a classic spectral (eigenmode) optimization: diagonalize once, then evaluate the resolvent for many frequencies using cheap elementwise operations.
This post explains the math behind the equivalence between the original direct-solve implementation and the optimized spectral method, includes pseudocode to map to the actual code, and investigates why “GPU” and “JIT” would likely not be the best first moves for this problem.
How this started: “can we just use a GPU?”
The other day, a friend approached me with a piece of code that was "too slow" for his liking. It was taking about an hour to run a single simulation, which made him ask me: "can we use any for of GPU acceleration here?"
I've worked with codes that needed to run for days, so first I thought to myself that an hour was not automatically horrible. I dont know exactly what he is using this code for, but I understand that an hour can become a problem when you're sweeping parameters, running optimizations, or doing uncertainty quantification.
Before opening the code, I told him:
- If the bottleneck is dense linear algebra, GPUs can help. If the bottleneck is Python overhead, memory traffic, or functions with no GPU implementation, GPUs won't help much.
- Moving arrays to the GPU, managing kernels, and reshaping code for GPU-friendly execution can be hard.
- If the runtime is spent in Python loops doing simple arithmetic, tools like Numba can produce near-C speed. If the runtime is already inside optimized BLAS/LAPACK calls, JIT will not help much.
The structure of the problem
The code models electrostatic response of graphene ribbons, computes induced charge densities,
and then computes induced fields by integrating over a momentum-like parameter q and frequency ω.
To be honest I did not understand much of the physics beheind the code. It had to do with Bessel/Struve functions, discretized operators and multiple nested loops. But in any case, the core computation was solving the same matrix problem over and over, inside a triple nested loop, and only changing a scalar prefactor which depended on the Fermi energy and the photon energy.
The repeated linear system
Fix a wavevector sample q (loop index i) and a Fermi energy Ef (loop index dd).
For each photon energy ħω (loop index j), both scripts compute the same total electrostatic potential
vector φtot(q, ω) from the linear system:
Where:
- I is the identity matrix of size
n_tot × n_tot, withn_tot = number_of_ribbons · nn. -
φext(q) is the external potential sampled on the ribbon grid (flattened).
In code:
phi_tilde_ext_C_0[i].flatten(). -
EEω is a scalar prefactor depending on
Efandω: EEω = −(α ħc Ef)/(π εeff W[0]) · 1/((ħω)(ħω + iΓ)) -
A(q) is the
q-dependent matrix (yourVV_D_matrix_array[i]), conceptually A(q) ≡ V(q) D(q) whereV(q)is the Coulomb operator andD(q)is the finite-difference operator. (Withnumber_of_ribbons = 1, this is just annn × nnmatrix.)
Key point: this linear system is the mathematical object both scripts solve; they only differ in the numerical method used.
Method 1 (slow): direct solve with np.linalg.solve
The original script forms $M(q, \omega) = \mathit{EE}_\omega A(q)$, and then solves: $\varphi_{\text{tot}}(q, \omega) = \text{solve}(I - M(q, \omega), \varphi_{\text{ext}}(q))$.
MM_matrix = EE_omega * VV_D_matrix_array[i]
phi_tot = np.linalg.solve(I - MM_matrix, phi_ext)
This is correct, stable, and straightforward. But it is also expensive: for each Ef, for each q,
for each ω, you do a dense solve. When the frequency grid is large, the same expensive work repeats with only the scalar
EE_omega changing.
Method 2 (fast): spectral (eigenmode) decomposition of the resolvent
After some investigation (and a lot of rubber-duck debugging with ChatGPT), we realized the system has the form:
$\varphi_{\text{tot}}(q, \omega) = (I - \mathit{EE}_\omega A(q))^{-1} \varphi_{\text{ext}}(q)$,
where A(q) does not depend on ω.
So we can precompute a spectral decomposition of A(q) once per q, and reuse it for all ω values.
Why “spectral”?
The expensive object in the original implementation is the parameter-dependent inverse $R(s) \equiv (I - sA)^{-1}$, with $s \equiv \mathit{EE}_\omega$, often called the resolvent. The point of the optimization is that we stop treating A as a black box and instead use its spectrum (its eigenvalues and eigenvectors) to evaluate R(s) cheaply for many s values.
If $A$ is diagonalizable, $A = V \Lambda V^{-1}$ with $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$, then $R(s) = V (I - s\Lambda)^{-1} V^{-1}$. But $(I - s\Lambda)^{-1}$ is diagonal, so it is just $\text{diag}(1/(1 - s\lambda_1), \ldots, 1/(1 - s\lambda_n))$.
In other words: once the eigenpairs $(\lambda_n, v_n)$ are known for a fixed q,
all ω-dependence enters only through the scalars $1 - \mathit{EE}_\omega \lambda_n$.
Evaluating the response for many ω values therefore reduces to inexpensive elementwise operations on the eigenvalues
(followed by a basis transform with V).
This is exactly why “spectral” is an appropriate name: the computation is driven by the spectrum of A
rather than by repeating dense linear solves.
The equivalence
Assume $A(q)$ is diagonalizable: $A(q) = V(q) \Lambda(q) V(q)^{-1}$. Then:
Starting from the linear system $(I - \mathit{EE}_\omega A) \varphi_{\text{tot}} = \varphi_{\text{ext}}$, we obtain:
Define the eigenbasis coordinates $\hat{\varphi}(q) \equiv V(q)^{-1} \varphi_{\text{ext}}(q)$. Because $\Lambda$ is diagonal, the inverse becomes mode-by-mode division:
Pseudocode: original vs spectral method
To make the computational difference concrete, here is the same calculation in two styles. Both produce the same $\varphi_{\text{tot}}(q, \omega)$ and downstream observables; they only differ in how the linear system is solved across many energies.
Original approach (direct solve for each ω)
for Ef in Ef_list:
for q in q_grid:
A = A(q) # build VV(q) D(q) once per q
Dq = D(q)
phiExt = phi_ext(q) # known RHS, once per q
for omega in omega_list:
EE = EE(Ef, omega) # scalar prefactor
phiTot = solve(I - EE*A, phiExt)
rhoInd = (EE/W) * Dq * phiTot
E_q_ind[q, omega] = integrate_theta( M(q) * rhoInd )
E_ind[omega] = integrate_q( E_q_ind[q, omega] )
P[Ef] = integrate_omega( spectral_weight(omega) * |E_ind + E_ext|^2 )
Spectral approach (diagonalize once per q, vectorize over all ω)
precompute spectral weights in omega (Lorentz * cross section)
for q in q_grid:
A = A(q)
Dq = D(q)
V, lam = eig(A) # A = V diag(lam) V^{-1}
phiExt = phi_ext(q)
phiHat = solve(V, phiExt) # V^{-1} phiExt
for Ef in Ef_list:
EE_vec = EE(Ef, omega_list) # vector over all omega
for q in q_grid:
denom[omega, n] = 1 - EE_vec[omega] * lam[n]
phiHatTot[omega, n] = phiHat[n] / denom[omega, n]
phiTot[omega, :] = V * phiHatTot[omega, :] # basis transform
rhoInd[:, omega] = (EE_vec[omega]/W) * Dq * phiTot[omega, :]
E_q_ind[q, omega] = integrate_theta( M(q) * rhoInd[:, omega] )
E_ind[omega] = integrate_q( E_q_ind[q, omega] )
P[Ef] = integrate_omega( spectral_weight(omega) * |E_ind + E_ext|^2 )
Rule of thumb: the direct method does one dense solve per (q, ω).
The spectral method replaces those repeated solves with one eigendecomposition per q plus cheap
mode-by-mode divisions for all ω (and a basis transform).
Why the second script vectorizes over all energies
Let {ħωk} be the energy grid stored in the code array hw_list,
and define the vector $\mathit{EE} = (\mathit{EE}_{\omega_1}, \ldots, \mathit{EE}_{\omega_N})$.
For fixed q, the eigenvalues λn are fixed, so the denominators
$d_{k,n} = 1 - \mathit{EE}_{\omega_k} \lambda_n$
form an Nω × n array.
Then the eigenbasis solution for all energies at once is
Φ̂k,n = φ̂n / dk,n,
and the total solutions (stored row-wise) are
Φtot = Φ̂ VT.
This is exactly what broadcasting implements in the code:
denom = 1.0 - EE_omega_vec[:, None] * w[None, :]
temp = phi_hat_i[None, :] / denom
phi_tot_all_omega = (V @ temp.T).T
Each row of phi_tot_all_omega is φtot(q, ωk).
A short note on broadcasting (for readers new to vectorization)
NumPy broadcasting is a set of rules that lets arrays with compatible shapes behave as if they had been replicated along length-1 dimensions (without copying data).
Here, EE_omega_vec[:, None] has shape (Nw, 1) and w[None, :] has shape (1, ntot).
Multiplying them broadcasts both to shape (Nw, ntot), producing all products EE(ω_k) * λ_n at once.
Then phi_hat_i[None, :] divides across the same grid, computing
$\hat{\varphi}_n/(1 - \mathit{EE}_{\omega_k} \lambda_n)$ for every (k, n) pair in one shot.
Finally, multiplying by V converts each row back from eigenbasis to the original basis.
So… why not GPU? why not JIT?
NumPy is already doing a lot of work in compiled code
Even when your script is “Python”, many expensive operations happen in optimized native libraries:
np.linalg.solveandnp.linalg.eigcall LAPACK routines (Fortran/C under the hood).- Matrix multiplications call BLAS (often OpenBLAS, MKL, or Accelerate).
- Many array ops (broadcasting arithmetic,
np.trapezoid, etc.) run in compiled loops.
One subtle point: vectorization and parallelism are different. Vectorization means "express work as array ops so compiled code does the loops." Parallelism may or may not happen depending on your BLAS/LAPACK backend: many are multi-threaded, so large matrix ops can already use multiple CPU cores. But NumPy does not automatically parallelize your outer Python loops.
Why JIT compilation likely does not help much for this bottleneck
JIT tools (e.g., Numba) shine when runtime is dominated by Python-level loops and simple numeric kernels that the JIT can compile. Here, the expensive steps are dominated by LAPACK/BLAS (solves/eigendecomposition) and special functions (Bessel/Struve), which are already compiled and highly optimized. JIT-ing the surrounding control flow usually cannot accelerate those inner library calls.
JIT could help if Python overhead in the outer loops is truly dominant, or if we can fuse many small array ops into one compiled loop.
But the largest gains came from changing the actual algorithmic structure.
We replaced Nω solves with one eigendecomposition plus cheap diagonal operations.
Why GPU is not the obvious first move
GPU acceleration can be effective, but it is never free. It's not always the answer.
- We need GPU-capable equivalents for all heavy kernels (dense linear algebra, special functions, integration).
- Data transfer between CPU and GPU can erase wins if not managed carefully.
- Even if we move linear algebra to the GPU, we may still be bottlenecked by operations that remain on CPU (or by Python-side orchestration).
Once we use this spectral method, the remaining workload is a mix of moderate-size dense linear algebra and vectorized elementwise operations. That might map well to a GPU in some environments. But at that point the decision is informed: profile again and see whether the new bottleneck is GPU-friendly.
Numerical considerations
The two methods are algebraically equivalent, but floating-point may still differ a bit because:
- if V is ill-conditioned, computing V−1 φext can amplify roundoff error.
- if $|1 - \mathit{EE}_\omega \lambda_n|$ is tiny, diagonal divisions also amplify numerical noise (and physically correspond to strong modal response).
- if A is not diagonalizable, the eigenbasis can be unstable. A Schur decomposition (upper-triangular) is typically a more stable route to the same idea.
Takeaway
We started with a “should we use a GPU?” question and ended with a better one: what structure is the code exploiting, and can we exploit it go brrr?
The key observation was that the computation repeatedly solves
$(I - \mathit{EE}_\omega A) \varphi_{\text{tot}} = \varphi_{\text{ext}}$
for many ω, with $A$ fixed for each q.
Diagonalizing A once per q turns repeated dense solves into cheap mode-by-mode divisions
in the eigenbasis, plus a basis transform.
In performance work, this is very very common. The biggest transformations come first from changing the algorithm, the maybe "accelerating" the existing one, whatever that means.