I’ve put together a repo demonstrating how to train PPO directly on a single TSPLIB instance (lin318) from scratch—without pre-training or GPUs.
Repo:https://github.com/jivaprime/TSP
1. Experiment Setup
Problem: TSPLIB lin318 (Opt: 42,029) & rd400
Hardware: Google Colab (CPU only)
Model: Single-instance PPO policy + Value network. Starts from random initialization.
Local Search: Light 2-opt during training, Numba-accelerated 3-opt for evaluation.
Core Concept: Instead of a "stable average-error minimizer," this policy is designed as a high-variance explorer. The goal isn't to keep the average gap low, but to occasionally "spike" very low-error tours that local search can polish.
2. Results: lin318
Best Shot: 42,064 (Gap ≈ +0.08%)
Time: Reached within ~20 minutes on Colab CPU.
According to the logs (included in the repo), the sub-0.1% shot appeared around elapsed=0:19:49. While the average error oscillates around 3–4%, the policy successfully locates a deep basin that 3-opt can exploit.
3. Extended Experiment: Smart ILS & rd400
I extended the pipeline with "Smart ILS" (Iterated Local Search) post-processing to see if we could hit the exact optimum.
A. lin318 + ILS
Took the PPO-generated tour (0.08% gap) as a seed.
Ran Smart ILS for ~20 mins.
Result: Reached the exact optimal (42,029).
B. rd400 + ILS
PPO Phase: ~2 hours on CPU. Produced tours with ~1.9% gap.
ILS Phase: Used PPO tours as seeds. Ran for ~40 mins.
Result: Reached 0.079% gap (Cost 15,293 vs Opt 15,281).
Summary
The workflow separates concerns effectively:
PPO: Drives the search into a high-quality basin (1–2% gap).
ILS: Digs deep within that basin to find the optimum.
If you are interested in instance-wise RL, CPU-based optimization, or comparing against ML-TSP baselines (POMO, AM, NeuroLKH), feel free to check out the code.
Constructive feedback is welcome!