I might be wrong regarding more sophisticated infra though.
With Stategraph, you'll get all the benefits and isolation of separate state files, but when you changed resources, you'll get meaningful plans around all of the infrastructure they impact, not just the statically defined boundaries of a state file.
Splitting state files is the common workaround but that only creates new problems like cross state dependencies and orchestration glue. The real issue is the storage model which is a single JSON blob with a global lock. Treating state as a graph with proper concurrency control avoids contention while keeping a cohesive view of infrastructure.
We have about 30 services with each managing their own terraform state. We also have a shared infra repo managing some top level items. We haven’t run into any issues (with any regularity at least) that I can think of but I’m wondering if this could be a good tool for us as we grow and things become even more complex?
We were all experienced with Go but at the time the Go SDK was very awkward, although I think some of that has been resolved with generics now. TF is less expressive but I think that’s actually better for 99% of cases.
Sounds very neat if you’re an big enough org.
I'm curious if this will be compatible with tools like Spacelift or Env Zero, or if they are going to build their own runner/agent to compete in that space.
eschatology•1h ago
I don’t see the state file as a complete downside. It is very simple and very easy to understand. It makes it easy to tell or predict what terraform will do given the current state and desired state.
Its simpleness makes troubleshooting easier: the state files are easy to read and manipulate or repair in the event of a drift, mismatch, or botched provider update.
With the solution proposed it feels like the state becomes a black box I shouldn’t put my hands in. I wonder how the troubleshooting scenarios change with it.
Personally, I haven’t ran into the scaling issue described; at any given time there is usually only one entity working with the state file. We do use terragrunt for larger systems but it is manageable. ~1000 engineer org.
lawnchair•39m ago
Not every Terraform setup runs into scaling pain. The trouble tends to show up in larger repos with thousands of resources where teams share big chunks of infra. That is where global locks and full refreshes become a bottleneck and where we think graph semantics help.