Does it, how? Why would it be the vector store that would make it easier for them to censor the content? Why not censor the documents in S3 directly, or the entries in the relational database. What is different about censoring those vs a vector store?
Not to mention there is lock-in once you've gone to the trouble of using a specific embedding model on a bunch of content. Ideally we'd converge on backwards-compatible, open source approaches, but cloud vendors want to offer "value" by offering "better" embedding models that are not open source.
Are there laws on the books that would force them to apply the technology in this way?
GovCloud exists so that AWS can sell to the US government and their contractors without impacting other customers who have different or less stringent requirements.
> Filtering looks to be applied after coarse retrieval. That keeps the index unified and simple, but it struggles with complex conditions. In our tests, when we deleted 50% of data, TopK queries requesting 20 results returned only 15—classic signs of a post-filter pipeline.
Things like this are why I'd much prefer if Amazon provided detailed documentation of how their stuff works, rather than leaving it to the development community to poke around and derive those details independently.
I actually think AWS did a reasonably good job of this with DynamoDB. Most of the performance tradeoffs, indexing etc., is pretty clear if you ready enough docs without exposing a ton of unnecessary internals.
Full disclosure: I founded ScaNN in GCP databases and am the lead for AlloyDB Semantic Search. And all these opinions are my own.
Absolutely this. So much engineering time has been wasted on reverse-engineering internal details of things in AWS that could be easily documented. I once spent a couple days empirically determining how exactly cross-AZ least-outstanding-requests load balancing worked with AWS's ALB because the docs didn't tell me. Reverse-engineering can be fun (or at least I kinda enjoy it) but it's not a good use of our time and is one of those shadow costs of using the Cloud.
It's not like there's some secret sauce here in most of these implementation details (there aren't that many ways to design a load balancer). If there was, I'd understand not telling us. This is probably less an Apple-style culture of secrecy and more laziness and a belief that important details have been abstracted away from us users because "The Cloud" when in fact, these details do really matter for performance and other design decisions we have to make.
This. There's a lot of freedom in how teams operate. Some teams have great internal documentation, others don't, and a lot of it is scattered across the internal Amazon wiki. I recall having to reach out on slack on multiple occasions to figure out how certain systems worked after diving through docs and the relevant issue trackers didn't make it clear.
Having worked inside AWS I can tell you one big reason is the attitude/fear that anything we put in out public docs may end up getting relied on by customers. If customers rely on the implementation to work in a specific way, then changing that detail requires a LOT more work to prevent breaking customer's workloads. If it is even possible at that point.
Rely on undocumented behavior at your own risk.
IME the implementation of ANN + metadata filtering is often the "secret sauce" behind many vector database implementations.
Yes, I’m the founder and maintainer of the Milvus project, and also a big fan of many AWS projects, including S3, Lambda, and Aurora. Personally, I don’t consider S3Vector to be among the best products in the S3 ecosystem, though I was impressed by its excellent latency control. It’s not particularly fast, nor is it feature-rich, but it seems to embody S3’s design philosophy: being “good enough” for certain scenarios.
In contrast, the products I’ve built usually push for extreme scalability and high performance. Beyond Milvus, I’ve also been deeply involved in the development of HBase and Oracle products. I hope more people will dive into the underlying implementation of S3Vector—this kind of discussion could greatly benefit both the search and storage communities and accelerate their growth.
Postgres for vector search is fine for toy products or stuff that's outside the hot loop of your business but for high performance applications it's just inadequate.
As far as I understand, Milvus is appropriate for very large scale, so will probably continue targeting enterprise.
I also didn’t see any latency info on their docs page https://docs.aws.amazon.com/AmazonS3/latest/API/API_S3Vector...
Fendy•4h ago
sharemywin•3h ago
jeffchuber•3h ago
nkozyra•3h ago
intalentive•3h ago
storus•3h ago
whakim•2h ago
simonw•2h ago
KaoruAoiShiho•2h ago
It took a while but eventually opensource dies.
CuriouslyC•1h ago