Where I work, we have maybe a hundred different sources of structured logs: Our own applications, Kubernetes, databases, CI/CD software, lots of system processes. There's no common schema other than the basics (timestamp, message, source, Kubernetes metadata). Apps produce all sorts JSON fields, and we have thousands and thousands of fields across all these apps.
It'd be okay to define a small core subset, but we'd need a sensible "catch all" rule for the rest. All fields need to be searchable, but it's of course OK if performance is a little worse for non-core fields, as long as you can go into the schema and explicitly add it in order to speed things up.
Also, how does Greptime scale with that many fields? Does it do fine with thousands of columns?
I imagine it would be a good idea to have one table per source. Is it easy/performant to search multiple tables (union ordered by time) in a single query?
Secondly, I prefer wide tables to consolidate all sources for easy management and scalability. With GreptimeDB's columnar storage based on Parquet, unused columns don't incur storage costs.
I find it interesting that Greptime is completely time-oriented. I don't think you can create tables without a time PK? The last time I needed log storage, I ended up picking ClickHouse, because it has no such restrictions on primary keys. We use non-time-based tables all the time, as well as dictionaries. So it seems Greptime is a lot less flexible?
Could you elaborate on why you find this inconvenient? I assumed logs, for example, would naturally include a timestamp.
(it seems greptime is.)
Something around the sentence structure just is offputting.
firesteelrain•6h ago
You can predictably control costs and predict costs with these models.
killme2008•4h ago