Words about Arrays and Tables

https://buttondown.com/hillelwayne/archive/2000-words-about-arrays-and-tables/

59•todsacerdoti•1d ago

Comments

mvc•23h ago

XSLT/XPath is an example of a platform that provides multiple axis through which to access your data structure.

https://developer.mozilla.org/en-US/docs/Web/XML/XPath/Refer...

kccqzy•23h ago

I'm not sure what's the insight in this article here. All the things mentioned seem pretty straightforward. But I'll comment on a few:

> Really any finite set can be a "dimension".

This is absolutely true and yet typical programming languages don't do a good job of unleashing this. When is the last time you wish you could define an array where the indices are from a custom finite set, perhaps a range 100 to 200, perhaps an enum? I only know of two languages, Pascal and Haskell, that are sufficiently flexible. All other languages insist that indices for arrays must be integers either starting from 0 or 1. Haskell has a mechanism to let you control what is an array index: https://hackage.haskell.org/package/base-4.18.1.0/docs/Data-... The lack of this ability makes programmers use more general hash maps by default and therefore leave performance on the table.

> We can also transform Row -> PowerSet(Col) into Row -> Col -> Bool, aka a boolean matrix.

I mean sure. Does the author know type signatures and their operations like this are isomorphic to algebraic operations on size of the possibly infinite sets? The function A -> B has exactly |B|^|A| implementations so that's why Col -> Bool is in this sense "same" as PowerSet(Col). And the function operator associates to the right so adding Row -> doesn't change anything. Wait till you learn why sum types are called sum types and why product types are called product types. Here's another thing that might be mind-blowing: arrays of an unknown length can be thought of as the infinite union of arrays of lengths of all natural numbers: Array<T, 0> | Array<T, 1> | … so using the above notation the cardinality is 1 + |T| + |T|^2 + …; but they can also be represented as a standard linked list List<T>=Nil|(T, List<T>) and using the above notation we have |List<T>|=1+|T|*|List<T>| which simplifies to |List<T>|=1/(1-|T|). And guess what, the Taylor expansion of that is just the earlier 1 + |T| + |T|^2 + … precisely.

Jtsummers•22h ago

> I only know of two languages, Pascal and Haskell, that are sufficiently flexible.

Julia, Fortran, some BASICs allow custom integer ranges.

Ada allows you to use any discrete range (integer, character, enums).

https://en.wikipedia.org/wiki/Comparison_of_programming_lang... - A more comprehensive list

The two columns of interest are "Specifiable index type" and "Specifiable base index".

cestith•21h ago

Older versions of Perl allowed you to set a lexically-scoped array base integer. It defaulted to zero and the docs mentioned 1-based arrays, but IIRC it could be any integer. This was deprecated several years ago in 5.12 as a harmful practice. Specifying using features of a Perl version of 5.16 or newer actually makes assigning to it (except for 0) a compile-time error.

Interestingly, though, Perl’s native arrays are decidedly not an array of primitives. They are arrays of scalar values, and a scalar value can not only contain a character, a string, an integer, a float, or a reference to another item (scalar, hash, array, blessed object, filehandle, etc) but it can return different values when called in a numeric context vs a string context. There are certainly ways to get at primitive values, but it’s not in Perl’s native array semantics.

pklausler•21h ago

Non-default lower bounds in Fortran are a famous pitfall, however; they don't persist in many cases where one might think that they do, and are not 100% portable across compilers.

zahlman•20h ago

> When is the last time you wish you could define an array where the indices are from a custom finite set, perhaps a range 100 to 200, perhaps an enum?

> ...

> The lack of this ability makes programmers use more general hash maps by default and therefore leave performance on the table.

You can simulate this in any language by hash-mapping the custom set elements to indices. (And in the general case, you won't be able to determine the indices faster than a hash table lookup.)

In fact, I imagine there are languages and implementations where hash table lookup is the fastest way to map 'a':'z' to 0:25.

In fact, let me try a (surely totally bogus) micro-benchmark right now:

  $ python -m timeit --setup 'lookup = dict(zip("abcdefghijklmnopqrstuvwxyz", range(26)))' '[lookup[x] for x in "abcdefghijklmnopqrstuvwxyz"]'
  500000 loops, best of 5: 884 nsec per loop
  $ python -m timeit '[ord(x) - 97 for x in "abcdefghijklmnopqrstuvwxyz"]'
  200000 loops, best of 5: 1.21 usec per loop

Huh.

...And surely if you use the hash map directly, instead of using it to grab an index into another container, overall performance only gets better (unless maybe you need to save per-instance memory, and you have multiple containers using the same custom index scheme).

Hash tables are a pretty neat idea.

galaxyLogic•22h ago

The characteristic feature of Arays is that all but the last element have their next element and all but the first element have their previous element. So what would be the value of having or using indexes other than Integers to index Arrays?

You can also define integer-valued named constants, and use those if you prefer.

larrik•22h ago

That sounds more like a linked list than an array.

In C, an array is just contiguous bytes, and you reach an index through math (starting location + (index * size)).

Pretty sure lots of languages still do this to some degree, as your lookup times are effectively zero, though altering the number of elements is way more complex/expensive than in a linked list.

deepsun•22h ago

> and you reach an index through math (starting location + (index * size)).

By the way, that's the only reason why C-derived languages use unintuitive zero-based indexing. There's really no other reason to call the first element a[0] instead of a[1].

dpassens•20h ago

Ah, but is it unintuitive though? Once one understands that in the vast majority of cases, C deals in pointers, not arrays, it makes perfect sense that indexing uses an offset rather than the nth element.

deepsun•7m ago

Yes, but that's only C. Languages that have no pointer arithmetic (JavaScript, Python, Java, Rust, Go you name it) got that from C even though they didn't have to.

tom_•18h ago

The index of the first element is always 0 plus the index of the first element. This means it sort-of doesn't matter what the index of the first element of the array is - but, viewed another way, it's an excellent argument for 0. Starting at 0 means you don't need to add a constant in the (very common) case of dealing with a subarray that starts at the first element of the containing array.

Jtsummers•22h ago

> So what would be the value of having or using indexes other than Integers to index Arrays?

It removes a level of indirection, wasted space, and mental overhead. Suppose you have some table of 'a'..'z' -> whatever. What are your options?

- Switch/case. This may be optimized and fast, but that's not guaranteed across all languages and compilers.

- Hash map. This introduces memory overhead even if it is technically O(1) for access times.

- If in C, you could do:

  T* table = (T*)malloc(sizeof(T) * 26) - 'a'

So that the characters can be used naturally without later modification (not a feature in every language, though).

- Manually calculate each offset every time:

  table[c - 'a'] = ...

With language appropriate type conversions, if needed.

Or you can allow any custom range to work, and do the math in that last option automatically because it's trivial for a compiler to figure it out and it removes unnecessary mental overhead from the program reader and writer.

  table['a']

Just works.

> You can also define integer-valued named constants, and use those if you prefer.

Because everyone wants to see `lower_case_a = 0` and the 25 other cases in their code along with all other possible constants. At least use an enum, sane languages even make them type-safe (or at least safer).

mcphage•21h ago

> The characteristic feature of Arays is that all but the last element have their next element and all but the first element have their previous element.

Where are you getting that from? That's not the characteristic feature of Arrays, that's the characteristic feature of Doubly Linked Lists.

zahlman•20h ago

I wouldn't call it a characteristic feature of either. Any ordered, finite sequence of elements has those properties. Doubly linked lists simply offer O(1) access to "next elements" and "previous elements" given an element.

galaxyLogic•9h ago

Use the index 'n' of any element of the array. Unless that is the first element you can always take the elment n - 1 of the same array. That is the "previous element" of the elment at n. And so on.

I'm not saying that if you have value of any element you could derive the value of the previous or next element from it. I'm saying "it has a previous element" (unless it is the 1st element) meaning there is such a previous element and it is also possible to access that previous element if you know the index of the current element under investigation.

mpweiher•22h ago

This might be interesting:

Structuring Arrays with Algebraic Shapes

https://dl.acm.org/doi/abs/10.1145/3736112.3736141

https://news.ycombinator.com/item?id=44399757

nyrikki•21h ago

One thing that may help is to drop the typing for generic words or pidgin holes, that will help with the FORTRAN example.

The IBM 704 had 3 index registers, which could be combined together with a bitwise OR, then was subtracted from a base address, that's where FORTRAN for its 7D arrays from.

When you mix that concept of decrementing index registers with hypercubes, you will naturally get to the extract format of BI tools like classic Tableau.

zahlman•20h ago

2000 words, specifically. Thanks again, title filter. Even just "Arrays and Tables" would have been better.

I sincerely believe this aspect of the filter is a misfeature. Submissions that have bad reasons to put a number in the title are generally submissions that are just bad (and should be flagged) anyway.

zamadatix•5m ago

Rather than attempt to autocorrect it should just throw the submission back to the user for reconsideration instead of trying to autocorrect it. "2000 words about" is genuinely not that helpful of a title starter, but I'm not sure the autocorrection is any better.

Animats•20h ago

I get the feeling this guy likes Lua. Lua combines arrays and dicts into "tables", which are usually indexed from 1 but can be indexed by any type as a dict. He never mentions Lua, though.

Multidimensional arrays are a blind spot for modern language designers. Not arrays of arrays, as this author points out. FORTRAN had multidimensional arrays, but C didn't. Go and Rust don't have them. Part of the trouble is slices. Once you have slices, people want slicing in multidimensional arrays along any axis, which complicates the basic array representation. Discussions in this area dissolved into bikeshedding and nothing got done for Go and Rust.

zahlman•20h ago

> Lua combines arrays and dicts into "tables", which are usually indexed from 1 but can be indexed by any type as a dict.

I think it's more accurate to just call them hash tables ("dicts") that happen to support some very limited array-like functionality. "Appending" still looks like key insertion if you don't use a named method (https://stackoverflow.com/questions/27434142); `ipairs` misses those non-integer keys but also misses non-contiguous keys, or even a zero key (https://stackoverflow.com/questions/55108794); the `#` operator is not well defined once you add those non-contiguous keys (https://stackoverflow.com/questions/2705793) (so you can't actually cleanly override the choice to "start from 1"); slicing etc. aren't provided as operators (https://stackoverflow.com/questions/24821045); etc.

layer8•19h ago

C and C++ do have multidimensional arrays [0], in the sense that they are arrays of arrays where the inner arrays all have the same fixed size, encoded into the multidimensional array type. Likewise in Go and Rust. This is different from, say, Java, where if you have an array of arrays, the inner arrays can all have different sizes, and there is no array type that could express that they shouldn’t.

Being able to express slices across arbitrary dimensions natively is another matter.

[0] https://en.cppreference.com/w/cpp/language/array.html#Multid...

marcosdumay•18h ago

> Go and Rust

The trade-offs are way too severe for people to agree to any implementation in a low level language like Rust or one that wants to be low level like Go.

We could get them in something like Python.

teleforce•15h ago

>2000 words about arrays and tables

Merely 2000 words, we have a full complete book for that [1].

Joking aside, D4M has seamlessly combined spreadsheet, table, database and graph concepts based on associative array mathematics [2].

On one extreme people are going to bolt on everything on Postgresql database, and another extreme of integrating clunky disparate systems, D4M is a breath of fresh air that is based on mathematics not unlike the venerable SQL relational database concepts [3].

[1] Mathematics of Big Data Spreadsheets, Databases, Matrices, and Graphs:

https://mitpress.mit.edu/9780262038393/mathematics-of-big-da...

[2] D4M: Dynamic Distributed Dimensional Data Model:

https://d4m.mit.edu/

[3] Just Use Postgres for Everything: How to reduce complexity and move faster:

https://www.amazingcto.com/postgres-for-everything/

[4] Just Use Postgres for Everything (430 comments):

https://news.ycombinator.com/item?id=33934139

How was the Universal Pictures 1936 opening logo created?

MacBook Pro Insomnia

Introduction to Computer Music

I tried Servo, the undercover web browser engine made with Rust

Ursa: A Leaderless, Object Storage–Based Alternative to Kafka

Zig Profiling on Apple Silicon

Many countries that said no to ChatControl in 2024 are now undecided

What is gVisor?

Magentic-UI: Towards Human-in-the-Loop Agentic Systems

GCP CloudQuarry: Searching for Secrets in Public GCP Images

Following Up on the Python JIT

GEPA: Reflective prompt evolution can outperform reinforcement learning

Profiling without Source code – how I diagnosed Trackmania stuttering

Infracost (YC W21) hiring first PM to shift $600B cloud spend to proactive

Sumo – Simulation of Urban Mobility

GenosDB (GDB) – Decentralized P2P Graph Database

NSW Fair Trading – Dark Patterns

Go’s race detector has a mutex blind spot

Altima NSX

How to trigger a command on Linux when power switches from AC to battery

Nova: A New Web Framework for Erlang

So you're a manager now

Carbon Language: An experimental successor to C++

OpenAI's ChatGPT Agent casually clicks through "I am not a robot" verification

U.S. Senators Introduce New Pirate Site Blocking Bill: Block Beard

The Math Is Haunted

Ollama's new app

Orion Browser by Kagi

Figma will IPO on July 31

Benchmarks in CI: Escaping the Cloud Chaos

How was the Universal Pictures 1936 opening logo created?

MacBook Pro Insomnia

Introduction to Computer Music

I tried Servo, the undercover web browser engine made with Rust

Ursa: A Leaderless, Object Storage–Based Alternative to Kafka

Zig Profiling on Apple Silicon

Many countries that said no to ChatControl in 2024 are now undecided

What is gVisor?

Magentic-UI: Towards Human-in-the-Loop Agentic Systems

GCP CloudQuarry: Searching for Secrets in Public GCP Images

Following Up on the Python JIT

GEPA: Reflective prompt evolution can outperform reinforcement learning

Profiling without Source code – how I diagnosed Trackmania stuttering

Infracost (YC W21) hiring first PM to shift $600B cloud spend to proactive

Sumo – Simulation of Urban Mobility

GenosDB (GDB) – Decentralized P2P Graph Database

NSW Fair Trading – Dark Patterns

Go’s race detector has a mutex blind spot

Altima NSX

How to trigger a command on Linux when power switches from AC to battery

Nova: A New Web Framework for Erlang

So you're a manager now

Carbon Language: An experimental successor to C++

OpenAI's ChatGPT Agent casually clicks through "I am not a robot" verification

U.S. Senators Introduce New Pirate Site Blocking Bill: Block Beard

The Math Is Haunted

Ollama's new app

Orion Browser by Kagi

Figma will IPO on July 31

Benchmarks in CI: Escaping the Cloud Chaos

Words about Arrays and Tables

Comments