The naive approach of trying a list of strptime format strings in order breaks on ambiguous dates like "01/02/03" - is that January 2nd 2003, or February 1st 2003, or 1st February 2003? The answer depends on the locale and context that we often don't have.
Our current approach: scan the first 50 non-null values, rank format candidates by match frequency, flag ambiguous dates for user confirmation, and store the detected format alongside the column metadata for future imports. We handle about 94% of real-world mixed-format columns automatically, but the remaining 6% need user input.
Some patterns that came up more often than expected in real datasets: dates with ordinal suffixes ("1st", "2nd", "3rd"), fiscal quarter notation ("Q3 2024"), and Unix timestamps stored as strings.
Curious what approaches others have used, particularly around the ambiguity resolution step.
stop50•1h ago