You can't just assume UTF-8

https://csvbase.com/blog/9

195 points by calpaterson on 2024-04-29 | 471 comments

Automated Summary

The article 'You can't just assume UTF-8' discusses the complexity of encoding letters into binary for computers to understand. Different encoding systems, such as ASCII, Unicode, UTF-8, and UTF-16, encode characters differently, leading to incompatibilities. The article takes the example of the euro sign (€) and demonstrates how it is encoded differently in UTF-8, UTF-16, Win-1252, GB18030, Big5, and Shift_JIS. The article suggests that detection of encoding can be achieved using statistical methods that analyze byte and character frequencies. Heuristics have proven to be effective in determining encoding, outperforming the assumption that text is UTF-8.

Archive links

https://web.archive.org/web/20080901185111/https://technology.newscientist.com/channel/tech/dn14124-compressed-web-phone-calls-are-easy-to-bug.html

https://web.archive.org/web/20240429061925/https://csvbase.com/blog/9

Comments

JonChesterfield on 2024-04-29

How about assume utf-8, and if someone has some binary file they'd rather a program interpret as some other format, they turn it into utf-8 using a standalone program first. Instead of burning this guess-what-bytes-they-might-like nonsense into all the software.

We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.

ezoe on 2024-04-29

I doubt you can handle UTF-8 properly with that attitude.

The problems is, there is one very popular OS which is very hard to enforce UTF-8 everywhere, Microsoft Windows.

It's very hard to ensure all the software stack you are depending on it use Unicode version of Win32 API. Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back. Even if you don't you have to ensure all the low level code you are depending on it do the same for you.

Oh and don't forget about the Unicode Normalizations. There is no THE UTF-8. There are bunch of UTF-8s with different Unicode normalizations. Apple macOS use NFD while other mostly use NFC.

These are Just some examples. When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.

josephg on 2024-04-30

> Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back.

Yes. You should convert your strings. Thankfully, UTF-16 is very difficult to confuse with UTF-8 because they're completely incompatible encodings. Conversion is (or should be) a relatively simple process in basically any modern language or environment. And personally, I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?). The different forms are (or should be) visually completely identical for the user - at least on modern computers with decent unicode fonts.

The largest problem with UTF-8 (and its biggest strength) is how similar it is to ASCII. It is for this reason we should consider emoji to be a wonderful gift to software correctness everywhere. Correctly handling emoji requires that your software can handle unicode correctly - because they need multi-unit encoding with both UTF-16 and UTF-8. And emoji won't render correctly unless your software can also handle grapheme clusters.

> When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.

Check! If your application deals with text, throw your favorite multi-codepoint emoji into your unit testing data. (Mine is the polar bear). Users love emoji, and your software should handle it correctly. There's no excuse! Even the windows filesystem passes this test today.

hnfong on 2024-04-30

This.

My native language uses some additional CJK chars on plane 2, and before ~2010s a lot of software had glitches beyond the basic plane of unicode. I am forever grateful for the "Gen Z" who pushed for Emojis.

Javascript's String.length is still semantically broken though. Too bad it's part of a unchangeable spec...

winternewt on 2024-04-30

There's no definition of String.length that would be the obvious right choice. It depends on the use case. So probably better to let the application provide its own implementation.

josephg on 2024-04-30

> So probably better to let the application provide its own implementation.

I’d be very happy with the standard library providing multiple “length” functions for strings. Generally I want three:

- Length in bytes of the utf-8 encoded form. Eg useful for http’s content-length field.

- Number of Unicode codepoints in the text. This is useful for cursor positions, CRDT work, and some other stuff.

- Number of grapheme clusters in the text when displayed.

These should all be reasonably easy to query. But they’re all different functions. They just so happen to return the same result on (most) ascii text. (I’m not sure how many grapheme clusters \0 or a bell is).

Javascript’s string.length is particularly useless because it isn’t even any of the above methods. It returns the number of bytes needed to encode the string as UTF16, divided by 2. I’ve never wanted to know that. It’s a totally useless measure. Deceptively useless, because it’s right there and it works fine so long as your strings only ever contain ascii. Last I checked, C# and Java strings have the same bug.

yau8edq12i on 2024-04-30

I don't know about Java, but the C# standard library is exceptionally well design with respect to variable byte encoding. https://learn.microsoft.com/en-us/dotnet/standard/base-types...

The built-in string.length method is useless (it returns the number of char objects) and I agree that's a problem, but the solution is also built into the language, unlike in JS.

LegionMammal978 on 2024-04-30

JS these days also has ways to iterate over codepoints and grapheme clusters. If you treat the string as an iterator, then its elements will be single-codepoint strings, on which you can call .codePointAt(0) to get the values. (The major JS engines can allegedly elide the allocations for this.) The codepoint count can be obtained most simply with [...string].length, or more efficiently by looping over the iterator manually.

The Intl.Segmenter API [0] can similarly yield iterable objects with all the grapheme clusters of a string. Also, the TextEncoder [1] and TextDecoder [2] APIs can be used to convert strings to and from UTF-8 byte arrays.

[0] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

[1] https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder

[2] https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder

singpolyma3 on 2024-04-30

Don't you want grapheme clusters for cursor positions?

Length in encoded form can be found after encoding by checking the length of the binary content I guess.

I think for historical reasons access to codepoints can be useful, but it's rarely what one wants.

CryZe on 2024-04-30

There's Intl.Segmenter now which does Unicode Segmentation to count the amount of graphemes for example: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

Though you are right in that I don't know of a built-in way to count Unicode Scalar Values (USVs).

josephg on 2024-04-30

JavaScript’s recently added implementation of String[Symbol.iterator] iterates through Unicode characters. So for example, [...str] will split any string into a list of Unicode scalar values.

josephg on 2024-04-30

Yep. I don't use eslint, but if I did I would want a lint against any use of string.length. Its almost never what you want. Especially now that javascript supports unicode through [...str].

extraduder_ire on 2024-04-30

String.length is fine, since it counts UTF16 (UCS2?) code units. The attribute was only accidentally useful for telling how many characters were in a string for a long time, so people think it should work that way.

extraduder_ire on 2024-04-30

> I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?)

The main place I've seen it get annoying is searching for some text in some other text. Unless you normalize the data you're searching through the same way as you normalize your search string.

josephg on 2024-04-30

After reading the above comment I went looking for Unicode documentation talking about the different normalisation formats. That was a point I read that surprised me because I hadn’t ever thought of it. They said search should be insensitive to normalisation form - so generally you should normalise all text before running a search.

That’s a great tip - obvious in hindsight but one I’d never considered.

kevin_thibedeau on 2024-04-30

> your software should handle it correctly. There's no excuse!

It is valid for the presentation of compound emoji can fallback to their component parts. You can't expect every platform to have an up to date database of every novel combination. A better test is emoji with color modifiers. Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

josephg on 2024-04-30

> You can't expect every platform to have an up to date database of every novel combination.

On modern desktop OSes and smart phones, I do expect my platform to have an up-to-date unicode database & font set. Certainly for something like the unicode polar bear, which was added in 2020. I'll begrudgingly look the other way for terminals, embedded systems and maybe video games... but generally it should just work everywhere.

Server code generally shouldn't interact with unicode grapheme clusters at all. I'm struggling to think of any common, valid reason to use a unicode character database in 'normal' backend server code.

> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

I didn't know about that one. I'll have to try it out.

wheybags on 2024-04-30

> I'm struggling to think of any common, valid reason to use a unicode character database in 'normal' backend server code.

Case insensitive search

kevin_thibedeau on 2024-04-30

Unicode gets used in places that aren't continuousky updated. Good luck showing a pirate flag emoji on an embedded device like an infotainment system.

PeterisP on 2024-04-30

I thing being continuously updated should be tied to receiving new external content.

It's fine to have an embedded device that's never updated, but never receives new content - it doesn't matter that a system won't be able to show a new emoji because it doesn't have any content that uses that new emoji.

However, if it is expected to display new and updated content from the internet, then the system itself has to be able to get updated and actually get updated, there's no acceptable excuses for that - if it's going to pull new content, it must also pull new updates for itself.

ndriscoll on 2024-04-30

As the user/owner of the device, no thanks. It should have code updates if and only if I ask it to, which I probably won't unless you have some compelling reason. For the device owner, pulling new updates by itself is just a built in backdoor/RCE exploit, and in practice those backdoors are often used maliciously. I'd much rather my devices have no way to update and boot from ROM.

bee_rider on 2024-04-30

So like a CD player needs some way to get updates? I guess they could send out CDs with updates but approximately nobody would actually do that.

PeterisP on 2024-05-02

The fact that we have to go as far back as CD players for a decent example illustrates my point - the "CD player" content distribution model is long dead, effectively nobody sells CD players or devices like CD players, effectively nobody distributes digital content on CDs or things like CDs (like, CD sales are even below vinyl sales) - almost every currently existing product receives content through a channel where updates could trivially be sent as well.

And if we're talking about how new products should be designed, then the "almost" goes away and they 100% wouldn't receive new content through anything like CDs, the issue transforms from an irrelevant niche (like CDs nowadays) to a nonexistent one.

kps on 2024-04-30

> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

I despise that Unicode retroactively applied default emoji presentation to existing symbols, breaking old text. Who the hell though that was a good idea?

ezoe on 2024-04-30

That's one good thing emoji bring to the software developers mind set.

Before emoji, if somebody open a bug report like: "Your software doesn't handle UTF-8 correctly. It doesn't handle Japanese.",

the response was "Huh? We don't bother to support Japanese. Go pound sand. Close ticket with wontfix.".

Now it's "Your software doesn't handle UTF-8 correctly. It doesn't handle emoji" and we're like "Oh shit! My software can't handle my beloved emoji!"

sharpshadow on 2024-04-30

Exactly I was and still am surprised how fast and wide the adoption of emojis went.

pasc1878 on 2024-04-30

Possibly because the major software companies made it work on phones. Thus users saw it working on many apps and complain when your app fails to do this.

Google, Apple, IBM and MS also did a lot of localisation so their code bases deal with encoding.

It is FOSS Unix software that had the ASCII mindset, probably as C and C++ string types are ASCII and many programmers want to treat strings as arrays. The MacOS and Windows APIs do take UTF as their input not char * (agreed earlier versions did not put they have provided the UTF encodings for 25 years at least.

marcosdumay on 2024-04-30

> And personally, I've never run into a problem where the difference between NFC and NFD mattered.

You mean like opening a file by name?

eclipticplane on 2024-04-30

> (Mine is the polar bear).

Mine is the crying emoji.

And after enough failures in breaking the system, the 100 emoji.

josephg on 2024-04-30

Those are ok, but both of those emoji are represented as a single unicode codepoint. Some bugs (particularly UI bugs) only show up when multiple unicode characters combine to form a single grapheme cluster. I'd recommend something fancier.

I just tried it in gnome-terminal, and while the crying emoji works fine, polar bear or a country flag causes weird issues.

rocqua on 2024-04-30

Crying emoji but with a different skin color?

josephg on 2024-04-30

That’d work!

thayne on 2024-04-30

> Apple macOS use NFD while other mostly use NFC.

It's actually worse than that.

Older versions of mac did enforce NFD for file names, but more recent names don't, at least at the OS level. But many apple programs, such as finder _will_ use NFD. Except that it isn't even Unicode standardized NFD, it is Apple's own modified version of it. And this can cause issues when for example you create a file in finder, then search for it using `find`, and type the name of the file the exact same way, but it can't find the file because find got an NFC form, but the actual file is in NFD.

OTOH, in many applications, you don't really care about the normalization form used. For example, if you are parsing a CSV, you probably don't need to worry about if one of the cells using using a single code point or two code points to represent that accented e.

neonsunset on 2024-04-30

Thanks, yet another quantum of knowledge that makes one's life irreversibly ever so slightly worse. But not as bad as encryption (and learning all the terrible ways most applications have broken implementations in)

staunton on 2024-04-30

> most applications have broken implementations

What applications? Almost nobody writes their own implementations of encryption nowadays (nor should they). You mean openssl is "broken"?

neonsunset on 2024-04-30

By broken implementations I meant incorrect usage of cryptographic APIs - padding errors, nonce reuse, using weak hash functions, etc.

ezoe on 2024-04-30

Yeah, I know that but omit it to make my comment shorter. The world will be a slightly better if there is no macOS.

DavidPiper on 2024-04-30

And immeasurably better if there were no Microsoft Windows ;)

(Kidding, mostly.)

magicalhippo on 2024-04-30

We make some B2B software running on Windows, integrating with customer systems. We get a lot of interesting files.

About a decade ago I wrote some utility code for reading files, where it'll try to detect BOM first, if not scan for invalid UTF-8 sequences. If none are found assume UTF-8 else assume Windows-1252. Worked well for us so far.

Still get the occasional flat file in Windows-1252 with one random field containing UTF-8, so some special handling is needed for those cases. But that's rare.

Fortunately we don't have to worry about normalization for the most part. If we're parsing then any delimiters will be one of the usual suspects and the rest data.

surfingdino on 2024-04-30

Microsoft Windows is a source of many a headache for me as almost every other client I write code for has to deal with data created by humans using MS Office. Ordinary users could be excused, because they are not devs but even devs don't see a difference between ASCII and UTF-8 and continue to write code today as if it was 1986 and nobody needed to support accented characters.

morpheuskafka on 2024-04-30

I got a ticket about some "folders with Chinese characters" showing up on an SMB share at work, my first thought was a Unicode issue and sure enough when you combine two UTF-8/ASCII A-z code points together as one UTF-16 code point, it will usually wind up in the CJK Common Ideographs range of Unicode. Some crappy software had evidently bypassed the appropriate Windows APIs and just directly wrote a C-style ASCII string onto the filesystem without realizing that NTFS is UTF-16.

dgellow on 2024-04-30

I’ve been sharing it multiple times but I love it: WTF-16 spec https://simonsapin.github.io/wtf-8/#ill-formed-utf-16

0xEF on 2024-04-30

Do you know of a resource that explains character encoding in greater detail? Just for my own curiosity. I am learning web development and boy, they brow beat UTF-8 upon us which okay, I'll make sure that call is in my meta data, but none bother to explain how or why we got to that point, or why it seems so splintered.

screwt on 2024-05-02

(very late reply, but in case you see it)

This Joel On Software article [0] is a good starting point. Incredibly it's now over 20 years old so that makes me feel ancient! But still relevant today.

The suggestion that the web should just use utf-8 everywhere is largely true today. But we still have to interact with other software that may not use utf-8 for various legacy reasons - the CSV file example in the original article is a good example. Joel's article also mentions the solution discussed in the original article, i.e. use heuristics to deduce the encoding.

[0] https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

int_19h on 2024-04-30

> If you're writing low level code, you have to convert UTF-8 to UTF-16 and back.

It depends. If you're writing an app, just add the necessary incantation to your manifest, and all the narrow char APIs start talking UTF-8 to you.

For a library, yeah.

ezoe on 2024-04-30

Don't be surprised such UTF-8 locale program will break so bad on Windows with default language set to Japanese.

int_19h on 2024-04-30

Why would it break? If you just assume that the system codepage is UTF-8, then sure. If you specifically say in your manifest that you want UTF-8, then Windows (10+) will give you UTF-8 regardless of which locale it is:

https://learn.microsoft.com/en-us/windows/apps/design/global...

Comma2976 on 2024-04-30

Some[1] would see breaking Windows as a feature

[1]Me, surely at least 1 other

mikhailfranco on 2024-04-30

Some [1] may also consider working for any company/app that needs to display an emoji, to be a waste of at least one life (your life, and all your users' lives).

[1] Me, for sure.

Comma2976 on 2024-04-30

Some[1] may have missed the point, emojis were never supported by some[2] in their projects and they consider them to be the mark of the beast[3]

[1]you may find him at ::1

[2]HEAD~2

[3]Front-end work

998244353 on 2024-04-30

Non-technical users don't want to do that, and won't understand any of that. That's the unfortunate reality of developing software for people.

If Excel generates CSV files with some Windows-1234 encoding, then my "import data from CSV" function needs to handle that, in one way or another. A significant number of people generate CSV files from Excel and if these people upload these files into my application and I assume an incorrect encoding, they won't care that Microsoft is using obsolete or weird defaults. They will see it as a bug in my program and demand that I fix my software. Even if Excel offers them a choice in encoding, they won't understand any of that and more importantly they don't want to deal with that right now, they just want the thing to work.

berkes on 2024-04-30

> then my "import data from CSV" function needs to handle that, in one way or another.

It doesn't. Well, maybe "another".

Your function or even app doesn't need to handle it. Here's what we did on a bookkeeping app: remove all the heuristics, edge cases, broken-csv-handling and validation from all the CVS ingress points.

Add one service that did one thing: receive a CSV, and normalize it. All ingress points could now assume a clean, valid, UTF8 input. It removed thousands of LOCs, hundreds of issues from the backlog. It made debugging easy and it greatly helped the clients even.

At some point, we offered their import runs for download, we added the [original name]_clean.csv our normalized versions. Got praise for that. Clients loved that, as they were often well aware of their internal mess of broken CVSs.

xorcist on 2024-04-30

The point is to separate the cleaning stage from the import stage. Having a clean utf-8 csv makes debugging import issues so much easier. And there are several well working csv tools such as the Python one, that can not only detect character encodings but also record separators and various quotation idiosyncrasies that you also need to be aware of when dealing with Microsoft Office files. Other people have already thought long and hard about that stuff so you don't have to.

edflsafoiewq on 2024-04-30

OTOH if no one ever pushes back against entropy, your great-grandkids will still be dealing with Windows-1234 problems in 2100.

vincnetas on 2024-04-30

Hot to push back against entropy?

Could this work? Implement handling of ancient excel files in your SaaS product, but charge extra dollar for parsing legacy formats and provide information how to export correct files from excel next time :)

vaylian on 2024-04-30

Plus, Excel really likes to use semicolons instead of commas for comma-separated-files. That's another idiosyncrasy that programmers need to take into account.

iggldiggl on 2024-04-30

I think that happens mainly in those locales where a comma is the traditional decimal separator and the dot gets used for 1000s grouping.

chopin on 2024-04-30

I'm German and I hate this. The icing on top: Excel separates function arguments with a semicolon in German locale, too. Got me some head scratching. Examples are separated with comma, iirc.

unclebucknasty on 2024-04-30

>A significant number of people generate CSV files from Excel and if these people upload these files into my application and I assume an incorrect encoding...

What if you just gave them instructions?

carlosjobim on 2024-04-30

I think most people would rather get in their car, drive to the ocean, board a ship and join the navy, and come back after a few years abroad - than following some instructions on how to use the computer.

Joker_vD on 2024-04-30

Joining the navy and then be obliged to follow instructions on so many things you'd normally never even be ordered to do (because now you're in the Navy, sailor! and there are three ways to do any thing: the right way, the wrong way, and the Navy way, and guess which way we use here in the Navy?), just to not follow some instructions on how to use the computer... is something I can actually imagine some people doing.

unclebucknasty on 2024-04-30

Pretty funny, and probably some truth to that being a user sentiment.

OTOH, they have to follow some process to use the software. For just the CSV export, they already have to ensure column orders, values, formats, maybe headers, etc. Selecting UTF-8 from a dropdown seems like the easy part.

lolc on 2024-04-30

> they already have to ensure column orders, values, formats, maybe headers, etc.

My accrued compatibility shims disagree.

unclebucknasty on 2024-04-30

Sorry to hear that. You have my sympathies.

nkrisc on 2024-04-30

They’ll probably switch to the application that does it for them and just works, instead of the one telling them to do something they don’t really understand.

unclebucknasty on 2024-04-30

Selecting UTF-8 from a dropdown on the export doesn't seem too onerous an ask. If that's the differentiator between you and your competitors, then you might have bigger problems.

nkrisc on 2024-04-30

I have sat in on and observed many in-person usability sessions with various applications and websites.

In can tell you right now anyone reading this site is in a completely different universe of tech competency than the general public, and even professionals who aren’t tech-focused.

You would have lost many of them simply with then jumble of letters “UTF-8”.

jimmaswell on 2024-04-30

I hope someday, with improved public education, we can have better users. Unlikely with Gen Z apparently having worse computer skills than the previous generation..

ghnws on 2024-04-30

Everyone knowing about character encodings sounds like a lot of time wasted.

Gen Z has skills in what is available and popular today, just like the previous generation had.

unclebucknasty on 2024-04-30

Yeah, I started writing software a "while" ago. I've encountered a handful of users (and charsets) in my time.

With CSV exports, there's already a good bit of training you have to do WRT the column layout, file format, headers (or not), cell value formats, etc. There's far more lift involved in training users there than ensuring they select "UTF-8" when they select CSV.

And, there's really nothing technical about it, as they don't need to understand what "UTF-8" actually means any more than they need to understand what "CSV" stand for. It's a simple ask in a list of asks.

I weigh this against having every developer now believe they have to check/convert every character set, which can be unreliable and produce garbage, BTW. And, speaking of garbage, there are some encodings that you won't be able to convert in any case, so it would be technically impossible to preserve data integrity without pushing some requirement on the source up to the user.

So, it's about tradeoffs. And, again, if asking users to choose UTF-8 is the difference between customers choosing your app or a competitor's, then you probably need to be more worried about that than your charset encoding.

EraYaN on 2024-04-30

Hell we just accept xlsx and xls natively. It sucks but it solves many issues. Also causes some, but you'll never get it out of users' heads that Excel files are not a data exchange format.

unclebucknasty on 2024-04-30

Definitely helps. Should be able to detect the charset from the Excel metadata, at least.

Of course, if that charset uses code points not available in your app's native charset, then you're kind of back to square one (unless your use case tolerates garbled or missing data).

ghnws on 2024-04-30

Having worked quite a bit in b2b applications, that is exactly the kind of thing that makes customers consider other options.

unclebucknasty on 2024-04-30

Interesting. My experience is that business users are much more teachable and mindful than consumers, especially when working with data.

zarzavat on 2024-04-29

Agreed. Continuing to support other encodings is like insisting that cars should continue to have cassette tape players.

It’s much easier to tell the people with old cassette tapes to rip them, rather than try to put a tape player in every car.

fl7305 on 2024-04-29

> It’s much easier to tell the people with old cassette tapes to rip them

I assume you mean "rip them", as in transcode to a different format?

In that case, you need a tool that takes the old input format(s) and convert them to the new format.

For text files, you'd need a tool that takes the old text files with various encodings and converts them to UTF-8.

Isn't the point of the article to describe how an engineer would create such a tool?

jcranmer on 2024-04-29

> Isn't the point of the article to describe how an engineer would create such a tool?

Honestly, no, because the tool that it's suggesting how to write isn't one that will even come close to doing a good job.

If you want to write such a tool, the first thing you need to do is to understand what the correct answer is. And to do that, you need to sample your input set to figure out what the correct answer should be for several inputs where it matters. There's unfortunately no easy way to avoid that work; universal charset detection isn't really a thing that works all that well.

fl7305 on 2024-04-29

I agree 100% on the technical issues.

But the point of the article is not the same thing as how well it achieved its goals.

javajosh on 2024-04-30

>universal charset detection isn't really a thing that works all that well.

This seems like something LLMs would be good at. A mundane use of them, but I bet they'd be really good at determining that the input has the wrong encoding. Then the program would iterate through encodings, from most probable to least, and select the one that the LLM likes the most. Granted, this means your tool will be 1GB or more. But hey, thems the breaks.

fl7305 on 2024-04-30

Yeah, that could be an interesting use of LLMs. It could at least tell you which languages might be present in the input text.

In the 1980s, we had a version of 7-bit ASCII in Sweden where the three extra Swedish vowels "åäö" were represented by "{}|".

So what might look like regular US 7-bit ASCII should be interpreted as the Swedish version if the text is in Swedish with "{}|" where "åäö" normally goes.

extraduder_ire on 2024-04-30

I'm glad that didn't stick around, like ¥/₩ being used for directory separators like \. I can't imagine trying to read source code with those substitutions.

0xffff2 on 2024-04-29

A tool whose purpose is to transcode should be asking the user to select explicit input and output formats, not guessing.

fl7305 on 2024-04-29

> should be asking the user to select explicit input and output formats

It depends on the requirements.

If you're hired by a company to convert millions of old textfiles, they might want you to do it as well as possible using heuristics without any human input as a starting point.

tempaccount1234 on 2024-04-30

Unless the input is such garbage that asking the user is pointless (like a lot of web sources) and fixing manually is too time expensive.

It becomes easier to just use https://pypi.org/project/ftfy/ on the input.

taneq on 2024-04-30

No you don’t, they (the ones with the piles of old mixtapes) do.

fl7305 on 2024-04-30

I'm an engineer. I'm sometimes hired by they to create tools like that.

gwervc on 2024-04-29

UTF-8 uses 50% more bytes than UFT-16 to encode Chinese or Japanese texts.

arp242 on 2024-04-29

Only with "pure" CJK text in a flat text file; for most real-world situations you'll have enough ASCII text that UTF-8 will be smaller: HMTL/XML tags, email headers, things like that. I did some tests a few years back, and wasn't really able to come up with a real-world situation where UTF-16 is smaller. I'm sure some situations exist, but by and large, CJK users are better off with UTF-8.

kijin on 2024-04-30

Yep. I'm a heavy user of CJK languages and I don't give a damn about the slightly increased plaintext storage. Give me UTF-8 any day, every day. Legacy two-byte encodings can't represent all of the historical glyphs anyway, so there's no room for nationalist crap here.

shinalin on 2024-04-30

Well, it's great that you did some tests a few years back, but I'm not sure how that qualifies you to make such a sweeping generalization about CJK text encoding. It's easy to dismiss UTF-16's benefits when you're only looking at a narrow slice of the real world, ignoring the vast amounts of pure CJK literature, historical archives, and user-generated content out there.

jiggawatts on 2024-04-30

Databases tend to pack rows efficiently enough that there are scenarios where this is noticeable.

Cloudef on 2024-04-30

Only for dense CJK text, and even then if you compress the difference goes away https://utf8everywhere.org/#asian

bawolff on 2024-04-29

Which basically never matters and in any case where it actually does, gzip will make it equal again.

iraqmtpizza on 2024-04-29

zip-then-encrypt leaks information about the plaintext. if it's life or death, better not to compress at all

bawolff on 2024-04-30

Only when the attacker can choose part of the plaintext and do the same thing over and over again with different chosen plaintexts to compare results.

Yes, there are scenarios where that matters. However the vast majority of usecases of utf-8 don't fit that or even use encryption at all.

iraqmtpizza on 2024-04-30

That is not the only way. There are other ways of knowing partial contents of files and changes to files, depending on the situation. If the document is a known form in which one of five boxes is checked by the sender, it's probably not hard to rule out certain selections based on the ciphertext length, if not pin down the contents exactly.

bawolff on 2024-04-30

I'm not sure i entirely understand your example (if there are 5 checkboxes and 1 checked, presumably length would be the same regardless which one of those are checked). However to your broader point, i agree there exist scenarios along those lines (e.g. fingerprinting known communication based on length), however most of them apply even better when not using compression.

iraqmtpizza on 2024-04-30

The checkbox example is completely plausible. There is no guarantee that all checkboxes lead to the same number of bytes changed in the file when checked. What if the format makes a note of the page number wherever a checkbox is checked? 1X could be two bytes and 15X would be three.

And even if the format only stored the checkbox states as a single bit each (unlikely), compression algorithms don't care. They will behave differently on different byte sequences, which can easily lead to a difference in output length.

Also, it's already been done with voice calls with no attacker-controlled data: https://web.archive.org/web/20080901185111/https://technolog...

zrm on 2024-04-30

The attack you're referring to is not specific to compression. It's the same class of attack that can reveal keystrokes over older versions of ssh based on packet size and timing, even on uncompressed connections. Conversely, fixed-bitrate voice streams don't have the same vulnerability as variable-bitrate encodings even though they're still compressed.

The version of your checkbox example which is vulnerable without any formal data compression is when the checkbox is encoded in a field that is only included or changes in length if the value isn't the default, common in uncompressed variable-length encodings like JSON.

iraqmtpizza on 2024-04-30

I'm sure that the people getting hacked care deeply about whether the attack they suffered was sui generis.

Also, zip/deflate etc was not designed to eliminate side channel leakage. Some compression schemes obviously (with padding) can mitigate leaks, but it has to be done deliberately

zrm on 2024-04-30

Any of it has to be done deliberately. The length of the data reveals something about its contents whether it's compressed or not.

The special concern with compression is when attacker-controlled data is compressed against secret data because then the attacker can measure the length multiple times and deduce the secret based not just on the length but on how the length changes when the secret is constant and the attacker-controlled data varies. This can be mitigated with random padding (makes the attack take many times more iterations because it now requires statistical sampling) or prevented by compressing the sensitive data and attacker-controlled data separately.

bawolff on 2024-04-30

If your example needs additional assumptions to be a relavent example then you should probably state them when you bring up the example.

iraqmtpizza on 2024-05-01

like what lol

thwarted on 2024-04-30

Encryption is completely unrelated to the task at hand, which is text encoding and compressing, and text encoding is not encryption.

tnmom on 2024-04-29

Huh, never heard that before. Does it leak more information than just encrypting without zipping? Struggling to imagine how this attack works.

Jach on 2024-04-30

It's an extension of the chosen-plaintext attack, and so requires the attacker to be able to send custom text that they know is in the encrypted payload. If the unencrypted payload is "our-secret-data :::: some user specified text", then the attacker can eventually determine the contents of our-secret-data by observing how the size of the encrypted response changes as they change the text when the compression step matches up with a part of the secret data. It can be defeated by adding random-length padding after compression and before the encryption step, though.

bawolff on 2024-04-30

Essentially if you zip something, repeated text will be deduplicated.

For example "FooFoo" will be smaller than "FooBar" since there is a repeated pattern in the first one.

The attacker can look at the file size and make guesses about how repetitive the text is if they know what the uncompressed or normal size is.

This gets more powerful if the attacker can insert some of their own plaintext.

For example if the plaintext is "Foo" and the attacker inserts "Fo" (giving "FooFo") the result will be smaller than if they inserted zq where there is no pattern. By making lots of guesses the attacker can figure out the secret part of the text a little bit at a time just by observing the size of the ciphertext after inserting different guesses.

iraqmtpizza on 2024-04-29

Encrypting without zipping doesn't leak any information about the content. You can't rule out certain byte sequences (other than by total length) just by looking at the ciphertext length.

If "oui" compresses to two bytes and "non" compresses to one byte, and then you go over them with a stream cipher, which is which:

A: ;

B: *&

lisper on 2024-04-30

This has nothing to do with compression. If you use "yes" and "no" instead of "oui" and "non" (which just happen to be three characters each) and you compress "yes" to "T" and "no" to "F" then the uncompressed text will be the leaky one.

stouset on 2024-04-30

It’s an example meant to prove the idea.

lisper on 2024-04-30

Yes, and my example was an example meant to prove the opposite idea. The point is that it is irrelevant whether you compress or not. You can leak information either way.

iraqmtpizza on 2024-04-30

I leak the length of my phone call and you leak:

1. the length of your phone call; and

2. what language you were speaking; oh and

3. half the words you said

(i.e. pwned)

https://web.archive.org/web/20080901185111/https://technolog...

lisper on 2024-04-30

> you leak [a bunch of stuff]

How? Remember, the uncompressed text gets encrypted too.

iraqmtpizza on 2024-04-30

It's in the article if you would bother to read it LOL. "simply measuring the size of packets without decoding them can identify whole words and phrases with a high rate of accuracy . . . [the researchers] can search for chosen phrases within the encrypted data"

lisper on 2024-04-30

Ah.

That article is about voice calls. Totally different topic. Nothing to do with UTF-8.

iraqmtpizza on 2024-04-30

[flagged]

BalinKing on 2024-04-30

Cryptography noob here: I'm confused by "Encrypting without zipping doesn't leak any information about the content." Logically speaking, if we compress first and therefore "the content" will now refer to "the zipped content", doesn't this mean we still can't get any useful information?

thadt on 2024-04-30

Not OP, but 'zipping and encrypting' one thing (a file for example) does not leak information by itself. The problem comes when an adversary is able to see the length of your encrypted data, and then can see how that length changes over time - especially if the attacker can control part of the input fed to the compressor.

So if you compressed the string "Bob likes yams" and I could convince you to append a string to it and compress again, then I could see how much the compressed length changed.

If the string I gave you was something already in your data then the string would compress more than it would if the string I gave you was not already in your data - "Bob likes yams and potatoes" will be larger than "Bob likes yams likes Bob".

If the only thing I can see about your data is the length and how it changes under compression - and I can get you to compress that along with data that I hand to you - then eventually I can learn the secret parts of your data.

bawolff on 2024-04-30

Encryption generally leaks the size of the plaintext.

This is true in both the compressed and non-compressed case. However with compression the size of the plaintext depends on the contents, so the leak of the size can matter more than when not using compression.

Even without compression this can matter sometimes. Imagine compressing "yes" vs "no".

BalinKing on 2024-04-30

> Encryption generally leaks the size of the plaintext.

Ah, I see. Naïvely, this seems like a really bad thing for an encryption algorithm to do—is there no way around it? Like, why is encryption different from hashing in this regard?

bawolff on 2024-04-30

There are methods, but they are generally very inefficient bandwidth wise in the general case. The general approach is to add extra text (pad) so that all messages are a fixed size (or e.g. some power of 2). The higher the fixed size is, the less information is leaked and the less efficient it is. E.g. if you pad to 64mb but need to transmit a 1mb message, that is 63mb of extra data to transmit.

Part of the problem (afaik) is we lack good math tools to analyze the trade offs of different padding size vs how much extra privacy they provide. This makes it hard to reason about how much padding is "enough".

Another approach is adding a random amount of padding. This can be defeated if you can force the victim to resend messages (which you then average out the size of).

Hashing is different because you don't have to reconstruct the message from the hash. With encryption the recipient needs to decrypt the message eventually and get the original back. However there is no way to transmit (a maximally compressed) message in less space then it takes up.

There are special cases where this doesn't apply e.g. if you have a fixed transmission schedule where you send a sprcific number of bytes on a specific agreed upon schedule.

stouset on 2024-04-30

Yes, of course it leaks more information than encryption without compression, because that’s just encryption which doesn’t leak anything.

In an enormous number of real world cases adversaries can end up including attacker-controller input alongside secret data. In that case you can guess at secret data and if you guess correctly, you get smaller compressed output. But even without that, imagine the worst case: a 1TB file that compresses to a handful of bytes. Pretty clearly the overwhelming majority of the text is just duplicate bytes. That’s information which is leaked.

kazinator on 2024-04-30

That's a small price to pay for:

- not having to worry about byte order;

- and cruft like surrogate pairs;

- being able to pass the text through 8 bit string representations and even manipulate it as 8 bit in certain useful ways

Add bitmapped graphics to any text, and it will dominate the size.

Size of text is just noise in the world of streaming video, terabyte drives, fiber to the home ...

neonsunset on 2024-04-30

[flagged]

ryandrake on 2024-04-29

> We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.

What ever happened to the Robustness Principle[1]? I think the entire comment section of this article has forgotten it. IMO the best software accepts many formats and "deals with it," or at least attempts to, rather than just exiting with "Hahah, Error 19923 Wrong Input Format. Try again, loser."

1: https://en.wikipedia.org/wiki/Robustness_principle

jerf on 2024-04-29

We collectively discovered that we were underestimating the long term costs, by a lot, so its lustre has faded. This is in some sense relatively recent, so word is still getting around, as the programming world does not move as quickly as it fancies itself to.

If you'd like to see why, read the HTML 5 parsing portion of the spec. Slowly and carefully. Try to understand what is going on and why. A skim will not reveal the issue. You will come to a much greater understanding of the problem. Some study of what had happened when we tried to upgrade TCP (not the 4->6 transition, that's its own thing) and why the only two protocols that can practically exist on the Internet anymore are TCP and UDP may also be of interest.

fanf2 on 2024-04-29

RFC 9413 “maintaining robust protocols” https://datatracker.ietf.org/doc/rfc9413/ was originally titled “The Harmful Consequences of the Robustness Principle (draft-thomson-postel-was-wrong)” https://datatracker.ietf.org/doc/draft-thomson-postel-was-wr... and it has several examples

riffraff on 2024-04-30

Doesn't HTML5 prove that such an approach was more effective tho?

HTML was a massive success, while at the same time when we tried having a strict spec with XHTML, we failed miserably.

I mean, the purist in me wants strictness but it seems lousyness wins.

kelnos on 2024-04-30

I think the particular domain matters.

The explosion of the web happened in no small part because of how easy it was to write some HTML and get a basic, working webpage out of it. If you nested some tags the wrong way and the browser just put up an error page, rather than doing a (usually) pretty good job figuring out what you actually meant, people would get frustrated faster and not bother with it at all.

But imagine if our C/C++/Java/Rust/Go/etc. compilers were like "syntax error, but ehhhhh you probably meant to put a closing brace there, so let's just pretend you did". That would be a nightmare of bugs and security issues.

The difficulty in drawing a line in the sand and sticking to the spec, though, is that of user blame. Let's say you implement a spec perfectly -- even if you are the originator of the spec -- and then someone comes along and builds something of their own that writes out files that don't conform to the spec. Your software throws up an error and says "invalid file", but the other piece of software can read it back in just fine. Users don't know or care about specifications; they just know that your software "doesn't work" for the files they have, and the other software does. If you try to tell them that the file is bad, and the other software has a bug, they really won't care.

astine on 2024-04-30

But imagine if our C/C++/Java/Rust/Go/etc. compilers were like "syntax error, but ehhhhh you probably meant to put a closing brace there, so let's just pretend you did". That would be a nightmare of bugs and security issues.

That's how Perl ended up the way it is.

AnthonyMouse on 2024-04-30

A possible solution to this is for large organizations to be intransigent about standards compliance. If your personal mail server rejects mail that isn't well-formed, you're just being a masochist because nobody is going to change for you. If Google does it, everybody else is going to fix their stuff because otherwise they can't send to gmail.

jerf on 2024-04-30

It proves that engineering is hard and that everything has costs and benefits, and you can't make them go away by just ignoring them. It turns out that "being robust" had a lot of costs we didn't see.

It also shows the harmfulness of binary black and white thinking in engineering. There are choices other than "just let everyone do whatever and hope all the different things picking up the pieces do it in more or less the same way" and "rigidly specify a spec and blow up the universe at the slightest deviation". Both of those easy-to-specify choices have excessive costs. There is no escape from hard design tasks. XHTML may always have been doomed to fail, but that is not to say that HTML had to be allowed to be as loosey-goosey as it is, either.

Had a gradient of failure been introduced rather than a rigid rock wall, things very likely wouldn't have gotten as badly out of hand as they did. If, for instance, a screwed up table was specified to deliberately render in a very aesthetically unappealing manner, but not crash the entire page the way XHTML did, people would have not come to depend so much on HTML being sloppy. The resulting broken page would still be somewhat usable, but there would have been motivation to fix it, rather than the world we actually live in where it all just seemed to work.

tomsmeding on 2024-04-29

Can you give a hint as to what the issue is that one should find reading a portion of the HTML 5 spec? Or is it genuinely unexplainable without experiencing something first-hand?

fl7305 on 2024-04-29

>>> the best software accepts many formats and "deals with it,"

>> We collectively discovered that we were underestimating the long term costs, by a lot, so its lustre has faded.

>> If you'd like to see why, read the HTML 5 parsing portion of the spec.

> Can you give a hint as to what the issue is that one should find reading a portion of the HTML 5 spec

I think the point was that the HTML 5 spec tries to parse all kinds of weird input instead of drawing a line in the sand and forcing the input to follow a simple format?

Spivak on 2024-04-30

It's ungodly complicated to be sure but websites from 30 years ago also still render unmodified so it's hard to say it's bad all of the time.

jerf on 2024-04-30

It isn't bad. In fact it's quite good. But it is very much a case of closing the barn door after the animals got out. You see in the standard that the effort was put to corral them back in, and I'm very glad they did, but it certainly was not free.

travisb on 2024-04-29

Among other things, security became a concern.

Being lenient is all well and good when the consequences are mild. When the consequences of misinterpreting or interpreting differently to a second implementation becomes costly, such as a security exploit, then the Robustness Principle becomes less obviously a win.

It's important to understand that every implementation will try to fix-up formatting problems in their own way unique to their particular implementation. From that you get various desync or reinterpretation attacks (eg. HTTP request smuggling).

thfuran on 2024-04-29

It was a horrible idea. The real robustness principle is "Follow the spec".

nomel on 2024-04-29

I've tried telling users "sorry, your file isn't to spec", and they say "but it works with <competitor>", and that ideology flies right out the window, along with their money.

ryandrake on 2024-04-30

Exactly. "Accepting and trying" is how a lot of popular software won their market. Look at HN's favorite media player, VLC. In the past, media player software were horrible, refusing to play all but the most tightly constrained set of allowed containers/codecs. I remember spending the early 2000s trying to get Windows Media Player to play MPEGs by downloading codec packs and trying to cast secret spells into the Windows Registry. Yuck! Then along comes VLC which accepts whatever you throw at it, and that software is basically everywhere now. You can throw line noise at VLC and it will try its best to play something!

silotis on 2024-04-30

The trick is to enforce conformance right from the start in the first implementation of a format. Shipping a product that doesn't interop with the existing tools is a non-starter so the devs will have to fix their shit first.

As you say, unfortunately the genie cannot be put back in the bottle for formats that already have defective implementations in the wild.

riffraff on 2024-04-30

The problem is that unless you restrict everyone else you'll get your own product not to interact with *new" tools too.

E.g. you produce valid .wat files, but my software which also outputs those has some bits screwed up.

My program can read both .wat but yours can't, but I have 5% market share.

Your users complain they sometimes receive files your software can't read while the competitor can. Do you tell them "well that file is invalid, tell whoever sent it to you to change the software they use"?

The genie can't stay in the bottle unless you have some sort of certification authority and even that may not be enough (see USB)

thfuran on 2024-04-30

But the more every actor aims to follow the spec rather than reproduce everyone else's bugs, the less quickly the ecosystem devolves into a horrid tangled mess that's nigh impossible for a new entrant.

quectophoton on 2024-04-30

"So you're saying that I can have an advantage over everyone else if I implement all the spec features, plus one extra feature that adds convenience to our users?

And you're saying that by doing this, not only do I gain an advantage over the existing competition, but I also make it more difficult for more competitors to appear?

[takes notes]"

thfuran on 2024-05-01

Microsoft is a few decades ahead of you there.

jimjimjim on 2024-04-30

This is the answer.

The customer does not know. They just want it to work. They may be using something that someone else gave them. The original source system of the file may not be changeable. But most importantly, their boss just wants it to work. or else.

thfuran on 2024-04-30

People will pay you to do all kinds of terrible not-robust things.

Joker_vD on 2024-04-30

Yes, and that's why Postel's law is more of an empirical observation (a law of nature, if you will) on which software survives and which doesn't. You may dislike it but that won't make it go away.

nomel on 2024-05-01

I see it as the opposite, actually. They will pay you for a robust product: one that works for them. They have no care for the technical minutiae of your implementation, because they're just trying to do actually interesting things, with the help of your product. This is the customer perspective.

pixl97 on 2024-04-30

That's when you ask to see their file, then hack <competitors> system.

vidarh on 2024-04-30

That is fine in contexts where a wrong guess does no harm.

But that is not always the case, and e.g. silently "fixing" text encoding issues can often corrupt the data if you get it wrong.

By all means offer options of you want, but if you do flag very clearly to the user that they're taking a risk of corrupting the data unless any errors are very apparent and trivial to undo.

david422 on 2024-04-30

> "deals with it,"

This basically loses data integrity if it's wrong though.

You might want to do that with human input if it's helpful to the user - ie user enters a phone number and you strip dashes etc. But if it's machine to machine, it should just follow the spec.

tedunangst on 2024-04-29

Seems like you've uploaded a jpeg. Let me OCR that into CSV for you. Hmm, no text found. Let's pass it to a multimodal LLM.

kelnos on 2024-04-30

The article addresses this, that current thinking in many places is that the robustness principle / Postel's Law maybe wasn't the best idea.

If you reject malformed input, then the person who created it has to go back and fix it and try again. If you interpret malformed input the best you can (and get it right), then everyone else implementing the same thing in the future now also has to implement your heuristics and workarounds. The malformed input effectively becomes a part of the spec.

This is why HTML & CSS are the garbage dump they are today, and why different browsers still don't always display pages exactly alike. The reason HTML5 exists is because people finally just gave up and decided to standardize all the broken behavior that was floating around in the wild. Pre-HTML5, the web was an outright dumpster fire of browser compatibility issues (as opposed to the mere garbage dump we have today).

Anyway, it's not really important to try to convince you that Postel's Law is bad; what's important is that you know that many people are starting to think it's bad, and there's no longer any strong consensus that it was ever a good thing.

arp242 on 2024-04-29

> What ever happened to the Robustness Principle

Bush hid the facts

cesarb on 2024-04-30

> Bush hid the facts

For those who don't know that reference: https://en.wikipedia.org/wiki/Bush_hid_the_facts

(A text file containing only the ASCII bytes "Bush hid the facts", when opened in Windows Notepad, displays a sequence of CJK characters instead of the expected English sentence.)

lamontcg on 2024-04-30

I've lived through dealing with non-UTF8 encoding issues and it was a truly gigantic pain in the ass. I'm much more on the side now of people who only want to deal with UTF8 and fully support software that tells any other encoding to go pound sand. The harder life gets for people who use other encodings (yes, particularly Microsoft) the more incentive they have to eventually get on board and stop costing everyone time and effort managing all this nonsense.

jppittma on 2024-04-30

I think people have collectively decided that they want their programs stupid and predictable, rather than smart and unwieldy.

kazinator on 2024-04-30

What happened to it:

https://en.wikipedia.org/wiki/Robustness_principle#Criticism

Postel's Law doesn't pass a software engineering smell test.

The idea that software should guess and repair bad inputs is deeply flawed. It is a security threat and a source of enshittification.

chuckadams on 2024-04-30

> enshittification

I am so over this word.

kazinator on 2024-04-30

I put a TODO item in my calendar for December 2024 to stop using it. :)

ykonstant on 2024-04-30

The enshittification of enshittification was an inevitability.

fl7305 on 2024-04-29

> they turn it into utf-8 using a standalone program first

I took the article to be for people who would be writing that "standalone program"?

I have certainly been in a position where I was the person who had to deal with input text files with unknown encodings. There was no-one else to hand off the problem to.

0xffff2 on 2024-04-29

In that case, you should be explicitly asking the user what the input format is.

fl7305 on 2024-04-29

> you should be explicitly asking the user what the input format is.

Me: "I'll make the program stop and ask the user."

Customer: "No. We have 10 million files. Start by using heuristics for batch processing."

Spivak on 2024-04-30

And it's usually worse, we have 10 million files coming from sources neither we nor our users control.

Or "what do you mean ask them, they barely know what a jpeg is?!"

MichaelZuo on 2024-04-30

There's someone that must be paying your invoices? Charge them extra for the extra work that needs to be done to sort things out.

Spivak on 2024-04-30

??? The work to use one of the many encoding guessing tools https://github.com/Joungkyun/libchardet and then get it correct for almost every document?

You just look bad if you can't do what every other software is able to do. Charging for it takes that to another level.

MichaelZuo on 2024-04-30

Then you don't have to worry about it since you won't get the work in the future? Someone else, with this presumably correct software, will always be able to do it for less, faster, and at a higher quality.

That's how business works...

If such a business competitor doesn't exist, then yes charge extra, and actually do the work correctly.

Spivak on 2024-04-30

Am I missing something here? The work is ingesting documents from uncontrolled sources that might not all be UTF-8 and handling them correctly. Using an encoding guessing tool is the means to do that. In practice since there are only a few widely-used encodings and they're not terribly ambiguous it means that everything just works and users happily use the software.

This isn't some theoretical thing, we do this at $dayjob right now not only guessing the encoding but the file-type as well so that we can make sense out of whatever garbage our users upload. Everything from malformed CSV exports form Excel to PDFs that are just JPEGs of scanned documents. It works, and it works well.

And of course it does, the files our users are handing to us work on their machines. They can open them up and look at them in whatever local software they produced them with, there's no excuse for us to be unable to do the same.

MichaelZuo on 2024-04-30

Then you don't need to worry about it either way?

bobmcnamara on 2024-04-30

Heckin no thank you!

The FCC ULS database records are stored in a combination of no fewer than three different encodings(1252, UTF8, and something else for a handful of German names) that vary per record.

When I brought this up they said something to the effect of: it's already unicodes it has tilde letters!

eythian on 2024-04-30

I did once have a file that had UTF8, Windows-1252, MARC8, and VT100 (really) all mixed up in it. I think the data had gone through multiple migrations between software in its past.

I had write to my own "clean this as well as possible" thing, and it did a good enough job.

PaulDavisThe1st on 2024-04-29

"I dunno, got it from a friend on a USB stick"

amarshall on 2024-04-29

> some software probably does.

Browsers do, kind of https://mimesniff.spec.whatwg.org/#rules-for-identifying-an-...

SuperNinKenDo on 2024-04-30

Not every encoding can make a round trip through Unicode without you writing ad hoc handling code for every single one. There's a number of reasons some of these are still in use and Unicode destroying information is one of them.

thaumasiotes on 2024-04-30

Browsers used to have a menu option to choose the encoding you wanted to use to decode the page.

In Firefox, that's been replaced by the magic option "Repair Text Encoding". There is no justification for this.

They seem to be in the process of disabling that option too:

> Note: On most modern pages, the Repair Text Encoding menu item will be greyed-out because character encoding changes are not supported.

( https://support.mozilla.org/en-US/kb/text-encoding-no-longer... )

This note is logical gibberish; encoding isn't something that has to be supported by the page. Decoding is a choice by the browser!

shiomiru on 2024-04-30

It seems the decision was made in the name of security:

https://hsivonen.fi/no-encoding-menu/

> Supporting the specific manually-selectable encodings caused significant complexity in the HTML parser when trying to support the feature securely (i.e. not allowing certain encodings to be overridden). With the current approach, the parser needs to know of one flag to force chardetng, which the parser has to be able to run in other situations anyway, to run.

> Elaborate UI surface for a niche feature risks the whole feature getting removed

> Telemetry [...] suggested that users aren’t that good at choosing correctly manually.

In other words, it's trying to protect users from themselves by dumbing down the browser. (Never mind that people who know what they are doing have probably also turned off telemetry...)

thaumasiotes on 2024-04-30

Wow. That writeup is insane.

>> Supporting the specific manually-selectable encodings caused significant complexity in the HTML parser when trying to support the feature securely (i.e. not allowing certain encodings to be overridden).

There's no explanation of why you'd want this, or why it's security-relevant.

(Farther down, there's a mention of self-XSS, which definitely isn't relevant.)

>> Elaborate UI surface for a niche feature risks the whole feature getting removed

They've already removed the whole feature. That was easier to do after they mostly disabled it, not harder.

>> Telemetry showed users making a selection from the menu when the encoding of the page being overridden had come from a previous selection from the menu.

That would be an example of "working as expected". The removal of the ability to do this is the problem that disabling the encoding menu causes! Under the old, correct approach, you'd guess what the encoding was until you got it right. Under the new approach, the browser guesses for you, and if the first guess is wrong, screw you.

n_plus_1_acc on 2024-04-30

Probably because most websites now send a correct encoding header or meta Tag, so the user changing can only make it wrong. (Assuming no encoding header is wrong, which happens in reality)

layer8 on 2024-04-30

It does happen a lot that old text content in non-UTF-8 encoding is mistakenly served explicitly marked as UTF-8. It is precisely in such circumstances that the Repair Text option is useful.

kelnos on 2024-04-30

Solutions that require lots of unrelated people to start doing something a different way are not really solutions.

kstrauser on 2024-04-29

If you give me a computer timestamp without a timezone, I can and will assume it's in UTC. It might not be, but if it's not and I process it as though it is, and the sender doesn't like the results, that's on them. I'm willing to spend approximately zero effort trying to guess what nonstandard thing they're trying to send me unless they're paying me or my company a whole lot of money, in which case I'll convert it to UTC upon import and continue on from there.

Same with UTF-8. Life's too short for bothering with anything else today. I'll deal with some weird janky encoding for the right price, but the first thing I'd do is convert it to UTF-8. Damned if I'm going to complicate the innards of my code with special case code paths for non-UTF-8.

If there were some inherent issue with UTF-8 that made it significantly worse than some other encoding for a given task, I'd be sympathetic to that explanation and wouldn't be such a pain in the neck about this. For instance, if it were the case that it did a bad job of encoding Mandarin or Urdu or Xhosa or Persian, and the people who use those languages strongly preferred to use something else, I'd understand. However, I've never heard a viable explanation for not using UTF-8 other than legacy software support, and if you want to continue to use something ancient and weird, it's on you to adapt it to the rest of the world because they're definitely not going to adapt the world to you.

hnick on 2024-04-30

It depends on the domain. If you are writing calendar software, it is legitimate to have "floating time" i.e. your medication reminder is at 7pm every day, regardless of time zone, travel, or anything else.

Unfortunately Google and many other companies have decided UTC is the only way, so this causes issues with ICS files that use that format sometimes when they are generating their helpful popups in the GMail inbox.

kstrauser on 2024-04-30

Those aren’t timestamps. They’re descriptions of how to select them.

hnick on 2024-04-30

Oh yes true. Somehow I missed that important detail.

cesarb on 2024-04-30

> If you are writing calendar software, it is legitimate to have "floating time" i.e. your medication reminder is at 7pm every day, regardless of time zone, travel, or anything else.

If you have to take medication (for instance, an antibiotic) every 24 hours, it must be taken at the same UTC hour, even if you took a train to a town in another timezone. Keeping the same local time even when the timezone changes would be wrong for that use case.

hnick on 2024-04-30

If you're there for a while, you'll need to adapt anyway since your biorhythms will too. But there are plenty of other cases like a reminder to check something after dinner, or my standard wake-up alarm in the morning. Or if I plan to travel, book lunch at a nice place for 1pm, and put it in my calendar I just want it to be 1pm wherever I go, without caring about TZ changes.

Calendars, alarms, and reminders have some overlap here and floating time can be good for some cases.

kstrauser on 2024-04-30

There are very few drugs where that's a requirement. Your kidneys and liver aren't smart enough to metabolize anything at precisely the same rate every day anyway.

kccqzy on 2024-04-29

> For instance, if it were the case that it did a bad job of encoding Mandarin

I don't know if you picked this example on purpose, but using UTF-8 to encode Chinese is 50% larger than the old encoding (GB2312). I remember people cared about this like twenty years ago. I don't know of anyone that still cares about this encoding inefficiency. Any compression algorithm is able to remove such encoding inefficiency while using negligible CPU to decompress.

kstrauser on 2024-04-29

That doesn't seem like the worst issue imaginable. I doubt there are too many cases where every byte counts, text uses a significant portion of the available space, and compression is unavailable or inefficient. If we were still cramming floppies full of text files destined for very slow computers, that'd be one thing. Web pages full of uncompressed text are still either so small that it's a moot point or so huge with JS, images, and fonts that the relative text size isn't that significant.

Which is all to say that you're right, but I can't imagine that it's more than a theoretical nuisance outside some extremely niche cases.

bawolff on 2024-04-29

> Web pages full of uncompressed text

Are basically non existent. Almost all modern web servers transparently compress html. Sending uncompressed text over the wire is extremely rare.

iraqmtpizza on 2024-04-29

They shouldn't be non-existent. Zip-then-encrypt is not secure due to information leakage.

EDIT: also, it's not safe—message length is dependent on the values of the plaintext bytes, period. i'm not saying don't live dangerously, i'm just saying live dangerously knowing

fanf2 on 2024-04-29

The information leakage problem occurs when compression is done in the TLS layer, because then the compression context includes both headers (with cookies) and bodies (containing potentially attacker-controlled data). But if you do compression at the HTTP layer using its Transfer-Encoding then the compression context only covers the body, which is safe.

rocqua on 2024-04-30

It can still leak data if attackers can get their input reflected. I.e. I send you a word, and then I get to observe a compressed and encrypted message including my word and sensitive data. If my word matches the sensitive data, the cyphertext will be smaller. Hence I can learn things about the cipgertext. That is no longer good encryption.

bawolff on 2024-04-30

What you are talking about is generally referred to as the "BREACH" attack. While there may theoretically be scenarios where it is relavent, in practise it almost never is so the industry has largely decided to ignore it (its important to distinguish this from the CRIME attack which is about http headers instead of the response body which has a much higher liklihood of being exploitable while still being hard).

The reason its usually safe is that to exploit you need:

- a secret inside the html file

- the secret has to stay constant and cannot change (since it is adaptive attack. CSRF tokens and similar things usually change on every request so cannot be attacked)

- the attacker has to have a method to inject something into the html file and repeat it for different payloads

- the attacker has to be able to see how many bytes the response is (or some other side channel)

- the attacker is not one of the ends of the communication (no point to attack yourself)

Having all these requirements met is very unlikely.

kstrauser on 2024-04-29

Do you often send raw bitmaps for the same reason?

iraqmtpizza on 2024-04-30

Do you often get completely pwned and have your encrypted calls transcribed by people eating doughnuts because you thought it was safe to compress sensitive data before encrypting? https://web.archive.org/web/20080901185111/https://technolog...

gary_0 on 2024-04-29

For Asian languages, UTF-8 is basically the same size as any other encoding when compressed[0] (and you should be using compression if you care about space) so in practice there is no data size advantage to using non-standard encodings.

[0] https://utf8everywhere.org/#asian

neild on 2024-04-30

In addition, Chinese characters encode more information than English letters, so a text written in Chinese will generally consume fewer bytes than the same text in English even when using UTF-8.

(Consider: Horse is five letters, but 馬 is one character. Even at three bytes per character, Chinese wins.)

Panzer04 on 2024-04-30

Presumably that derives from the overhead of encoding an english character as a full byte? Given there's only 26 characters normally, you could fit that into 5 bits instead, which funnily enough does actually line up with the chinese character encoding (5x5 vs 1x24).

kps on 2024-04-30

Yes. It's the non-Latin alphabets that lose with either UTF-8 or UTF-16, compared with stateful ISO 2022 page switching.

kstrauser on 2024-04-30

True, but even then you wouldn’t want to store it egregiously badly.

PeterisP on 2024-04-29

A key aspect is that nowadays we rarely encode pure text - while other encodings are more efficient for encoding pure Mandarin, nowadays a "Mandarin document" may be an HTML or JSON or XML file where less than half of the characters are from CJK codespace, and the rest come from all the formatting overhead which is in the 7-bit ASCII range, and UTF-8 works great for such combined content.

tomoyoirl on 2024-04-29

To say nothing of mod_gzip

layer8 on 2024-04-30

> For instance, if it were the case that it did a bad job of encoding Mandarin

Please look up the issues caused by Han unification in Unicode. It’s an important reason why the Chinese and Japanese encodings are still used in their respective territories.

Shorel on 2024-05-01

That makes Unicode a failure then

LaffertyDev on 2024-04-30

I can't help myself. The grandest of nitpicks is coming your way. I'm sorry.

> If you give me a computer timestamp without a timezone, I can and will assume it's in UTC.

Do you mean, give you an _offset_? `2024-04-29T14:03:06.0000-8:00` the `-8:00` is an offset. It only tells you what time this stamp occurred relative to standard time. It does not tell you anything about the region or zone itself. While I have consumed APIs that give me the timezone context as part of the response, none of them are part of the timestamp itself.

The only time you should assume a timestamp is UTC is if it has the `z` at the end (assuming 8601) or is otherwise marked as UTC. Without that, you have absolutely no information about where or when the time has occurred -- it is local time. And if your software assumes a local timestamp is UTC, then I argue it is not the sender of that timestamp's problem that your software is broken.

My desire to meet you at 4pm has no bearing on if the DST switchover has happened, or my government decides to change the timezone rules, or if {any other way the offset for a zone can change for future or past times}. My reminder to take my medicine at 7pm is not centered on UTC or my physical location on the planet. Its just at 7pm. Every day. If I go from New York to Paris then no, I do not want your software to tell me my medicine is actually supposed to be at Midnight. Its 7pm.

But, assuming you aren't doing any future scheduling, calendar appointments, bookings, ticket sales, transportation departure, human-centric logs, or any of the other ways Local Time is incredibly useful -- ignore away.

kstrauser on 2024-04-30

As I mentioned in another reply, "remind me every day at 7PM" isn't a timestamp. It's a formula for how to determine when the next timestamp is going to occur. Even those examples are too narrow, because it's really closer to "remind me the next time you notice that it's after 7PM wherever I happen to be, including if that's when I cross a time zone and jump instantly from 6:30PM to 7:30PM".

Consider my statement more in the context of logs of past events. The only time you can reasonably assume a given file is in a particular non-UTC TZ is when it came from a person sitting in your same city, from data they collected manually, and you're confident that person isn't a time geek who uses UTC for everything. Otherwise there's no other sane default when lacking TZ/offset data. (I know they're not the same, but they're similar in the sense that they can let you convert timestamps from one TZ to another).

LaffertyDev on 2024-04-30

> As I mentioned in another reply, "remind me every day at 7PM" isn't a timestamp. It's a formula for how to determine when the next timestamp is going to occur. Even those examples are too narrow, because it's really closer to "remind me the next time you notice that it's after 7PM wherever I happen to be, including if that's when I cross a time zone and jump instantly from 6:30PM to 7:30PM".

That's certainly fair in the context of a recurring event with some formula. I caution that a lot of people will still immediately reach for timestamps to calculate that formula, particularly for a next occurrence, and in the context if this conversation they would be given as an ISO8601 datetime based on Local Time. I would also caution that calendar events that have a distinct moment in time that they start are also prime for Local Time where a UTC-default mentality will cause errors.

> Consider my statement more in the context of logs of past events

From the stance of computer generated historical log data, I definitely agree that UTC everywhere is a sane default and safe to assume :)

(And, in your defense, I would definitely argue UTC-everywhere gets you 95% of the way there for 5% of the effort... I get why people make the tradeoff)

> (I know they're not the same, but they're similar in the sense that they can let you convert timestamps from one TZ to another).

More nitpicking on my part, again, I'm sorry, it lets you convert from one _offset_ to another or from an offset to UTC. Think Arizona being a special snowflake who (mostly!) doesn't observe DST. You can't assume all UTC-7 offsets are all Mountain Time.

Spivak on 2024-04-30

It's always nice to see someone who actually understands time.

"Convert to UTC and then throw away the time zone" only works when you need to record a specific moment in time so it's crazy how often it's recommended as the universal solution. It really isn't that hard to store (datetime, zone) and now you're not throwing away information if you ever need to do date math.

globular-toast on 2024-04-30

Yeah, I've been trying to convince people forever to store time zones with timestamps when appropriate. If you record events from around the world and don't record what time zone they happened in you can't even answer basic questions like "what proportion happened before lunch time?"

People love simple rules and they will absolutely take things too far. Most developers learn "just use UTC!" and think that's the last thing they ever need to learn about time.

logrot on 2024-04-30

Fundamentally I agree, but sadly the world isn't that simple.

You usually end up with having to deal with whatever eccentric sh!t that ultimately comes from the same source as the payment for the job.

mikhailfranco on 2024-04-30

Developers should assume UTF-8 for text files going forward.

UTF-8 should have no BOM. It is the default. And there are no undefined Byte sequences that need an Order. Requiring a UTF-8 BOM just destroys the happy planned property that ASCII-is-UTF8. Why spoil that good work?

Others variants of Unicode have BOMs, e.g. UTF-16BE. We know CJK languages need UTF-16 for compression. The BOM is only a couple more bytes. No problem, so far so good.

But there are old files, that are in 'platform encoding'. Fine, let there be an OS 'locale', that has a default encoding. That default can be overridden with another OS 'encoding' variable. And that can be overridden by an application arg. And there may be a final default for a specific application, that is only ever used with one language encoding. Then individual files can override all of the above ...

Text file (serialization) formats should define handling of optional BOM followed by an ASCII header that defines the encoding of the body that follows. One can also imagine .3 filetypes that have a unique or default encoding, with values maintained by, say, IANA (like MIME types). XML got this right. JSON and CSV are woefully neglectful, almost to the point of criminal liability.

But in the absence of all of the above, the default-default-default-default-default is UTF-8.

We are talking about the future, not the past. Design for UTF-8 default, and BOMs for other Unicodes. Microsoft should have defined BOMs for Windows-Blah encodings, not for UTF-8!

When the whole world looks to the future, Microsoft will follow. Eventually. Reluctantly.

jrochkind1 on 2024-04-30

The specific use case the OP author was focusing on was CSV. (A format which has no place to signal the encoding inline). They noted that, to this day, Windows Excel will output CSV in Win-1252. (And the user doing the CSV export has probably never even heard of encodings).

If you assume UTF-8, you will have corrupted text.

I agree that I'm mad about Excel outputting Win-1252 CSV by default.

out_of_protocol on 2024-04-30

More pressure on M$ - faster the change. Also, almost any software nowadays assuming UTF8 for plaintext anyway

jrochkind1 on 2024-04-30

You are suggesting that if software developers willfully refuse to implement measures to detect CP-1252 in CSVs, instead insisting on assumign UTF-8 even though they know that will result in lots of corruption with data from the most popular producer of CSVs -- you are suggesting that will be pressure on MS to make it output in UTF-8 by default instead?

If the world worked that way, it would be a very different world than the one we've got.

out_of_protocol on 2024-05-05

If you think you can detect only cp-1252 i have news for you. This is 1-byte encoding, it can't fit much. So, be prepared for whole zoo of other 1-byte encodings from that era (e.g. cyrillic letters - welcome to cp-1251, where everything above first 128 chars have totally different meaning). Writing encoding detector is not easy at all. Chance of guessing wrong is high. I'm glad most of the world (but not excel) moved on away from that can of worms

> You are suggesting that if software developers willfully refuse to implement

it will happen anyway, sooner or later

> will result in lots of corruption

there are lots of corruption anyway, e.g euro sign, some european letters, whole other alphabets like cyrillic one). The most dangerous are subtle ones like german ẞ)

P.S. I don't have office installed to check, but online version exports non cp-1252 chars as "?" . So nice

P.P.S. Apple Numbers offers choice of encoding on export, with UTF-8 default. Google Sheets exports as UTF-8 without choice

jrochkind1 on 2024-05-12

OP is in fact about doing your best to detect encoding without proper tagging. It is not perfect and you won't have 100% accuracy. But it can get pretty decent (OP is literally about the techniques used to do so, statistically etc), and is necessary because of the actual world we live in. If just refusing to do it would get MS to export in UTF-8 by default that would of course have happened a long time ago!

brnt on 2024-05-01

And that's far from the only thing Excel does quietly. It also changes comma's to semi-colons based on system locale. It's braindead.

teknopaul on 2024-04-30

Imho Presume 8 bit, encapsulting 7bit usacsii. That includes utf8 and many many other encodings.

Don't interpret user supplied strings at all. Defines max lengths as byte lengths.

Remain agnostic of encoding. Especially in libraries.

It's easier than people think it is thanks to some very clever people's work a long time ago.

PaulHoule on 2024-04-30

Programming languages have lumbered slowly towards UTF-8 by default but from time to time you find an environment with a corrupted character encoding.

I worked at an AI company that was ahead of its time (actually I worked at a few of those) where the data scientists had a special talent for finding Docker images with strange configurations so all the time I'd find out one particular container was running a Hungarian or other wrong charset.

(And that's the problem with Docker... People give up on being in control of their environment and just throw in five different kitchen sinks and it works... Sometimes)

calpaterson on 2024-04-30

If csv files bring criminal liability then I am guilty.

Sidenote: this particular criminal conspiracy is open to potential new members. Please join the Committee To Keep Csv Evil: https://discord.gg/uqu4BkNP5G

Jokes aside, talking about the future is grand but the problem is that data was written in the past and we need to read it in the present. That means that you do have to detect encoding and you can't just decide that the world runs on UTF-8. Even today, mainland China does not use UTF-8 and is not, as far as I know, in the process of switching.

I understand UTF-8 is mostly fine even for east asian languages though - and bytes are cheap

mikhailfranco on 2024-04-30

Old CSV files bring no stigma.

There is no CSV specification. That does bring opprobrium. RFC4180 is from 2005, long after Unicode and XML, so people should have known better. The absence of a standard points to disagreement, perhaps commercial disagreement, but IETF should be independent, is it not?

That failure to standardize encoding, and other issues (headers, etc.), has wasted an enormous amount of time for many creative hackers, who could have been producing value, rather than banging their head against a vague assembly of pseudo-spec-illusions. Me included, sadly.

eduction on 2024-04-30

You're blaming the lack of a spec for all the wasted time but that's not the cause.

The cause is that CSV is popular and is popular because it is incompletely defined. (See also: HTML, RSS.)

Making a CSV spec post-hoc solves nothing, as others here have pointed out, because there is already an "installed base" so to speak. If anything it's worse because it might mislead some people into thinking they can write to the spec and handle any CSV file.

The right move, if you want a nicely precise and strict spec, is to admit it's a new thing and give it a new name, maybe CSVS or something like that.

But good luck making it popular - there are plenty of CSV libraries out there that handle most CSV files well enough, just as there is tons of software that handles HTML and RSS well enough (which is why I am a fan of both those formats).

calpaterson on 2024-04-30

> The cause is that CSV is popular and is popular because it is incompletely defined. (See also: HTML, RSS.)

This is a surprisingly deep thought. I'm going to need a couple of days to reflect on this...

bee_rider on 2024-04-30

The only argument I’ll present for giving over CSV to those who want to flail at the idea of standardizing it: the name implies too much standardization already. Why “C”SV? Most CSV processing tools accept a delineator, right? They are just whatever separated values, use semicolons or tabs if you want.

mikhailfranco on 2024-04-30

I explained how old files could be read.

Let me reframe it as a Schelling Point [1] - the uncoordinated coordination problem.

You arrange to meet your file on a certain day in New York, no place or time were mentioned. When and where will you go? It seems impossible.

But perhaps you go at noon, to the UTF-8 Building in midtown Manhattan. Are you there now?

[1] https://en.wikipedia.org/wiki/Focal_point_(game_theory)

calpaterson on 2024-04-30

I'm sorry, quite right - you did. Using the computer locale.

The trouble is that people transmit data from one computer to another and so from one locale to another. And sadly, they do not always set the character encoding header correctly, if they even know what that is.

I mean take csvbase's case. It has to accept csv files from anyone. And christ preserve me, they aren't going to label them as "Win-1252" or "UTF-16" or whatever.

There is no alternative but statistical detection. And there is good evidence that this solution is fairly satisfactory, because millions of computers are using it right now! csvbase uploads run into more problems with trailing commas than with character detection getting it wrong at this point, that is your "Schelling Point" I'm afraid!

mikhailfranco on 2024-04-30

HTTP actually does quite a good job of providing headers containing MIME type and encoding. There is a little work to get the default (e.g. HTML and XML are different), and decide on the case where the XML payload encoding is different to the HTTP transport encoding (e.g. perhaps XML parsers need a way to override the embedded header).

So we end up at another plausible future-directed design decision: computer-computer communication should use HTTP. I think many systems have ended up there already, perhaps prompted by the issues we have discussed.

Moral: good specs attract usage; bad, incomprehensible or inconsistently implemented specs fade away.

mikhailfranco on 2024-04-30

So trailing commas are another failure. That issue is mildly entangled with headers. Perhaps start making a list...

e63f67dd-065b on 2024-04-30

> China does not use UTF-8 and is not, as far as I know, in the process of switching.

That’s … not true? Most Chinese software and websites are utf8 by default, and it’s been that way for a while. GBK and her sisters might linger around in legacy formats, but UTF-8 has certainly reached the point of mass adoption from what I can tell.

WorldMaker on 2024-04-30

> We know CJK languages need UTF-16 for compression.

My understanding is that it is for the opposite of compression: it saves memory when uncompressed versus the UTF-8 surrogates needing more bytes. My understanding is that UTF-8 surrogates compress pretty well as they have common patterns that form dictionary "words" just as easily anything else. UTF-8 seems to be winning in the long run for even CJK and astral plane languages on disk and the operating systems and applications that were preferring UTF-16 in memory are mostly only doing so out of backwards compatibility and are themselves often using more UTF-8 buffers internally as those reflect the files at rest.

(.NET has a backwards compatibility based on using UTF-16 codepoint strings by default but has more and more UTF-8 only pathways and has some interesting compile time options now to use UTF-8 only today. Python 3 made the choice that UTF-8 was the only string format to support, even with input from CJK communities. UTF-8 really does seem to be slowly winning everything.)

> JSON and CSV are woefully neglectful,

As the article also points out, JSON probably got it right: UTF-8 only and BOM is an error (because UTF-8) (but parsers are allowed to gently ignore that error if they wish). https://www.rfc-editor.org/rfc/rfc8259#section-8.1

That seems to be the way forward for new text-based formats that only care about backward compatibility with low byte ASCII: UTF-8 only, no BOM. UTF-8 (unlike UTF-16 and missing reservations for some of its needed surrogates) is infinitely expandable if we ever do find a reason to extend past the "astral plane".

(Anyone still working in CSV by choice is maybe guilty of criminal liability though. I still think the best thing Excel could do to help murder CSV is give us a file extension to force Excel to open a JSON file, like .XLJSON. Every time I've resorted to CSV has been because "the user needs to double click the file and open in Excel". Excel has great JSON support, it just won't let you double click a file for it, which is the only problem, because no business executive wants the training on "Data > From JSON" no matter how prominent in the ribbon that tool is.)

> When the whole world looks to the future, Microsoft will follow.

That ship is turning slowly. Windows backward compatibility guarantees likely mean that Windows will always have some UTF-16, but the terminals in Windows now correctly default to UTF-8 (since Windows 10) and even .NET with its compatibility decrees is more "UTF-8 native" than ever (especially when compiling for running on Linux, which is several layers of surprise for anyone that was around in the era where Microsoft picked UCS-2 as its one true format in the first place).

PaulHoule on 2024-04-30

You can fit Japanese comfortably in a 16 bit charset but Chinese needs more than that.

My take though is that CSV is not a good thing because the format isn't completely standardized, you just can't count that people did the right thing with escaping, whether a particular column is intended to be handled as strings or numeric values, etc.

Where I work we publish data files in various formats, I'm planning on making a Jupyter notebook to show people how to process our data with Pandas -- one critical point is that I'm going to use one of the commercial statistics data formats (like Stata) because I can load the data right the first time and not look back. (e.g. CSV = good because it is "open" is wrong)

If I am exporting files for Excel users I export an Excel file. Good Excel output libraries have been around for at least 20 years and even if you don't have fun with formulas, formatting and all that it is easy to make a file that people will load right the first time and every time.

WorldMaker on 2024-04-30

> Good Excel output libraries have been around for at least 20 years

I wish that were the case more often. Depends on your ecosystem, of course.

For instance, I've yet to find a good XLSX library for JS that works well "idempotently" (in the browser as well as Node/etc). Every one I tried either had native dependencies and couldn't run in-browser or had a cost (time, money, size) I couldn't budget for at the time.

I have found some XLS libraries for JS that were extremely "mid", but outputting XLS is nearly as bad as CSV in 2024. (Including all the huge messy legacy of character set Encoding problems.)

The best and worst thing about CSV is that it seems "low overhead": it seems really cheap to output. ("How hard can it be, just ','.join(records)?" has been the pit so many of us fall into over and over again and sometimes never come out.) In terms of low overhead: in a world where all my APIs are already talking JSON if I can wrap an existing HTTP API with just two extra headers to get "free" Excel files for my users, that could be a beautiful world:

  Content-Type: application/json+excel
  Content-Disposition: attachment; filename="BusinessUsersDumbReportRequest.xljson"

All the pieces are already there. If you could teach every user to use "Data > From JSON" we could maybe have nice things today instead of yet another CSV export dump. We just need someone on the Excel team to greenlight a "double to click to open an .XLJSON file" feature.

PaulHoule on 2024-04-30

Why not just add a file association for plain .json files?

WorldMaker on 2024-04-30

First and foremost to avoid accidents: We don't want a misconfigured HTTP website accidentally opening a new Excel window for every fetch/XHR call or to have to fight Excel defaults to get JSON to open up in our IDE or Dev Tools of choice. We don't want random shell scripts accidentally curl-ing things to Excel. Things like that.

Secondly for "ownership" reasons: We don't want to give non-developers the mistaken impression that Excel "owns" JSON and that it is a Microsoft format. I've had people tell me that CSV must be a Microsoft format because you can double click them in Excel and they show an Excel-like icon (in some ways it has been too long since Lotus existed and Excel was in the "we'll take all of our competitors' file associations too" era). On the one hand it might be nice to blame all of CSV's problems on Microsoft and Excel if that were actually the case, but on the other hand it also confuses people as to the real/valid uses of the format. Unfortunately, too, that transitive relationship goes both ways and I've heard second hand that CSV files are among the reasons the Excel team hopes to never add another file type association again because supposedly they get far too many support requests for CSV file problems that maybe shouldn't be their job to deal with.

A separate file association adds some intent of "this file was actually meant to be opened in Excel and hopefully the developer actually tested it some".

Pet_Ant on 2024-04-30

> Requiring a UTF-8 BOM just destroys the happy planned property that ASCII-is-UTF8. Why spoil that good work?

So that you know you are dealing with UTF-8. Assuming ASCII only works if you are only dealing with English texts and data.

themerone on 2024-04-30

The BOM only tells you that your are not dealing with ASCII. You could encounter anything after it.

remram on 2024-04-30

Who is assuming ASCII?

bobmcnamara on 2024-04-30

Much code incorrectly assumes that ASCII can be rendered in any code page.

kazinator on 2024-04-30

Indeed, you can't assume UTF-8.

What you do, rather, is drop support for non-UTF-8.

Work with tech-stacks whose text handling is based strictly around Unicode and UTF-8, and find enough opportunities that way that you don't have to care about anything else.

Let the customers who cling to data in weird encodings go to someone who makes it a nice to support that.

djha-skin on 2024-04-30

Joel spolsky spoke against this exact statistics-based approach when he wrote about unicode[1]:

> What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle.

1: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

kelnos on 2024-04-30

I don't think he was speaking against the statistics-based approach itself, just against Postel's Law in general.

Ideally people would see gibberish (or an error message) immediately if they don't provide an encoding; then they'll know something is wrong, figure it out, fix it, and never have the issue again.

But if we're in a situation where we already have lots and lots of text documents that don't have an encoding specified, and we believe it's not feasible to require everyone to fix that, then it's actually pretty amazing that we can often correctly guess the encoding.

vbezhenar on 2024-04-30

There's enca library (and cli tool) which does that. I used it often before UTF-8 became overwhelming. The situation was especially dire with Russian encodings. There were three 1-byte encodings which were quite wide-spread: KOI8-R mostly found in unixes, CP866 used in DOS and CP1251 used in Windows. What's worse, with Windows you sometimes had to deal with both CP866 and CP1251 because it includes DOS subsystem with separate codepage.

bastawhiz on 2024-04-30

Exactly. I used this technique at Mozilla in 2010 when processing Firefox add-ons, and it misidentified scripts as having the wrong encoding pretty frequently. There's far less weird encoding out there than there are false positives from statistics-based approaches.

mark-r on 2024-04-30

20 years old, but still true.

bhaney on 2024-04-29

I'm just gonna assume UTF-8

duskwuff on 2024-04-29

I'm disappointed that the article doesn't discuss this in more detail. Most byte sequences are not valid UTF-8. If you can decode a message as UTF-8 with no errors, that is almost certainly the correct encoding to use; it's extremely unlikely that some text in another encoding just happened to be perfectly valid as UTF-8. (The converse is not true; most 8-bit text encodings will happily decode UTF-8 sequences to nonsense strings like ðŸš©.)

If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding. But that should be a fallback, not the first thing you try.

Izkata on 2024-04-30

> If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding.

Don't really even need to do that. There's only a handful other encodings still in common use, just try each of them as fallbacks and see which one works without errors, and you'll manage the vast majority of what's not UTF-8.

(We recently did just that for a system that handles unreliable input, I think I remember our fallback only has 3 additional encodings before it gives up and it's been working fine)

kelnos on 2024-04-30

The person you're replying to sort of addresses this, though not completely.

Since UTF-8 is a variable-length encoding, it somewhat naturally has some error detection built in. Fixed-length encodings don't really have that, and for some of them, any byte value, 0 to 255, in any position, is valid. (Some have a few byte values that are invalid or reserved, but the point still stands.)

So you could very easily pick a "next most common encoding" after UTF-8 fails, try it, find that it works (that is, no bytes are invalid in that encoding), but it turns out that's still not actually the correct encoding. The statistics-based approach will nearly always yield better results. Even a statistics-based approach that restricts you to a few possible encodings that you know are most likely will do better.

mark-r on 2024-04-30

Unfortunately Windows code page 1252 has no invalid bytes, so it will always succeed. You'd better try that one last.

Dwedit on 2024-04-30

81, 8D, 8F, 90, 9D are invalid.

layer8 on 2024-04-30

These are actually interpreted as the corresponding C1 control codes by Windows, so arguably not invalid in practice, just formally reserved to be reassigned to other characters in the future.

actionfromafar on 2024-04-29

Not extremely unlikely. Many charsets decode fine as UTF-8 as long as the message happens to fit in ASCII.

duskwuff on 2024-04-29

> Many charsets decode fine as UTF-8 as long as the message happens to fit in ASCII.

At which point the message is effectively ASCII. UTF-8 is a superset of ASCII, so "decoding" ASCII as UTF-8 is fine.

(Yes, I know there are some Japanese text encodings where 0x5c is decoded as "¥" instead of "\". But they're sometimes treated as backslashes even though they look like ¥ symbols so handling them "correctly" is complicated.)

eqvinox on 2024-04-30

"Fun" fact: some video subtitle formats (ASS specifically) use "text{\b1}bold" to format things — but since they were primarily used to subtitle Japanese anime, this frequently became "text{¥b1}bold". Which is all good and well, except when those subtitles moved to UTF-8 they kept the ¥. So now you have to support ¥ (0xC2 0xA5) as a markup/control character in those subtitles.

Izkata on 2024-04-30

I'm guessing they're thinking of Extended ASCII (the 8-bit one that's actually multiple different encodings, but the lower half is shared with 7-bit ASCII and so that part does fit in UTF-8 while the upper half likely won't if the message actually uses it).

actionfromafar on 2024-05-01

Yes. And some test messages will decode fine and then suddenly no.

jwilk on 2024-04-30

ISO-2022-JP (sometimes?) disguises perfectly as ASCII:

  $ echo は | iconv -t ISO-2022-JP | hd
  00000000  1b 24 42 24 4f 1b 28 42  0a                       |.$B$O.(B.|
  00000009

jcranmer on 2024-04-29

The definition GP is using most likely refers to non-ASCII sequences that validly decode as UTF-8, because virtually every major charset in practical use has ASCII as a subset.

layer8 on 2024-04-30

ASCII (or ISO646) has local variants [0] that replace certain characters, like $ with £, or \ with ¥. The latter is still in use in Japan. That’s why "ASCII" is sometimes clarified as "US-ASCII".

[0] https://en.wikipedia.org/wiki/ASCII#7-bit_codes

kelnos on 2024-04-30

I'm skeptical. Any charset that uses bytes 128-255 as characters is unlikely to successfully decode to UTF-8. Are there really many others that only use 0-127, or most text ends up only using 0-127?

PaulDavisThe1st on 2024-04-29

such encodings are also UTF-8 then, are they not?

tsimionescu on 2024-04-30

I think there are a bunch of encodings that just repurposed a few ASCII characters as different characters - someone on this page was giving the example of some Swedish encoding where {}| were replaced with three accented Swedish letters. There are probably a bunch of others. In those cases, the text will decode fine as UTF-8, but it will display the wrong thing.

srj on 2024-04-29

A distinction without a difference?

actionfromafar on 2024-05-02

I meant, some messages will decode fine as UTF-8, but in some other messages there may be letters which don't fit in 7 bits. So some simple testing, especially with English words, will show it to work fine. But as soon as a non-7-bit characters creeps, it will stop working fine.

calpaterson on 2024-04-29

榥\ue0af侬펭懃䒥亷

bhaney on 2024-04-29

And a good day to you too, my friend whose input I'm going to discard

eqvinox on 2024-04-30

It's garbage anyway, which you can (non-reliably) guess by there being a Korean character in the middle of C/J kanji. (Kanji are not completely gone from Korean, but mostly.)

calpaterson on 2024-04-30

It's mojibake for "probably a bad idea" [in Chinese]

klysm on 2024-04-30

You can’t just do that! /s

hnick on 2024-04-30

Based on my past role, you can't even assume UTF-8 when the file says it's UTF-8.

Clients would constantly send CSV or other files with an explicit BOM or other marking indicating UTF-8 but the parser would choke since they just output native Windows-1252 or similar into it. I think some programs just spit it out since it's standard.

groestl on 2024-04-29

I will assume it, I will enforce it where I can, and I will fight tooth and nail should push come to shove.

I got 99 problems, but charsets aint one of them.

zadokshi on 2024-04-29

Better to assume UTF8 and fail with a clear message/warning. Sure you can offer to guess to help the end user if it fails, but as other people have pointed out, it’s been standard for a long time now. Even python caved and accepted it as the default: https://peps.python.org/pep-0686/

Veserv on 2024-04-29

Off-topic, but the bit numbering convention is deliciously confusing.

Little-endian bytes (lowest byte is leftmost) and big-endian bits (bits contributing less numerical value are rightmost) are normal, but the bits are referenced/numbered little-endian (first bit is leftmost even though it contributes the most numerical value). When I first read the numbering convention I thought it was going to be a breath of fresh air of someone using the much more sane, but non-standard, little-endian bits with little-endian bytes, but it was actually another layered twist. Hopefully someday English can write numbers little-endian, which is objectively superior, and do away with this whole mess.

kstrauser on 2024-04-29

> Hopefully someday English can write numbers little-endian, which is objectively superior

Upon reading this, I threw my laptop out the window.

CoastalCoder on 2024-04-30

> Upon reading this, I threw my laptop out the window.

And having shattered the dirty window, your room was enlightened.

Veserv on 2024-04-29

It actually is if we did not need to consider historical baggage.

Especially in programming where we already use in-band encoding like 0x to denote a hex string or 0b to denote a binary string. I like using 1{s}, e.g. 1x to denote a little-endian hex string and 1b to denote a little-endian binary string, to denote little-endian encoding.

But, even ignoring programming, it is still better in normal use. The Arabic language got it right writing little-endian (Arabic numbers are written the same, but Arabic is a right-to-left language so it is actually little-endian), and the European languages just stole it stupid by copying the form instead of the function.

filmor on 2024-04-29

Arabian numerals are derived from Indian scripts, which are written left to right, yet the order of digits is already the same we have today.

pezezin on 2024-04-30

Not to mention that in most (all?) spoken languages numbers are big-endian.

Veserv on 2024-04-30

From what I have seen, low numbers in Arabic are spoken/written little-endian (twenty five is five and twenty). Apparently German as well. The internet claims that historically large numbers in Arabic were also written out (as in when using number words rather than numerals) little-endian.

umanwizard on 2024-04-30

Numbers below 100 in German are little-endian. 85 is "fünfundachtzig" (five-and-eighty).

kibwen on 2024-04-30

"Although generally found in text written with the Arabic abjad ("alphabet"), numbers written with these numerals also place the most-significant digit to the left, so they read from left to right (though digits are not always said in order from most to least significant[10]). The requisite changes in reading direction are found in text that mixes left-to-right writing systems with right-to-left systems."

https://en.wikipedia.org/wiki/Hindu%E2%80%93Arabic_numeral_s...

o11c on 2024-04-29

Default UTF-8 is better than the linked suggestion of using a heuristic, but failing catastrophic when old data is encountered is unacceptable. There must be a fallback.

(Note that the heuristic for "is this intended to be UTF-8" is pretty reliable, but most other encoding-detection heuristics are very bad quality)

lifthrasiir on 2024-04-29

You can't just assume UTF-8, but you can verify that it is almost surely encoded in UTF-8 unlike other legacy encodings. Which makes UTF-8 the first and foremost consideration.

norir on 2024-04-29

If it's turtles all the way down and at every level you use utf-8, it's hard to see how any input with a different encoding (for the same underlying text) will not be detected before any unintended side effects were invoked.

At this point, I don't see any sufficiently good reason to not use utf-8 exclusively in any new system. Conversions to and from other encodings would only be done at well defined boundaries when I'm calling into dependencies that require non utf-8 input for whatever reason.

bandyaboot on 2024-04-29

> In the most popular character encoding, UTF-8, character number 65 ("A") is written:

> 01000001

> Only the second and final bits are 1, or "on".

Isn’t it more accurate to say that the first and penultimate bits are 1, or “on”?

fl7305 on 2024-04-29

It depends on whether your bit numbering is like x86 (your description), or PowerPC (left most bit is 0).

mark-r on 2024-04-30

Thanks for giving me an example of an architecture where the bits are labelled backwards, I'd never encountered that before. I've always appreciated that the bit number represents 2 to the power of that number.

duskwuff on 2024-04-29

Basically everyone uses x86 bit numbering. It has the pleasant property that the place value of every bit is always 2^n (or -2^n for a sign bit), and zero-extending a value doesn't change the numbering of its bits.

fl7305 on 2024-04-29

Sure, it is by far the industry standard.

It works much better for handling discrete integers.

Once you get into bitfield instructions it is nice to have bit 0 be the "left most bit".

mark-r on 2024-04-30

It's a mistake to call it x86 bit numbering, it's way older than that.

bandyaboot on 2024-04-29

The more I thought it through, even assuming x86, I guess there’s just no “correct” way to casually reference bit positions when we read them in the opposite order from the machine. Are they being referenced from the perspective of a human consumer of text, or the machine’s perspective as a consumer of bits? If I were writing that content, I’d have a difficult time deciding on which to use. If I were writing for a lay person, referencing left-to-right seems obvious, but in this case where the audience is primarily developers, it becomes much less obvious.

skerit on 2024-04-30

This confused me too. Until reading this I didn't even think much about how I read binary numbers right-to-left by default.

vitaut on 2024-04-29

This is so spectacularly outdated. KOI-8 has been dead for ages.

vkaku on 2024-04-30

The probability of web content not in UTF-8 is increasingly getting lower and lower.

Last I tracked, as of this month, 0.3% of surveyed web pages used Shift JIS. It has been declining steadily. I really hope people move to UTF-8. While it is important to understand how the code pages and encodings helped, I think it's a good time to actually start moving a lot of applications to use UTF-8. I am perfectly okay if people want to use UTF-16 (the OG Unicode) and it's extensions alternatively, especially for Asian applications.

Yes, historic data preservation requires a different strategy than designing stuff for the future. It is okay to however migrate to these encodings and keep giving old data and software new life.

layer8 on 2024-04-30

Just to be pedantic, the OG Unicode is UCS-2, not UTF-16, the main difference being that surrogate characters didn’t exist originally.

vkaku on 2024-05-04

Good point. Ah, we peel onions every few years.

mihaaly on 2024-04-30

Excellent article, good content, good length, enlightened subtexts and references, joy to read.

lolc on 2024-04-30

Just the most recent episode: A statistician is using PHP, on Windows, to analyze text for character frequency. He's rather confused by the UTF-16LE encoding and thinks the character "A" is numbered 4100 because that's what is shown in a hex-editor. I tried explaining about the little-endian part, and mb-string functions in PHP. And that PHP is not a good fit for his projects.

Then I realized that this is hilarious and I won't be able to kick him from his local minimum there. Everything he could learn about encodings would first complicate his work.

flohofwoe on 2024-04-30

The post seems to assume that only UTF-16 has Byte Order Marks, but as pointless as it sounds, UTF-8 has a BOM too (EF BB BF). It seems to be a Windows thing though, haven't seen it in the wild anywhere else (and also rarely on Windows - since text editors typically allow to save UTF-8 files with or without BOM. I guess it depends on the text editor which of those is the default).

LinAGKar on 2024-04-30

That's not really a byte order mark though, it's just the UTF-8 encoding of U+FEFF, which corresponds to the byte order mark in UTF-16. Honestly, emitting that into UTF-8 was probably the result of a bug originally, caused by Windows Unicode APIs being designed for UTF-16.

flohofwoe on 2024-04-30

Wikipedia begs to differ though (check under "UTF-8": https://en.wikipedia.org/wiki/Byte_order_mark)

Also, at least the Rider IDE and VSCode lets you select to save UTF-8 with or without BOM.

My guess is that it's used mainly to differentiate between UTF-8 and codepage encoded 8-bit ASCII files (which are still very common on Windows).

calpaterson on 2024-04-30

Yes you're right, UTF-8 technically does as well. I've never seen them in real life either.

UTF-16 BOMs do have a useful function as I recall: they really help Excel detect your character encoding (Excel is awful at detecting character encoding).

rob74 on 2024-04-30

30 years ago: "you can't just assume ASCII"

Today: "you can't just assume UTF-8"

The more things change, the more they stay the same...

pronoiac on 2024-04-29

Archive copy: https://web.archive.org/web/20240429061925/https://csvbase.c...

drdaeman on 2024-04-29

Anyone got EBCDIC on their bingo cards? Because if the argument is "legacy encodings are still relevant in 2024" then we also need to bring EBCDIC (and EBCDIK and UTF-EBCDIC for more perverted fun) into the picture. Makes heuristics extra fun.

Or, you know, just say "nah, I can, those ancient stuff don't matter (outside of obligatory exceptions, like software archeology) anymore." If someone wants to feed me a KOI8-R or JIS X 0201 CSV heirloom, they should convert it into something modern first.

skissane on 2024-04-29

> Anyone got EBCDIC on their bingo cards? Because if the argument is "legacy encodings are still relevant in 2024"

I have a hobby interest in IBM mainframes and IBM i, so yes to EBCDIC for me. (I have encountered them professionally too, but only to a very limited extent.) In practice, I find looking for 0x40 (EBCDIC space) a useful heuristic. Even in binary files, since many mainframe data structures are fixed length space padded.

> then we also need to bring EBCDIC (and EBCDIK and UTF-EBCDIC for more perverted fun) into the picture. Makes heuristics extra fun.

Actual use of UTF-EBCDIC, while not nonexistent, has always been extremely rare. A person could spend an entire career dedicated to IBM mainframes and never encounter it

EBCDIK, at first I wondered if that was a joke, now I realise it is a name used for non-IBM Japanese EBCDIC code pages. Again, something one can spend a whole career in mainframes and never encounter – if one never works in Japan, if one works for a vendor whose products aren't sold in Japan, probably even if you work for a vendor whose products are sold in Japan but only to IBM sites (as opposed to Fujitsu/Hitachi sites)

iamcreasy on 2024-04-30

By heuristics, is the author referring to the rules and policies published by Unicode? [1]

Link[1] was referring as a solution to this problem in this article: Absolute Minimum Every Software Developer Must Know About Unicode in 2023 [2].

[1] https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Bound...

[2] https://tonsky.me/blog/unicode/

mgaunard on 2024-04-29

There is a pretty successful world language standard: English.

bawolff on 2024-04-29

Kids these days. Latin was a much more elegant standard.

CoastalCoder on 2024-04-30

Yeah, but: Quis legentibus scribit?

jujube3 on 2024-04-29

Actually, I can just assume UTF-8, since that's what the world standardized on. Just like I can assume the length of a meter or the weight of a gram. There is no need to have dozens of incompatible systems.

fl7305 on 2024-04-29

That comparison doesn't hold.

If you're dealing with lengths, you can get input data in meters, centimeters, millimeters, inches, feet, etc.

If the input data is human heights, would you automatically assume meters even if the input data is "183"?

If the input data is the weight of humans, would you always assume grams, even if the input data is "75"?

asddubs on 2024-04-29

would you guess the unit instead of specifying what you expect?

fl7305 on 2024-04-29

> would you guess the unit instead of specifying what you expect?

It depends on the circumstances. It might be the least bad thing to do. Or not.

But that wasn't my point. I replied to this:

> I can assume the length of a meter or the weight of a gram

Sure, the length of a meter and the "weight" of a gram are both standardized. (To be very picky, "gram" is a mass, not a weight. The actual weight depends on the "g" constant, which on average is 9.81 m/s^2 on earth, but can vary about 0.5%.)

So if you know the input is in meters, you don't need to do any further processing.

But dealing with input text files with an unknown encoding is like dealing with input lengths with an unknown unit.

So while UTF-8 itself might be standardized, it is not the same as all input text files always being in UTF-8.

You can choose to say that all input text files must be in valid UTF-8, or the program refuses to load them. Or you can use silent heuristics. Or something inbetween.

jujube3 on 2024-04-29

> It depends on the circumstances. It might be the least bad thing to do. Or not.

It is not.

https://en.wikipedia.org/wiki/Mars_Climate_Orbiter#Cause_of_...

kelnos on 2024-04-30

Giving an example of a specific circumstance where it's bad doesn't invalidate the idea that it "depends on circumstances".

fl7305 on 2024-04-30

Is that supposed to be proof that it never depends on the circumstances?

Write all software like NASA needs to do for spacecraft?

Let me know when anyone has developed a web browser to those standards.

jujube3 on 2024-04-30

I don't understand the outraged tone. Asking developers to write actually good software shouldn't be viewed as some kind of crazy imposition. We don't have to write everything like it is running on a spacecraft (which I never claimed), but we should try to make it actually good. For example, if there was a web browser with a security compromise and the makers left it unfixed for a long time, there would be consequences. Saying "well, it's just a browser" wouldn't cut it.

More to the point, what situation can you think of where guessing measurement units is a good idea? In a CNC machine? Maps program? Somewhere else? You seem to have omitted the actual counterargument part from your counterargument, while adding a hearty dash of misplaced outrage.

etiennebausson on 2024-04-30

Why would you assume gram when the SI unit is kg and the imputed value fit?

(yeah, my only grip with SI, but it's a frustrating one.)

qwertox on 2024-04-29

You can't assume it if you're handed a random sample of text files.

duskwuff on 2024-04-29

You can start by assuming UTF-8, then move on to other heuristics if UTF-8 decoding fails. UTF-8 is "picky" about the sequence of bytes in multi-byte sequences; it's extraordinarily unlikely that text in any other encoding will satisfy its requirements.

(Other than pure ASCII, of course. But "decoding" ASCII text as UTF-8 is safe anyway, so that hardly matters.)

ranger_danger on 2024-04-30

Shift-JIS specifically cannot be reliably detected, and can pass for UTF-8 but actually make no sense if you looked at a conversion of it.

kelnos on 2024-04-30

I don't think that's true. Looking at how it's encoded[0], it seems similar to many other country/language-specific encodings: bytes 0-127 are the control chars and latin alphabet and symbols, and is more-or-less ASCII, then 128-255 represent characters specific to the language at hand.

The only way you'd successfully decode Shift-JIS as UTF-8 is if it essentially is just latin-alphabet text (though the yen symbol would incorrectly display as a '\'). If it includes any non-trival amount of Japanese in it, it'll fail to decode as UTF-8.

As for whether or not you can then (after it fails to decode as UTF-8) use statistical analysis to reliably figure out that it's in fact Shift-JIS, and not something else, I can't speak to that.

[0] https://en.wikipedia.org/wiki/Shift_JIS#Shift_JIS_byte_map

duskwuff on 2024-04-30

Do you have an example in mind? Looking at the Shift-JIS encoding tables, that seems unlikely to happen in a text of any nontrivial length; there's a small number of Shift-JIS sequences which would be valid as UTF-8, and any meaningfully long text is likely to stray outside that set.

ranger_danger on 2024-04-30

I don't think it's fair to require "meaningfully long text" since when you're dealing with strings in programming they can often be of any arbitrary length.

duskwuff on 2024-04-30

Encoding detection is usually applied to a larger document, at the point it's ingested into an application. If you're applying it to short strings, something's not right -- where did those strings come from?

ranger_danger on 2024-04-30

Taking an ID3 tag example, if you are mass-converting/sanitizing/etc. tag titles and other similar metadata, the strings are often very short, sometimes only even a single codepoint or character, and proper assumptions of encoding can not be relied on because so many people violate specs and put whatever they want in there, which is the whole point of wanting to sanitize the info in the first place.

kelnos on 2024-04-30

Even then, I think it's likely that a your average short byte sequence that's valid Shift-JIS would still not be valid UTF-8.

ranger_danger on 2024-05-02

I disagree... half of the UTF-8 Latin-1 supplement range overlaps with the entire half-width katakana alphabet.

plorkyeran on 2024-04-29

I in fact can assume it. If the assumption is wrong then that's someone else's problem. 15 years ago I wrote a bunch of code using uchardet to detect encodings and it was a pretty useful feature at the time. In the last decade everything I've touched has required UTF-8 unless it's been interoperating with a specific legacy system which has some other fixed charset, and it's never been an issue.

treflop on 2024-04-29

There’s a difference between asssuming and not making a distinction.

Very few developers I’ve met know could make a distinction. They’d see a few off characters and think it’s some one-off bug but it’s because they’re both assuming an encoding.

Even if you said you’d pay them one billion dollars to fix it, they’d absolutely be unable to.

jujube3 on 2024-04-29

OK, for a billion dollars I will write a program that deals with KOI-8 or any other text encoding from the 1960s you care to name.

Please pay up front.

Dylan16807 on 2024-04-30

> Even if you said you’d pay them one billion dollars to fix it, they’d absolutely be unable to.

Unless you want it fixed immediately, then a million dollars should motivate almost any developer to spend a month learning, a month doing, and a few years on vacation. A billion is incomprehensible.

paulddraper on 2024-04-29

You can assume encoding is UTF-8, length is in meters, and timezone is UTC.

You just won't always be right.

kazinator on 2024-04-30

The trick is to be wrong in precisely those situations in which you don't care.

nijave on 2024-04-29

Microsoft would like a word with you (utf-8-bom & utf-16)

pixl97 on 2024-04-30

Microsoft wants a word with everybody.... and that word is "pay us more money".

pyuser583 on 2024-04-29

Universal UTF-8 is a hope we aspire to, not a reality we assume.

fiddlerwoaroof on 2024-04-29

The way you get that reality is you do the opposite of the recommendation of Postel’s law: be very picky about what you consume and fail loudly if it’s not UTF-8

mschuster91 on 2024-04-30

> CSV files, in particular, have no way to signal, in-band, which encoding is used.

That's actually wrong. Add an UTF-8 BOM, that's enough for Excel (and some other libraries) to know what is going on [1].

[1] https://csv.thephpleague.com/8.0/bom/

planede on 2024-04-30

A bag of bytes is a bag of bytes. Any encoding should be either assumed by the protocol being used or otherwise specified.

jcranmer on 2024-04-29

I haven't seen discussion of this point yet, but the post completely fails to provide any data to back up its assertion that charset detection heuristics works, because the feedback I've seen from people who actually work with charsets is that it largely doesn't (especially if you're based on naive one-byte frequency analysis). Okay, sure, it works if you want to distinguish between KOI8-R and Windows-1252, but what about Windows-1252 and Windows-1257?

See for example this effort in building a universal charset detector in Gecko: https://bugzilla.mozilla.org/show_bug.cgi?id=1551276

toast0 on 2024-04-29

I've done some charset detection, although it's been a while. Heuristics kind of work for somethings --- I'm a big fan of if it's decodable as utf-8, it's probably utf-8, unless there's zero bytes (in most text). If there's a lot of zero bytes, maybe it's UCS-2 or UTF-16, and you can try to figure out the byte order and if it decodes as utf-16.

If it doesn't fit in those categories, you've got a much harder guessing game. But usually you can't actually ask the source what it is, because they probably don't know and might not understand the question or might not be contactable. Usually, you have to guess something, so you may as well take someone else's work to guess, if you don't have better information.

renhanxue on 2024-04-29

Yeah. The fantastic python library ftfy ("fixes text for you", https://ftfy.readthedocs.io/en/latest/index.html), designed to fix mangled Unicode (mojibake, of many different varieties), mentions in its docs that heuristic encoding guessers are the cause of many of the problems ftfy is designed to fix. It's magical, by the way.

It has an entire section of its documentation dedicated to why you should not use heuristic charset detection: https://ftfy.readthedocs.io/en/latest/avoid.html#don-t-use-c...

torstenvl on 2024-04-30

That section explains why not to use a specific naive charset detection library that doesn't have a strong prior for UTF-8. There's no basis for extrapolating that further.

Karellen on 2024-04-29

Don't worry, I never assume UTF-8.

I require UTF-8. If it isn't currently UTF-8, it's someone else's problem to transform it to UTF-8 first. If they haven't, and I get non-UTF-8 input, I'm fine bailing on that with a "malformed input - please correct" error.

fl7305 on 2024-04-29

That works until you can't pay your bills unless you take a new contract where you have to deal with a large amount of historical text files from various sources.

Then it's no longer "someone else's problem".

cheeze on 2024-04-29

Even then, at that point I'm writing an adapter to convert into UTF-8 JIT in front of my service. Or modifying my program when that contract comes up, but I'm not going to waste time proactively safeguarding against a what-if that may come up at some point.

fl7305 on 2024-04-29

> I'm not going to waste time proactively safeguarding against a what-if

The article doesn't say that you should. It clearly states that for many cases, the input format is known or explicitly stated in the input headers.

The article talks about cases where the input files are in an unknown input format. Even then, it states: "Perhaps there is a case to be made that csvbase's auto-detection should be a more explicit user interface action than a pre-selected combo-box."

But for the case where the requirements call for heuristics, the article then talks about how that can be done.

> at that point I'm writing an adapter to convert into UTF-8 JIT in front of my service.

And at that point you might need the advice in the article, right?

hot_gril on 2024-04-30

The article is overly broad, and if the point is that not every single system ever uses UTF-8... well of course.

kelnos on 2024-04-30

> at that point I'm writing an adapter to convert into UTF-8 JIT in front of my service

Right, and at that point you're probably gonna need these statistics-based heuristics to write that adapter. Unless you know specifically what other encoding each bit of input is. If you do, then, again, you are not the target audience for this article.

kazinator on 2024-04-30

If you have to deal with a large number of historic text files that are not in UTF-8, why, that's a chance to promote UTF-8!

ahi on 2024-04-30

> a large amount of historical text files from various sources

Sounds like academia. "You" can't pay your bills anyway.

lmm on 2024-04-29

[flagged]

Karellen on 2024-04-29

If the Unicode consortium haven't been able to come up with a way of encoding their name correctly, I don't see what hope I have of doing so.

Bonus - as soon as the Unicode consortium do find a way, my software should be able to handle it with no further changes. Well, it might need a recompile against a newer `libicu` as I don't think they maintain ABI backcompat between versions. But there's not much I can do about that.

lmm on 2024-05-01

Unicode can't do it without a breaking change. But if you support non-unicode encodings in the traditional, documented and standard way then your application will handle it fine. Who knows, one day a successor to unicode may come out that can handle all languages properly within a single encoding, in which case an application written to be encoding-aware will support it without even a recompile.

okanat on 2024-04-29

What a bad, hyperbolic take. UTF-8 can encode the entire Unicode space. All you need is up-to-date libraries and fonts to display the codepoints correctly. It is backwards compatible forever. So requiring UTF-8 allows Japanese to represent their writing method exactly how it is and keep the scheme for a very long time with room to improve.

Manabu-eo on 2024-04-30

Japanese text in uft-8 is frequently rendered with the Chinese version of the kanji due to han unification, not "representing their writing method exactly". Shift-JIS encoding comunicates that the text is in Japanese via the encoding facilitating the correct font to be selected.

And it indeed facilitates, as in practice it works better than encoding in utf-8 that lacks an in-band way to communicate that, and out-of-band often fails/is ignored/doesn't exist.

eviks on 2024-04-30

Utf-8 has language tags, in-band way to communicate a language

lmm on 2024-04-30

It "has" them but they're officially deprecated and generally unsupported.

GrantMoyer on 2024-04-29

I get that you're referring to Han Unification, but if software doesn't display unified glyphs with the correct style, that's an issue with the font rendering system, not Unicode. Sure, the font rendering system's job may have been easier had Unicode made different choices, but encoding-wise, Unicode is no more ambiguous than any other encoding it's round trip compatible with. The font rendering system is free to assume all unified glyphs should be rendered in a Japanese style, just like it would have with a Japanese-centric encoding.

lmm on 2024-04-29

> The font rendering system is free to assume all unified glyphs should be rendered in a Japanese style

In theory yes. In practice it doesn't.

> just like it would have with a Japanese-centric encoding.

The difference is that encoding-aware systems naturally use Japanese fonts for Japanese encodings and other fonts for other encodings.

Dylan16807 on 2024-04-30

> The difference is that encoding-aware systems naturally use Japanese fonts for Japanese encodings and other fonts for other encodings.

Systems like what?

Switching your entire encoding system to set a different font is by far the stupidest way to do it.

lmm on 2024-04-30

> Systems like what?

Encoding-aware programs, UI toolkits, etc..

> Switching your entire encoding system to set a different font is by far the stupidest way to do it.

If it's stupid and it works, it's not stupid. I wish there were other reliable ways to have international programs display Japanese correctly, but there aren't.

Dylan16807 on 2024-04-30

> Encoding-aware programs, UI toolkits, etc..

Name just one please.

To be clear, we're talking about a program/library that can handle both unicode and shift-JIS, and it will render a character that unicode considers identical in different ways depending on what encoding you loaded the character from, right?

lmm on 2024-04-30

> Name just one please.

Firefox (not the best example for a number of reasons, if you're going to follow up then I'll talk about a different one, but if you really want just one then it's the program I have to hand right now).

> To be clear, we're talking about a program/library that can handle both unicode and shift-JIS, and it will render a character that unicode considers identical in different ways depending on what encoding you loaded the character from, right?

Yes.

Dylan16807 on 2024-04-30

> Firefox

Huh, today I learned.

Though if you're making any attempt to use valid tags you'll have <html lang="ja">, and that solves the problem for the web context at least as far as my testing goes.

lmm on 2024-04-30

> Though if you're making any attempt to use valid tags you'll have <html lang="ja">

Right, which is why that's a bad example. But it's the same for most "normal" applications (even e.g. text editors, or I remember hitting this kind of thing in WinRAR), and a lot of the time there isn't a standard way of indicating the language/locale in the file. Even within firefox, there are (admittedly rare) cases where you're viewing something that isn't HTML and doesn't have HTTP headers so using the encoding or manually setting the page language is still the only way to make it work - and applications that have a manual "file language" setting are the exception rather than the rule.

toast0 on 2024-04-29

My understanding is Unicode (and therefore UTF-8) can encode all the codepoints encodable by Shift JIS. I know that you need a language context to properly display the codepoints that have been Han Unified, so that could lead to display problems. But if we're trying to properly display a Japanese name, it's probably easier to put the appropriate language context in a UTF-8 document than it is to embed Shift JIS text into a UTF-8 document.

Realistically --- if someone hands me well marked Shift JIS content, I'm just going to reencode it as UTF-8 anyway... And if they hand me unmarked Shift JIS content, I'll try to see if I can decode it as UTF-8 and throw it away as invalid if not.

torstenvl on 2024-04-30

> My understanding is Unicode (and therefore UTF-8) can encode all the codepoints encodable by Shift JIS

Trivia: There are some variants of ShiftJIS where this isn't entirely true. The traditional yuugen gaisha symbol, for example, which is analogous to U+32CF (LTD), is not supported. The /VENDORS/APPLE/JAPANESE.TXT file uses a PUA designator and then a sequence of four Unicode code points to convert it.

lmm on 2024-04-29

> if we're trying to properly display a Japanese name, it's probably easier to put the appropriate language context in a UTF-8 document than it is to embed Shift JIS text into a UTF-8 document.

You'd think that, but in practice I've found the opposite. Applications that use encodings managed to display things properly. Applications that hardcode UTF-8 don't.

eviks on 2024-04-30

There are both in-band and out-of-band ways to signal language while staying with utf-8, so you don't need to exclude anyone

lmm on 2024-04-30

Only if you actually implement one of those ways of signalling. Which most (IME all) people who make UTF-8 only applications don't!

kelnos on 2024-04-30

I'm no encoding geek, but have I'd say more than a passing familiarity with the issues involved. But I didn't know UTF-8 had in-band language signaling until today, so it perhaps doesn't surprise me that many applications implement it. (UI toolkits should, though... there's kinda no excuse for that.)

eviks on 2024-04-30

Just like only if you actually implement Shift-JIS will you get it working! Which, I guess, also all of utf-8 only apps don't?

lmm on 2024-04-30

If you implement any kind of encoding support (that is, any kind of support for non-ASCII/non-unicode) you will probably have working Shift-JIS support even if you never test it, because Shift-JIS works the same as every other encoding you might test with. If you tested French or Spanish or really anything that wasn't English, you will display Japanese fine.

If you implement only unicode then you put yourself in a situation where Japanese is uniquely different from every other language, and your program will not work properly for Japanese unless you tested Japanese specifically.

eviks on 2024-04-30

This wouldn't solve your original problem since UTF8 also popular with Japanese, so just adding Shift-JIS isn't enough, so it comes down to the same basic thing: to support Japanese, you have to do some extra work to get and use extra info about a language, which is also possible within UTF8 counter to your initial broad claim of the opposite

lmm on 2024-05-01

> This wouldn't solve your original problem since UTF8 also popular with Japanese

Only among people who don't care. Implementing encoding support means your app supports a way to display Japanese properly. If you want to add more ways to display Japanese properly, go ahead, but that's supererogatory, whereas UTF8-only apps don't have a way to display Japanese properly at all.

> to support Japanese, you have to do some extra work to get and use extra info about a language, which is also possible within UTF8

It's not. There's some imaginary theorycrafted way in which it might notionally be possible within UTF-8, but not one UTF-8-only app has ever actually implemented support for displaying Japanese properly. The only approach to displaying Japanese properly that has ever actually been implemented in reality is to support multiple encodings (or to support only a Japanese encoding, but that has obvious downsides), and if you make your app encoding-aware then that's enough, you don't have to do anything else to be able to display Japanese properly for people who care about displaying Japanese properly (it's always possible for an app maker to go above and beyond, but I'm sure you'll agree the big difference is between an app that has a way to display proper Japanese at all and one that does not).

gabrielhidasy on 2024-04-29

Are there Japanese characters missing in UTF-8? They should be added ASAP.

I know there's a weird Chinese/Japanese encoding problem where characters that kind-of look alike have the same character id, and the font file is responsible for disambiguation (terrible for multi-language content and we should really add more characters to create versions for each, but still the best we have).

torstenvl on 2024-04-30

IMHO the Unicode Consortium should standardize on using a variant selector for switching between Chinese and Japanese variants of unified Han characters. Best of both worlds: language-independent specificity while keeping the ability to have both in the same document.

pavel_lishin on 2024-04-29

Does the entirety of Japan shun utf8?

koito17 on 2024-04-29

Many Japanese websites have migrated from Shift-JIS to UTF-8, but this still ignores the fact that e.g. television captioning uses special characters[1] that are not found in UTF-8 or Shift-JIS. Windows itself has a habit of using its own Windows-932 encoding, which frequently causes problems in the Unix software I use. (e.g. Emacs fails at auto-detecting this format, and instructing Emacs to use Shift-JIS will result in decoding issues)

[1] section 1.2 in https://www.fontucom.com/pdf/AFSARIBR-Map.pdf

Dwedit on 2024-04-29

All the television caption characters shown in that chart have a Unicode code-point defined, so they exist in UTF-8.

cheeze on 2024-04-29

If Emacs fails to do it correctly, end developers are screwed.

koito17 on 2024-04-29

Windows-932 is the text encoding I dread the most. I wish literally everything could use Unicode, but we're not quite there yet. There's a reason encoding-japanese[1] on NPM has nearly 1 million weekly downloads. Unfortunately, I have to use it in one of my React Native applications since the one of servers I speak to returns text encoded as Shift-JIS and Hermes does not implement JavaScript's TextDecoder API. Shift-JIS is marginally better than the Windows code page, but I'd really prefer UTF-8.

[1] https://www.npmjs.com/package/encoding-japanese

lmm on 2024-04-29

More or less. The proportion of Japanese websites that use Shift-JIS is actually increasing, for example. (It's true that the absolute number of Japanese websites using UTF-8 is increasing, but that's misleading - it's only due to the overall growth of the web).

java-man on 2024-04-29

This is interesting. Can you show which Japanese names cannot be encoded in UTF-8 please?

TheDong on 2024-04-30

The parent said "displayed correctly", not "encoded".

For example, if I want to talk about the fairly rare japanese name '刃一' (jinichi), there's a chance your computer displays it correctly, but there's also a chance your computer displays the chinese variant of the first character, making it look wrong. It's basically up to font-choice of your computer.

The "correct" way to fix that would be for me to be able to tag it with 'lang=ja', but Hacker News doesn't let me include html tags or some other 'language selector' in my comment, so I'm unable to indicate whether that's supposed to be the chinese or japanese variant of the character.

Most unicode text files don't have extra metadata indicating if a certain bit of text is japanese or chinese, so displaying it correctly by adding the correct 'lang' tag is impossible, especially since it's perfectly possible for one utf-8 text to mix both chinese and japanese.

See https://heistak.github.io/your-code-displays-japanese-wrong/

int_19h on 2024-04-30

How does Shift-JIS fix this?

TheDong on 2024-04-30

I didn't propose any solution or claim Shift-JIS fixes this. It doesn't really since a single Shift-JIS document can only encode the japanese variant, not both.

However, a unicode codepoint which acted as a "language hint" switch would solve this, and wouldn't require doubling the number of han codepoints.

There already are unicode variant selectors, but iiuc they only apply to a single character at a time, and no one actually uses them, so they're not very useful.

lmm on 2024-04-30

The Shift-JIS codepoints for the characters of that name are understood to refer to Japanese characters, so fonts render them correctly. Encoding-aware programs have different representations (such as GB 18030 codepoints) for the similar-but-different Chinese characters that unicode-only programs tend to display these characters as, and so will render them differently.

int_19h on 2024-04-30

If I understand you correctly, it means that the encoding itself serves as the metadata that indicates Chinese/Japanese. In which case, why is it unreasonable to ask for the same for UTF-8, except using some more clearly specified way to indicate this (like lang="ja" etc), rather than encoding it all into separate characters?

lmm on 2024-05-01

> If I understand you correctly, it means that the encoding itself serves as the metadata that indicates Chinese/Japanese.

Only if you have some unicodebrained mentality where you consider a Chinese character that looks sort of similar to a Japanese character to be "the same". If you think of them as two different characters then they're just different characters, which may or may not be present in particular encodings (which is completely normal if you make a program that handles multiple encodings: not every character exists in every encoding)

> In which case, why is it unreasonable to ask for the same for UTF-8, except using some more clearly specified way to indicate this (like lang="ja" etc), rather than encoding it all into separate characters?

Firstly, you have to "draw the rest of the fucking owl" and actually implement that language selection mechanism. Secondly, if you implement some clever extension mechanism on top of UTF-8 that's only needed for Japanese and can only really be tested by people who read Chinese or Japanese, realistically even if you implement it perfectly, most app makers won't use it or will use it wrong. Whereas if you implement encoding-awareness in the standard way that we've been doing for decades and test with even one non-default encoding, your application will most likely work fine for Japanese, Chinese and every other language even if you never test it with Japanese.

lmm on 2024-04-29

Any of the well-known han unification examples. People claim this can be solved with some kind of out-of-band font selection vaporware, but the kind of person who thinks it's fine to demand UTF-8 everywhere never actually stoops to implement this imaginary font selection functionality.

ogurechny on 2024-04-30

Well, I must nitpick: Shift JIS is actually just one of those out-of-band vaporware methods. Sure, it's better supported because of legacy, but new code that doesn't care about handling some lang metadata is not going to care about Shift JIS either.

Of course, there is no correct solution (people in other comments seem to believe it exists a bit too quickly). Dynamic HTML page can load pieces of text written by people all across the world, it can't rely on any “main” language. Those people can't be automatically classified (based on some browser settings or location). Then there are always people who know both Chinese and Japanese (while using some other system locale, to make things more complex). It is wrong to assume that they should not be able to use Chinese forms and Japanese forms at once, even inside a single paragraph or phrase.

I wonder why Unicode has not simply introduced some “combining language form character” to make user choice stick. After all, there's a whole subsystem for emoji modification, and those things were once “weird Japanese texting customs”. As for complexity of handling Unicode text, it asymptotically reaches its maximum anyway.

Wait a second, there are “variation selectors” and some “Ideographic Variation Database”. Is it the solution? Can IMEs and converters simply stamp each variable character with invisible mark based on current input language? I suppose there's some catch…

lmm on 2024-04-30

> Shift JIS is actually just one of those out-of-band vaporware methods.

It's the opposite of vaporware; there's a whole bunch of well-known software that handles encodings correctly.

> new code that doesn't care about handling some lang metadata is not going to care about Shift JIS either.

If you support encodings the way almost every programming language tells you too, you'll handle Shift-JIS just fine. If you use the "legacy" encodings approach and test even one non-English language you'll handle Japanese fine.

> Can IMEs and converters simply stamp each variable character with invisible mark based on current input language? I suppose there's some catch…

They're officially deprecated, IMEs and converters don't use them, naïve search implementations (which is to say most search implementations, because people do not test for an obscure edge case) break, ....

ogurechny on 2024-04-30

But there is no “naive” implementation of Unicode. It won't handle emoji, normalization, and thousands other things. People use The Library anyway. Adding another mutagen to it should be no different from the rest.

As for official stance, maybe it's time for some group to agree on certain non-conflicting sequences, and implement them in some library/stack. A particularly evil solution would choose arbitrary existing combining characters that can't be declared “illegal” retroactively.

By the way, is it possible to make proper quotes from other languages as described above when using Shift JIS?

lmm on 2024-05-01

> But there is no “naive” implementation of Unicode.

I assure you there are thousands if not millions of apps that implement search the naive way, by comparing bytes.

> It won't handle emoji, normalization, and thousands other things.

Yeah. But it works well enough for Americans, so the maintainers don't care enough to fix it.

> As for official stance, maybe it's time for some group to agree on certain non-conflicting sequences, and implement them in some library/stack.

And then what? What do you do with documents in Japanese that have already been converted without using your new codepoints? What do you do with programs that ignore your new standard? What do you do about search being broken in most apps with no prospect of fixing it? There are already some PUA codepoints for a proper implementation of Japanese, but most app authors don't even understand the problem, never mind being willing to support something that's "non-standard". Asking them to support traditional non-unicode encodings, which is something that's at least relatively well-known and standardised, and something they can test without knowing Japanese, is much easier.

> By the way, is it possible to make proper quotes from other languages as described above when using Shift JIS?

No, not for arbitrary languages. If you want to mix arbitrary languages, you need some structure that's a rope of strings with an encoding per segment. But that's exactly the same thing that the unicode advocates claim is easy ("just have a sequence of spans with different lang tags"), and at least if you use traditional encodings then you don't have to do any such fancy business for the much more common case of a file that's entirely in Japanese (perhaps with some ASCII keywords).

mnau on 2024-04-30

> implement this imaginary font selection functionality

We are already doing GSUB/GDEF tables and a lot of horrible stuff to display pretty glyphs on the screen. Hell, there are literal instructions for VM in TTF files to help with pretty rasterization on low-res screens.

Making font rendering library is hard, Nintendo hard.

That's just a nature of the beast. Fonts are messy and if we want one standard to deal with it once and for all, that means some compromises must be done. CJK is not some obscure alphabet. In this case, perfect is enemy of good.

I am just glad there really is only one (non-niche) standard, obligatory https://xkcd.com/927/

lmm on 2024-04-30

> CJK is not some obscure alphabet.

Right. One would think that misrendering a major world language used by over 100 million people would be an issue that warrants some attention. But too many of the HN crowd don't care.

> In this case, perfect is enemy of good.

It's not though. Unicode-only is not just imperfect, it's an outright regression for Japanese. Meanwhile traditional encoding-aware programs render Japanese just fine.

mnau on 2024-05-02

> It's not though. Unicode-only is not just imperfect, it's an outright regression for Japanese.

Unicode has variant selectors which can deal with region variants (e.g. https://imgur.com/a/syMcWNO) to deal with the issue.

Granted, it's not very widely used, but Unicode provided a solution. The onus is now on application developers and font providers.

Choosing to limit number of characters to 2 bytes was a good technical choice back in 1992, non-unified CJK wouldn't fit (that was a time of 2-4MB RAM and 100-200MB HDD). A solution to problem was later provided.

Using Shift-JIS is like doing on-IP, non-UDP protocol on the internet.

Sure, you can do it, but it's just non-perspective choice. A better invested time is on improving standard protocol, which is Unicode (e.g. conversion software).

lmm on 2024-05-07

> Unicode has variant selectors which can deal with region variants (e.g. https://imgur.com/a/syMcWNO) to deal with the issue.

They're officially deprecated and cause issues with breaking search etc.. So they're still below feature parity with using a traditional encoding. Traditional encoding support is also easier to test since documents with traditional encodings are more widespread and exist for many world languages, not just Japanese.

kstrauser on 2024-04-29

They cannot.

ranger_danger on 2024-04-29

Unless you're using an OS older than Windows 2000, or a linux distro from the 2000s, where some form of Unicode was not the default encoding, or maybe an ancient Win32 program compiled without "UNICODE" defined, it shouldn't be a problem. I specifically work with a lot of Japanese software and have not seen this problem in many years.

And even back in the mid 2000s, the only real problems I saw otherwise, were things like malformed html pages that assumed a specific encoding that they wouldn't tell you, or an MP3 file with an ID3 tag with CP932 shoved into it against the (v1) spec.

I also disagree with the author that Shift-JIS can be "good enough" hueristically detected due to its use of both 7 and 8-bit characters in both the high and low bytes to mean different things depending on what character is actually intended. Even string searching requires a complex custom-made version just for Shift-JIS handling.

smeagull on 2024-04-29

[flagged]

chungy on 2024-04-30

The article's pretty weird for presenting little-endian UTF-16 as normal and barely even mentioning that big-endian is an option (in fact, seems to refer to it as "backwards"), even though big-endian is a much more human readable format.

pylua on 2024-04-30

Stupid question: how are the headers passed for http? What encoding describes the encoding ?

a57721 on 2024-04-30

The RFCs for HTTP require headers to be in US-ASCII.

mark-r on 2024-04-30

I think the headers are pure ASCII, which as noted is a subset of almost all encodings.

hot_gril on 2024-04-30

Java and Javascript both use UTF-16 for internal string representation, even though JSON specifies UTF-8. Windows APIs too. I'm still not sure why, but it means that one char uses at least 2 bytes even if it's in the ASCII range.

mark-r on 2024-04-30

Early adopters of Unicode used the first available encoding, UCS-2. UTF-16 is an extension of that to handle the increased range of code points that came later.

dandigangi on 2024-04-29

Except I can

GnarfGnarf on 2024-04-30

Why would anyone use anything other than UTF-8 in this day and age?

Windows took a gamble years ago when the winner was not obvious, we’re stuck with UCS-2, but you can circumvent that with a good string library like QT’s Qstring.

otikik on 2024-04-30

However you can check for invalid utf-8 sequences, throw an error with "invalid encoding on byte x, please use valid utf-8" if encountered, and from that point on assume utf-8.

Dwedit on 2024-04-29

But you can assume non-UTF-8 upon seeing an invalid UTF-8 byte sequence. From there, it can be application-specific depending on what encoding you expect to see.

(And of course UTF-16 if there's a BOM)

camgunz on 2024-04-30

Dear lazyweb: I think I read something about Postel's Law being essential to the internet's success -- maybe this was also IPv6 related? Does anyone else remember this?

camgunz on 2024-05-02

(I found it: https://apenwarr.ca/log/20200708)

teliskr on 2024-04-30

I maintain a system I created in 2004 (crazy right). Not sure how we lived; but at the time, emojis were not as much of a thing. This has come to bite me several times.

ahi on 2024-04-30

Why utf-8 when we have MARC-8? https://en.wikipedia.org/wiki/MARC-8

moefh on 2024-04-30

I had never heard of MARC-8 before, but after a quick google search:

"There are no defined MARC-8 character sets for Armenian, Bengali, Devanagari, Ethiopic, Syriac, Tamil or Thai."[0]

"If a character is not representable in MARC-8 of a MARC-21 record, then UTF-8 must be used instead."[1]

[0] https://help.oclc.org/Metadata_Services/Connexion/Connexion_...

[1] https://en.wikipedia.org/wiki/MARC-8

tanin on 2024-04-30

I'm actually having this issue where users import CSV files that don't seem to be valid. DuckDB would throw out an error like: Invalid Closing Quote: found non trimable byte after quote at line 34, Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in value construction, Value with unterminated quote found

One example: pgAdmin can export a database table into a CSV... but the CSV isn't valid for DuckDB to consume. Because, for some odd reason, pgAdmin uses a single quote to escape a double quote.

This blog is pretty timely. Thank you for writing it!

AzzyHN on 2024-04-30

I'm not a programmer, but how hard is it to write something that just checks what encoding the file uses. Or at least does its best to guess?

torstenvl on 2024-04-30

Pretty hard. There's no true way of knowing when you're right. You can make educated guesses based on statistical likelihood of certain patterns, but nothing stops people from constructing text that happens to look more "normal" when interpreted as another character set.

If I see the bytes 0xE5 0xD1 0xCD 0xC8 0xC7, then it is a decent bet that it's intended to be مرحبا in DOS 708. But there's no definitive reason why it couldn't be σ╤═╪╫ (DOS 437) or åÑÍÈÇ (Windows-1252) or Еямих (KOI8-RU). Especially if you don't know for sure that the data is intended to be natural language.

I can be pretty reasonably certain that it isn't, say, Japanese. 0xE5 in ShiftJIS is the first of a double-byte sequence, and it's odd, so the next byte would have to come from the range 0x40-0x9E, and 0xD1 does not fall within that range. In other words, it simply isn't valid ShiftJIS (it also isn't valid UTF-8). So you can narrow down the possibilities... but you're still making an educated guess. And you have to figure out a way to program that fuzzy, judgment-y analysis into software.

hgs3 on 2024-04-30

I've written code to do this. If you're lucky there will be a BOM (byte order mark) or MIME type to indicate the encoding form. In this case you know the encoding. If you don't have this information, then you must guess the encoding. The issue is guessing may not produce accurate results, especially if you don't have enough bytes to go by.

The program I wrote to guess the encoding would scan the bytes in multiple passes. Each pass would check if the bytes encoded valid characters in some specific encoding form. After a pass completed I would assign it a score based on how many successfully encoded characters were (or were not) found. After all passes completed I'd pick the highest score and assume that was the encoding. This approach ended up being reasonably reliable assuming there were enough bytes to go by.

bastawhiz on 2024-04-30

It's not easy! Go check out the source for chardet (Python). It's not pretty.

Havoc on 2024-04-29

You underestimate my willingness to happy path code...

missblit on 2024-04-30

Encodings I have used or been exposed to in my career: ASCII, Latin1, Windows-1252, UTF-8, UTF-16, UTF-32, GBK, Zawgyi, Shift-JIS

kdklol on 2024-04-30

>You can't just assume UTF-8

But I will, because in this day and age, I should be perfectly able to do so. Non-use of UTF-8 should be simply considered a bug and not treating text as UTF-8 should frankly be a you problem. At least for anything reasonably modern, by which I mean made in the last 15 years at least.

dublin on 2024-04-29

Make your life easy. Assume 7-bit ASCII. No one needs all those other characters, anyway...

AlienRobot on 2024-04-29

Do we really need 128 permutations just to express an alphabet of 26 letters?

I think we should use a 4 bit encoding.

0 - NUL

1-7 - aeiouwy

8 - space

9-12 - rst

13-15 - modifiers

When modifier bits are set, the values of the next half-byte change to represent the rest of the alphabet, numbers, symbols, etc. depending on the bits set.

Dylan16807 on 2024-04-30

I like it, but what if we use 2 to pick symbols, 3 to pick numbers, and 4 and 5 to pick capital letters.

apantel on 2024-04-30

R S T L N E

jonathankoren on 2024-04-30

As someone who has repeatedly had to deal with Unicode nonsense; I wholeheartedly agree. Also, you don’t need accents. You just need to know how to read and have context. See: live and live, read and read, etc.

rch on 2024-04-30

Don't assume; force UTF-8.

teknopaul on 2024-04-30

I suspect the author does not about the first bit being 1 thing.

utf8 is magic.

You can assume US ASCII for lots of very useful text protocols , like http and Stomp and not care what the variable string bytes mean.

Soooo many software architects don't grok the magic of it.

You can define 8bit parser that check for "a:"(some string)\n

and work with a shit load of human languages.

The company I work for does not realise that most of f the 50 year old legacy C it has is fine with utf8 for all the arbitrary fixed length or \0 terminated strings it stores.

neonsunset on 2024-04-30

I will assume UTF-8 and if it's not, it will be your fault :)

smeagull on 2024-04-29

I absolutely can. If it's not UTF-8, I assume it's worthless.

hchak on 2024-04-29

Hoping that LLMs can solve our “Tower of Babel” problem… :)

stalfosknight on 2024-04-29

You can on all of Apple's platforms.

teddyh on 2024-04-30

If you’re actually in a position where you need to guess the encoding, something like “ftfy” <https://github.com/rspeer/python-ftfy> (webapp: <https://ftfy.vercel.app/>) is a perfectly reasonable choice.

But, you should always do your absolute utmost not to be put in a situation where guessing is your only choice.

eqvinox on 2024-04-30

The solution, obviously, is to train an LLM to recognize the character set.

layer8 on 2024-04-30

That sounds like an efficient use of computing resources.

koito17 on 2024-04-29

The comments in this thread are a bit amusing.

I wish I could live in the world where I could bluntly say "I will assume UTF-8 and ignore the rest of the world". Many Japanese documents and sites still use Shift JIS. Windows has this strange Windows-932 format that you will frequently encounter in CUE files outputted by some CD ripping software. ARIB STD-B24, the captioning standard used in Japanese television, has its own text encoding with characters not found in either JIS X 0201 or JIS X 0208. These special characters are mostly icons used in traffic and weather reports, but transcoding to UTF-8 still causes trouble with these icons.

samatman on 2024-04-29

It's more like "I will assume UTF-8 and ignore edge case encoding problems which still arise in Japan, for some strange reason".

We are not running short on Unicode codepoints. I'm sure they can spare a few more to cover the Japanese characters and icons which invariably get mentioned any time this subject comes up on HN. I don't know why it hasn't happened and I won't be making it my problem to solve. Best I can do is update to version 16 when it's released.

koito17 on 2024-04-29

I mention Japanese because I deal with Japanese text daily. I could mention some Chinese documents and sites using GBK to save space (since such encodings use exactly 2 bytes per character whereas the average size in UTF-8 is strictly larger than 2 bytes). But I am not very familiar with it. Overall, I would not say these are "strange reasons".

samatman on 2024-04-29

Other encodings exist, yes. But they can all be mapped to UTF-8 without loss of information[0]. If someone wants to save space, they should use compression, which will reduce any information, regardless of encoding, to approximately the same size. So it's perfectly reasonable to write software on the assumption that data encoded in some other fashion must be first reëncoded as UTF-8

[0]: Except Japanese, people hasten to inform us every time this comes up. Why? Why haven't your odd characters and icons been added to Unicode, when we have cuneiform? That's the strange part. I don't understand why it's the case.

Cloudef on 2024-04-30

Unicode did kind of dumb thing with CJK, the unified Chinese and Japanese kanjis makes displaying the CJK text much harder problem than it should be, as it now relies also on a language specific font to be displayed correctly[0]. I guess this could be bandaided by some sort of language marker in the UTF8 bytestring which then a text shaping engine would have to understand and switch the font accordingly..

0: https://heistak.github.io/your-code-displays-japanese-wrong/

mnau on 2024-04-30

Wasn't it solved by https://en.wikipedia.org/wiki/Han_unification#Ideographic_Va... ?

Kind of a band-aid (it's necessary to stuff a variant selector after a CJK codepoint), but should work.

These decisions were made back in 1992 and codepoint in 16-bit was one of desired goals. Non-unified CJK wouldn't fit. In hindsight, it looks like a rather unfortunate decision, but having more codepoints that would fit to 16 bits could seriously hamper adoption and different standard would win (compute resources were far more limiting back then).

In either case, it's like 4 byte addressing in IPv4, in hindsight, 6+ bytes would be better, but what's done is done.

Edit: Even in 2000s, when C# was released, string was just a sequence of 16-bit code units (not codepoints), so they could deal with BMP without problems and astral planes were ... mostly DIY. They added Rune support (32-bit codepoint) only in .NET Core 3.0 (2019).

Cloudef on 2024-04-30

Seems so, I wonder when the whole stack starts supporting it (IMEs, fonts, text shaping engines)

Adobe / Google seems to have a font https://ken-lunde.medium.com/improving-font-information-proc...

EDIT: Seems IVS is really old, but its still a problem so not holding my breath

int_19h on 2024-04-30

They aren't strange, but they are sort of self-inflicted, so it's not unreasonable for others to say, "we're not going to spend time and effort to deal with this mess".

I'm Russian. 20 years ago that meant having to deal with two other common encodings aside from UTF-8 (CP1251 and KOI8-R). 25 years ago, it was three encodings (CP866 was the third one). Tricks like what the article describes were very common. Things broke all the time anyway because heuristics aren't reliable.

These days, everything is in UTF-8, and we're vastly better off for it.

kelnos on 2024-04-30

Unless the Unicode Consortium decides to undo the Han Unification stuff, I don't think it's going to get better for Japanese users, and programmers who build for a Japanese audience will have to continue to suffer with Shift-JIS.

samatman on 2024-04-30

There will be no undoing of anything, fortunately. Unicode is committed to complete backward compatibility, to the point where typos in a character name are supplemented with an alias, rather than corrected. Han Unification was an unforced error based on the proposition, which was never workable, that sixteen bits could work for everyone. This is entirely Microsoft's fault, by the way. But it shouldn't be, and won't be, fixed by breaking compatibility. That way lies madness.

There are two additional planes set aside for further Hanzi, the Supplementary and Tertiary Ideographic Planes, the latter is still mostly empty. Eventually the last unique ideograph used only to spell ten known surnames from the 16th century will also be added as a codepoint.

I view the continued use of Shift-JIS in Japan as part of a cultural trend, related to the continued and widespread use of fax machines, or the survival of floppy disks for many years after they were effectively dead everywhere else. That isn't at all intended as an insult, it's that matters Japanese stay within Japan to a high degree. Japanese technology has less outside pressure for cross-compatibility.

Shift-JIS covers all the corner cases of the language, and Unicode has been slow to do likewise, and it isn't like Japanese computers don't understand UTF-8, so people have been slow to switch. It's the premise of "unaware of how it works in the rest of the world" that I object to. It's really just Japan. Everywhere else, including the Chinese speaking parts of the world, there's Unicode data and legacy-encoded data, and the solution to the later is to encode it in the former.

lifthrasiir on 2024-04-30

> ARIB STD-B24, the captioning standard used in Japanese television, has its own text encoding with characters not found in either JIS X 0201 or JIS X 0208.

Amazingly enough, ARIB STD-B24 is one of the major source of Unicode emojis. So transcoding would actually work for them! (I am aware of some exceptions, but semantically there is no loss.) Unicode and UTF-8 are truly eating every other legacy encoding else, so much so that it is becoming more reasonable to have a separate transcoding step.

msla on 2024-04-30

But remember, adding emoji to a character encoding standard is Morally Bad, somehow, and Proof Of Intellectual Decay In The Modern World. Also, Unicode invented emoji to Sap And Impurify Our Bodily Fluids and there is no reason for any of them to exist.

bongodongobob on 2024-04-29

Most people on this site probably live in the world where everything is done in English. That's the norm for the vast majority of businesses and people in the US.

fl7305 on 2024-04-29

Even for those people, there is still a ton of old text files in Windows-1252 etc floating around.

You can choose to never work on projects where you have to deal with files like that.

But there may come a day where you have to choose between not paying your rent or writing a tool that converts old textfiles to UTF-8. At that point, it's nice to have references on the internet on how other people have actually dealt with it and what works. "Abort with an error" is not very useful advice then.

pixl97 on 2024-04-30

Why would you write a tool that does that instead of just digging up one that's already written? This sounds like the folly of writing ones own encryption library.

fl7305 on 2024-04-30

Which library do you pick? Which ones are good? Bad?

What are the common pitfalls that the library must deal with?

How do you know how to evaluate them if all you know is "not UTF-8? Abort with error" ?

bongodongobob on 2024-04-30

As far as implementing new tech goes, this sounds like about the easiest research project you could ask for. Grab a few and test em out. Not sure why you're making this out to be some kind of intractable problem.

fl7305 on 2024-04-30

If you look for comments by "jcranmer" elsewhere on this page you'll find examples of why it's not as simple as you make it sound.

Do you have any experience at all with different charsets under the hood by the way?

bongodongobob on 2024-04-30

I do. I cut my IT teeth on an old ass system from the 80s in the 00's. I remember having problems feeding the reports into modern systems. Goofy problems with eol and eof and some other hiccups. It wasn't that bad.

pixl97 on 2024-04-30

Honestly if you can't review/read/figure that out without writing a library of your own, you probably shouldn't be writing a library of your own in the first place.

fl7305 on 2024-04-30

You can't pick and use a library like this without understanding the underlying concepts. That goes for both encryption and the charset conversion issue. It's not always just plug and play.

There are examples of where people used encryption libraries in the wrong way and undermined the strength of the encryption (for example, CVE-2024-31497 in PuTTY).

A very big part of software development is dealing with leaky abstractions. We don't work with perfect black boxes. We need to understand enough of how things works in the lower layers to avoid problems. Note here that I wrote "enough", not "everything", or "write everything yourself".

I would not want a person writing software to handle charset conversion if he refuses to learn how the various encodings work, which charset will decode as another charset or not, etc.

eviks on 2024-04-30

Your example seems to be due to, paraphrasing, lack of a library rather than inability to choose one.

"older approach, PuTTY developers said, was devised at a time when Microsoft Windows lacked native support for a cryptographic random number generator."

So "enough of how things work" could just be "pick a modern encryption library" that doesn't come from the dark ages when t there were no random numbers

Same with encodings, it requires a much lower level understanding to pick a library, you can rely on the expertise of others

jheriko on 2024-04-30

no mention of the BOM... tragic.

mseepgood on 2024-04-30

Yes, I can.

timoteostewart on 2024-04-30

Fascinating topic. There are two ways the user/client/browser receives reports about the character encoding of content. And there are hefty caveats about how reliable those reports are.

(1) First, the Web server usually reports a character encoding, a.k.a. charset, in the HTTP headers that come with the content. Of course, the HTTP headers are not part of the HTML document but are rather part of the overhead of what the Web server sends to the user/client/browser. (The HTTP headers and the `head` element of an HTML document are entirely different.) One of these HTTP headers is called Content-Type, and conventionally this header often reports a character encoding, e.g., "Content-Type: text/html; charset=UTF-8". So this is one place a character encoding is reported.

If the actual content is not an (X)HTML file, the HTTP header might be the only report the user/client/browser receives about the character encoding. Consider accessing a plain text file via HTTP. The text file isn't likely to itself contain information about what character encoding it uses. The HTTP header of "Content-Type: text/plain; charset=UTF-8" might be the only character encoding information that is reported.

(2) Now, if the content is an (X)HTML page, a charset encoding is often also reported in the content itself, generally in the HTML document's head section in a meta tag such as '<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>' or '<meta charset="utf-8">'. Now just because an HTML document self-reports that it uses a UTF-8 (or whatever) character encoding, that's hardly a guarantee that the document does in fact use said character encoding.

Consider the case of a program that generates web pages using a boilerplate template still using an ancient default of ISO-8859-1 in the meta charset tag of its head element, even though the body content that goes into the template is being pulled from a database that spits out a default of utf-8. Boom. Mismatch. Janky code is spitting out mismatched and inaccurate character encoding information every day.

Or to consider web servers. Consider a web server whose config file contains the typo "uft-8" because somebody fat-fingered while updating the config (I've seen this in random web pages.). Or consider a web server that uses a global default of "utf-8" in its outgoing HTTP headers even when the content being served is a hodge-podge of UTF-8, WINDOWS-1251, WINDOWS-1252, and ISO-8859-1. This too happens all the time.

I think the most important takeaway is that with both HTTP headers and meta tags, there's no intrinsic link between the character encoding being reported and the actual character encoding of the content. What a Web server tells me and what's in the meta tag in the markup just count as two reports. They might be accurate, they might not be. If it really matters to me what the character encoding is, there's nothing for it but to determine the character encoding myself.

I have a Hacker News reader, https://www.thnr.net, and my program downloads the URL for every HN story with an outgoing link. I have seen binary files sent with a "UTF-8" Content-Type header. I have seen UTF-8 files sent with a "inode/x-empty" Content-Type header. My logs have literally hundreds of goofy inaccurate reports of content types and character encodings. Because I'm fastidious and I want to know what a file actually is, I have a function `get_textual_mimetype` that analyzes the content of what the URL's web server sends me. My program downloads the content and uses tools such as `iconv` and `isutf8` to get some information about what encoding it might be. It uses `xmlwf` to check if it's well-formed XML. It uses `jq` to check whether it's valid JSON. It uses `libmagic`. There's a lot of fun stuff the program does to pin down with a high degree of certainty what the content is. I want my program to know whether the content is an application/pdf, an iamge/webp, a text/html, an application/xhtml+xml, a text/x-csrc, or whatever. Only a rigorous analysis will tell you the truth. (If anyone is curious, the source for `get_textual_mimetype` is in the repo for my HN reader project: https://github.com/timoteostewart/timbos-hn-reader/blob/main... )

matheusmoreira on 2024-04-30

Yeah but I'm gonna do it anyway. If it's not UTF-8 it's terrible and broken and not worth supporting unless some serious cash is on the table.

klysm on 2024-04-29

Meh I'd rather start making those assumptions and blame everything that doesn't use UTF-8 as broken.

AlienRobot on 2024-04-29

"You can't assume a 32 bit integer starts from 0"

jheriko on 2024-04-30

Fuck me it's 2024

2014 me shaking his head in ways 2004 me saw coming in 1994.

Im only 40.

FUCK

TypicalHog on 2024-04-30

IMO humanity should just do ASCII and call it a day. I'm not talking about viability of this, just my wishes.