Should UTF-8 CSV files contain a BOM (byte order mark)?

Question

Our line-of-business software allows the user to save certain data as CSV. Since there are a lot of different formats (all called "CSV") in use in the wild, we are tying to decide what the "default format" should look like.

Regarding line/field separators and escaping, there is a standard we can use: RFC 4180.
Regarding text encoding, UTF-8 seems to have emerged in the last decade as the "default text file format", so we will use that.

The one question left open is: Should we add a BOM at the start or not? I have read multiple opinions and pros/cons on the use of BOMs in general, but is there an "official" recommendation or at least some kind of community consensus on the use of BOMs in CSV files?

If it has a BOM then it is not UTF-8. But what format do the programs want. If they need a BOM (mainly micro-sloth) then you need to add one, but UTF-8 + BOM ≠ UTF-8. — ctrl-alt-delor, Jun 18 '18 at 18:26
Even though CSV is apparently easier to generate, there are so many compatibility issues, especially if you stray out of pure 7-bit ASCII, that I would very, very, strongly recommend you generate actual XLSX if the goal is for users to open it in Excel (rather than re-import it in some other software, in which case you will have to give options for separators, encoding, etc.). There are libraries for most languages out there, and you'll save you and your users a lot of time. — jcaron, Jun 18 '18 at 22:27
If you do take the CSV route, check what happens when you open the file on both Mac and PC, ideally with several versions of Excel. Also be aware that some versions of Excel do not behave the same when you double-click on the file to open it or open the file via the menu. — jcaron, Jun 18 '18 at 22:30
Why does it matter if it opens correctly in Excel? Nothing in the question states Excel needs to be able to parse the generated file... — rubenvb, Jun 19 '18 at 14:53
BOM is in the UTF-8 spec. It’s use is discouraged since it has very little point, but it’s valid UTF-8. — Andy Lynch, Jun 03 '21 at 13:41

Kayaman · Answer 1 · 2018-06-19T07:12:18.373

62

Not for UTF-8, but see the various caveats in the comments.

It's unnecessary (UTF-8 has no byte order) unlike UTF-16/32 and not recommended in the Unicode standard. It's also quite rare to see UTF-8 with BOM "in the wild", so unless you have a valid reason (e.g. as commented, you'll be working with software that expects the BOM) I'd recommend the BOM-less approach.

Wikipedia mentions some mainly Microsoft software that forces and expects a BOM, but unless you're working with them, don't use it.

edited Jun 19 '18 at 07:12

answered Jun 18 '18 at 07:50

Kayaman

1,930
12
16

34

There's also widespread software requiring a BOM: Excel needs a BOM to correctly identify a CSV file as UTF-8 rather than "ANSI", i.e., the local compatibility locale. (But Excel also does [strange things](https://superuser.com/q/1204233/14517) when saving such a file, so we advise users to use our "real" Excel export instead of the CSV export if they want to open the file with Excel.) – Heinzi Jun 18 '18 at 08:55
3

@Heinzi It's generally only MS software that violates the standard and gets angry about missing and wrong BOM in UTF-8. – jpmc26 Jun 18 '18 at 13:51
24

@Heinzi I learnt a long time ago that you cannot really win when working with CSV and Excel. It's simply a lousy CSV-reader. Too bad it's what normal users expects. – pipe Jun 18 '18 at 13:53
5

@jpmc26 I guess you missed the part where the BOM for UTF-8 is explicitly allowed by the standard? That's far from "violating the standard". – Voo Jun 18 '18 at 14:44
9

@Voo: Requiring a BOM for UTF-8 certainly violates the standard, considering it is "[*neither required nor recommended*](https://www.unicode.org/versions/Unicode5.0.0/ch02.pdf#G19273)". – Deduplicator Jun 18 '18 at 15:57
@Deduplicator If that logic really worked, then not being able to handle UTF-8 with a BOM would be just as violating (since the standard clearly defines how UTF-8 with a BOM is supposed to look) and the only way you'd be standards compliant would be if you implemented *both* versions. At which point the whole "optional" part of BOM handling would be rather lost. This is not how optional features in specifications are defined obviously. – Voo Jun 18 '18 at 16:10
14

@Deduplicator: MS-DOS and Windows systems have a large base of legacy text files in encodings other than UTF-8. Quality applications allow a user to specify how a text file is encoded when opening it, but often include an "auto" option. If a user selects "UTF-8", a UTF-8 file will be opened correctly with or without a BOM. If a user selects "auto", some UTF-8 files that don't have a BOM may be misidentified as using some other encoding. I'm not sure what one would expect an application to do differently, since files that are "misidentified" could be bit-for-bit identical with... – supercat Jun 18 '18 at 16:14
3

...files that contain meaningful text using other encodings. – supercat Jun 18 '18 at 16:15
2

@supercat: That you need some out-of-band method for signalling the encoding is true for all text. Though some more modern file-formats like HTML allow you to assume e.g. ASCII until you run across an explicit in-band notice. Also, for pure ASCII, being able to treat it as a plethora of extended-ASCII encodings at will is a feature, not a bug. – Deduplicator Jun 18 '18 at 17:12
1

.NET has a tendency to prepend a BOM. I think they made it that way at the time because it was important to differentiate UTF8 from other common encodings. Might have been a mistake. – usr Jun 18 '18 at 17:36
2

@supercat The problem is that we can only sacrifice so much current-compatibility and forward-compatibility on the alter of backward-compatibility, and I think this discussion is just a symptom of the fact that some people now think that Windows has ended up on the wrong side of that threshold. It *could* instead err on the side of assuming that a file is the now-ubiquitous UTF-8 encoding instead of erring on the side of assuming that it's one of the legacy encodings, when faced with ambiguity. – mtraceur Jun 18 '18 at 17:55
@mtraceur: Given that text files tend to end with control-Z, making them instead end with control-Z and a format indicator would have caused fewer compatibility problems than putting a market at the front (many applications that know nothing about such markers will simply ignore everything after a control-Z). Otherwise, given that the MS file systems are generally case-insensitive, it might have been possible to use the case of a file extension (e.g. tXt) as an indication that a file should be presumed to be UTF8, while still being usable by older applications. – supercat Jun 18 '18 at 18:32
@mtraceur: It's also important to realize that, within Microsoft, it is still [fairly common](https://blogs.msdn.microsoft.com/oldnewthing/20180608-00/?p=98945) for people to refer to UTF-16LE as "Unicode" without further specification. ISTM that at least some of them think of UTF-8 as "that weird thing that the web uses" rather than "the standard text encoding that everyone uses." – Kevin Jun 18 '18 at 19:04
7

@Voo: That conflicts with many other format-specific requirements where a BOM is illegal. For example, a shell script with a BOM before the `#!` is invalid. At best a BOM in UTF-8 is "allowed, when no format-/application-specific requirement precludes it", not "allowed", and as such should not be used. The standards are actually clear about the SHOULD NOT. – R.. GitHub STOP HELPING ICE Jun 18 '18 at 19:39
1

@supercat: No out-of-band signaling is fundamentally needed to identify UTF-8 text even on DOS/Windows systems where legacy codepages may also be in use. Simply attempt to interpret the file as UTF-8. If that succeeds without error, then the file is not meaningful *textual content* in some other encoding (at best, it's an attempt to show UTF-8 mojibake in some other encoding) and you can assume it was intended to be UTF-8. If it fails, you can assume whatever legacy encoding is appropriate for the user's locale. – R.. GitHub STOP HELPING ICE Jun 18 '18 at 19:43
2

@R. "Simply attempt to interpret the file as UTF-8. If that succeeds without error" And what does "succeeds without error" mean? Please be more specific and come up with an algorithm that won't have a gigantic false positive rate (remember, every time you guess wrong you will have ruined a users experience). Now sure you could say "ah screw backwards compatibility, it's just too ugly and who cares about old programs", but backcomp is the main reason why Windows still has a 8x% market share on the desktop while Linux is still hovering around what.. 5%? – Voo Jun 18 '18 at 21:15
2

@Voo: In a valid UTF-8 file, any byte in the range 0xC0-0xDF will be followed by one in the range 0x80-0xBF, any byte 0xE0-0xEF will be followed by two in the range 0x80-0xBF, any 0xF0-0xF7 will be followed by three 0x80-0xBF, and no other bytes in the range 0x80-0xBF or 0xF8-0xFF will occur. Most non-contrived text files having that property will be either pure ASCII (which is also valid UTF-8) or be UTF8-encoded. – supercat Jun 18 '18 at 21:45
1

@supercat There's billions of text files out there (for once that might not even be that much of an overstatement) that are not pure ASCII and not have any characters >= 192. Take a look at the windows-1252 code page (the simplest of all code pages, there are much more interesting ones around) and check what's between 127 and 192. Your assumption might work for US Americans, but the world is a much larger place than that. – Voo Jun 18 '18 at 21:54
1

@Voo: If a file has any characters 0x80-0xBF that aren't immediately preceded by other characters in the range 0x80-0xF7, it isn't a valid UTF-8 file. – supercat Jun 18 '18 at 22:02
1

@supercat Yes, and sadly there's a lot of perfectly valid sentences in windows1252 that have the same property (`He owes me 80€…` to give just one reasonably non-contrived example from the top of my head). I agree that it's reasonably easy to figure out the right encoding for long texts. But for texts that are just a few sentences? Devilishly hard to get high enough accuracy. I'm German, I remember the 90s and start of 2000s on the web when browsers tried to figure out encodings and failed miserably often enough to be memorable. – Voo Jun 18 '18 at 22:14
2

@Voo: That's a bad example. The Euro is encoded at 0x80 in Windows-1252, and is preceded in your example by an ASCII character. That's invalid UTF-8 (continuation byte with no lead byte). – Kevin Jun 18 '18 at 22:29
2

@Voo: "Succeeds without error" is strictly defined. There is a formal syntax for UTF-8 defined via ABNF in RFC 3629, and equivalent definitions defined in the Unicode standard and elsewhere. It's actually stricter than what supercat explained, and if you spend a few minutes trying you'll realize that you're not going to produce meaningful legacy-encoding strings that satisfy the UTF-8 syntax. – R.. GitHub STOP HELPING ICE Jun 19 '18 at 01:02
3

It should also be pointed out that the standard doesn't recommend removing a BOM that's already there, both in case of software that depends on it, and to prevent data loss during round-tripping. This suggests that the intent may be `Follow precedent when possible, but prefer no-BOM when you're the one ` _`making`_ `the precedent.` – Justin Time - Reinstate Monica Jun 19 '18 at 07:05
@Voo illegal monopolistic practices are the reason for the Windows market share being what it is. Linux is far far far more backwards compatible than Windows. – Miles Rout Jun 20 '18 at 04:18
@Kevin YEah wrong way around. And while I can get some working examples those are probably sufficiently unlikely that it'd presumably work. – Voo Jun 20 '18 at 21:25
@Miles I can't even take a binary compiled for one current distribution and run it on a different current one. Sure the kernel is great at backcomp, but anything in user space? Yeah good luck with that. I can't imagine any non-trivial GUI program from 2001 (I started with red hat 7 something around that time) running on a 2018 distro. (And if your idea is "oh just recompile it", then you demonstrated why even after way, way over a decade without "illegal monopolistic practices" Linux still isn't running on every Desktop - we'll get there with android though) – Voo Jun 20 '18 at 21:36
@Voo can you take a Windows program that relies on the presence of dozens of third party DLLs from the early 2000s and expect it to just run perfectly on Windows today? Of course not. And no don't fucking recompile it. Bundle the dependences. Jesus. And let's be very clear here, Microsoft has not stopped their monopolistic practices one bit. – Miles Rout Jun 21 '18 at 22:56
1

The full quote: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature." [source](https://www.unicode.org/versions/Unicode5.0.0/ch02.pdf#G19273). Its definitely not a violation of the standard, but frowned on – lightswitch05 Jun 26 '18 at 14:49
@MilesRout To be fair, MS at least has a tendency to try to maintain bug compatibility with previous versions of Windows, in addition to normal backwards compatibility. This tends to be harder for Linux as a whole to do, because both the kernel and individual distros have to be designed with bug compatibility in mind. – Justin Time - Reinstate Monica Jul 02 '18 at 17:22

Joop Eggen · Answer 2 · 2018-06-18T11:35:26.133

11

There still is no widespread convention AFAIK, though certainly UTF-8 is now generally accepted.

The BOM is an awful artifact:

It is invisible (zero-width space).

Some software might break on the first column name not containing only letters, but that strange BOM in front.

The header line might perchance be copied for value lines corrupting the first value.

It is only needed by some Windows software to distinghuish between one of the ANSI encodings used by that local Windows machine, and UTF-8. Notepad, Excel.

So the sad thing is one should support the BOM. Maybe optional.

Use a naming scheme for the files (...-utf8.txt, ...-utf8bom.txt).

In many cases we could use HTML as export alternative. This allows setting the encoding in the file. An extra feature is the background/foreground coloring of rows and cells. Which heightens the quality of the export.

edited Jun 18 '18 at 11:35

answered Jun 18 '18 at 11:17

Joop Eggen

2,011
12
10

16

Whether formatting "heightens the quality of the export" is hugely dependent on the intended use of the file. CSV is often used as a simple *machine readable* format, and making the recipient parse HTML instead would be a big *disadvantage* in that case. – IMSoP Jun 18 '18 at 11:55
@IMSoP human usage however for 90% means loading by Excel and is much nicer. B2B indeed, and all that nice processing by other software becomes a question mark. Hence my "in many cases we could use HTML". In fact Excel can do CSV export too. – Joop Eggen Jun 18 '18 at 12:16
5

If you're choosing a naming scheme, keep the audience in mind. `-utf8-windows.csv` is better. Almost everyone knows what Windows is, in the context of computers, but far fewer users know what a Byte Order Mark is. – MSalters Jun 18 '18 at 12:49
Is it necessary to specify `utf8` in the filenames if the plan is to always use UTF-8? Even if it later becomes necessary to save some data in another encoding, UTF-8 could be the default, and `-gb18030.csv` could be marked. – Davislor Jun 18 '18 at 13:35
2

@Davislor yes if it is a broadly communicated known standard. Otherwise error reports will come in about `tschÃ¼ÃŸ` being garbage whereas `tschüß` should have been written. On StackOverflow many IT errors are about encodings. End users will experience problems too. – Joop Eggen Jun 18 '18 at 13:48
3

@JoopEggen "Broadly communicated known standard" in what community exactly? I've been doing software development for nearly 10 years now and I've never seen that - not even on windows, and _certainly_ not on Linux or OSX where you almost always deal with utf-8. – Cubic Jun 18 '18 at 14:16
@Cubic Davislor indeed means UTF-8, and I say _if_ it is a broadly communicated standard [_in the company_], _then_ my proposed ending `-utf8.txt` indeed makes no sense. – Joop Eggen Jun 18 '18 at 15:00
@JoopEggen Ah, perception fail. I didn't see the 'if' in that sentence. – Cubic Jun 18 '18 at 15:01
It's interesting to note that Notepad doesn't actually _need_ the BOM to identify UTF-8... provided the file contains at least one sequence that is valid in UTF-8, but not in ASCII. – Justin Time - Reinstate Monica Jun 19 '18 at 07:02
1

@JustinTime yes since some years even, but not before. The MS developers are not that bad (Posix compliance, now UTF-8 support). – Joop Eggen Jun 19 '18 at 09:17

Should UTF-8 CSV files contain a BOM (byte order mark)?

2 Answers2