A Blog about coding & physics

UTF 8 BOM Detection in Java

Have you ever encountered like this: Reading a file encoded in UTF-8, but always found it starts with a mysterious character which may be printed as "?" into screen but is not seen in any text editor. This is caused by the BOM of UTF-8 files.

What is BOM?

See this: Wikipedia BOM

BOM is, put simply, some marks used to identify the encoding of text, but it is not necessarily required in UTF-8 standard, see:

While there is obviously no need for a byte order signature when using UTF-8, there are occasions when processes convert UTF-16 or UTF-32 data containing a byte order mark into UTF-8. When represented in UTF-8, the byte order mark turns into the byte sequence. Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence does not affect conformance to the UTF-8 encoding scheme. Identification of the byte sequence at the beginning of a data stream can, however, be taken as a near-certain indication that the data stream is using the UTF-8 encoding scheme.

Link To Document

Plus: A discussion on ZhiHu

How to Deal with BOM

There could be many ways to do it but I found a simple solution. I figured out the unicode representation of BOM is \uFEFF. Therefore, if any UTF-8 file started with character \uFEFF, just remove the first character from it will sovle this problem.

How to Write files without BOM

Well, most text editors under Windows will automatically add BOMs to your UTF-8 files because this is favoured by Microsoft, the only exception I know of is notepad++, a great text editor for programmers (other exception? Feel free to inform me by E-mail). So basically you have to live with it on Windows.

Things get much better in Linux. With UTF-8 everywhere, Linux never use this BOM to identify a UTF-8-based file from ANSI-based file. Maybe I will never got to worry about BOM in Linux. Thanks for Microsoft's stupid idea to remind me of BOMs.

There are comments.

blogroll

social