Remove BOM mark from text files in Python

16 Mar

In working with Tensorflow and TFLearn on Windows I frequently run into a problem with my source data files being encoded as UTF-8 with a BOM header.  A BOM is a byte order mark, a single unicode character that prefaces the file.  Many data loading utilities load with the incorrect encoding and will throw a ValueError about this unexpected character.

The large files often used as training data can be challenging to open/re-encode properly so I created this method to rewrite the file in-place without the BOM mark.

I actually use this in my TFLearn training scripts automatically, like this:

Such that any BOM related error during loading attempts a repair of the file, and a retry automatically.

Leave a Reply

Your email address will not be published. Required fields are marked *