bytes, characters and encoding

 

 

  Byte files contain 8-bit bytes. Character files can contain 8-bit, 16-bit, or 32-bit characters. Only 8-bit and 16-bit characters are discussed on this site.

 

  Unless you choose a method which writes 16-bit characters, Java will usually write 8-bit bytes to files.

 

 

bytes

 

 Eight-bit byte files are commonly called text files. But Java's Unicode output default for characters is UTF-8, which again is 8-bits per character  So most Java I/O write methods produce machine-default byte files of 8-bit bytes.

 

The following 8-bit formats all appear identical in a file so you needn't worry about them.

ASCII

ISO-8859-1

ISO Latin-1

US-ASCII

UTF-8

Windows default cp1252

 

 

characters

 

 Whereas 8-bit bytes and characters are all the same, there are three forms of 16-bit Unicode which can appear in character files.

UTF-16BE, UTF-16LE, and UTF-16

 

 UTF-16BE: This is the simplest and most common form of 16-bit characters.  (UTF means Unicode Transformation Format.) For the English language characters with which you are familiar, each 8-bit character simply has a zero byte added at the front.  The low end (like the 8-bit hex 41 for "A"). stays the same.  The "BE" in UTF-16BE stands for Big Endian, meaning the big numbers are at the end. The zeroes are at the top.  So it's hex 00 41 for "A".

 

 UTF-16LE: This is the similar to Big Endian but it's LE for Little Endian, meaning the little numbers are at the end of the characters.  The zeroes are now at the top. So "A" would be hex 41 00 in Little Endian.

 

 UTF-16: This is simply Big Endian with an additional Byte Order Mark (BOM) located at the front of the file.  Here is a table of the most common BOMs:

 

Hex                  For Char Set

FE FF               UTF-16BE

FF FE               UTF-16LE

00 00 FE FF      UTF-32BE

FF FE 00 00      UTF-32LE

 

The BOM is not necessary.  It is always optional.  It is a file prefix used to identify the file format to others who may not know the format in advance. 

 

The BOM itself is usually ignored and not returned  by most everyday byte or text-oriented programs (such as NotePad).  However if you are reading such a file in a way that is sensitive to the actual number of characters which are present (such as with RandomAccessFile) then you will have to deal with the BOM.

 

Java methods which automatically write UTF-16 with a BOM are:

writeChars(...) method of RandomAccessFile

 

 

encoding

 

  The words "character" and  "char," which is also a Java primitive type, imply Unicode character set encoding.  This can be either Java's default UTF-8 encoding or another one, like UTF-16 or 8-bit ISO 8859-1, which is also called ISO Latin-1.  On Windows machines the default encoding is usually cp1252 Windows Western Europe / Latin-1. 

 

  Unless you choose a method which writes 16-bit characters, Java will usually write 8-bit bytes to files.

 

  You can determine your machine's normal character encoding with either of these snippets:

 

String s = System.getProperty( "file.encoding" ));

System.out.println( s );

      or

InputStream fis = new FileInputStream( "anyregularsystemfilewilldohere" );

InputStreamReader isr = new InputStreamReader( fis );

String s = isr.getEncoding( );

System.out.println( s );

 

  Note that, under Windows, you cannot obtain the Unicode encoding of any individual file with the InputStreamReader getEncoding( ) method shown above.  It always returns the machine's normal character encoding regardless of the encoding which was written inside the actual file..

 

  To specify an explicit encoding when writing to a file, use an OutputStreamWriter. There, you can specify anything you want. For an example, see OutputStreamWriter class