bytes, characters and
encoding
■ Byte files contain 8-bit bytes. Character files can contain 8-bit,
16-bit, or 32-bit characters. Only 8-bit and 16-bit characters are discussed on
this site.
■ Unless
you choose a method which writes 16-bit characters, Java will usually write
8-bit bytes to files.
bytes
■ Eight-bit byte
files are commonly called text files.
But Java's Unicode output default for characters
is UTF-8, which again is 8-bits per character
So most Java I/O write methods produce machine-default byte files of
8-bit bytes.
■ The following 8-bit
formats all appear identical in a file so
you needn't worry about them.
ASCII
ISO-8859-1
ISO Latin-1
US-ASCII
UTF-8
Windows default cp1252
characters
■ Whereas 8-bit bytes and characters are all
the same, there are three forms of 16-bit Unicode which can appear in character files.
UTF-16BE, UTF-16LE, and UTF-16
■ UTF-16BE:
This is the simplest and most common form of 16-bit characters. (UTF means Unicode Transformation Format.) For the English language characters
with which you are familiar, each 8-bit character simply has a zero byte added
at the front. The low end (like the
8-bit hex 41 for "A"). stays the
same. The "BE" in UTF-16BE stands
for Big Endian, meaning the big numbers are at the end. The
zeroes are at the top. So it's hex 00 41
for "A".
■ UTF-16LE:
This is the similar to Big Endian but it's LE for Little Endian, meaning the
little numbers are at the end of the characters. The zeroes are now at the top. So "A"
would be hex 41 00 in Little Endian.
■ UTF-16:
This is simply Big Endian with an additional Byte Order Mark (BOM) located at
the front of the file. Here is a table
of the most common BOMs:
Hex For Char Set
FE FF UTF-16BE
FF FE UTF-16LE
00 00 FE FF UTF-32BE
FF FE 00 00 UTF-32LE
■ The BOM is not
necessary. It is always optional. It is a file prefix used to identify the
file format to others who may not know the format in advance.
■ The BOM itself is usually
ignored and not returned by most
everyday byte or text-oriented programs (such as NotePad). However if you are reading such a file in a
way that is sensitive to the actual number of characters which are present
(such as with RandomAccessFile) then you will have to
deal with the BOM.
Java methods which automatically write UTF-16 with a BOM are:
writeChars(...) method of RandomAccessFile
encoding
■ The words "character" and "char," which is also a Java primitive type, imply Unicode
character set encoding. This can be
either Java's default UTF-8 encoding or another one, like UTF-16
or 8-bit ISO 8859-1, which is also called ISO Latin-1. On Windows machines the default encoding
is usually cp1252 Windows Western Europe / Latin-1.
■ Unless
you choose a method which writes 16-bit characters, Java will usually write
8-bit bytes to files.
■ You can determine your machine's normal character encoding with
either of these snippets:
String
s = System.getProperty( "file.encoding" ));
System.out.println(
s );
or
InputStream
fis = new FileInputStream( "anyregularsystemfilewilldohere" );
InputStreamReader
isr = new InputStreamReader( fis );
String
s = isr.getEncoding( );
System.out.println(
s );
■ Note
that, under Windows, you cannot obtain the Unicode encoding of any individual file with the InputStreamReader
getEncoding( )
method shown above. It always returns
the machine's normal character
encoding regardless of the encoding which was written inside the actual file..
■ To specify an explicit encoding when writing to a file, use
an OutputStreamWriter. There, you can specify
anything you want. For an example, see OutputStreamWriter
class