What is Unicode?
It is consistent encoding, representation and handling of text.
Text can be in any language or writing system. For English
characters/numbers/symbols use is ASCII standard for numerical representation
to characters. ASCII table can be refereed on below link.
http://www.asciitable.com/
Most popular encoding format is UTF-8, where 8 means it uses
8-bit blocks to represent a character. Depending on the character the number of
blocks required for representation varies. 1-bit is one memory cell capable of
storing 0 or 1.
Now let’s understand how the encoding is represented. For any
characters equal to or below 127 decimal representation (converted to
hexadecimal 7F) of character can be represented in 8-bits i.e. one byte. If you
checked the ascii table from the link then you can see that there are exactly
127 characters from ASCII which covers all American English character set
including numbers/symbols/control characters. This is why the UTF-8 is very
popular form of representation as it represent the ASCII in 8 bits. For the
characters from other languages it might take more then one block. I know,
it is not fair for all the other forms of characters but this is how it is.
UTF-8 is widely used on internet, ever noticed the "content
type" in Meta tag, If not go on view page source by right clicking on
webpage. The tag will be inside <head> part. Same in xml is in <?xml
version="1.0" encoding="utf-8" ?>
Now let’s do the math on how characters are converted or
encoded, we will stick to English alphabets and numbers for this demonstration.
(x - Means free bits for storing).
8 bit block stream starts with 0 and the character is converted
into binary using Unicode code point which is assigned to every character. For
e.g. "A" has ascii representation of decimal 65,
which if you convert to hex becomes 41 which is nothing but
the Unicode code point and then it is converted into binary representation 4
for 100 and 1 for 0001 so the UTF-8 binary is 01000001.
Representation
|
Character
|
UTF-8 (Hex)
|
Unicode code Point
|
UTF-8 (Binary)
|
UTF-16(Hex)
|
UTF-32(hex)
|
0xxxxxxx
|
DEL
|
0x7F(7F)
|
U+007F
|
01111111
|
0x007F(007F)
|
0x0000007F (007F)
|
0xxxxxxx
|
A
|
0x41(41)
|
U+0041
|
01000001
|
0x0041(0041)
|
0x00000041 (0041)
|
Now, what if your character is using multi blocks i.e. more than
8 bits.
·
00 to 7F hex (0 to 127): first
and only byte of a sequence, starts with 0xxxxxxx where number of free bits are
7.
·
80 to BF hex (128 to 191): continuing
byte in a multi-byte sequence
·
C2 to DF hex (194 to 223): first
byte of a two-byte sequence, , where 1st byte starts with 110xxxxx and second
byte starts with 10xxxxxx where number of free bits are (5+6)=1
·
E0 to EF hex (224 to 239): first
byte of a three-byte sequence, where 1st byte starts with 1110xxxx,
second byte starts with 10xxxxxx and 3rd byte starts with 10xxxxxx where number
of free bits are (4+6+6)=16
·
F0 to FF hex (240 to 255): first
byte of a four-byte sequence, where 1st byte starts with 11110xxx, second byte
starts with 10xxxxxx, 3rd byte starts with 10xxxxxx and 4th byte starts with
10xxxxxx where number of free bits are (4+6+6+6)=21
Lastly I would like to share an article of a developer from 2003
who was frustrated with the programs not understanding the UTFs. If you have
been through what I explained above then it is sufficient to swim the waters.
http://www.joelonsoftware.com/articles/Unicode.html
!Dhanyawadah!