Unicode | What and Why?


We will discuss following concerns in this article:

What were the needs of character sets?(ASCII, Unicode etc.)
How ASCII has emerged?
What is  Unicode and Why is it?
Why 1 Bit = 8 Byte?
Why Java character takes Two Bytes?
Why C character takes one Byte?


In the early age of Computer Technology. The only Binary language was there(1,0). We, the Human, were not comforting with Binary. When they tried to write their name in binary it was time taking and can be a mesh. They needed a human language to use which is most often to use. This technology was emerging in the US. so they need the English language. They found Latin which contains 256 characters including 26 characters of English + German + French + special characters + numbers and others. Language is just a set of characters. So they assign a binary sequence of 8 bit for their every English alphabet's character as:

a => 01100001

b => 01100010
c => 01100011
A => 01000001 etc.

That's why 1 Byte = 8 Bit, as 2^8 = 256 characters.
That's why character data type of C, C++, and many technologies takes One byte.
This is how ASCII(American Standard Code for Information Interchange) created.
And by following this, Every country was making its own character set. As  KOI for Japanese, Big5 for Chinese, ISO-8859 for Europe's seven languages etc.

But with this, a problem arises. let's look at this:
"The character set is the fundamental raw material of any language and they are used to represent information. Like natural languages, computer language will also have a well-defined character set, which is useful to build the programs."
This means that Choosing a programming language means choosing a character set also.
So If we create a software using C language which supports ASCII character set then this software will understand English only, not Chinese or French etc. Softwares were language specific.
We humans have not one language. Every country has their own language. And every country wanted the software to understand their language too. It means that they need a character set which contains all languages of this world.  And Programming languages use that character set to make software.
So they collect the world's all languages and try to make a new character set. They became more than 256 characters so they assign 2 bytes for that character set. It means now this character set can hold 2^16 = 65536 character. But in world's all language's characters are more than 65536. To increase this size, they encode it first in Hexadecimal then convert in binary and generate a Unique Code called Unicode. This is how Unicode arises.

So languages which use Unicode, much have their character data type of 2 bytes.
for ex: Java, Ruby etc
That's why Java reserve 2 bytes for character data type, and as C language uses ASCII, it reserves 1 byte of character data type.

Post a Comment

0 Comments