Global Wordbank–First Steps

An English Hindi Wordbank

And the need for a Global Wordbank

Arvind Kumar


A Global Wordbank, would make the generation of bilingual or multilingual thesauruses of any number of desired languages just a click away. It will just be a matter of choosing the target languages (say Spanish-Russian, German-Arabic or French-Hindi) and clicking to get an output in the shape of a thesaurus either in a printed book format or as an online computer program.


Some day, some body, some organization, some university (or Unesco?) may step in and help initiate the process of such a Wordbank, which may end up as a repository or depository of words from all the major languages of the world.


New Impetus to a Tradition

The value of thesauruses as linguistic tools has been recognized the world over since ancient times. Attempts were made to create word lists in ancient societies. In India''s Vedic age, the sage Kashyap collected 1,800 words in Nighantu (circa 1,500 b.c.) and arranged them subject-wise.

The famed Sanskrit thesaurus Amar Kosh was composed by Amar Singh (circa 6th-7th century a.d.). In 1,502 verses (Poetry was considered an appropriate way of communication for the ease of memorizing and maintaining the pristine ness of the text. Scholars would then commit the entire text to memory), it has 8,000 words divided in 3 hierarchical cantos and 25 chapters (Panel 1). Subsequent centuries continued to see lively lexical activity in India.

Panel 1

Structure of Amar Kosh

1. Canto 1

2. Canto 2

3. Canto 3

1. Heaven

1. Earth

1. Adjectives

2. Sky

2. Towns

2. Words with narrow meanings

3. Directions

3. Mountains

3. Words with many meanings

4. Time

4. Plants

4. Non-changing words—avyayas

5. Intellect

5. Animals**

5. Words as per gender

6. Words

6. Man


7. Dramatic Arts*

7. Brahmins or priests


8. The Nether World

8. Kshatriyas or warriors


9. Hell

9. Vaishyas or traders and farmers


10. Water

10. Shudras or menials



*Dramatic Arts come under Heaven, as it was considered a heavenly activity, but the Performers of Dramatic Arts come under Shudras.

** All animals are not listed here. Lion is found under Kshatriyas, Cow under Vaishyas.

This can be explained by pointing out that in Amar Singh’s time, Indian society was divided in four Varnas or major castes, each having its own well-demarcated areas of activity.

Contact with the West, and establishment of British rule in the nineteenth century, gave a new impetus to language studies in India. The rulers needed to understand their subjects better. The discipline of Indology came into being. Simultaneously, a great effort was afoot to propagate Christianity. To make vernacular translations of the Bible, Christian missionaries took to learning Indian languages and formalized grammars to suit their needs. Scholars made bilingual dictionaries; among them is the famous and unrivalled Sanskrit-English Dictionary (1872) by Sir Monier Monier-Williams.

English language became the official medium of education and the catalytic agent of change. Those who wanted to get government jobs and prosper in various other ways under the British took whole-heartedly to English. A sizable body of Indians came in contact with Western culture and industrialization. Modernization came to mean Westernization. In reaction, many leaders arose to contest Christianity, fight Westernization and reform medieval society. Thus, contact with Western ideas and literature started a cultural cross-fertilization. Ever since, there has been a great churning, an awakening and the Indian renaissance.

Independence from the British rule in 1947 greatly accelerated the lexical activity. The nascent nation had to come to terms with the world''s nations. This gave a new urgency to dictionary making. Under the British, many had opposed English as an imperial language imposed by conquerors. Now, it came to be perceived as India''s main bridge to the world. While some bilingual dictionaries were made between Hindi and many world languages, the main stress remained on English-Hindi and Hindi-English dictionaries. The Government of India set up commissions to coin technical terms so that Hindi could replace English as the medium of governance.

The making of India''s first modern thesaurus

As with many others, my first exposure to a thesaurus was through Roget, in 1952-53--almost a century after the first publication of his work. At that time, in Delhi, I worked in the Hindi editorial department of a publishing house that brought out the magazines Caravan (English) and Sarita (Hindi). To advance myself, I was simultaneously studying in the evening classes.

At times, I had to translate some Hindi items into English and quite often, found myself at a loss to get the right word. A friend suggested I keep Roget''s Thesaurus by my side. The copy I got was of a small old-style edition, much before the present-day vast International editions. In it, words pertaining to opposite concepts were printed in two facing columns like armies arrayed against each other. I found it a great treasure trove of words and a valuable linguistic tool. My English usage improved and the editor was happier with my output--so much so that I was shifted from the Hindi to English editorial department!

How I wished Hindi had such a valuable resource! It was beyond my wildest dreams that one day I would make such a thesaurus myself--for a young man of 22-23, that would be audacity itself. I hoped someday someone would do it.

It was two decades later, on the Christmas evening 1973. I was in Bombay editing a Hindi fortnightly magazine Madhuri for the Times of India group. A Hindi thesaurus had yet to be compiled. I saw that I would have to do it myself.

My wife Kusum and I realized that such a huge task would require full-time attention for many years and would involve me leaving my lucrative job, and in the absence of any outside financial support, surviving on our savings. While we spent some months in collecting reference material in 1974, the main work started in 1976 when I started working on it part-time in the evenings, with Kusum as an assistant. This was a sort of testing waters and finding our way in a sea that neither been charted nor navigated before in Hindi. We took the final plunge in 1978 when I left my job and moved to Delhi to devote myself wholly to the thesaurus.

Problems of Structure

I had imagined we would be able to complete it within two years. After all, we had the excellent model of Roget before us! I assigned numbers to all the concepts as per our model and put the numbered cards in Rogetian sequence. Now, all we had to do was to write appropriate Hindi words on them. Alas, it was not that simple. To check the model, I went through the first few pages of a Hindi dictionary. I found very many important concepts missing in my model. There was no way to add more cards in between the already numbered cards.

Roget had based his work on the so-called scientific classification. (Panel 2.) However, language is anything but scientific. While study of words is a science, their coinage is not. People coin words in various unscientific ways, mostly associative, at times whimsical. Associations vary from people to people and time to time.

Panel 2

Rogetian Structure

Roget compartmentalized the language into six classes, to which the last two were added later, to get the contemporary structure:

1. Abstract Relations

2. Space

3. Physics

4. Matter

5. Sensations

6. Intellect

7. Volition

8. Affections


When Roget''s model failed us, we thought of emulating Amar Singh. However, we found Amar Singh was too much out of sync with expansion of knowledge and language in the intervening centuries. No longer do wars or arms remind an Indian of a warrior from the Kshatriya caste. Nor would one think of lion in the context of a Kshatriya or of cow in that of a Vaishya. The Shudras are no longer menials or servants. In Amar Singh''s time, music was a heavenly activity, but a musician was a menial. Thus, he put music is in the first canto Heavens, and musician under Shudras. None of this would do in today''s world. Panel 1 (above) illustrates and explains this.

The somber realization was that we were left with no model. We were totally on our own with no idea of what order, sequence, pattern, or structure to give to our word-groups for a reader to make the best use of it. We decided to evolve our own system as we progressed with the work. There were at least five false starts and it was to be a full 14 years before we could come out with a viable solution.

The first edition of our Hindi thesaurus--Samantar Kosh--was published in 1996. It has 160,850 expressions/records under 1,100 headings and 23,759 subheadings arranged purely by associatory method. The only guiding principle in their placement is that a heading leads to the next naturally. Panel 3 gives a list of the first 30 headings to illustrate this.

Panel 3

Purely associational sequence of the first 30 headings

in Samantar Kosh

1.          The Universe

2.          Sky

3.          Stellar Body

4.          Movement of Stellar Bodies

5.          Rotation of Earth

6.          Eclipse

7.          Solar System (all the non-earth planets)

8.          Sun and Moon

9.          Earth

10.      Geography

11.      Plains and Deserts

12.      Jungles and Gardens

13.      Garden and Urban Trees

14.      Garden Flowers

15.      Pits and Caves

16.      Mountains and Valleys

17.      Indian Mountains (list and synonyms of important mountains like the Himalayas--we give only 30 out of many)

18.      Ponds and Lakes

19.      Water Supply (wells for drinking and irrigation, water carriers, water taps)

20.      Rivers

21.      Indian Rivers (Ganges has 37 synonyms here, Yamuna 20)

22.      River: From its source to the end (source of a river, waterfall, flow of water, confluence, delta, submergence in the sea, etc.)

23.      Flood

24.      Flood Control (dams, etc.)

25.      Draining Out (canals, drains, sewers)

26.      Seas and Bays

27.      Landmass (coast, ground reclaimed from water, marsh, etc.)

28.      Islands and Continents

29.      Asia and Countries of South Asia

30.      India and the States of India

Names of some headings are given in bold letters, just to just to draw a user’s attention to the nature of subjects in their vicinity.

We have jumped to an unassociated category only when unavoidable.

Problems of handling large data

We had started by following the time-honored tradition of using index cards. By 1990, we had 60,000 hand-written cards containing more than 250,000 words. The cards were arranged in wooden trays. The trays represented broad categories and occupied a complete room. In them were arranged conceptual groups and subgroups. To change the sequence, we would simply inter-shift trays, or subgroups within a tray. The task of handling the room-full of data was getting out of hand. There was much overlapping of categories and repetition of words. As we had no help, the initial attempt to index the body of words proved futile and had to be abandoned.

The task of handling of the data at the time of publishing, too, was giving me nightmares. First, the cards would go to typists. They would mix up their sequence or misplace/lose some of them. A large number of typing mistakes would occur. We would have to proofread typescripts carefully. On their part, the type sheets could be mixed up. Typesetting at the printing press would add even more errors.

Enter the Computer and Shabda Lexicographer

It was my son Sumeet, a general surgeon, who realized that continuing to work with index cards would never succeed, with the limited human resources we had. He thought computerization was the only solution. Since we did not have the funds to hire software experts, Sumeet learnt computing by himself and started exploring options. The option of using a word processing program was abandoned as it would not allow automatic sorting, indexing or easy rearrangement.

While we came across some existing computer programs, called Assassin, Astute, Avocon, and Thesaurus Development, they were not available in India and would probably not have allowed input in Indian scripts anyway.

Meanwhile C-DAC (Centre for Development of Advanced Computing, India) had come out with the GIST Card, an ingenious device that enabled the input, display and printing of Indian scripts under DOS and UNIX.

After considering all aspects, Sumeet chose the DOS version of FoxPro 2.5 in combination with the GIST card as the software platform for our work.

He then programmed FoxPro to handle the specialized task of creating a thesaurus. This program enabled me to get my work done faster. He then kept on adding new features to the program and evolved it to fully satisfy all requirements of thesaurus making. We call it Shabda Lexicographer amongst ourselves.

The value of having a database for a thesaurus or dictionary, and the way Sumeet''s Shabda Lexicographer is designed, cannot be under-estimated. With it, we can manipulate the data in any which way. We can add as many new categories/concepts as we like, include extra columns to accommodate more languages, and enter at will any number of synonyms, and shift whole groups from one place to another to change/modify the sequence. We can examine synonyms alphabetically and/or by the first or the last letter/word.

We can convert synonyms into index entries and output them as an exhaustive index. If we enter, create or associate a pronunciation pattern to English words, a rhyming book can also be made available as a dictionary or even as a rhyming thesaurus!

The Shabda Lexicographer also allows us to output data into a theme-wise thesaurus or an alphabetical one. The application includes the facility to convert the data into tagged text for import into popular programs such as MS Word, Adobe PageMaker and Quark Xpress and enables us to be ready for pre-press in just a day or two. This is how we were able to give our publishers fully formatted laser prints of Samantar Kosh, ready for press.

The result has been a full-fledged and satisfying program to the extent that we can use it to make bilingual and multi-lingual thesauruses.

The Index is critical

Publishers consider an index a necessary evil consuming space and paper and increasing the price of a book. Nevertheless, in a thesaurus, an index is perhaps more important than the main book. It is the basic key, the point of entry, the portal opening into the treasure house of words. It deserves equal importance. Care has to be given to its compilation and organization. In my opinion, the index should be devised to act as half the thesaurus, so as to say. This is because a thesaurus is not read from start to finish nor is it browsed at leisure. Users turn to the thesaurus when they need a better word for a concept and the quality of the index determines the degree of their success in finding it. For example, when the index entry Science or Disease appears, names of other sciences or diseases as sub-index-entries should appear below it, avoiding the need for the reader to open the main thesaurus to find their names. Only in case a reader wants to get more words for a particular science or disease, should he need to go to its section in the main book.

Hence, we have adopted an inclusive approach in our selection of synonyms with the goal of helping users reach the desired _expression easily, and have added some extra explanatory short and clipped phrases as synonyms. These help a reader understand the basic meaning of a concept and are of a great help in indexing by concept. Furthermore, our index can also act as a reverse dictionary.

Where bilingual dictionaries fail

The makers of bilingual dictionaries would love to give a one-to-one correspondence for words in two languages. However, the fact is that it is very rare to find two words in two languages carrying the same meaning, weight, background and associations. To give a simple example, for the English word success, Hindi has two main words saphalata and kamyabi. All the three words have different cultural and semantic background and context. Success carries with it the sense of reaching somewhere. Saphalata is a word emanating from an agricultural background. Literally it means fruitfulness, having come to fruition. Kamyabi has an Indo-Persian origin denoting achievement of an objective. Success leads to succession (uttaradhikar), but neither saphalata nor kamyabi can lead one to uttaradhikar.

I am always at a loss to find the English equivalent of a very common Hindi word like shobha. I find listed many English words as its rough equivalents in many Hindi-English, Sanskrit-English dictionaries: splendor, brilliance, luster, beauty, grace, loveliness, elegance, show… None of these satisfies me. Shobha carries with it a little of each of these but also something beyond them.

In addition, dictionaries cannot offer more than a few options for any given word. A bilingual thesaurus offers a whole range of words for a word-group and helps user select the most suitable one from the two languages.

An English/Hindi Wordbank

A bilingual English Hindi thesaurus is a crying need in India, which has a very high density of English speaking personnel. It may sound incredible, but it is true, that many Indians are more at home with English than Hindi. They even think and dream in English. Add to them the vast number of Non-Resident Indians spread all over the world, especially USA and UK, and also the non-Indian researchers and scholars of Hindi and the number of people who are often looking for the correct Hindi word goes up dramatically. There is also a whole class of people translating to and from Hindi who need parallel English and Hindi words.

To fulfill their needs, we started work on an English-Hindi Wordbank in 1997, immediately after the publication of our Hindi Thesaurus.

The first step was to find equivalent English words for the headings and sub-headings in the Samantar Kosh. In the FoxPro table, we added more columns to accommodate corresponding English headings, sub-headings and synonyms and modified Shabda Lexicographer accordingly. We then started with the Hindi data and entered appropriate corresponding English synonyms in every category. To ensure the completeness of the Wordbank, we are now scanning the Wordbank from the English point of view and filled in any blanks found.

This process of cross-fertilization, I am happy to say, has helped me change, enrich and improve the Hindi thesaurus too. Many new categories have been added, many more expressions included.

The end product will be a unique bilingual thesaurus that covers concepts existing in both languages and will offer readers a unique and unprecedented linguistic tool.

I also feel that the same data would enable the creation of a very useful English thesaurus and this should be able to stand on its own in the English language market.

Output Options

There are various options open for the bilingual data. It can be manipulated in various ways and sizes for different audiences, both theme-wise and alphabetically.

- If one wants to keep Hindi as the first language, a paragraph of Hindi synonyms can be followed by one of English synonyms, concept by concept.

- The opposite of it can also be achieved. An English paragraph can be followed by a Hindi paragraph, concept by concept.

- The categories may also be presented alphabetically, sorted by either of the two alphabets (though English is probably better). In this case, too, a paragraph of one language will be followed by one of the other.

- In any case, the thesaurus will need to have two independent indexes, one for each language. Here too there are similar options.

- We may have Hindi-Hindi and English-English indexes.

- Bilingual Hindi-English/English-Hindi indexes too are possible. The bilingual indexes can also double as instant cross dictionaries.

A Global Wordbank and the crucial role of English

The process does not end here. It opens unending new vistas of opportunity. The Internet is making the world more interactive. Enough manpower and energy is employed in solving the problems of Machine Translation. Already, machine translations of many web pages are available. In the contemporary context, the need for a Global Wordbank becomes obvious. Languages of the world can be linked to each other through a link language and stored in the Wordbank--a modern-day, all-encompassing Rosetta Stone.

A lot of this correlation exists in the proprietary bilingual thesauruses available today and this word-base could be used after addressing copyright issues. Since English recurs in many of them as one of the languages, it suggests itself as the global link language by which concepts from different languages could be mapped to one another.

The technology finally exists to input and display all the scripts of the world. Unicode which uses a 16-bit character set that can accommodate all the characters in all the scripts of the world in now widely available in Windows 2000 as well as UNIX and Linux.

Our English Hindi thesaurus could even be the nucleus with which such a Global Wordbank starts as it is in a relational database format, can be converted to Unicode and has a full fledged lexicon making program.

Such a Global Wordbank, would make the generation bilingual or multilingual thesauruses of any number of desired languages just a click away. It will just be a matter of choosing the target languages (say Spanish-Russian, German-Arabic or French-Hindi) and clicking to get an output in the shape of a thesaurus either in a printed book format or as an online computer program.

Some day, some body, some organization, some university (or Unesco?) has to step in and initiate the process of such a Wordbank, which may end up as a repository or depository of words from all the major languages of the world.

It is high time world languages came closer together. Perhaps the time is now?

(Written in 2001)


Will soon write an update,and  give a detailed account of our progress in the matter… sort of a progress report of what we have done, the coming of Arvind Lexicon, and what we are about to do.

(AK. 24 Feb 2011)


©Arvind Kumar