Tuesday, June 28, 2016

Article Summary for Lecture # 7 - Aitchison

The Thesaurus:
A Historical Viewpoint, 
with a Look to the Future


Four decades of use, experimentation, and development have allowed users, researchers, and catalogers to refine thesauri to be very effective search tools. Aitchison and Clarke discuss the history, making note of important, monumental events that help in the creation of what we know as the thesaurus. They draw on earlier printed histories of thesauri (specifically Gilchrist’s Thesaurus in Retrieval), and go on to define thesaurus as “a treasury or storehouse of knowledge, as a dictionary, encyclopedia and the like.” The primary purpose of thesauri is to match the vocabulary used by the indexer with the language of the searcher.

The first time the word “thesaurus” was used (in terms of information retrieval) was in 1957 by Peter Luhn of IBM, and had definitely evolved through the 1950s. One particular highlight on the timeline of the thesaurus is the Uniterm System, which used uncontrolled single words taken from the text of documents, which ultimately proved difficult since only single-word terms were available to deal with synonyms, homonyms, etc. Fortunately, it was superseded by vocabularies containing significant numbers of compound terms. In addition, during this time, the thesauri listed terms in alphabetical order, which was eventually carried into the standardization of format in 1967 when the Thesaurus of Engineering and Scientific Terms (TEST) was published.

This predominant feature alphabetically displays descriptors and non-descriptors - synonyms, broader, narrower, and related terms showing under each descriptor. A subject overview or systematic display was of secondary importance. The idea of a detailed classified arrangement was considered too complex. An example of this is the Descriptor Group Display, within which main groups are divided into subgroups, and within the subgroups, descriptors are further organized into clusters. For example:

14    DEMOGRAPHY. POPULATIONS
                14.01 POPULATION DYNAMICS
                14.01.01
                    CIVIL REGISTRATION
                    DEMOGRAPHIC STATISTICS
                    POPULATION DATA
                       USE: DEMOGRAPHIC STATISTICS
                    etc.

In order to section thesaurus information in this way, the classification scheme is an indispensable tool. When the editor works only with an alphabetical list, it is a sense of working blind, but if rigorous classification is developed, the compiler has a better chance of building accurate and meaningful relationships between the terms. In the early days, most thesauri were compiled manually, which was a massive and both a time and space investment (e.g. Thesaurofacet was held in more than 20 shoeboxes containing cards for 16,000 descriptors and 7,000 non-descriptors), and was greatly vulnerable to human errors or mid-process interruptions. This is where computer-aided compilation becomes handy. In the late 1970s, computer compilation was more common; however, there was no software to maintain a systematic display of the faceted thesaurus style.

During this time, access was usually limited to one workplace. There was either a large tome that stood by the bank of filing cards or optical coincidence viewer, and even computerized thesauri were limited in space. However, trained searchers became fluent with the process, and that alongside trained indexers they were able to fully harness the power of the thesauri to perform effective searches. Nowadays, pcs are everywhere, and each of them provide access to unlimited networks. In order to apply thesauri to information retrieval the authors feel the following challenges need to be addressed:
  • Access to information proceeds through any number of different portals, gateways, and search engines, many geared to particular audiences and subject areas. There is no universal thesaurus, but a multitude of different vocabularies for different applications.
  • In the publish one, re-utilize many times’ environment, it is hard to predict in which systems or networks a given document may eventually appear. Indexers must struggle to foresee all the needs that may arise for a given document.
  • With the data entry/indexing task distributed among a vast number of authors, webmasters, system administrators, etc., quality control cannot be enforced across organizational boundaries.
  • How can we train end-users to use a thesaurus properly? The experience of most information providers is that users do not want to cope with anything complicated, and the thesaurus is perceived as very complicated. Those beautifully presented systematic displays, carefully designed for selecting the right term(s) for each required concept, are often rejected as an unnecessary impediment and delay between the user and goal.

Confronting these challenges has recently led to two major trends in thesaurus developments:
  • Hunting for adaptations that will make a controlled vocabulary much quicker, easier, and more intuitive to use.
  • Drive to interoperability of systems, meaning to design vocabularies for easy integration into downstream applications such as content management systems, indexing/meta-tagging interface, search engines, and portals.

Current technology has users who are more than happy to browse through a simple classified directory, using point-and-click interaction with established headings instead of actively thinking of search terms. This is why some companies are working on developing taxonomies that will make things easier for the searcher, and perhaps even the indexer. There are even mentions of hiding vocabulary all together by implementing synonyms sets for selected terms that can be used to drive automatic expansion of free-text search queries.

Another topic current thesauri developers need to think about is interoperability. It makes things easier for users. Gone are the days of looking up in a printed thesaurus and then key selected terms into the indexing system. Now, copy-paste or clicking on them, a search system has to be capable of interacting with the thesaurus database.These newest concerns were reflected in the updated standards released in the Workshop on Electronic Thesauri held in 1999, which says, “The standard should provide for a broader group of controlled vocabularies than those that fit the standard definition of “thesaurus.” This includes, for example, ontologies, classifications, taxonomies and subject headings, in addition to standard thesauri. The primary concern is with shareability (interoperability), rather than with construction or display. Therefore, this new standard will probably not supersede Z39.19, but supplement it.

Overall, I think Aitchison and Clarke’s article is very thorough and offers a lot of insight into the world of thesauri. This is a great read for anyone interested in the development of thesauri or the organization of information. They have a ton of information and references to support their examples, and write in a way that is easy for even beginners to be able to pull information from the article and form new ideas and appreciation for the thesaurus in their life!
_______________________________________________________________________
For more information, check out the full article (citation below)!

Aitchison, J., & Clarke, S.D. (2004). The thesaurus: A historical viewpoint, with a look to the future. Cataloging & Classification Quarterly 37(3/4):5-21.

1 comment: