and beyond:
information retrieval on the World Wide Web
Northedge defines a web directory
as “a human compiled list of links to web pages, typically organized into a
hierarchical structure of subject categories.” Back in 1994, a mere 3 years
after Berners-Lee created the “Web”; there were less than 10,000 websites. This
number inflated to almost 3.5 million in 1998, and in 2006, it was estimated to
be at over 100 million. Imagine if those websites were books. Without anyone to
organize and sort through all of them, it would take forever for us users to retrieve
any kind of information, let alone navigate the sea of changes that authors and
creators make on a daily basis to their sites. If a librarian is involved, the
user can submit their queries to the librarian.
In the case
of the internet, search engines are the librarians. Several criteria measure
the quality of the search engine, such as:
- The size of the corpus – the more books the librarian can search, the better.
- The speed of the answer – if we do not get our information quickly, we will find another search engine.
- The availability of service – if it is not available when it is needed, the users are going to find another search engine.
·
The accuracy of results – if the information the
user gets back is not what they are looking for, and then they will find
another search engine that will return related results. However, if the three preceding
criteria are not met, accurate data is not going to be important. (See this
post).
Search engines require their users to submit their searches
through a search box, which allows the user to choose whatever terms they like –
unlike web directories, which constrain users to search using vocabulary chosen
by the indexer. Since it might take a while for a search engine to sift through
over 100 million constantly changing websites, it only makes sense to implement
an indexing program (called a spider or robot). This program accesses web
pages, analyses their contents and records the results in a database (referred
to as an “index”), which enables fast access to sought information and bridges
the gap between the search engine and the requested content.
Today,
one of the most used search engine is Google. Google’s software agent
(indexer), called “Googlebot” continually locates billions of web pages,
analyses the content, and save the result in the Google index. The algorithms
it uses are a company secret, as they are what sets Google apart from its
competitors (Bing, Yahoo, etc.). Googlebot breaks down webpages into words and
examines their context within the page (position – is it in a header, sub
header, body text, etc.) and sources are returned to user, based on the
algorithms weighted scale, in order of assumed most relevant to least.
In
addition, while Google’s search box may seem to ask “What subject do you want
information on?” in reality, it is asking “What word or combination of words
will be most likely to appear on web pages that address the subject I am
interested in, an least likely to appear on pages that are irrelevant to me?”.
This may trip up users who are unfamiliar with how search engines work, and
this may be the one negative Northedge presents about search engines – there is
no one-to-one correspondence between words and meanings, and a single word may have
multiple meanings (search for Java – the country – and only results about the
computer programming language are returned). He also offers information on
alternatives to search engines, which include META tags (the assignment of
subject keywords by the web content creators), and folksonomies/tagging
(creation of a taxonomy by the collective actions of users on the Web – see del.icio.us and flickr).
These alternatives are somewhat controversial, because users and/or creators
can deliberately assign misleading or inaccurate keywords to the content for
financial gain or malicious reasons.
Ultimately,
Northedge offers insight into the possible future of web searches, computer-generated
indexes, but the data contained in those indexes may be driven by data sets
produced by human indexing techniques and human linguistic research. I agree with this assertation, because it seems as more technologies are developed and released, the search process becomes more streamlined and tailored to what the user REALLY wants from their search. This article is very informative, and if you want to know more about the inner-workings of search engines, this is a fascinating read. I definitely came away knowing more about what happens once I search for "cat videos" on Google.
______________________________________
To read the whole article, see the citation below:
Northedge, R. (2007, April). Google and beyond: Information retrieval on the World Wide Web. The Indexer, 25(3), 192-195.
No comments:
Post a Comment