Thursday, July 7, 2016

Article Summary for Lecture # 10 - Northedge


and beyond:
information retrieval on the World Wide Web
Northedge defines a web directory as “a human compiled list of links to web pages, typically organized into a hierarchical structure of subject categories.” Back in 1994, a mere 3 years after Berners-Lee created the “Web”; there were less than 10,000 websites. This number inflated to almost 3.5 million in 1998, and in 2006, it was estimated to be at over 100 million. Imagine if those websites were books. Without anyone to organize and sort through all of them, it would take forever for us users to retrieve any kind of information, let alone navigate the sea of changes that authors and creators make on a daily basis to their sites. If a librarian is involved, the user can submit their queries to the librarian.
                In the case of the internet, search engines are the librarians. Several criteria measure the quality of the search engine, such as:
  • The size of the corpus – the more books the librarian can search, the better.
  • The speed of the answer – if we do not get our information quickly, we will find another search engine.
  • The availability of service – if it is not available when it is needed, the users are going to find another search engine.

·         The accuracy of results – if the information the user gets back is not what they are looking for, and then they will find another search engine that will return related results. However, if the three preceding criteria are not met, accurate data is not going to be important. (See this post).
Search engines require their users to submit their searches through a search box, which allows the user to choose whatever terms they like – unlike web directories, which constrain users to search using vocabulary chosen by the indexer. Since it might take a while for a search engine to sift through over 100 million constantly changing websites, it only makes sense to implement an indexing program (called a spider or robot). This program accesses web pages, analyses their contents and records the results in a database (referred to as an “index”), which enables fast access to sought information and bridges the gap between the search engine and the requested content.

                Today, one of the most used search engine is Google. Google’s software agent (indexer), called “Googlebot” continually locates billions of web pages, analyses the content, and save the result in the Google index. The algorithms it uses are a company secret, as they are what sets Google apart from its competitors (Bing, Yahoo, etc.). Googlebot breaks down webpages into words and examines their context within the page (position – is it in a header, sub header, body text, etc.) and sources are returned to user, based on the algorithms weighted scale, in order of assumed most relevant to least.

                In addition, while Google’s search box may seem to ask “What subject do you want information on?” in reality, it is asking “What word or combination of words will be most likely to appear on web pages that address the subject I am interested in, an least likely to appear on pages that are irrelevant to me?”. This may trip up users who are unfamiliar with how search engines work, and this may be the one negative Northedge presents about search engines – there is no one-to-one correspondence between words and meanings, and a single word may have multiple meanings (search for Java – the country – and only results about the computer programming language are returned). He also offers information on alternatives to search engines, which include META tags (the assignment of subject keywords by the web content creators), and folksonomies/tagging (creation of a taxonomy by the collective actions of users on the Web – see del.icio.us and flickr). These alternatives are somewhat controversial, because users and/or creators can deliberately assign misleading or inaccurate keywords to the content for financial gain or malicious reasons.

                Ultimately, Northedge offers insight into the possible future of web searches, computer-generated indexes, but the data contained in those indexes may be driven by data sets produced by human indexing techniques and human linguistic research. I agree with this assertation, because it seems as more technologies are developed and released, the search process becomes more streamlined and tailored to what the user REALLY wants from their search. This article is very informative, and if you want to know more about the inner-workings of search engines, this is a fascinating read. I definitely came away knowing more about what happens once I search for "cat videos" on Google. 
______________________________________
To read the whole article, see the citation below:

Northedge, R. (2007, April). Google and beyond: Information retrieval on the World Wide Web. The Indexer, 25(3), 192-195.

No comments:

Post a Comment