Thursday, July 7, 2016

Verbal Subject Analysis III: Webpage Databases (a.k.a. "Search Engines")

Human vs. Automatic Indexing

  • Both are related to the subject analysis of information resources.
  • Human indexing is used to describe the subject analysis of various periodical databases.
  • Automatic indexing is a term used for the subject analysis operations by the computer algorithms of various webpage databases (a.k.a. search engines).
    • Research from the 60s-80s were trying to get a computer to calculate what articles were about. The most frequent words, articles like a, an, the, etc., don't really tell you much about the article, neither do the least used words. The key is finding the sweet spot based  on what the author usually writes about.
Why Webpage Database?
  • It is always important to know the documentary unit of an information database. 
  • The adjective associated with database is always a cue to the documentary unit. 
  • Webpage databases are informational databases in which a webpage is the documentary unit. 
  • They are also known as search engines and discovered databases.
Analysis of Websites and their Structure
  • What are webpages? What are websites? Webpages:Websites as pages:books
  • Standards (or lack thereof) for the authoring of web sites and webpages
    • HTML and other markup languages
    • Editors
  • What are the implications of the lack of authoring standards for web-based information resources?
Location of Webpage Subject Metadata
  • In webpage headers: For individual webpages, subject metadata can be created by authors and included in HTML headers.
  • In separate metadata record databases:
    • Subject metadata can be created by intermediaries using Dublin Core schema
    • In search engines, subject metadata is inferred "automatically" by computer algorithm.
Search Engine Questions
  • For greater understanding we need to be able to answer:
    • Why do search engines produce different results the exact same query?
    • What is the principle for ranking the display of search engine records in response to a query?
The Term "Search Engine"
  • The term has become the common designation for webpage databases, However, in actuality, webpage databases have three parts:
    • Spidering/crawling software to collect webpages.
    • Indexing software to build the index of surrogate records.
    • Retrieval software to facilitate retrieval of surrogates.
Automatic Indexing in Context
  1. Obtain information resource - spidering/crawling
    • Steps for spidering/crawling:
      • Computers owned by search engine retrieve documents by clicking on all hyperlinks on each retrieved webpage
      • Determination is made whether a webpage needs to be indexed (because it is new) or reindexed (if it has already been indexed)
      • Determination is made whether reindexing is warranted
      • New webpages and those meeting criteria for reindexing are then placed in the indexing queue
  2. Describe information resource in surrogate record - read off webpages by indexing software
    • Left Side elements must be inferred by searcher:
      • Examine structure of retrieved records
      • Examine advanced search interface
      • Element sets are not standard, i.e., they will vary across search engines.
    • Right Side Content:
      • What is the source for the content?
      • Authority control?
  3. Subject analyze information resource in surrogate record - indexing software:
    • Verbal - inferred by computer algorithm
    • Classification - inferred by computer algorithm
    • Subject Indexing in Search Engines
      • The subject fields of webpage surrogate records include the words that describe what the webpage is about. 
      • Right side subject content is inferred through the application of proprietary algorithms.
      • Subject terms added to surrogate records are weighted:
        • Doc #1: SU = dogs (.99); breeding (.87);dachshund (.30)
        • Doc #2: cats (.92); dogs(.44); dachshund (.03)
        • The weights are computed by proprietary algorithm.
Retrieval from Search Engines
  • Unlike bibliographic databases, in which the ordering of retrieved surrogate records is reverse chronological, search engines use a relevance-based ranking.
  • The search engine component of a search engine takes the entered query and compares it to the terms to the index.
  • The documents that are retrieved first are those that contain a higher "relevance" score:
    • Doc #1: SU = dogs (.99); breeding (.87);dachshund (.30)
    • Doc #2: cats (.92); dogs(.44); dachshund (.03)
    • "dog" query would rank document #1 ahead of document #2
    • "breeding" query would rank document #1 ahead of document #2
    • "cats"query would rank document #2 ahead of document #1
How are Subject Weights Calculated?
  • Conventional methods (Dating from the 1950s) for automatically inferring what a document is about include the following three techniques:
    • Frequency of word occurrences
    • Location of words occurrences
    • Size of word occurrences
  • In the web era, however, these techniques did not scale well to meet the needs of databases containing billions of records:
    • Could facilitate retrieval of relevant documents, but could not distinguish between "good" and "bad" documents.
    • Were also subject to manipulation by authors desiring higher search engine retrieval (spamming)
Two responses to Early Indexing Failure
  • Yahoo! era (late 1990's)
    • Human indexing (website directories)
    • More discussion during lectures on classification. 
  • Google era (since 1999)
    • Additional criteria introduced to infer aboutness, e.g.,;
      • $ - paid submissions, such as Alta Vista
      • Quality - PageRank algorithm of Google
Google Approach to Authomatic Indexing
  • Issue addressed by Google concerns the quality problem: How to cause the "best" documents to rise to the top of a set of retrieved webpages.
  • Solution concerns identifying additional criteria to include int he subject weighting algorithm.
  • Google maintains additional metadata elements for each surrogate record in its index of webpages:
    • How many other webpages link to a given webpage
      • The more webpages (i.e. linkers) a dachshund webpage has poiting to it, the more quality it has.
      • This factors into the weight assigned to the "dachshund" descriptor inthe subject field of its surrogate record
    • Who are the linkers
      • Those linkers that have a higher quality rank are given more weight than those linkers with a lower quality rank.

No comments:

Post a Comment