Learning Libraries: Verbal Subject Analysis III: Webpage Databases (a.k.a. "Search Engines")

Thursday, July 7, 2016

Verbal Subject Analysis III: Webpage Databases (a.k.a. "Search Engines")

Human vs. Automatic Indexing

Both are related to the subject analysis of information resources.
Human indexing is used to describe the subject analysis of various periodical databases.
Automatic indexing is a term used for the subject analysis operations by the computer algorithms of various webpage databases (a.k.a. search engines).

Research from the 60s-80s were trying to get a computer to calculate what articles were about. The most frequent words, articles like a, an, the, etc., don't really tell you much about the article, neither do the least used words. The key is finding the sweet spot based on what the author usually writes about.

Why Webpage Database?

It is always important to know the documentary unit of an information database.
The adjective associated with database is always a cue to the documentary unit.
Webpage databases are informational databases in which a webpage is the documentary unit.
They are also known as search engines and discovered databases.

Analysis of Websites and their Structure

What are webpages? What are websites? Webpages:Websites as pages:books
Standards (or lack thereof) for the authoring of web sites and webpages

HTML and other markup languages
Editors

What are the implications of the lack of authoring standards for web-based information resources?

Location of Webpage Subject Metadata

In webpage headers: For individual webpages, subject metadata can be created by authors and included in HTML headers.
In separate metadata record databases:

Subject metadata can be created by intermediaries using Dublin Core schema
In search engines, subject metadata is inferred "automatically" by computer algorithm.

Search Engine Questions

For greater understanding we need to be able to answer:

Why do search engines produce different results the exact same query?
What is the principle for ranking the display of search engine records in response to a query?

The Term "Search Engine"

The term has become the common designation for webpage databases, However, in actuality, webpage databases have three parts:

Spidering/crawling software to collect webpages.
Indexing software to build the index of surrogate records.
Retrieval software to facilitate retrieval of surrogates.

Automatic Indexing in Context

Obtain information resource - spidering/crawling

Steps for spidering/crawling:

Computers owned by search engine retrieve documents by clicking on all hyperlinks on each retrieved webpage
Determination is made whether a webpage needs to be indexed (because it is new) or reindexed (if it has already been indexed)
Determination is made whether reindexing is warranted
New webpages and those meeting criteria for reindexing are then placed in the indexing queue

Describe information resource in surrogate record - read off webpages by indexing software

Left Side elements must be inferred by searcher:

Examine structure of retrieved records
Examine advanced search interface
Element sets are not standard, i.e., they will vary across search engines.

Right Side Content:

What is the source for the content?
Authority control?

Subject analyze information resource in surrogate record - indexing software:

Verbal - inferred by computer algorithm
Classification - inferred by computer algorithm
Subject Indexing in Search Engines

The subject fields of webpage surrogate records include the words that describe what the webpage is about.
Right side subject content is inferred through the application of proprietary algorithms.
Subject terms added to surrogate records are weighted:

Doc #1: SU = dogs (.99); breeding (.87);dachshund (.30)
Doc #2: cats (.92); dogs(.44); dachshund (.03)
The weights are computed by proprietary algorithm.

Retrieval from Search Engines

Unlike bibliographic databases, in which the ordering of retrieved surrogate records is reverse chronological, search engines use a relevance-based ranking.
The search engine component of a search engine takes the entered query and compares it to the terms to the index.
The documents that are retrieved first are those that contain a higher "relevance" score:

Doc #1: SU = dogs (.99); breeding (.87);dachshund (.30)
Doc #2: cats (.92); dogs(.44); dachshund (.03)
"dog" query would rank document #1 ahead of document #2
"breeding" query would rank document #1 ahead of document #2
"cats"query would rank document #2 ahead of document #1

How are Subject Weights Calculated?

Conventional methods (Dating from the 1950s) for automatically inferring what a document is about include the following three techniques:

Frequency of word occurrences
Location of words occurrences
Size of word occurrences

In the web era, however, these techniques did not scale well to meet the needs of databases containing billions of records:

Could facilitate retrieval of relevant documents, but could not distinguish between "good" and "bad" documents.
Were also subject to manipulation by authors desiring higher search engine retrieval (spamming)

Two responses to Early Indexing Failure

Yahoo! era (late 1990's)

Human indexing (website directories)
More discussion during lectures on classification.

Google era (since 1999)

Additional criteria introduced to infer aboutness, e.g.,;

$ - paid submissions, such as Alta Vista
Quality - PageRank algorithm of Google

Google Approach to Authomatic Indexing

Issue addressed by Google concerns the quality problem: How to cause the "best" documents to rise to the top of a set of retrieved webpages.
Solution concerns identifying additional criteria to include int he subject weighting algorithm.
Google maintains additional metadata elements for each surrogate record in its index of webpages:

How many other webpages link to a given webpage

The more webpages (i.e. linkers) a dachshund webpage has poiting to it, the more quality it has.
This factors into the weight assigned to the "dachshund" descriptor inthe subject field of its surrogate record

Who are the linkers

Those linkers that have a higher quality rank are given more weight than those linkers with a lower quality rank.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)