Lesson 8. Search Engine Indexing + PageRank

Lesson Objective

Understand how search engine use Indexing to store and organise web pages.
Understand the PageRank algorithm.

Lesson Notes

Search Engines

Google was created by Larry Page and Sergey Brin while they were PhD students at Stanford University in California. They launched it in 1998.

A search engine is a software system that responds to user queries by providing hyperlinks to relevant web pages.

These results are often accompanied by summaries and images, helping users find the information they seek.

Popular search engines like Google, Bing, and Yahoo use complex algorithms to rank and display the most relevant content.

Indexing

Search engine indexing is the process of discovering, storing, and organizing web page content so that it can be easily and quickly searched, analyzed, and retrieved by search engines.

The process of Indexing involves the following stages:

Web Crawlers

Search engines send out automated programs called crawlers (or spiders) to explore the web.
These crawlers visit web pages, follow links, and collect information about the content on each page.
During crawling, the search engine discovers new pages and updates information about existing ones.

Index

Once the crawler collects data from a web page, it processes the content and creates an index.
The index is like a massive catalog of all the content available on the internet.
It includes information about words, phrases, images, videos, meta tags and descriptions built into each webpage.
The index helps search engines quickly retrieve information the user is looking for.
Other factors that affect indexing:

using keywords in the <title> tag.
the age of your website and date of last update (or frequency of updates).
the number and relevancy of keywords appearing in <h1> tags.
the relevancy of the domain name to the content.

Retrieval

When a user enters a search query (such as “best pizza places”), the search engine looks up the query terms in its index.
It identifies relevant pages based on the indexed content.
The search engine then ranks these pages to display the most relevant results to the user.

Page Ranking

The PageRank algorithm is a fundamental component of Google's search engine. It is used by to rank web pages in their search engine results.

It's named after Larry Page, one of Google's co-founders.

PageRank measures the importance of website pages based on the number and quality of links pointing to them.

How does it work?

Imagine the entire web as a vast interconnected graph, where each web page is a node, and hyperlinks between pages are edges.

PageRank assigns a numerical weight (a score) to each page within this graph.
The score reflects the page's relative importance within the entire set of web pages.

A hyperlink from one page to another counts as a "vote" of support for the linked page.

PageRank assigns a numerical weight (a score) to each page within this graph.

PageRank is recursive. The importance of a page depends on the importance of other pages linking to it.

If a page receives many links from other high-ranking pages, its own rank increases.
Conversely, if a page has few or low-quality links, its rank decreases.

Page Ranking - Algorithm

The original PageRank algorithm is: PR(A) = (1-d) + d (PR(Ti)/C(Ti) + … + PR(Tn)/C(Tn))

PR(A) is the PageRank of page A
PR(Ti) is the PageRank of pages Ti which link to page A
d is the damping factor
C(Ti) is the number of outbound links on page Ti

There are many factors that affect a PageRank. These include:

Domain name - relevance to the search item
Frequency of search term in web page
Age of web page
Frequency of page updates
Magnitude of content updates
Keywords in Tags

Beyond PageRank...

While PageRank was the original algorithm used by Google, it's no longer the sole factor in ranking search results.

Google now employs a variety of algorithms, including machine learning models, to refine search rankings.

However, PageRank remains a foundational concept in understanding how links impact a page's visibility.

mrahmedcomputing