Webcrawler

<< ---------------------------------------------------------------- >>

Just gathering open information

Reqs: Scrape all the content accessible inline and store all of it. Respect website crawling policies(robots.txt) Complete this process within one week. Complete this within one week.

Process of web crawling:

pull in url from to-crawl list
check if we have already crawled it
check if crawling it is compliant with its host’s robots.txt file
get the IP address of the host via DNS
make http request to load the contents of the site
check if we have crawled a different url with identical content
parse the content
store the results
add any referenced URLs to our to-crawl list

Nikan's Notebook

Explorer

Webcrawler

Webcrawler

Graph View

Backlinks

Explorer