How to find all links on a website using python

This post demonstrate how to crawl a website and get all the links.

A web crawler also known as a web spider or web robot is a program or automated script which browses the World Wide Web in a methodical, automated manner.

Working of crawlers is very simple basically crawlers starts from the given web page and fetches all the links on that page.
After that they jump to the next page and perform the same operation and so on. Crawlers maintains a stack of URLs so as soon as the URL is visited from stack it will be removed. similarly the crawler fetches all the links until the stack become empty. 





As shown the crawler maintains two lists
1. remaining
2. visited 

After that the crawler start fetching all the URL until the stack is empty. As soon as the crawler reads one URL it pops the URL from the list and returns the number of URLs available in the stack i.e. in the remaining list.

Now it find all <a> tag and fetches the link inside the href attribute and then appends the URLs to the remaining and visited list after examining whether it exists in the visited stack or not.

And at the ends it prints all the URL from the list using a simple for loop.

The script is avaiable at my Github account.



Comments

Popular posts from this blog

PwnLab CTF Walkthrough

Lord of the root CTF walkthrough

SecOS: 1 Walkthrough