Thursday 29 March 2012

Extracting All Hyperlinks From Webpages - Python

In this example, I am going to show how easily you can extract all the links in a webpage using python. If you are learning to write some small scale crawler, this can be a quick startup on how you can extract the links in any webpage.

Basically, we will send the http request to any webpage and we will read the HTML response except in the case when the connection can not be established. In such case, we will simply inform the user that we could not connect to the website.

For all these stuffs, we will import few modules and most important ones are re and urllib2 for regular expression stuff and HTTP request/response stuffs respectively.

We then write the regex for the hyperlinks for which we will make a search in the HTML data we get back after sending the request from the server. Note the <a href=[\'"]?([^\'" >]+). The small brackets are there to let us capture our necessary information i.e. the actual links.

Now you understood what we'll be doing, below is the python script to extract the hyperlinks from any webpage.

#!/usr/bin/python

import re, urllib2
from sys import argv

if (len(argv) != 2):
    print "No URL specified. Taking default URL for link extraction"
    url = "http://www.techgaun.com"
else:
    url = str(argv[1])
    
links_regex = re.compile('<a href=[\'"]?([^\'" >]+)', re.IGNORECASE)
url_request = urllib2.Request(url)
try:
    response = urllib2.urlopen(url_request)
    html = response.read()
    links = links_regex.findall(html)
    print '\n'.join(links)
except urllib2.URLError:
    print "Can't Connect to the website"

Now run the script as python extracter.py http://www.techgaun.com or any URL you wish to.

So isn't it a good start for writing your own simple web crawler? :P