Friday, June 15, 2007

How to count html links in a web page

We can use parse the hmtl pages of a site and then further write a recursive program to count the no of html links present in that web page.
Here is the code snippet that will give you an idea.

Suppose your web page content is given as a String object in java

This program parses the above html page and will return you two html links present in the web page as follows.

2
http://localhost/mypage1.html
http://localhost/mypage2.html

2 is the total count of html links present in web page and links are given at each
line.

Following is the program that gives us total html links and their count




Modifications can be done to this code so as to incorporate hashmap instead of arraylist. This may be helpful if we want to count only unique links present in the page.

When we use hashmap, we can use this program to recursively crawl the website's each
page as each link will be unique in hashmap and we can fetch pages corresponding to those links using standard java techniques and hence recursively applying the program for newly obtained page content.

Sphere: Related Content

0 comments: