11/4/09

C# Extract HTML Links From Website

Using C# to get a list of active links in a URL is not as hard as it may seem. The trick is in properly processing the raw HTML code of pages...

The purpose of retrieving a list of links from a web page is very open-ended. Some people want to know the active links in a page to make sure that none are broken. Others might need the links for SEO purposes like PageRank checking.


Whatever the goal, C# is a good programming language to write an application capable of sorting through HTML code quickly.

C#'s .Net Framework provides programmers with two string processing functions that will be important: Substring and IndexOf. Substring will be used to cut out parts of the string that are both useful and not so useful. Meanwhile IndexOf gives us a way to effortlessly scan through large amounts of text.

So the algorithm will go as follows:

1. While-Loop as long as a "key" phrase exists in the text. In this case the key phrase will be "a href=". As long as the phrase "a href" is present in the HTML code, there is a link to be found.

2. Using the IndexOf function to get to the key phrase, we use the Substring function to extract the link from between the quote marks that follow the a href part.

3. Get rid of all the text from the beginning of the code up to (and including) this link we just found.

Of course there are many optional steps that can be added to add complexity to the algorithm, but in its essential steps, that is the way to go. Note that the last step is of particular importance. If you don't routinely toss out the data that has already been processed, the while-loop will get stuck forever in the same information.

As a closing point, this article does not address the part of getting access to the HTML code in the first place. The reason being that it is a whole different topic. If you are interested, a simple Google search reveals a wealth of information.

No comments:

Post a Comment