Atlanta Custom Software Development 

 
   Search        Code/Page
 

User Login
Email

Password

 

Forgot the Password?
Services
» Web Development
» Maintenance
» Data Integration/BI
» Information Management
Programming
  Database
Automation
OS/Networking
Graphics
Links
Tools
» Regular Expr Tester
» Free Tools


Sometimes we need to retrive only HTML text from a given URL (i.e. http://msdn.microsoft.com/vbasic/)

In this article I have used Regular Expression to perform this complex task. To do this we nned to perform following steps

1) Download HTML page from given URL
2) Grab Body of HTML
3) Perform regular Expression Search & Replace to clean all HTML tags
4) Perform regular Expression Search & Replace to clean all Script Blocks of JScript/VBScript)
5) Perform regular Expression Search & Replace to clean all HTML Comments (i.e. )
6) Perform regular Expression Search & Replace to clean all other unwanted words (i.e  <& gt; ...)

Step 1 : Download HTML page from given URL

Click here to copy the following block
Public Function GetHtmlPageSource(ByVal url As String, Optional ByVal username As _
     String = Nothing, Optional ByVal password As String = Nothing) As String
  Dim st As System.IO.Stream
  Dim sr As System.IO.StreamReader

  Try
    ' make a Web request
    Dim req As System.Net.WebRequest = System.Net.WebRequest.Create(url)
    ' if the username/password are specified, use these credentials
    If Not username Is Nothing AndAlso Not password Is Nothing Then
      req.Credentials = New System.Net.NetworkCredential(username, _
       password)
    End If
    ' get the response and read from the result stream
    Dim resp As System.Net.WebResponse = req.GetResponse
    st = resp.GetResponseStream
    sr = New System.IO.StreamReader(st)
    ' read all the text in it
    Return sr.ReadToEnd
  Catch ex As Exception
    Return ""
  Finally
    ' always close readers and streams
    sr.Close()
    st.Close()
  End Try
End Function

Step 2 : Grab Body of HTML

Click here to copy the following block
Private Function GetHTMLBody(ByRef strInput As String) As String
  '(?<bodystart><\s*(body)((.|\n)*?)\s*>)(?<body>(.|\n)*?)(?<bodyend><\s*\/(body)\s*>))"
  Dim strBodyRegX As String
  strBodyRegX = "<\s*body(.|\n)*?\s*>((.|\n)*?)<\s*\/body\s*>"
  Dim re As New System.Text.RegularExpressions.Regex(strBodyRegX)
  GetHTMLBody = re.Replace(strInput, "$2")
End Function

Step 3/4/5/6 : Perform regular Expression Search & Replace to clean all HTML tags/Script Blocks/Comments/other words

This is a bit tricky part of regular expression search and replace. To perform grouped search you need to define delegate for match handling. this delegate must be of type MatchEvaluator which will take one argument which is Actual function which will handle each Match. check the following declaration for MatchEvaluator.

Click here to copy the following block
Dim MatchDelegate As New MatchEvaluator(AddressOf MatchHandler)


Click here to copy the following block
'//Returns text without HTML tags
Function ProcessHTML(ByRef strInput As String) As String
  Dim strRegX As String
  '//This regx will find        //
  '=====================================
  '-->all html tag <xxx></xxx>
  '-->all script blocks <script></script>
  '-->all comments <!-- dfvcvc -->
  '-->all &nbsp; &lt; &gt; ...
  Dim sb As New System.Text.StringBuilder

  strRegX = "(?<script><\s*script(.|\n)*?\s*>((.|\n)*?)<\s*\/script\s*>)" & vbCrLf & _
      "|(?<com><!--[\s\S]*?-->)                 (?#ASP/ASP.net/HTML block comment)" & vbCrLf & _
      "|(?<nbsp>&nbsp;)" & vbCrLf & _
      "|(?<tag><(.|\n)+?>)(?#strip html tags)" & vbCrLf & _
      "|(?<gt>&gt;)" & vbCrLf & _
      "|(?<lt>&lt;)"
  Dim re As New System.Text.RegularExpressions.Regex(strRegX, RegexOptions.IgnoreCase Or RegexOptions.IgnorePatternWhitespace Or RegexOptions.Multiline)
  sb.Append(re.Replace(GetHTMLBody(strInput), MatchDelegate))
  ProcessHTML = sb.ToString
End Function

Private Function MatchHandler(ByVal m As Match) As String
  If m.Groups("script").Value <> "" Then
    MatchHandler = " "
  ElseIf m.Groups("com").Value <> "" Then
    MatchHandler = " "
  ElseIf m.Groups("nbsp").Value <> "" Then
    MatchHandler = " "
  ElseIf m.Groups("tag").Value <> "" Then
    MatchHandler = " "
  ElseIf m.Groups("gt").Value <> "" Then
    MatchHandler = " "
  ElseIf m.Groups("lt").Value <> "" Then
    MatchHandler = " "
  Else
    MatchHandler = m.ToString
  End If
End Function

How to execute the code

Click here to copy the following block
Dim strHTML,strOnlyText as String
strHTML=GetHtmlPageSource("www.msn.com")
strOnlyText=ProcessHTML(strHTML)

or wrap in a function

Click here to copy the following block
Public Function GetOnlyTextFromHTML(ByVal URL As String) As String
  GetOnlyTextFromHTML = ProcessHTML(GetHtmlPageSource(URL))
End Function

I hope you will enjoy this article.......

Full class implementation of Source code is given below

CSpider.vb

Click here to copy the following block
'//Author: Nayan S. patel
'//Date : 6/1/2004
'//Copyright © 2004 Reserved

Imports System.Text.RegularExpressions
Public Class CSpider
  Dim MatchDelegate As New MatchEvaluator(AddressOf MatchHandler)
  Public Function GetOnlyTextFromHTML(ByVal URL As String) As String
    GetOnlyTextFromHTML = ProcessHTML(GetHtmlPageSource(URL))
  End Function

  Public Function GetHtmlPageSource(ByVal url As String, Optional ByVal username As _
     String = Nothing, Optional ByVal password As String = Nothing) As String
    Dim st As System.IO.Stream
    Dim sr As System.IO.StreamReader

    Try
      ' make a Web request
      Dim req As System.Net.WebRequest = System.Net.WebRequest.Create(url)
      ' if the username/password are specified, use these credentials
      If Not username Is Nothing AndAlso Not password Is Nothing Then
        req.Credentials = New System.Net.NetworkCredential(username, _
         password)
      End If
      ' get the response and read from the result stream
      Dim resp As System.Net.WebResponse = req.GetResponse
      st = resp.GetResponseStream
      sr = New System.IO.StreamReader(st)
      ' read all the text in it
      Return sr.ReadToEnd
    Catch ex As Exception
      Return ""
    Finally
      ' always close readers and streams
      sr.Close()
      st.Close()
    End Try
  End Function
  Private Function GetHTMLBody(ByRef strInput As String) As String
    '(?<bodystart><\s*(body)((.|\n)*?)\s*>)(?<body>(.|\n)*?)(?<bodyend><\s*\/(body)\s*>))"
    Dim strBodyRegX As String
    strBodyRegX = "<\s*body(.|\n)*?\s*>((.|\n)*?)<\s*\/body\s*>"
    Dim re As New System.Text.RegularExpressions.Regex(strBodyRegX)
    GetHTMLBody = re.Replace(strInput, "$2")
  End Function

  '//Returns text without HTML tags
  Public Function ProcessHTML(ByRef strInput As String) As String
    Dim strRegX As String
    '//This regx will find        //
    '=====================================
    '-->all html tag <xxx></xxx>
    '-->all script blocks <script></script>
    '-->all comments <!-- dfvcvc -->
    '-->all &nbsp; &lt; &gt; ...
    Dim sb As New System.Text.StringBuilder

    strRegX = "(?<script><\s*script(.|\n)*?\s*>((.|\n)*?)<\s*\/script\s*>)" & vbCrLf & _
        "|(?<com><!--[\s\S]*?-->)                 (?#ASP/ASP.net/HTML block comment)" & vbCrLf & _
        "|(?<nbsp>&nbsp;)" & vbCrLf & _
        "|(?<tag><(.|\n)+?>)(?#strip html tags)" & vbCrLf & _
        "|(?<gt>&gt;)" & vbCrLf & _
        "|(?<lt>&lt;)"
    Dim re As New System.Text.RegularExpressions.Regex(strRegX, RegexOptions.IgnoreCase Or RegexOptions.IgnorePatternWhitespace Or RegexOptions.Multiline)
    sb.Append(re.Replace(GetHTMLBody(strInput), MatchDelegate))
    ProcessHTML = sb.ToString
  End Function
  Private Function MatchHandler(ByVal m As Match) As String
    If m.Groups("script").Value <> "" Then
      MatchHandler = " "
    ElseIf m.Groups("com").Value <> "" Then
      MatchHandler = " "
    ElseIf m.Groups("nbsp").Value <> "" Then
      MatchHandler = " "
    ElseIf m.Groups("tag").Value <> "" Then
      MatchHandler = " "
    ElseIf m.Groups("gt").Value <> "" Then
      MatchHandler = " "
    ElseIf m.Groups("lt").Value <> "" Then
      MatchHandler = " "
    Else
      MatchHandler = m.ToString
    End If
  End Function
End Class

Happy Programming........


Submitted By : Nayan Patel  (Member Since : 5/26/2004 12:23:06 PM)

Job Description : He is the moderator of this site and currently working as an independent consultant. He works with VB.net/ASP.net, SQL Server and other MS technologies. He is MCSD.net, MCDBA and MCSE. In his free time he likes to watch funny movies and doing oil painting.
View all (893) submissions by this author  (Birth Date : 7/14/1981 )


Home   |  Comment   |  Contact Us   |  Privacy Policy   |  Terms & Conditions   |  BlogsZappySys

© 2008 BinaryWorld LLC. All rights reserved.