Remove HTML with Regular Expressions

Working with websites you often need to strip out HTML tags, tag attributes or the complete contents of a HTML tag from some text. Regular expressions can make this very easy and so we thought we would share some that we use all the time.

Find Html Tags

<.*?>

This expression will find all HTML starting and closing tags with or without attributes and so can allow you to strip out all HTML tags from an input string.

Find HTML Tag and Content

<head.*?>(.|\n)*?</head>

With this expression we are searching for an opening and closing <head> tag. This expression gives us the option to remove the complete <head> section from a document.

Using the Regular Expressions

The following C# code uses the second regular expressions to remove the <head> tag from the HTML content and replace it with an empty string:

using System.Text.RegularExpressions;
...
string content = "<html><head><title>Using Regular Expressions</title></head><body><h1>Using Regular Expressions</h1><p>Regular expressions are really quite powereful and can make replacing HTML really easy.";

string pattern = "<head.*?>(.|\n)*?</head>";
string replacedContet = Regex.Replace(content, pattern, string.Empty);

To remove all HTML attributes from some HTML you could use the first regular expression and a MatchEvaluator:

string content = "<div clas="a-class" id="an-id">Strip <em style="color:#0f0">any</em> HTML attributes from this content</div>";

string pattern = "<.*?>";
string filteredContent = System.Text.RegularExpressions.Regex.Replace(dirtyString, pattern, delegate(System.Text.RegularExpressions.Match match)
{
	// called for each time there is a match
	string m = match.ToString();
	// now replace anything after the first space
	int spacePosition = m.IndexOf(" ");
	if (spacePosition >= 0)
	{
		return m.Substring(0, spacePosition) + ">";
	}
	else
	{
		return m;
	}
});
This entry was posted in Programming and tagged . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

4 Comments

  1. Posted November 12, 2013 at 6:26 am | Permalink

    Help me how to replace text between specified tags
    For example replace between these tags

    • Posted November 12, 2013 at 8:51 am | Permalink

      Hi Raj. Your tags look to have been stripped out. Send me an email with an example and I’ll have a look at it for you.

      • Posted November 12, 2013 at 12:15 pm | Permalink

        Thanks for your replay
        I mailed you my request

  2. Posted May 10, 2014 at 12:24 pm | Permalink

    Thanks so much for this post. Finally it resolved my issue 🙂

Post a Comment

*
*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

* = Required