Need regular expression to find all phrases in html [duplicate] - c#

This question already has an answer here:
Regex to remove all spans from HTML keeping inner text as it is
(1 answer)
Closed 7 years ago.
I parse html (in c# code as string) and need to get all phrases from html. For example html:
<div><div>text1</div>text2</div>
I want to get array of strings:
text1
text2
If regular expression is impossible, please provide algorithm how to skip all tag names, tag attributes and get only text content.
Update: it is not a dublicate for span problem, becase text can be in any tag, not only span. I need all text, except tags and attributes. Dont want to use HtmlAgility parser.
Update2: found regex (yes, it possible)
//parse html, save text node in list
public void FindTextHtml(string html, List<string> list)
{
var ms = Regex.Matches(html, #">([^<>]*)<", RegexOptions.IgnoreCase | RegexOptions.Multiline);
foreach (Match m in ms)
{
var text = m.Groups[1].Value;
list.Add(text);
}
}
Full source code available here

What you are looking for is here: Grabbing HTML Tags
The matches you are looking for would be in the ...(.*?)... group. Hope this helps

use HtmlAgilityPack dll to parse through XML and HTML files and then use code below to get your text :
string path = #"path to the file";
HtmlAgilityPack.HtmlDocument hd = new HtmlAgilityPack.HtmlDocument();
hd.Load(path);
string result= hd.DocumentNode.InnerText.Trim();
that is all of what you need

Related

RegEx to pull out specific URL format from HTML source

I'm having problems with RegEx and trying to pull out a specifically formatted HTML link from a page's HTML source.
The HTML source contains many of these links. The link is in the format:
<a class="link" href="pagedetail.html?record_id=123456">RecordName</a>
For each matching link, I would like to be able to easily extract the following two bits of information:
The URL bit. E.g. pagedetail.html?record_id=123456
The link name. E.g. RecordName
Can anyone please help with this as I'm completely stuck. I'm needing this for a C# program so if there is any C# specific notation then that would be great. Thanks
TIA
People will tell you you should not parse HTML with REGEX. And I think it is a valid statement.
But sometimes with well formatted HTML and really easy cases like it seems is yours. You can use some regex to do the job.
For example you can use this regex and obtain group 1 for the URL and group 2 for the RecordName
<a class="link" href="([^"]+)">([^<]+)<
DEMO
I feel a bit silly answering this, because it should be evident through the two comments to your question, but...
You should not parse HTML with REGEX!
Use an XML parser, or better yet, a dedicated tool, like the HTML Agility Pack (which is still an XML parser, but fancier for working with HTML).
You can use TagRegex and EndTagRegex classes to parse html string and find tag you want. You need to iterate through all characters in html string to find out desired tag.
e.g.
var position = 0;
var tagRegex = new TagRegex();
var endTagRegex = new EndTagRegex();
while (position < html.length)
{
var match = tagRegex.Match(html, position);
if (match.Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
else if (endTagRegex.match(html, position).Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
position++;
}

Reasons of working slowly and solution of that [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 8 years ago.
here is my function with using regex. it's working corectly but it's taking tags very slowly.
I think it's searching html code character by character.So it works slowly. Is there any solution of working slow.
string s = Sourcecode(richTextBox6.Text);
// <a ... > </a> tagları arasını alıyor.(taglar dahil)
Regex regex = new Regex("(?i)<a([^>]+)>(.+?)</a>");
string gelen = s;
string inside = null;
Match match = regex.Match(gelen);
if (match.Success)
{
inside= match.Value;
richTextBox2.Text = inside;
}
string outputStr = "";
foreach (Match ItemMatch in regex.Matches(gelen))
{
Console.WriteLine(ItemMatch);
inside = ItemMatch.Value;
//boşluk bırakıp al satır yazıyor
outputStr += inside + "\r\n";
}
richTextBox2.Text = outputStr;
Change outputStr to a StringBuilder, if you are appending very many items this will increase your speed. As already mentioned parsing HTML with a regex might be an issue (depends a lot on your input).
The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier.
Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it.
It almost needs to be done on a site-by-site basis.
You should not parse HTML using Regex.(Although you can use compiled Regex in your above code, to make it a bit quick.)
Regex is not build for parsing HTML. You can use a third-party library for parsing HTML which are built specifically for this purpose.
List of HTML Parsing Libraries
If you don't want to use 3rd party libraries, then you can use the System.Windows.Forms.WebBrowser for this purpose.
You can also use Fizzler, it uses HTML agility pack, but has extended support for jQuery
Then there is Majestic-12 HTML Parse, which is very quick.
You can also use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.
Check the following example on how improper usage of Regex can degrade performance.

Regex for Removing Comma between <a> tag text C#

I have the following string , i tried many many regex to remove comma between a tag text, but not found any regex for removing comma between a tag text. I want that , whenever text inside a tag has comma ,then will be replace by empty string.
Getty Center, Restaurant at the
i have tried this regex but it is not working, here input is string that contains html.
input = Regex.Replace(input, #"<a(\s+[^>]*)?>[^\w\s]</a(\s+[^>]*)?>", "");
Please help me out. Thank You
You can use the Regex to find and modify the content of the tag like so.
var input = "Getty Center, Restaurant at the";
var regex = new Regex(#"<a[^>]*>(?<content>.*?)</a[^>]*>",
RegexOptions.Singleline);
var match = regex.Match(input);
while (match.Success) {
var group = match.Groups["content"];
input = input.Substring(0, group.Index)
+ group.Value.Replace(",", "")
+ input.Substring(group.Index + group.Length);
match = regex.Match(input, group.Index);
};
The loop is in place to catch multiple tags in the same string. The Regex however is fairly naive. It will mess with tags nested inside the A tag, and will parse incorrectly if a > is in any of the attributes. (Though that would probably be bad HTML anyway.) A proper HTML parser is recommended for this reason.
I would suggest to use a HTML parser. There are plenty available which are open source and are free. One of the best I found is HTMLAgilityPack at HTMLAgilityPack
Some examples at Some Examples
In nutshell, the following code snippet will give you all tag
HtmlDocument myDoc = new HtmlDocument();
myDoc.Load(path);
HtmlNodeCollection imgs = new HtmlNodeCollection(myDoc.DocumentNode.ParentNode);
imgs = myDoc.DocumentNode.SelectNodes("//img");
Hope that helps
If you want to directly use the replace, you will have to match only the comma and not the text before or after the comma. You'd have to use look ahead and look behind to check if the comma is in the tag. Although this is doable, it is not advised to do this.
An alternative is to use matching groups to match the whole text in the tag and group the comma if it exists and replace the match.
<a[^>]+>[\w\s]*(,?)[\w\s]*<\/a>
The first capture group captures comma if present. You can test it here. [http://rubular.com/r/K2jjIaObty][1]
The best option would be to use a html parser to capture contents of the a tag, search for comma and replace.

C# Remove Span Tags I Insert

Via a WYSIWYG text editor I insert span tags with a class of "comment". I want to remove any instances of the span tags and their contents from a string
So how do I get from here:
string content = "<p>sadf<span class="otherclass"><span class="comment">asdfsdafsdafsadfsdf</span></span></p>";
to here:
content = "<p>sadf<span class="otherclass"></span></p>";
I know about the HTMLAgilityPack but don't want to add the overhead for HTML that I control. I perfer a regex solution.
EDIT: I only want to remove spans with the "comment" class.
Inadequate answer:
content = Regex.Replace(content, #"<span\s+class=""comment"">.*?</span>", "");
The regex expression to filter your string would be <span\s+class=\"comment.*?span> You might be interested in trying RegexBuddy. it has helped me figure my regex expressionsvery nicely.
Catch the regex as a string an replace that string against you string content
Edit after I realized that you needed to just remove the <span class="comment"></span> like BLUEPIXY did.

C# extracting certain parts of a string

I have a console application which is parsing HTML documents via the WebRequest method (http). The issue is really with extracting data from the html code that is returned.
Below is a fragment of the html I am interested in:
<span class="header">Number of People:</span>
<span class="peopleCount">1001</span> <!-- this is the line we are interested in! -->
<span class="footer">As of June 2009.</span>
Assume that the above html is contained in a string called "responseHtml". I would like to just extract the 'People Count' value, (second line).
I've searched stack over flow and found some code that could work:
How do I extract text that lies between parentheses (round brackets)?
But when I implement it, it doesn't work - I don't think it likes the way I have placed HTML tags into the regex:
string responseHtml; // this is already filled with html code above ^^
string insideBrackets = null;
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");
Match match = regex.Match(responseHtml);
if (match.Success)
{
insideBrackets = match.Groups["TextInsideBrackets"].Value;
Console.WriteLine(insideBrackets);
}
The above just fails to work, is it something to do with the html span brackets? All I want is the text value in between the tags for that specific span.
Thanks in advance!
Try this one:
Regex regex = new Regex("class=\\\"peopleCount\\\"\\>(?<data>[^\\<]*)",
RegexOptions.CultureInvariant
| RegexOptions.Compiled
);
It should be a tad faster, as you are basically saying the data you are looking for starts after peopleCount"> and ends at the first <
(I changed the group name to data)
Cheers,
Florian
?<TextInsideBrackets> is incorrect
You need:
(?<TextInsideBrackets>...)
I assume you want to do a named capture.
You should use
Regex regex = new Regex("\\<span class=\"peopleCount\">(?<TextInsideBrackets>\\w+)\\</span>");
and not
Regex regex = new Regex("\\<span class=\"peopleCount\">?<TextInsideBrackets>\\w+\\</span>");

Categories

Resources