This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 8 years ago.
here is my function with using regex. it's working corectly but it's taking tags very slowly.
I think it's searching html code character by character.So it works slowly. Is there any solution of working slow.
string s = Sourcecode(richTextBox6.Text);
// <a ... > </a> tagları arasını alıyor.(taglar dahil)
Regex regex = new Regex("(?i)<a([^>]+)>(.+?)</a>");
string gelen = s;
string inside = null;
Match match = regex.Match(gelen);
if (match.Success)
{
inside= match.Value;
richTextBox2.Text = inside;
}
string outputStr = "";
foreach (Match ItemMatch in regex.Matches(gelen))
{
Console.WriteLine(ItemMatch);
inside = ItemMatch.Value;
//boşluk bırakıp al satır yazıyor
outputStr += inside + "\r\n";
}
richTextBox2.Text = outputStr;
Change outputStr to a StringBuilder, if you are appending very many items this will increase your speed. As already mentioned parsing HTML with a regex might be an issue (depends a lot on your input).
The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier.
Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it.
It almost needs to be done on a site-by-site basis.
You should not parse HTML using Regex.(Although you can use compiled Regex in your above code, to make it a bit quick.)
Regex is not build for parsing HTML. You can use a third-party library for parsing HTML which are built specifically for this purpose.
List of HTML Parsing Libraries
If you don't want to use 3rd party libraries, then you can use the System.Windows.Forms.WebBrowser for this purpose.
You can also use Fizzler, it uses HTML agility pack, but has extended support for jQuery
Then there is Majestic-12 HTML Parse, which is very quick.
You can also use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.
Check the following example on how improper usage of Regex can degrade performance.
Related
This question already has an answer here:
Regex to remove all spans from HTML keeping inner text as it is
(1 answer)
Closed 7 years ago.
I parse html (in c# code as string) and need to get all phrases from html. For example html:
<div><div>text1</div>text2</div>
I want to get array of strings:
text1
text2
If regular expression is impossible, please provide algorithm how to skip all tag names, tag attributes and get only text content.
Update: it is not a dublicate for span problem, becase text can be in any tag, not only span. I need all text, except tags and attributes. Dont want to use HtmlAgility parser.
Update2: found regex (yes, it possible)
//parse html, save text node in list
public void FindTextHtml(string html, List<string> list)
{
var ms = Regex.Matches(html, #">([^<>]*)<", RegexOptions.IgnoreCase | RegexOptions.Multiline);
foreach (Match m in ms)
{
var text = m.Groups[1].Value;
list.Add(text);
}
}
Full source code available here
What you are looking for is here: Grabbing HTML Tags
The matches you are looking for would be in the ...(.*?)... group. Hope this helps
use HtmlAgilityPack dll to parse through XML and HTML files and then use code below to get your text :
string path = #"path to the file";
HtmlAgilityPack.HtmlDocument hd = new HtmlAgilityPack.HtmlDocument();
hd.Load(path);
string result= hd.DocumentNode.InnerText.Trim();
that is all of what you need
I'm having problems with RegEx and trying to pull out a specifically formatted HTML link from a page's HTML source.
The HTML source contains many of these links. The link is in the format:
<a class="link" href="pagedetail.html?record_id=123456">RecordName</a>
For each matching link, I would like to be able to easily extract the following two bits of information:
The URL bit. E.g. pagedetail.html?record_id=123456
The link name. E.g. RecordName
Can anyone please help with this as I'm completely stuck. I'm needing this for a C# program so if there is any C# specific notation then that would be great. Thanks
TIA
People will tell you you should not parse HTML with REGEX. And I think it is a valid statement.
But sometimes with well formatted HTML and really easy cases like it seems is yours. You can use some regex to do the job.
For example you can use this regex and obtain group 1 for the URL and group 2 for the RecordName
<a class="link" href="([^"]+)">([^<]+)<
DEMO
I feel a bit silly answering this, because it should be evident through the two comments to your question, but...
You should not parse HTML with REGEX!
Use an XML parser, or better yet, a dedicated tool, like the HTML Agility Pack (which is still an XML parser, but fancier for working with HTML).
You can use TagRegex and EndTagRegex classes to parse html string and find tag you want. You need to iterate through all characters in html string to find out desired tag.
e.g.
var position = 0;
var tagRegex = new TagRegex();
var endTagRegex = new EndTagRegex();
while (position < html.length)
{
var match = tagRegex.Match(html, position);
if (match.Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
else if (endTagRegex.match(html, position).Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
position++;
}
This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I want to use regular expression to get the airline code between <AirlineCode> and </AirlineCode> tags.
I only want the values of the <AirlineCode> tags that are w/in the <Flight> tags. There are more <AirlineCode>tags outside and I don't want the airline values from them.
I tried w/ the regex below but it's giving me all airline codes regardless of the position consideration mentioned. Please help.
var regex = new Regex(#"<AirlineCode>(.*?)</AirlineCode>", RegexOptions.IgnoreCase);
Match m = regex.Match("<PNRViewRS><AirGroup><Flight CnxxIndicator=\"N\"><Arrival></Arrival><Carrier><AirlineCode>DL</AirlineCode></Carrier></Flight><Flight CnxxIndicator=\"N\"><Arrival></Arrival><Carrier><AirlineCode>AA</AirlineCode></Carrier></Flight></AirGroup></PNRViewRS>");
int matchCount = 0;
while (m.Success)
{
Console.WriteLine("Match" + (++matchCount));
for (int i = 1; i <= 2; i++)
{
Group g = m.Groups[i];
//do stuff...
}
m = m.NextMatch();
}
In general, it's a bad idea to try parsing XML with regular expressions. The reason is that regex is insufficiently expressive, even with back references and such. The questions linked in the comments are worth reading to understand why this is generally a bad idea.
That said, you can be successful if you know for certain the format of your file, and if you're willing to do a little non-regex parsing as well.
In your situation, you have essentially:
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
And you want all of the <AirlineCode> tags that occur within <Flight> tags.
The way to approach this problem is to extract the <Flight> tags and their contents with one regex, and then use another regex to extract the <AirlineCode> tags from those extracted <Flight> tags. Don't try to do it in a single regular expression. You will not succeed.
If your data really is that simple, then this will work. I won't say that I recommend this approach. There are too many things that can go wrong. Data formats have a distressing tendency to change, and that fragile regex solution is likely to break if the format changes even a little bit. An XML parser solution will be much more robust.
_request = (HttpWebRequest)WebRequest.Create(url);
_response = (HttpWebResponse) _request.GetResponse();
StreamReader streamReader = new StreamReader(_response.GetResponseStream());
string text = streamReader.ReadToEnd();
Text with html tags. How i can get text without html tags?
How do you extract text from dynamic HTML without using 3rd party libraries? Simple, you invent your own HTML parsing library using the string parsing functions present in the .NET framework.
Seriously, doing this by yourself is a bad idea. If you're pulling dynamic HTML off the web, you have to be prepared for different closing tags, mismatched tags, missing end tags, and so forth. Unless you have a really good reason why you need to write one yourself, just use HTML Agility Pack, and let that do the hard work for you.
Also, make sure you're not succumbing to Not Invented Here Syndrome.
Try this:
System.Xml.XmlDocument docXML = new System.Xml.XmlDocument();
docXML.Load(url);
string textWithoutTags = docXML.InnerText;
Be happy :)
1) Do not use Regular Expressions. (see this great StackOverflow post: RegEx match open tags except XHTML self-contained tags)
2) Use HtmlAgilityPack. But I see you do not want 3rd Party libraries, so we are forced to....
3) Use XmlReader. You can pretty much use the example code straight from MSDN, and just ignore all cases of XmlNodeType except for XmlNodeType.Text. For that case simply write your output to a StreamWriter.
This question has been asked before. There are a few ways to do it, including using a Regular Expression or as pointed out by Adrian, the Agility Pack.
See this question: How can I strip HTML tags from a string in ASP.NET?
I have an string input-buffer that contains html.
That html contains a lot of text, including some stuff I want to parse.
What I'm actually looking for are the lines like this : "< strong>Filename< /strong>: yadayada.thisandthat.doc< /p>"
(Although position and amount of whitespace / semicolons is variable)
What's the best way to get all the filenames into a List< string> ?
Well a regular expression to accomplish this will be very hard to write and will end up being unreliable anyway.
Probably your best bet is to have a whitelist of extensions you want to look for (.doc, .pdf etc), and trawl through the html looking for instances of these extensions. When you find one, track back to the next whitespace character and that's your filename.
Hope this helps.
You have a couple of options. You can use regular expressions, it could be something like Filename: (.*?)< /p> , but it will need to be much more complex. You would need to look at more of the text file to write a proper one. This could work depending on the structure of all your text, if there is always a certain tag after a filename for example.
If it is valid HTML you can also use a HTML parser like HTML Agility Pack to go through the html and pull out text from certain tags, then use a regex to seperate out the path.
I'm not sure a regular expression is the best way to do this, traversing the HTML tree is probably more sensible, but the following regex should do it:
<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>
As you can see, I've been extremely tolerant of whitespace, as well as tolerant on the content of the filename. Also, multiple (or no) semicolons are permitted.
The C# to build a List (off the top of my head):
List<String> fileNames = new List<String>();
Regex regexObj = new Regex(#"<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
fileNames.Add(matchResults.Groups[0].Value);
matchResults = matchResults.NextMatch();
}