This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Using C# regular expressions to remove HTML tags
I'm trying to write a code that will return only the content of an HTML file. The best way I've figured revolves either around eliminating all elements within < ..> brackets, or to make a list of all text in between >...< brackets. I'm pretty new to regular expressions, but I'm pretty sure they're the way to go.
Here's the code I've tried
Regex reg = new Regex(#"<.*>");
file = reg.Replace(file, "");
Which works, as long as there is only one <...> before a block of text. Any file that has two or more of those elements in sequence, like <...><...>, and it just starts deleting any text it finds. Can someone tell me what I'm doing wrong?
Regex are regulary greedy (they match the longest string they can find). Try checking, depending on the language you are looking for, for the +? or *? operators, that will try the shortest match. Otherwise you must build another regex.
Well, the unexpected behavior you're getting is because your regular expression is greedy
If you change your regex to
Regex reg = new Regex(#"<.*?>");
file = reg.Replace(file, "");
you'll get what you expect.
Also, Know that Regex doesn't handle nesting, which HTML has a lot of, and I'd avoid using Regex to parse HTML unless you're trying to match a very specific thing, on a specifically formed piece of html.
Related
This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I want to use regular expression to get the airline code between <AirlineCode> and </AirlineCode> tags.
I only want the values of the <AirlineCode> tags that are w/in the <Flight> tags. There are more <AirlineCode>tags outside and I don't want the airline values from them.
I tried w/ the regex below but it's giving me all airline codes regardless of the position consideration mentioned. Please help.
var regex = new Regex(#"<AirlineCode>(.*?)</AirlineCode>", RegexOptions.IgnoreCase);
Match m = regex.Match("<PNRViewRS><AirGroup><Flight CnxxIndicator=\"N\"><Arrival></Arrival><Carrier><AirlineCode>DL</AirlineCode></Carrier></Flight><Flight CnxxIndicator=\"N\"><Arrival></Arrival><Carrier><AirlineCode>AA</AirlineCode></Carrier></Flight></AirGroup></PNRViewRS>");
int matchCount = 0;
while (m.Success)
{
Console.WriteLine("Match" + (++matchCount));
for (int i = 1; i <= 2; i++)
{
Group g = m.Groups[i];
//do stuff...
}
m = m.NextMatch();
}
In general, it's a bad idea to try parsing XML with regular expressions. The reason is that regex is insufficiently expressive, even with back references and such. The questions linked in the comments are worth reading to understand why this is generally a bad idea.
That said, you can be successful if you know for certain the format of your file, and if you're willing to do a little non-regex parsing as well.
In your situation, you have essentially:
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
And you want all of the <AirlineCode> tags that occur within <Flight> tags.
The way to approach this problem is to extract the <Flight> tags and their contents with one regex, and then use another regex to extract the <AirlineCode> tags from those extracted <Flight> tags. Don't try to do it in a single regular expression. You will not succeed.
If your data really is that simple, then this will work. I won't say that I recommend this approach. There are too many things that can go wrong. Data formats have a distressing tendency to change, and that fragile regex solution is likely to break if the format changes even a little bit. An XML parser solution will be much more robust.
I have been using this regular expression to extract file names out of file path strings:
Regex r = new Regex(#"\w+[.]\w+$+");
This works, as long as there is no space in the file name. For example:
r.Match("c:\somestuff\myfile.doc").Value = "myfile.doc"
r.Match("c:\somestuff\my file.doc").Value = "file.doc"
I need my regular expression to give me "my file.doc", and not just "file.doc"
I tried messing around with the expression myself. In particular I tried adding \s+ after learning that that is for matching whitespaces. I didn't get the results I hoped for.
I did devise a solution just to get the job done: I started at the end of the string, went backwards until a backslash was reached. This gave me the file name in reverse order (i.e. cod.elifym) into an array of chars, then I used Array.Reverse() to turn it around. However I'd like to learn how to achieve this by simply modifying my original regular expression.
Does it have to be a regular expression? Use System.IO.Path.GetFileName() instead.
Regex r = new Regex(#"[\w ]+\.\w+$");
A working regex might simply look like:
[^\\]+$
Consider using:
System.IO.Path.GetFileName(path)
It would be great if someone could provide me the Regular expression for the following string.
Sample 1: <div>abc</div><br>
Sample 2: <div>abc</div></div></div></div></div><br>
As you can see in the samples provided above, I need to match the string no matter how many number of </div> occurs. If there occurs any other string between </div> and <br>, say like this <div>abc</div></div></div>DEF</div></div><br> OR <div>abc</div></div></div></div></div>DEF<br>, then the Regex should not match.
Thanks in advance.
Try this:
<div>([^<]+)(?:<\/div>)*<br>
As seen on rubular
Notes:
This only works if there are not tags in the abc part (or anything that has a < symbol).
You might want to use start and end of string anchors (^<div>([^<]+)(?:<\/div>)*<br>$ if you want your string to match the pattern exactly.
If you want to allow the abc part to be empty, use * instead of +
That being said, you should be wary of using regex to parse HTML.
In this example, you can use regex because you are parsing a (hopefully) known, regular subset of HTML. But a more robust solution (ie: an [X]HTML parser like HtmlAgilityPack) is preferred when it comes to parsing HTML.
You need to use a real parser. Things like infinitely nested tags can't be handled via regex.
You could also include a named group in the the expression, e.g.:
<div>(?<text>[^<]*)(?:<\/div>)*<br>
Implemented in C#:
var regex = new Regex(#"<div>(?<text>[^<]*)(?:<\/div>)*<br>");
Func<Match, string> getGroupText = m => (m.Success && m.Groups["text"] != null) ? m.Groups["text"].Value : null;
Func<string, string> getText = s => getGroupText(regex.Match(s));
Console.WriteLine(getText("<div>abc</div><br>"));
Console.WriteLine(getText("<div>123</div></div></div></div></div><br>"));
NullUserException's answer is good. Here are a couple of questions, and variations, depending on what you want.
Do you want to prevent anything from occurring before the open div tag? If so, keep the ^ at the beginning of the regex. If not, drop it.
The rest of this post refers to the following section of the regex:
([^<]+?)
Do you want to capture the contents of the div, or just know that it matches your form? To capture, leave it as is. If you don't need to capture, drop the parentheses from the above.
Do you want to match if there is nothing inside the div? If so change the + in the above to *
Finally, although it will work fine, you don't need the ? in the above.
I think, this regex is more flexible:
<div\b[^><]*+>(?>.*?</div>)(?:\s*+</div>)*+\s*+<br(?:\s*+/)?>
I don't include the ^ and $ in the beginning and the end of my regex because we cannot assure that your sample will always in a single line.
I have an string input-buffer that contains html.
That html contains a lot of text, including some stuff I want to parse.
What I'm actually looking for are the lines like this : "< strong>Filename< /strong>: yadayada.thisandthat.doc< /p>"
(Although position and amount of whitespace / semicolons is variable)
What's the best way to get all the filenames into a List< string> ?
Well a regular expression to accomplish this will be very hard to write and will end up being unreliable anyway.
Probably your best bet is to have a whitelist of extensions you want to look for (.doc, .pdf etc), and trawl through the html looking for instances of these extensions. When you find one, track back to the next whitespace character and that's your filename.
Hope this helps.
You have a couple of options. You can use regular expressions, it could be something like Filename: (.*?)< /p> , but it will need to be much more complex. You would need to look at more of the text file to write a proper one. This could work depending on the structure of all your text, if there is always a certain tag after a filename for example.
If it is valid HTML you can also use a HTML parser like HTML Agility Pack to go through the html and pull out text from certain tags, then use a regex to seperate out the path.
I'm not sure a regular expression is the best way to do this, traversing the HTML tree is probably more sensible, but the following regex should do it:
<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>
As you can see, I've been extremely tolerant of whitespace, as well as tolerant on the content of the filename. Also, multiple (or no) semicolons are permitted.
The C# to build a List (off the top of my head):
List<String> fileNames = new List<String>();
Regex regexObj = new Regex(#"<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
fileNames.Add(matchResults.Groups[0].Value);
matchResults = matchResults.NextMatch();
}
I have a string of test like this:
<customtag>hey</customtag>
I want to use a RegEx to modify the text between the "customtag" tags so that it might look like this:
<customtag>hey, this is changed!</customtag>
I know that I can use a MatchEvaluator to modify the text, but I'm unsure of the proper RegEx syntax to use. Any help would be much appreciated.
I wouldn't use regex either for this, but if you must this expression should work:
<customtag>(.+?)</customtag>
I'd chew my own leg off before using a regular expression to parse and alter HTML.
Use XSL or DOM.
Two comments have asked me to clarify. The regular expression substitution works in the specific case in the OP's question, but in general regular expressions are not a good solution. Regular expressions can match regular languages, i.e. a sequence of input which can be accepted by a finite state machine. HTML can contain nested tags to any arbitrary depth, so it's not a regular language.
What does this have to do with the question? Using a regular expression for the OP's question as it is written works, but what if the content between the <customtag> tags contains other tags? What if a literal < character occurs in the text? It has been 11 months since Jon Tackabury asked the question, and I'd guess that in that time, the complexity of his problem may have increased.
Regular expressions are great tools and I do use them all the time. But using them in lieu of a real parser for input that needs one is going to work in only very simple cases. It's practically inevitable that these cases grow beyond what regular expressions can handle. When that happens, you'll be tempted to write a more complex regular expression, but these quickly become very laborious to develop and debug. Be ready to scrap the regular expression solution when the parsing requirements expand.
XSL and DOM are two standard technologies designed to work with XML or XHTML markup. Both technologies know how to parse structured markup files, keep track of nested tags, and allow you to transform tags attributes or content.
Here are a couple of articles on how to use XSL with C#:
http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=63
http://www.csharphelp.com/archives/archive78.html
Here are a couple of articles on how to use DOM with C#:
http://msdn.microsoft.com/en-us/library/aa290341%28VS.71%29.aspx
http://blogs.msdn.com/tims/archive/2007/06/13/programming-html-with-c.aspx
Here's a .NET library that assists DOM and XSL operations on HTML:
http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack
If there won't be any other tags between the two tags, this regex is a little safer, and more efficient:
<customtag>[^<>]*</customtag>
Most people use HTML Agility Pack for HTML text parsing. However, I find it a little robust and complicated for my own needs. I create a web browser control in memory, load the page, and copy the text from it. (see example below)
You can find 3 simple examples here:
http://jakemdrew.wordpress.com/2012/02/03/getting-only-the-text-displayed-on-a-webpage-using-c/
//This is to replace all HTML Text
var re = new RegExp("<[^>]*>", "g");
var x2 = Content.replace(re,"");
//This is to replace all
var x3 = x2.replace(/\u00a0/g,'');