I want to query a string (html) from a database and display it on a webpage. The problem is that the data has a
<p> around the text (ending with </p>
I want to strip this outer tag in my viewmodel or controlleraction that returns this data. what is the best way of doing this in C#?
Might be overkill for your needs, but if you want to parse the HTML you can use the HtmlAgilityPack - certainly a cleaner solution in general than most suggested here, although it might not be as performant:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<p> around the text (ending with </p>");
string result = doc.DocumentNode.FirstChild.InnerHtml;
If you're absolutely sure the string will always have that tag, you can use String.Substring like myString.Substring(3, myString.Length-7) or so.
A more robust method would be to either manually code the appropriate tests or use a regular expression, or ultimately, use an HTML parser as suggested by BrokenGlass's answer.
UPDATE: Using regexes you could do:
String filteredString = Regex.Match(myString, "^<p>(.*)</p>").ToString();
You could add \s after the initial ^ to remove also leading whitespace. Also, you can check the result of Match to see if the string matched the <p>...</p> pattern at all. This may also help.
If the data is always surrounded by <p> ... </p>:
string withoutParas = withParas.Substring(3, withParas.Length - 7);
Try using string function Remove() passing it the FirstIndex() of <p> and the last index of </p> with length 3
If you are absolutely guaranteed that you string will always fit the pattern of <p>...</p>, then the other solutions using data.Substring(3, data.Length - 6) are sufficient. If, however, there's any chance that it could look at all different, then you really need to use an HTML parser. The consensus is that the HTML Agility Pack is the way to go.
s = s.Replace("<p>", String.Empty).Replace("</p>", String.Empty);
Related
I have a source to a web page and I need to extract the body. So anything between </head><body> and </body></html>.
I've tried the following with no success:
var match = Regex.Match(output, #"(?<=\</head\>\<body\>)(.*?)(?=\</body\>\</html\>)");
It finds a string but cuts it off long before </body></html>. I escaped characters based on the RegEx cheat sheet.
What am i missing?
I'd recommend using the HtmlAgilityPack instead - parsing HTML with regular expressions is very, very fragile.
The latest version even supports Linq so you can get your content like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://stackoverflow.com");
string html = doc.DocumentNode.Descendants("body").Single().InnerHtml;
Regex is not meant for such html handling, as many here would say. Without having your sample web page / html, I can only say that try removing the non-greedy ? quantifier in (.*?) and try. After all, a html page will have only one head and body.
Though regexes are definitely not the best tool for this task, there are a few suggestions and points I would like to make:
un-escape the angle brackets - with the # before your string, they are going through to the regex and they do not need to be escaped for a .NET regex
with your regex, you need to make sure that the head/body tag combinations do not have any white-space between them.
with your regex, the body tag cannot have any attributes.
I would suggest something more like:
(?<=</head>\s*<body(\s[^>]*)?>)(.*?)(?=</body>\s*</html>)
this seems to work for me on the source of this page!
As the others have said, the correct way to handle this is with an HTML-specific tool. I just want to point out some problems with that cheat-sheet.
First, it's wrong about angle brackets: you do not need to escape them. In fact, it's wrong twice: it also says \< and \> match word boundaries, which is both incorrect for .NET, and incompatible with the advice about escaping angle brackets.
That cheat-sheet is just a random collection of regex syntax elements; most of them will work in most flavors, but many are guaranteed not to work in your particular flavor, whatever it happens to be. I recommend you disregard it and rely instead on .NET-specific documents or Regular-Expressions.info. The books Mastering Regular Expressions and Regular Expressions Cookbook are both excellent, too.
As for your regex, I don't see how it could behave the way you say it does. If it were going to fail, I would expect it to fail completely. Does your HTML document contain a CDATA section or SGML comment with </body></html> inside it? Or is it really two or more HTML documents run together?
It would be great if someone could provide me the Regular expression for the following string.
Sample 1: <div>abc</div><br>
Sample 2: <div>abc</div></div></div></div></div><br>
As you can see in the samples provided above, I need to match the string no matter how many number of </div> occurs. If there occurs any other string between </div> and <br>, say like this <div>abc</div></div></div>DEF</div></div><br> OR <div>abc</div></div></div></div></div>DEF<br>, then the Regex should not match.
Thanks in advance.
Try this:
<div>([^<]+)(?:<\/div>)*<br>
As seen on rubular
Notes:
This only works if there are not tags in the abc part (or anything that has a < symbol).
You might want to use start and end of string anchors (^<div>([^<]+)(?:<\/div>)*<br>$ if you want your string to match the pattern exactly.
If you want to allow the abc part to be empty, use * instead of +
That being said, you should be wary of using regex to parse HTML.
In this example, you can use regex because you are parsing a (hopefully) known, regular subset of HTML. But a more robust solution (ie: an [X]HTML parser like HtmlAgilityPack) is preferred when it comes to parsing HTML.
You need to use a real parser. Things like infinitely nested tags can't be handled via regex.
You could also include a named group in the the expression, e.g.:
<div>(?<text>[^<]*)(?:<\/div>)*<br>
Implemented in C#:
var regex = new Regex(#"<div>(?<text>[^<]*)(?:<\/div>)*<br>");
Func<Match, string> getGroupText = m => (m.Success && m.Groups["text"] != null) ? m.Groups["text"].Value : null;
Func<string, string> getText = s => getGroupText(regex.Match(s));
Console.WriteLine(getText("<div>abc</div><br>"));
Console.WriteLine(getText("<div>123</div></div></div></div></div><br>"));
NullUserException's answer is good. Here are a couple of questions, and variations, depending on what you want.
Do you want to prevent anything from occurring before the open div tag? If so, keep the ^ at the beginning of the regex. If not, drop it.
The rest of this post refers to the following section of the regex:
([^<]+?)
Do you want to capture the contents of the div, or just know that it matches your form? To capture, leave it as is. If you don't need to capture, drop the parentheses from the above.
Do you want to match if there is nothing inside the div? If so change the + in the above to *
Finally, although it will work fine, you don't need the ? in the above.
I think, this regex is more flexible:
<div\b[^><]*+>(?>.*?</div>)(?:\s*+</div>)*+\s*+<br(?:\s*+/)?>
I don't include the ^ and $ in the beginning and the end of my regex because we cannot assure that your sample will always in a single line.
I have the following string:
<div id="mydiv">This is a "div" with quotation marks</div>
I want to use regular expressions to return the following:
<div id='mydiv'>This is a "div" with quotation marks</div>
Notice how the id attribute in the div is now surrounded by apostrophes?
How can I do this with a regular expression?
Edit: I'm not looking for a magic bullet to handle every edge case in every situation. We should all be weary of using regex to parse HTML but, in this particular case and for my particular need, regex IS the solution...I just need a bit of help getting the right expression.
Edit #2: Jens helped to find a solution for me but anyone randomly coming to this page should think long and very hard about using this solution. In my case it works because I am very confident of the type of strings that I'll be dealing with. I know the dangers and the risks and make sure you do to. If you're not sure if you know then it probably indicates that you don't know and shouldn't use this method. You've been warned.
This could be done in the following way: I think you want to replace every instance of ", that is between a < and a > with '.
So, you look for each " in your file, look behind for a <, and ahead for a >. The regex looks like:
(?<=\<[^<>]*)"(?=[^><]*\>)
You can replace the found characters to your liking, maybe using Regex.Replace.
Note: While I found the Stack Overflow community most friendly and helpful, these Regex/HTML questions are responded with a little too much anger, in my opinion. After all, this question here does not ask "What regex matches all valid HTML, and does not match anything else."
I see you're aware of the dangers of using Regex to do these kinds of replacements. I've added the following answer for those in search of a method that is a lot more 'stable' if you want to have a solution that will keep working as the input docs change.
Using the HTML Agility Pack (project page, nuget), this does the trick:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here");
// or doc.Load(stream);
var nodes = doc.DocumentNode.DescendantNodes();
foreach (var node in nodes)
{
foreach (var att in node.Attributes)
{
att.QuoteType = AttributeValueQuote.SingleQuote;
}
}
var fixedText = doc.DocumentNode.OuterHtml;
//doc.Save(/* stream */);
You can match:
(<div.*?id=)"(.*?)"(.*?>)
and replace this with:
$1'$2'$3
So say I have some html with an image tag like this:
<p> (1) some image is below:
<img src="/somwhere/filename_(1).jpg">
</p>
I want a regex that will just get rid of the parenthesis in the filename so my html will look like this:
<p> (1) some image is below:
<img src="/somwhere/filename_1.jpg">
</p>
Does anyone know how to do this? My programming language is C#, if that makes a difference...
I will be eternally grateful and send some very nice karma your way. :)
I suspect your job would be much easier if you used the HTML Agility that can help you to do this instead of regex's judging from the answers, it will make parsing the HTML a lot easier for you to achieve what you are trying to do.
Hope this helps,
Best regards,
Tom.
This (rather dense) regex should do it:
string s = Regex.Replace(input, #"(<img\s+[^>]*src=""[^""]*)\((\d+)\)([^""]*""[^>]*>)", "$1$2$3");
Nick's solution is fine if the file names always match that format, but this one matches any parenthesis, anywhere in the attribute:
s = Regex.Replace(#"(?i)(?<=<img\s+[^>]*\bsrc\s*=\s*""[^""]*)[()]", "");
The lookbehind ensures that the match occurs inside the src attribute of an img tag. It assumes the attribute is enclosed in double-quotes (quotation marks); if you need to allow for single-quotes (apostrophes) or no quotes at all, the regex gets much more complicated. I'll post that if you need it.
In this simple case, you could just use string.Replace, for example:
string imgFilename = "/somewhere/image_(1).jpg";
imgFilename = imgFilename.Replace("(", "").Replace(")", "");
Or do you need a regex for replacing the complete tag inside a HTML string?
Regex.Replace(some_input, #"(?<=<\s*img\s*src\s*=\s*""[^""]*?)(?:\(|\))(?=[^""]*?""\s*\/?\s*?>)", "");
Finds ( or ) preceded by <img src =" and, optionally, text (with any whitespace combination, though I didn't include newline), and followed by optional text and "> or "/>, again with any whitespace combination, and replaces them with nothingness.
How can I write a regular expression to replace links with no link text like this:
with
http://www.somesite.com
?
This is what I was trying to do to capture the matches, and it isn't catching any. What am I doing wrong?
string pattern = "<a\\s+href\\s*=\\s*\"(?<href>.*)\">\\s*</a>";
I wouldn't use a regex - I'd use the Html Agility Pack, and a query like:
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[.='']")) {
link.InnerText = link.GetAttribute("href");
}
I could be wrong, but I think you simply need to change the quantifier within the href group to be lazy rather than greedy.
string pattern = #"<a\s+href\s*=\s*""(?<href>.*?)"">\s*</a>";
(I've also changed the type of the string literal to use #, for better readability.)
The rest of the regex appears fine to me. That you're not capturing any matches at all makes me think otherwise, but there could be a problem in the rest of the code (or even the input data - have you verified that?).
I would suggest
string pattern = "(<a\\b[^>]*href=\"([^\"]+)\"[^>]*>)[\\s\\r\\n]*(</a>)";
This way also links with their href attribute somewhere else would be captured.
Replace with
"$1$2$3"
The usual word of warning: HTML and regex are essentially incompatible. Use with caution, this might blow up.
Marc Gravell has the right answer, regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.