Regex to get values between Double Quotes

Regex to get values between Double Quotes - c#

I have a value i am pulling from a database
<iframe width="420" height="315" src="//www.youtube.com/embed/8GRDA1gG8R8" frameborder="0" allowfullscreen></iframe>
I am trying to get the src as a value using regex.
Regex.Match(details.Tables["MarketingDetails"].Rows[0]["MarketingVideo"].ToString(), "\\\"([^\\\"]*)\\\"").Groups[2].Value
that is how i am currently writing it
How would I write this to pull the correct value of src?

You could do it like this....
Match match = Regex.Match( #"<iframe width=""420"" height=""315"" src=""//www.youtube.com/embed/8GRDA1gG8R8"" frameborder=""0"" allowfullscreen></iframe>", #"src=(\""[^\""]*\"")");
Console.WriteLine (match.Groups[1].Value);
However, as others have already commented on your question... it's better practice to use an actual html parser.

Don't use regex to parse xml or html. It's not worth it. I'll let you read this post, and it sort of exagerates the point, but the main thing to keep in mind is you can get into a lot of trouble with regex and html.
So, instead you should use an actual html/xml parser! For starters, use XElement, a class built into the .net framework.
string input = "<iframe width=\"420\" height=\"315\" src=\"//www.youtube.com/embed/8GRDA1gG8R8\" frameborder=\"0\" allowfullscreen=''></iframe>";
XElement html = XElement.Parse(input);
string src = html.Attribute("src").Value;
This will make src have the value //www.youtube.com/embed/8GRDA1gG8R8. You can then split that up to get whatever you need from it.
I should also note that your input is not valid xml. allowfullscreen does not have a value attached, which is why I added =''.
If you need to get more complex, such as your input, use an HTML parser (XElement is meant for xml). Use the Html Agility Pack like this (using the previous example):
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(input);
string src = doc.DocumentNode
.Element("iframe")
.Attributes["src"]
.Value;
This parser is more forgiving for invalid or incorrect (or just irregular) inputs. This will parse your original input just fine (so missing the ='').

Related

Using C# to convert incorrect html string to real html

My original issue is that I am trying to serialize a string containing html tags to an XML element.
hello World, this
is
a nice
test
<ul>
<li>to demonstrate my issue</li>
<li>and find a solution</li>
</ul>
However, I have 2 issues
Serializing HTML to XML: I did not succeed in defining the Serializable class to correctly serialize with XmlSerialze, so I decided that, using CDATA sections might be the better way. This is however not correctly deserialized by the target tool (that I have no influence on). What I need is plain and correct html (XHMTL?) within the xml output file.
2. The string looks e.g. as above, but is not fully correct html (no <p> tags, no <br> tags).
Now I would like to replace the newlines by a p or br tag. I have had a look here and used the suggested solution:
string result = "<p>" + text
.Replace(Environment.NewLine + Environment.NewLine, "</p><p>")
.Replace(Environment.NewLine, "<br />")
.Replace("</p><p>", "</p>" + Environment.NewLine + "<p>") + "</p>";
However, this does not in all cases generate valid html. In the example above, it would create <br />s between the <li> tags or cause <ul> tags within <p> tags - which is both not allowed.
Target would be to have a result like the following (line breaks are only for better readability and don't matter here)
<p>hello World, this</p>
<p>is<br/>
a nice<br/>
test<br/></p>
<ul>
<li>to demonstrate my issue</li>
<li>and find a solution</li>
</ul>
Do you have any suggestion how to solve this either with a string.Replace, Regex, or better solution (HtmlDocument)?
Please note: I have no influence on deserialization, the XML output is evaluated by I tool I have no influence on, and it has to be UTF-8 encoded.
Thank you!
EDIT: Clearly separated the 2 issues
EDIT2: No influence on deserialization
EDIT3: Added target output

What you're trying to do is implement a "tag soup parser", which takes text that may or may not be HTML as input and transforms that into a valid DOM, that a HTML parser can handle.
You don't want to reinvent this wheel, most definitely not with simple string replaces. See How to parse bad html? for some hints.
Or you can just encode the input HTML in such a way that it doesn't interfere with the XML that you're trying to put it in, like a CDATA section or base64-encoding the input would also suffice. Don't use "entity encoding", as your XML parser is going to complain about HTML entities that aren't XML entities.

I've had to do similar (ensuring 3rd party content has valid HTML). If I was doing this, I'd do the following:
1) Replace line breaks with HTML line breaks
string result = text.Replace(Environment.NewLine, "<br />");
2) Use the HTMLAgility pack to fix any invalid HTML
var doc = new HtmlDocument();
HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;
doc.OptionFixNestedTags = false;
doc.LoadHtml(result);
if (doc.ParseErrors.Count() > 0)
{
// throw error
}else{
// get fixed html
result= doc.DocumentNode.OuterHtml;
}

How to Remove all the HTML tags and display a plain text using C#

I want to remove all html tags from a string.i can achieve this using REGX.
but inside the string if it contains number inside the angular braces <100> it should not remove it .
var withHtml = "<p>hello <b>there<1234></b></p>";
var withoutHtml = Regex.Replace(withHtml, "\\<[^\\>]*\\>", string.Empty);
Result: hello there
but needed output :
hello there 1234

Your example of HTML isn't valid HTML since it contains a non-HTML tag. I figure you intended for the angle-brackets to be encoded.
I don't think regular expressions are suitable for HTML parsing. I recommend using an HTML parser such as HTML Agility Pack to do this.
Here's an example:
var withHtml = "<p>hello <b>there<1234></b></p>";
var document = new HtmlDocument();
document.LoadHtml(withHtml);
var withoutHtml = HtmlEntity.DeEntitize(document.DocumentNode.InnerText);
Just add the HtmlAgilityPack NuGet package and a reference to System.Xml to make it work.

Not sure you can do this in one regular expression, or that a regex is really the correct way as others have suggested. A simple improvement that gets you almost there is:
Regex.Replace(withHtml, "\\<[^\\>0-9]*\\>", string.Empty);
Gives "hello there<1234>" You then just need to replace all angled brackets.

Better alternative to getting the 'inner html' of a div?

I have a string:
<div class="className1234"><p>Some html</p></div>
From this string, I would like to get <p>Some html</p> i.e. I would like to remove the surrounding div tags based on the fact that it's class contains 'className'.
What I've Tried
What I've tried works, but it's cludgey - and I know there'll be a better alternative like regex or something. What I currently do is chain a series of substring(), indexof() and replace() calls to strip out the divs.
EDIT: I've used the phrase 'innerhtml' because I'd like to think there's a library out there somewhere that would allow me to manipulate a string with regard to the tags within it.
PLEASE NOTE: There's no JQuery involved in this. It's all server-side C#.
(See tags)

I would suggest Html Agility Pack, it's designed to allow operations on html documents, kind of like the builtin support for XML in the framework.
It might be overkill, but it will get the work done, easily, and you won't have to care about bad html

How about:
XmlDocument doc = new XmlDocument();
doc.LoadXml(divStr);
// classAtr will be null if the root is not a div with a class with the value className1234
XmlNode classAtr = doc.SelectSingleNode("/div/#class[contains(., 'className1234')]");
string result = classAtr != null ? doc.DocumentElement.InnerXml : divStr;

Whenever you need to manipulate HTML, you should use a dedicated HTML parser/DOM library. One library I've found recommended here on StackOverflow for .Net is HTMLAgilityPack.

As others said HtmlAgilityPack is the best for html parsing, also be sure to download HAP Explorer from HtmlAgilityPack site, use it to test your selects, anyway this SelectNode command will get :
HtmlDocument doc = new HtmlDocument();
doc.Load(htmlFile);
var myNodes = doc.DocumentNode.SelectNodes("/div/#class[. = 'className1234']");
foreach (HtmlNode node in myNodes)
{
// you code
}

How to get text off a webpage?

I want to get text off of a webpage in C#.
I don't want to get the HTML, I want the real text off of the webpage. Like if I type "<b>cake</b>", I want the cake, not the tags.

Use the HTML Agility Pack library.
That's very fine library for parse HTML, for your requirement use this code:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("Yor Path(local,web)");
var result=doc.DocumentNode.SelectNodes("//body//text()");//return HtmlCollectionNode
foreach(var node in result)
{
string AchivedText=node.InnerText;//Your desire text
}

It depends. If your application downloads the webpage using a WebBrowser component, then that component will do the parsing for you automatically in the background (just like Internet Explorer). Just walk the DOM tree and extract the text you want. You will find HtmlElement.InnerText property especially useful :)

You can strip tags using regular expressions such as this one2 (a simple example):
// You can import System.Text.RegularExpressions for convenience, of course.
System.Text.RegularExpressions.Regex tag = new System.Text.RegularExpressions.Regex("\<.+?\>");
myHTML = tag.Replace(myHTML, String.Empty);
But if you need to retrieve large volumes of well-structured data, then you might be better off using an HTML library1. (If the webpage is XHTML, all the better - use the System.Xml classes.)
1 Like http://htmlagilitypack.codeplex.com/, for example.
2 This might have unintended side-effects if you're trying to get data out of JavaScript, or if the data is inside an element's attribute and includes angle brackets. You'll also need to accept escape sequences like &.

parsing XML with ampersand

I have a string which contains XML, I just want to parse it into Xelement, but it has an ampersand. I still have a problem parseing it with HtmlDecode. Any suggestions?
string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>";
XElement.Parse(HttpUtility.HtmlDecode(test));
I also added these methods to replace those characters, but I am still getting XMLException.
string encodedXml = test.Replace("&", "&").Replace("<", "<").Replace(">", ">").Replace("\"", """).Replace("'", "&apos;");
XElement myXML = XElement.Parse(encodedXml);
t
or Even tried it with this:
string newContent= SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);

Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.
For example, "wow&".Replace("&", "&") results in wow&amp; which is clearly undesirable.
Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as <, something like:
string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&");
The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as and the list can grow.
A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&" the decode process would return "&wow&" then re-encoding it would return "&wow&", which is desirable. To pull this off you could use this:
string result = Regex.Replace(test, #"value=\""(.*?)\""", m => "value=\"" +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
"\"");
var doc = XElement.Parse(result);
Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.
EDIT: updated solution that should handle content between tags as well as anything between double quotes. Be sure to test this thoroughly. Attempting to manipulate XML/HTML tags with regex is not favorable as it can be error prone and over-complicated. Your case is somewhat special since you need to sanitize it first in order to make use of it.
string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
m.Groups["start"].Value +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
m.Groups["end"].Value);
var doc = XElement.Parse(result);

Your string doesn't contain valid XML, that's the issue. You need to change your string to:
<MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"

HtmlEncode will not do the trick, it will probably create even more ampersands (for instance, a ' might become ", which is an Xml entity reference, which are the following:
& &
&apos; '
" "
< <
> >
But it might you get things like &nbsp, which is fine in html, but not in Xml. Therefore, like everybody else said, correct the xml first by making sure any character that is NOT PART OF THE ACTUAL MARKUP OF YOUR XML (that is to say, anything INSIDE your xml as a variable or text) and that occurs in the entity reference list is translated to their corresponding entity (so < would become <). If the text containing the illegal character is text inside an xml node, you could take the easy way and surround the text with a CDATA element, this won't work for attributes though.

Filip's answer is on the right track, but you can hijack the System.Xml.XmlDocument class to do this for you without an entire new utility function.
XmlDocument doc = new XmlDocument();
string xmlEscapedString = (doc.CreateTextNode("Unescaped '&' containing string that would have broken your xml")).OuterXml;

The ampersant makes the XML invalid. This cannot be fixed by a stylesheet so you need to write code with some other tool or code in VB/C#/PHP/Delphi/Lisp/Etc. to remove it or to translate it to &.

This is the simplest and best approach. Works with all characters and allows to parse XML for any web service call i.e. SharePoint ASMX.
public string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
var node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}

If your string is not valid XML, it will not parse. If it contains an ampersand on its own, it's not valid XML. Contrary to HTML, XML is very strict.

You should 'encode' rather than decode. But calling HttpUtility.HtmlEncode will not help you as it will encode your '<' and '>' symbols as well and your string will no longer be an XML.
I think that for this case the best solution would be to replace '&' with '& amp;' (with no space)

Perhaps consider writing your own XMLDocumentScanner. That's what NekoHTML is doing to have the ability to ignore ampersands not used as entity references.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex to get values between Double Quotes - c#

Related

Using C# to convert incorrect html string to real html

How to Remove all the HTML tags and display a plain text using C#

Better alternative to getting the 'inner html' of a div?

How to get text off a webpage?

parsing XML with ampersand

Categories

Resources