How can html be parsed as XML when containing '...&body='? [duplicate] - c#

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 8 years ago.
I have html file that is a well-formed xml document (tags are paired), but contains anchor like the one below:
link
Xml parser invoked by XDocument.Load throws XmlException that says:
Additional information: '=' is an unexpected token. The expected token is ';'.
How can I instruct parser that I '&body' is not an entity? Do I must escape '&' character?

Not all HTML is going to be valid XML so you shouldn't try to parse it as such (although, in this case, it looks like you have some un-escpaped strings in the document that should probably get taken care of).
Instead, you should use something like the HTMLAgilityPack to parse your HTML and work with the document that way.

Related

How to remove Only HTML tags in the program [duplicate]

This question already has an answer here:
Retrieving Inner Text of Html Tag C#
(1 answer)
Closed 3 years ago.
I want to remove HTML Tags with some source with C#.
Unfortunately, there are some content like <This is content>
first, I tried to Regex class like that.
Regex.Replace(htmltext,"[\\x00-\\x1f<>:\"/\\\\|?*]" +
"|^(CON|PRN|AUX|NUL|COM[0-9]|LPT[0-9]|CLOCK\\$)(\\.|$)" +
"|[\\. ]$", String.Empty);
but in this case,
"<This is content>" was removed.
so anyone, please tell me how to remove Only HTML Tags in the program.
Thanks regard.
Don't try and parse HTML with Regex. It tends not to go well.
Use a parser, HTML Agility Pack is very popular.
Using HTML agility pack you can simply call InnerText to extract the contents without HTML tags.

XML parsing throwing an exception for '&' and '<' characters [duplicate]

This question already has answers here:
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 3 years ago.
Getting exception while parsing the XML if it contains '&' and '<' characters. I have read somewhere that having these characters in XML means that XML is not valid, but I'm receiving it from third party where I can't reformat it.
Below is my code of XML parsing using XDocument:
string data = profile.Content.ReadAsStringAsync().Result; //Read input
XDocument doc = new XDocument();
if (data != "")
{
string rawHtml = WebUtility.HtmlDecode(data);
doc = XDocument.Parse(rawHtml); //Parse input into XDocument
}
Here, data contains actual XML input and not XML filepath.
Please suggest me how to handle these special characters.
This data is not XML.
Check what you agreed with the third party.
If the contract was to exchange data in XML, then they are failing to satisfy the contract and you should deal with it the way you would deal with any other faulty goods from a supplier: return it and ask for your money back.
If the agreement didn't specify that they would send you XML, then you shouldn't be trying to parse it with an XML parser.

C# (.NET), Html parse using regex [duplicate]

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 6 years ago.
Using Regex, I'm trying to get data from html code, but I don't know how build it, without using any html tags.
I have some string (item-desc), and count of symbols after this string, which must be my data.
Something like: in item-desc12345abcde, I'm using regex with value of 6 symbols, and i got 12345a.
This expression give me only 1 symbol after my string:
Regex itemInfoFilter = new Regex(#"item-desc\s*(.+?)\s*>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
I don't recommend using regular expressions to parse HTML.
Use an HTML parser instead:
HTML Agility Pack
From what I understand of your question I think this should work: item-desc(.){6}(?=[\s'"])
In the code I assume that your string ends with a space (\s), ' or "
Hope this helps

Easiest way to extract some html from string [duplicate]

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 9 years ago.
I have a long c# string of HTML code and I want to specifically extract bullet points "<ul><li></li></ul>".
Say I have the following HTML string.
var html = "<div class=ClassC441AA82DA8C5C23878D8>Here is a text that should be ignored.</div>This text should be ignored too<br><ul><li>* Need this one</li><li>Another bullet point I need</li><li>A bulletpoint again that I want</li><li>And this is the last bullet I want</li></ul><div>Ignore this line and text</div><p>Ignore this as well.</p>Text not important."
I need everything between the '<ul>' to '</ul>' tags. The '<ul>' tag can be excluded.
Now regular expression is not my strongest side, but if that can be used I need some help.
My code is in c#.
You should use the HtmlAgilityPack for things like this. I wrote a little introduction to it a while ago that may help you get going: http://colinmackay.scot/2011/03/22/a-quick-intro-to-the-html-agility-pack/

Get text from HTML [duplicate]

This question already has answers here:
How do you convert Html to plain text?
(20 answers)
Closed 1 year ago.
I need a way to get all text from my aspx files.
They may contain javascrip also but I only need this for the HTML code.
Basically I need to extract everything on Text or Value attributes, text within code, whatever...
Is there any parser API available?
Cheers!
Alex
As an alternative, you might consider playing with Linq to XML to strip the interesting stuff out.

Categories

Resources