This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 9 years ago.
I have a long c# string of HTML code and I want to specifically extract bullet points "<ul><li></li></ul>".
Say I have the following HTML string.
var html = "<div class=ClassC441AA82DA8C5C23878D8>Here is a text that should be ignored.</div>This text should be ignored too<br><ul><li>* Need this one</li><li>Another bullet point I need</li><li>A bulletpoint again that I want</li><li>And this is the last bullet I want</li></ul><div>Ignore this line and text</div><p>Ignore this as well.</p>Text not important."
I need everything between the '<ul>' to '</ul>' tags. The '<ul>' tag can be excluded.
Now regular expression is not my strongest side, but if that can be used I need some help.
My code is in c#.
You should use the HtmlAgilityPack for things like this. I wrote a little introduction to it a while ago that may help you get going: http://colinmackay.scot/2011/03/22/a-quick-intro-to-the-html-agility-pack/
Related
This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 2 years ago.
I prefer to use regex not any HTML Parser.
Best way to extract base64 image from a HTMl that string is like:
"<p>This is test </p>
<p><img src=\"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==\"></p>"
I need this line so I can have access to base 64 image:
/9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==
If there is an adequate HTML parser for this use case as suggested by others in the comments, go for that...
But, if that doesn't work, regular expressions to the rescue! This is using a positive lookbehind assertion and is matching everything until the first double quote. Should work -- adjust if it doesn't...
var val = "<p>This is test </p><p><img src=\"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==";
var match = Regex.Match(val, "(?<=data:image/jpeg;base64,)[^\"]*");
Console.WriteLine(match.Value);
// output: /9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==
This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 6 years ago.
I'm converting a lot of code from legacy to maintainable and I'm creating a list of regex we can use to do all the pages quickly and the same. My regex skills are that of a child running with a knife...its not great. I've looked up a lot of different ways to only find the first set but I can't seem to get it to work. Can anyone solve this specific problem for me?
Here is the regex search and replace I'm using.
regex: (rs.*)\.Fields\[\"(\w+)\"\].Value
replace: $1.GetValue<object>("$2")
Works
code to search: ...rsProducts.Fields["Price"].Value...
result: rsProducts.GetValue<object>("Price")
This, as I want it to, finds the rs (recordset) of something and changes the way that we extract the value to use an extension method.
Does Not Work
code to search: ...rsProducts.Fields["Price"].Value + rsProducts.Fields["Price2"].Value...
result: rsProducts.Fields["Price"].Value + rsProducts.Fields["Price2"].Value
should be: rsProducts.GetValue<object>("Price") + rsProducts.GetValue<object>("Price2")
In this case the search does match 2 distinct instances but instead it matches the entire line. Here's a pic from regexr.com.
// sorry I don't have the reputation to post the image as an image but heres the
Link to Example Image
You're not dealing handling the case for the + between the two.
(rs.*?)\.Fields\[\"(\w+)\"\].Value
This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 6 years ago.
Using Regex, I'm trying to get data from html code, but I don't know how build it, without using any html tags.
I have some string (item-desc), and count of symbols after this string, which must be my data.
Something like: in item-desc12345abcde, I'm using regex with value of 6 symbols, and i got 12345a.
This expression give me only 1 symbol after my string:
Regex itemInfoFilter = new Regex(#"item-desc\s*(.+?)\s*>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
I don't recommend using regular expressions to parse HTML.
Use an HTML parser instead:
HTML Agility Pack
From what I understand of your question I think this should work: item-desc(.){6}(?=[\s'"])
In the code I assume that your string ends with a space (\s), ' or "
Hope this helps
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Full text search in HTML ignoring tags / &
I did lots of googling but didnt find any help.
I have a webbrowser control wchich has HTML body.Body contains data that includes special charactors also. I want to add search box wchich will search and hilight it in page .Text can include special characters like \,/,?,$,^,&,<,> .
how should i achieve this using jquery/javascript or c#?
Here's an answer I gave to a similar question:
https://stackoverflow.com/a/5887719/96100
However, window.find(), which the above answer relies on, is likely to be removed from browsers in the future and is not going to be replaced in the short term. That being the case, I've written a flexible search function for my Rangy library. Demo (with highlighting) here:
http://rangy.googlecode.com/svn/trunk/demos/textrange.html
This question already has answers here:
How do you convert Html to plain text?
(20 answers)
Closed 1 year ago.
I need a way to get all text from my aspx files.
They may contain javascrip also but I only need this for the HTML code.
Basically I need to extract everything on Text or Value attributes, text within code, whatever...
Is there any parser API available?
Cheers!
Alex
As an alternative, you might consider playing with Linq to XML to strip the interesting stuff out.