C# (.NET), Html parse using regex [duplicate]

C# (.NET), Html parse using regex [duplicate] - c#

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 6 years ago.
Using Regex, I'm trying to get data from html code, but I don't know how build it, without using any html tags.
I have some string (item-desc), and count of symbols after this string, which must be my data.
Something like: in item-desc12345abcde, I'm using regex with value of 6 symbols, and i got 12345a.
This expression give me only 1 symbol after my string:
Regex itemInfoFilter = new Regex(#"item-desc\s*(.+?)\s*>", RegexOptions.Compiled | RegexOptions.IgnoreCase);

I don't recommend using regular expressions to parse HTML.
Use an HTML parser instead:
HTML Agility Pack

From what I understand of your question I think this should work: item-desc(.){6}(?=[\s'"])
In the code I assume that your string ends with a space (\s), ' or "
Hope this helps

Related

Best way to separate base64 image from a string in C# [duplicate]

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 2 years ago.
I prefer to use regex not any HTML Parser.
Best way to extract base64 image from a HTMl that string is like:
"<p>This is test </p>
<p><img src=\"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==\"></p>"
I need this line so I can have access to base 64 image:
/9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==

If there is an adequate HTML parser for this use case as suggested by others in the comments, go for that...
But, if that doesn't work, regular expressions to the rescue! This is using a positive lookbehind assertion and is matching everything until the first double quote. Should work -- adjust if it doesn't...
var val = "<p>This is test </p><p><img src=\"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==";
var match = Regex.Match(val, "(?<=data:image/jpeg;base64,)[^\"]*");
Console.WriteLine(match.Value);
// output: /9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==

Regex - Find Img Element [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I am trying to get my regex expression to work to no avail:
All I want to do is find the image tags in an html string so I can replace them:
This is what I think should work:
var regex = new Regex(#"<img.*>");
return regex.Replace(content, "<p><i><b>(See Image Online)</b></i></p>");
And it does work partially, but it seems to be stripping out more than just the image tag.
This is an example of what I want to match:
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAM0AAAD
NCAMAAAAsYgRbAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5c
cllPAAAABJQTFRF3NSmzMewPxIG//ncJEJsldTou1jHgAAAARBJREFUeNrs2EEK
gCAQBVDLuv+V20dENbMY831wKz4Y/VHb/5RGQ0NDQ0NDQ0NDQ0NDQ0NDQ
0NDQ0NDQ0NDQ0NDQ0NDQ0NDQ0PzMWtyaGhoaGhoaGhoaGhoaGhoxtb0QGho
aGhoaGhoaGhoaGhoaMbRLEvv50VTQ9OTQ5OpyZ01GpM2g0bfmDQaL7S+ofFC6x
v3ZpxJiywakzbvd9r3RWPS9I2+MWk0+kbf0Hih9Y17U0nTHibrDDQ0NDQ0NDQ0
NDQ0NDQ0NTXbRSL/AK72o6GhoaGhoRlL8951vwsNDQ0NDQ1NDc0WyHtDTEhD
Q0NDQ0NTS5MdGhoaGhoaGhoaGhoaGhoaGhoaGhoaGposzSHAAErMwwQ2HwRQ
AAAAAElFTkSuQmCC" alt="beastie.png">

You need either
new Regex(#"<img.*?>");
if supported, or if not,
new Regex(#"<img[^>]*>");
Your problem is that your regular expression is not matching the first ">" it finds but LAST.

How to match 2 substrings in responce with regex in c# [duplicate]

This question already has answers here:
Returning only part of match from Regular Expression
(4 answers)
Closed 4 years ago.
I have a responce string
"c=2020&action=approvecomment&_wpnonce=7508ac918a' data-wp-lists='dim:the-comment-list:comment-2020:unapproved:e7e7d3:e7e7d3:new=approved"
Im trying to extract 2020 and 7508ac918a. I dont understand how I must use regex with substrings in C#, simple regex like
c=(\d+)&action=approvecomment&_wpnonce=(.*?)' .+new=approved.

In Regex, you can create match groups
They look like this (?.+?)
So your _wpconce part could become something like this (?.*?)
Then you can grab each group individually for example
Match result = myRegex.Match(someString);
soneOtherString = result.Groups["GROUPNAME"].Value;
I use Regex101 to build and test my regex. (Whoever made that site deserves a crown with shinny stones on it!! :)
https://regex101.com/
Hope this helps

Get words between "<" and ">" in .net [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 7 years ago.
I have written a program to identify tags(between < and >) in a string. From the below string I am able to get <P>, <OL> and <LI> . Div is not getting any idea what I am doing wrong?
string yy = #"<P> </P><OL><LI><DIV align=center>fjsdhfsdjf</DIV></LI><LI>";
MatchCollection allMatchResults = null;
var regexObj = new Regex(#"<\w*>");
allMatchResults = regexObj.Matches(yy);

DIV is not begin matched because \w is not matching spaces. Use new Regex(#"<[^>]+>");

You are not getting Div because it has got attribute. Use .*? to include attributes or any text.
var regexObj = new Regex(#"<\w.*?>");
You can use Html Agility Pack to easily parse and manipulate the HTML.

\w* will match only alfanemeric characters.
Here problem lies in space and =
Quick solution:
<[^>]+> instead of <\w*>
But You may want to consider this:
RegEx match open tags except XHTML self-contained tags

Your regex is wrong, should be something like
#"<[^>]+>"
Also, if you have to do a lot of regexes like this, maybe it's better to use something like HTMLAgilityPack. It allows you to parse out the html into node lists that you can iterate through.
Samples can be found here.

I believe more in this method we are using this one daily where I work.
its a translation company so we translate xml, html, php files to different languages.
var myRegex= new Regex(#"(<[^>]+>)");
here is just the regex:
(<[^>]+>)

Easiest way to extract some html from string [duplicate]

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 9 years ago.
I have a long c# string of HTML code and I want to specifically extract bullet points "<ul><li></li></ul>".
Say I have the following HTML string.
var html = "<div class=ClassC441AA82DA8C5C23878D8>Here is a text that should be ignored.</div>This text should be ignored too<br><ul><li>* Need this one</li><li>Another bullet point I need</li><li>A bulletpoint again that I want</li><li>And this is the last bullet I want</li></ul><div>Ignore this line and text</div><p>Ignore this as well.</p>Text not important."
I need everything between the '<ul>' to '</ul>' tags. The '<ul>' tag can be excluded.
Now regular expression is not my strongest side, but if that can be used I need some help.
My code is in c#.

You should use the HtmlAgilityPack for things like this. I wrote a little introduction to it a while ago that may help you get going: http://colinmackay.scot/2011/03/22/a-quick-intro-to-the-html-agility-pack/

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# (.NET), Html parse using regex [duplicate] - c#

I don't recommend using regular expressions to parse HTML. Use an HTML parser instead: HTML Agility Pack

From what I understand of your question I think this should work: item-desc(.){6}(?=[\s'"]) In the code I assume that your string ends with a space (\s), ' or " Hope this helps

Related

Best way to separate base64 image from a string in C# [duplicate]

Regex - Find Img Element [duplicate]

How to match 2 substrings in responce with regex in c# [duplicate]

Get words between "<" and ">" in .net [duplicate]

Easiest way to extract some html from string [duplicate]

Categories

Resources