Get words between "<" and ">" in .net [duplicate] - c#

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 7 years ago.
I have written a program to identify tags(between < and >) in a string. From the below string I am able to get <P>, <OL> and <LI> . Div is not getting any idea what I am doing wrong?
string yy = #"<P> </P><OL><LI><DIV align=center>fjsdhfsdjf</DIV></LI><LI>";
MatchCollection allMatchResults = null;
var regexObj = new Regex(#"<\w*>");
allMatchResults = regexObj.Matches(yy);

DIV is not begin matched because \w is not matching spaces. Use new Regex(#"<[^>]+>");

You are not getting Div because it has got attribute. Use .*? to include attributes or any text.
var regexObj = new Regex(#"<\w.*?>");
You can use Html Agility Pack to easily parse and manipulate the HTML.

\w* will match only alfanemeric characters.
Here problem lies in space and =
Quick solution:
<[^>]+> instead of <\w*>
But You may want to consider this:
RegEx match open tags except XHTML self-contained tags

Your regex is wrong, should be something like
#"<[^>]+>"
Also, if you have to do a lot of regexes like this, maybe it's better to use something like HTMLAgilityPack. It allows you to parse out the html into node lists that you can iterate through.
Samples can be found here.

I believe more in this method we are using this one daily where I work.
its a translation company so we translate xml, html, php files to different languages.
var myRegex= new Regex(#"(<[^>]+>)");
here is just the regex:
(<[^>]+>)

Related

Best way to separate base64 image from a string in C# [duplicate]

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 2 years ago.
I prefer to use regex not any HTML Parser.
Best way to extract base64 image from a HTMl that string is like:
"<p>This is test </p>
<p><img src=\"....+tzPaXLlstlSjpcxKPEqV/zH//2Q==\"></p>"
I need this line so I can have access to base 64 image:
/9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==
If there is an adequate HTML parser for this use case as suggested by others in the comments, go for that...
But, if that doesn't work, regular expressions to the rescue! This is using a positive lookbehind assertion and is matching everything until the first double quote. Should work -- adjust if it doesn't...
var val = "<p>This is test </p><p><img src=\"....+tzPaXLlstlSjpcxKPEqV/zH//2Q==";
var match = Regex.Match(val, "(?<=data:image/jpeg;base64,)[^\"]*");
Console.WriteLine(match.Value);
// output: /9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==

Strip html with regex, except tags that contains a character [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 3 years ago.
I have a program that imports emails to a database. To make the emails more readable in another program I have to strip it for html. I am using this string extension to strip the html.
public static string StripHtml(this string input)
{
return Regex.Replace(input, "<.*?>", String.Empty);
}
The problem is that when I copy forwarded mails, the email of the sender is written inside a tag.
< example#forwared.com >
Is there a way to use regex to remove all the tags, except tags that contains # or an email?
The solution here is a possible way: Remove html tags except <br> or <br/> tags with javascript. But If there is a way to do it with just regex I prefer to do that.
You can use the below Regex by adding an extra condition to your original regex to achieve your requirement:
<.[^#]*?>
Working Demo: https://regex101.com/r/CNOvS7/1/
Use [^#]* instead of .*
It’s a character set of anything but #. The ^ stands for “not”. You could also do something like that [^0-9]* to exclude all numbers for example.

C# (.NET), Html parse using regex [duplicate]

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 6 years ago.
Using Regex, I'm trying to get data from html code, but I don't know how build it, without using any html tags.
I have some string (item-desc), and count of symbols after this string, which must be my data.
Something like: in item-desc12345abcde, I'm using regex with value of 6 symbols, and i got 12345a.
This expression give me only 1 symbol after my string:
Regex itemInfoFilter = new Regex(#"item-desc\s*(.+?)\s*>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
I don't recommend using regular expressions to parse HTML.
Use an HTML parser instead:
HTML Agility Pack
From what I understand of your question I think this should work: item-desc(.){6}(?=[\s'"])
In the code I assume that your string ends with a space (\s), ' or "
Hope this helps

Regex to match XML elements in a text file [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I have a text file consist of conversion instruction templates.
I need to parse this text file,
I need to match something like this:
(Source: <element>)
And get the "element".
Or this pattern:
(Source: <element attr="name" value=""/>)
And get "element attr="name"".
I am currently using this regex:
\(Source:\ ?\<(.*?)\>\)
Sorry for being a newbie. :)
Thanks for all your help.
-JRC
Try this Regex for detect attibs by both ” or " characters:
\(Source:\s+<(\w+\s+(?:\w+=[\"”][^\"”]+[\"”])?)[^>]*>\)
and your code:
var result = Regex.Match(strInput,
"\\(Source:\\s+<(\\w+\\s+(?:\\w+=[\"”][^\"”]+[\"”])?)[^>]*>")
.Groups[1].Value;
explain:
(subexpression)
Captures the matched subexpression and assigns it a zero-based ordinal number.
?
Matches the previous element zero or one time.
\w
Matches any word character.
+
Matches the previous element one or more times.

How do I use regex properly? Why isn't this working? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
string regex = "<Name[.\\s]*>[.]*s[.]*</Name>";
string source = "<Name xmlns=\"http://xml.web.asdf.com\">Session</Name>";
bool hit = System.Text.RegularExpressions.Regex.IsMatch(
source,
regex,
System.Text.RegularExpressions.RegexOptions.IgnoreCase
);
Why is hit false? I'm trying to find any Name XML field that has an 's' in the name. I don't understand what could be wrong.
Thanks!
You are using . in a character class, where it means literally ., I think you mean to use in the sense of any character - so .* rather than [.]*
string regex = "<Name(.|\\s)*>.*s.*</Name>";
With XPath, this could be as easy as /Name[contains(.,'s')]

Categories

Resources