I need a regex that matches CDATA elements in html - c#

I'm trying to write a regular expression to match CDATA elements in HTML in a web crawler class in c#.
What I have used in the past is : \<\!\[CDATA\[(?<text>[^\]]*)\]\]\> , but the problem is that this breaks in the presence of array [] elements if there is javascript contained within the CDATA tags. The negation is necessary because if there are multiple I want to match them all.
If I modify the regex to match the end '>' character I have the same problem. Any javascript with a > operator breaks my regex.
So I need to use a negative look-ahead within this regex to ignore ']]>'. How would I write this?
Here's some test data for a quick setup of the problem:
//Matches any
string pattern = #"\<\!\[CDATA\[(?<text>[^\]]*)\]\]\>";
var rx = new Regex(pattern, RegexOptions.Singleline);
/* Testing...*/
string eg = #"<![CDATA[TesteyMcTest//]]><![CDATA[TesteyMcTest2//]]><![CDATA[TesteyMcTest//]]><! [CDATA[TesteyMcTest2//]]>
<![CDATA[Thisisal3ongarbi4trarys6testwithnumbers//]]><![CDATA [thisisalo4ngarbitrarytest6withumbers123456//]]><![CDATA[ this.exec = (function(){ var x = this.GetFakeArray(); var y = x[0]; return y > 3;});//]]> ";
var mz = rx.Matches(eg);
This example matches every instance of CDATA except for the last one, which contains javascript and ']', '>'
Thanks in advance,

The problem is that your <text> subpattern is false! You don't need to avoid ], you need to avoid ] followed by ]>. You can use this subpattern instead:
(?<text>(?>[^]]+|](?!]>))*)
the whole pattern: (note that many characters don't need to be escaped)
#"<!\s*\[CDATA\s*\[(?<text>(?>[^]]+|](?!]>))*)]]>"
I added two \s* to match all your example strings, but if you want to disallow these optional spaces, you can remove the \s*.

Does the following work for you: http://regex101.com/r/cT0pT0
\[CDATA\[(.*?)\]\]>
It seems to match what you are asking for... Key here is that the use of .*? (non greedy match) stops on the first occasion that you get ]]>
NOTE - it is usually a REALLY BAD IDEA to use regex for parsing HTML. There are plenty of good libraries available to do the job far more robustly.
See for example What is the best way to parse html in C#?

Related

Regex-like construction to match %([text]) where [text] can contain escaped parens

I'm trying to resolve tokens in a string.
What I would like is given input like this:
string input = "asdf %(text) %(123) %(a\)a) asdf";
That I could run that through regex.Replace() and have it replace on "%(text)", "%(123)" and "%(a\)a)".
That is, that it would match everything between a starting "%(" and a closing ")" unless the closing ")" was escaped. (But of course, then you could escape the slash with another slash, which would prevent it from escaping the end paren...)
I'm pretty sure standard regular expressions can't do this, but I'm wondering if any of the various fancy expanded capabilities of the C# regular expression library could, rather than just iterating across the string totally manually? Or some other method that could do this? I feel like it's a common enough program that there has to be some way to solve it without implementing the solution from scratch, given the immensity of the .net framework? If I do have to implement iterating through the string and replacing with string.Replace(), I will, but it just seems so inelegant.
How about
var regex = new Regex(#"%\(.*?(?<!\\)(?:\\\\)*\)");
var result = regex.Replace(source,"");
%\( match literal %(
.*? match anything non-greedy
(?<!\\) preceding character to next match must not be \
(?:\\\\)* match zero or more literal \\ (i.e. match escaped \
\) match literal )
This is working for me :
String something = "\"asdf %(text) %(123) %(a\\)a) asdf\";";
String change = something.replaceAll("%\\(.*\\)", "");
System.out.println(change);
The output
"asdf asdf";

Regex to find anchor tag consist of new line in c# .net

I want to find the href from an achore tag. So I have used regex as
<a\s*[^>]*\s*href\s*\=\s*([^(\s*|\>)]*)\s*[^>]*>\s*Text\s*<\/a>
Options = Ignorecase + singleline
Example
Text
So Group[1]="/abc/xzy/pqr.com"
But If the content is like
<a href="/abc/xzy/ //Contains new line
pqr.com" class="m">Text</a>
so Group[1]="/abc/xzy/
So I want to know how to get "/abc/xzy/pqr.com" if the content contains new line(\r\n)
Your capture group is a bit weird: [^(\s*|\>)]* is a character class and it will match any character not (, ror a character class \s, nor an asterisk *, etc.
What you can do however is to put quotes before and after the capture group:
<a\s*[^>]*\s*href\s*\=\s*"([^(\s*|\>)]*)"\s*[^>]*>\s*Text\s*<\/a>
^ ^
And then change the character class to [^"] (not quotes):
<a\s*[^>]*\s*href\s*\=\s*"([^"]*)"\s*[^>]*>\s*Text\s*<\/a>
^^^^
regex101 demo.
This said, it would be better to use a proper html parser instead of regex. It's just that it's more tedious to make a suitable regex because you can forget about a lot of different scenarios, but if you're certain of how your data comes through, regex might be a quick way to get what you need.
If you want to consider single quotes and no quotes at all in some cases, you might try this instead:
<a\s*[^>]*\s*href\s*=\s*((?:[^ ]|[\n\r])+)\s*[^>]*>\s*Text\s*<\/a>
Updated regex101.
This regex has this part instead (?:[^ ]|[\n\r])+ which accepts non-spaces and newlines (and carriage returns just in case). Note that \s contains white spaces, tabs, newlines and form-feed.

Regular Expression to match /u/{word or underscore or numbers}

I have tried and failed for two days now to successfully match /u/{word or underscore or numbers}. I also need to ignore the value if it is in a link (ex: <a href="asdfasdf/u/word" />. I have exhausted all options. Can someone please help me out here?
Edit: I am unfamiliar with regular expressions and am still trying to figure them out. Excuse me if this is a noobish question. And to clarify, I can get the matches fine. I just don't understand in Regex how to ignore a match completely if a certain character follows.
Example:
/u/username
/u/username this is
this/is/u/user
<a href="http://www.regex.com/u/something/" />
I want to match the first two occurrences of /u/username.
This is embarrassing, but here is my current regex /u/\w*[^"]
You can use do this pattern:
/u/\w*
It will match the string /u/ followed by zero or more letters, numbers, or underscores. To ensure that the string consists only of this pattern, use start (^) and end ($) anchors, like this:
^/u/\w*$
For example:
string result = Regex.Match(input, #"^/u/\w*$").Value;
If you're trying to do some special parsing of HTML, I'm afraid regular expressions are a pretty bad option. You really should find some way of properly parsing the document first. Nevertheless, here's a very crude pattern that will ignore this sequence if it happens to be within inside an href attribute (it also assumes the attribute value will be surrounded by quotation marks):
(?<!href="[^"]*)/u/\w*
For example:
string input = #"/u/bar";
string pattern = #"(?<!href=""[^""]+)/u/\w*";
string Regex.Match(input, pattern).Value; // will match /u/bar but not /u/foo
This pattern will match any sequence that doesn't have a word character (letter, number, or underscore), quote, or forward slash in front of it:
(?<![\w""/])/u/\w*
This example shows how it can be used get all matches from the string:
var input = #"/u/username
/u/username this is
this/is/u/user <a href=""http://www.regex.com/u/something/"" />";
var pattern = #"(?<![\w""/])/u/\w*";
foreach(Match match in Regex.Matches(input, pattern))
{
System.Console.WriteLine(match.Value);
}
The output will me:
/u/username
/u/username
This regular expression will meet your test scenario
\w*(/u)*[a-z,A-Z,0-9]+$
This actually catches on the characters unique to HTML tags, so as long as you want to ignore HTML code. this will do the trick.

parsing tweet text with regex

Regex-noob here. Looking for some C# regex code to "syntax highlight" twitter text. So given this tweet:
#taglius here's some tweet text that shouldn't be highlighted #tagtestpix http://aurl.jpg
I want to find the user mentions (#), hashtags (#), and urls (http://) and add appropriate html to color highlight these elements. Something like
<font color=red>#taglius</font> here's some tweet text that shouldn't be highlighted <font color=blue>#tagtestpix</font> <font color=yellow>http://aurl.jpg</font>
This isn't the exact html I will use, but I think you get the idea.
The answers above are parts of the whole answer, so I think I can add a little extra to answer your question:
Your highlight function would look something like this:
public static String HighlightTwitter(String input)
{
String result = Regex.Replace(input, #"\b\#\w+", #"<font color=""red"">$0</font>");
result = Regex.Replace(result, #"\b#\w+", #"<font color=""blue"">$0</font");
result = Regex.Replace(result, #"\bhttps?://[-\w]+(\.\w[-\w]*)+(:\d+)?(/[^.!,?;""\'<>()\[\]\{\}\s\x7F-\xFF]*([.!,?]+[^.!,?;""\'<>\(\)\[\]\{\}\s\x7F-\xFF]+)*)?\b", #"<font color=""yellow"">$0</font", RegexOptions.IgnoreCase);
return result;
}
I have include \b to make sure that # and # is the start of the word and make sure that urls stands alone. This means that #this_will_highlight but#this_will_not.
If performance might be an issue you can make the Regex'es as static members with RegexOptions.Compiled
E.g.:
private static Regex regexAt = new Regex(#"\b\#\w+", RegexOptions.Compiled);
...
String result = regexAt.Replace(input, #"<font color=""red"">$0</font>");
...
The following would match the '#' character followed by a sequence of alpha-num characters:
#\w+
The following would match the '#' character followed by a sequence of alpha-num characters:
\#\w+
There are a lot of free-form http url match expressions, this is the one I use most commonly:
https?://[-\w]+(\.\w[-\w]*)+(:\d+)?(/[^.!,?;""\'<>()\[\]\{\}\s\x7F-\xFF]*([.!,?]+[^.!,?;""\'<>\(\)\[\]\{\}\s\x7F-\xFF]+)*)?
Lastly, You're going to get false positive hits with all of these so you're going to need to look real hard at how to correctly delineate these tags... For instance you have the following tweet:
the url http://Roger#example.com/#bookmark is interesting.
Obviously this is going to be a problem as all three of the expressions will match inside the url. To avoid this you will need to figure out what characters are allowed to precede or follow the match. As an example, the following requires a whitespace or start of string to precede the #name reference and requires a ',' or space following it.
(?<=[^\s])#\w+(?=[,\s])
Regex patterns are not easy, I recommend getting a tool like Expresso.
You can parse out the # replies using (\#\w+). You can parse out the hash tags using (#\w+).

regex replace - but with a few exceptions

I have a string containing HTML and I need to replace some words to be links - I do this with the following code;
string lNewHTML = Regex.Replace(lOldHTML, "(\bword1\b|\bword2|word3\b)", "$1", RegexOptions.IgnoreCase);
The code works, but I need to include some exceptions to the replace - e.g. I will not replace anything i an img-, li- and a-tag (including link-text and attributes like href and title) but still allow replacements in p-, td- and div-tags.
Can anyone figure this one out?
Ok, after some time of trying to construct a fitting regex, here my try.. This might need additional work, but should point you in the right direction.
I am matching the words "word1" and "word2", not inside a "tag1" or "tag2" tag. You need to adjust this to your needs, of course. Enable RegexOptions.IgnorePatternWhitespace, if you'd like to keep my formatting.
Unfortunatly, I have come up with a regex you could simply plug into Regex.Replace, since this Regex will match the whole String since the match before, but the word you are concerned with is in the first group. This group contains index and length of the word, so you can easily replace it using String.Substring...
(?:
\G
(?:
(?>
<tag1(?<N>)
|<tag2(?<N>)
|</tag1(?<-N>)
|</tag2(?<-N>)
|.)*?
(?(N)(?!))
)*
)
(word1|word2)
You need to use the Replace overload with the MatchEvaluator parameter so that you examine each match and decide whether to replace or not.

Categories

Resources