parsing tweet text with regex

parsing tweet text with regex - c#

Regex-noob here. Looking for some C# regex code to "syntax highlight" twitter text. So given this tweet:
#taglius here's some tweet text that shouldn't be highlighted #tagtestpix http://aurl.jpg
I want to find the user mentions (#), hashtags (#), and urls (http://) and add appropriate html to color highlight these elements. Something like
<font color=red>#taglius</font> here's some tweet text that shouldn't be highlighted <font color=blue>#tagtestpix</font> <font color=yellow>http://aurl.jpg</font>
This isn't the exact html I will use, but I think you get the idea.

The answers above are parts of the whole answer, so I think I can add a little extra to answer your question:
Your highlight function would look something like this:
public static String HighlightTwitter(String input)
{
String result = Regex.Replace(input, #"\b\#\w+", #"<font color=""red"">$0</font>");
result = Regex.Replace(result, #"\b#\w+", #"<font color=""blue"">$0</font");
result = Regex.Replace(result, #"\bhttps?://[-\w]+(\.\w[-\w]*)+(:\d+)?(/[^.!,?;""\'<>()\[\]\{\}\s\x7F-\xFF]*([.!,?]+[^.!,?;""\'<>\(\)\[\]\{\}\s\x7F-\xFF]+)*)?\b", #"<font color=""yellow"">$0</font", RegexOptions.IgnoreCase);
return result;
}
I have include \b to make sure that # and # is the start of the word and make sure that urls stands alone. This means that #this_will_highlight but#this_will_not.
If performance might be an issue you can make the Regex'es as static members with RegexOptions.Compiled
E.g.:
private static Regex regexAt = new Regex(#"\b\#\w+", RegexOptions.Compiled);
...
String result = regexAt.Replace(input, #"<font color=""red"">$0</font>");
...

The following would match the '#' character followed by a sequence of alpha-num characters:
#\w+
The following would match the '#' character followed by a sequence of alpha-num characters:
\#\w+
There are a lot of free-form http url match expressions, this is the one I use most commonly:
https?://[-\w]+(\.\w[-\w]*)+(:\d+)?(/[^.!,?;""\'<>()\[\]\{\}\s\x7F-\xFF]*([.!,?]+[^.!,?;""\'<>\(\)\[\]\{\}\s\x7F-\xFF]+)*)?
Lastly, You're going to get false positive hits with all of these so you're going to need to look real hard at how to correctly delineate these tags... For instance you have the following tweet:
the url http://Roger#example.com/#bookmark is interesting.
Obviously this is going to be a problem as all three of the expressions will match inside the url. To avoid this you will need to figure out what characters are allowed to precede or follow the match. As an example, the following requires a whitespace or start of string to precede the #name reference and requires a ',' or space following it.
(?<=[^\s])#\w+(?=[,\s])
Regex patterns are not easy, I recommend getting a tool like Expresso.

You can parse out the # replies using (\#\w+). You can parse out the hash tags using (#\w+).

Related

I need a regex that matches CDATA elements in html

I'm trying to write a regular expression to match CDATA elements in HTML in a web crawler class in c#.
What I have used in the past is : \<\!\[CDATA\[(?<text>[^\]]*)\]\]\> , but the problem is that this breaks in the presence of array [] elements if there is javascript contained within the CDATA tags. The negation is necessary because if there are multiple I want to match them all.
If I modify the regex to match the end '>' character I have the same problem. Any javascript with a > operator breaks my regex.
So I need to use a negative look-ahead within this regex to ignore ']]>'. How would I write this?
Here's some test data for a quick setup of the problem:
//Matches any
string pattern = #"\<\!\[CDATA\[(?<text>[^\]]*)\]\]\>";
var rx = new Regex(pattern, RegexOptions.Singleline);
/* Testing...*/
string eg = #"<![CDATA[TesteyMcTest//]]><![CDATA[TesteyMcTest2//]]><![CDATA[TesteyMcTest//]]><! [CDATA[TesteyMcTest2//]]>
<![CDATA[Thisisal3ongarbi4trarys6testwithnumbers//]]><![CDATA [thisisalo4ngarbitrarytest6withumbers123456//]]><![CDATA[ this.exec = (function(){ var x = this.GetFakeArray(); var y = x[0]; return y > 3;});//]]> ";
var mz = rx.Matches(eg);
This example matches every instance of CDATA except for the last one, which contains javascript and ']', '>'
Thanks in advance,

The problem is that your <text> subpattern is false! You don't need to avoid ], you need to avoid ] followed by ]>. You can use this subpattern instead:
(?<text>(?>[^]]+|](?!]>))*)
the whole pattern: (note that many characters don't need to be escaped)
#"<!\s*\[CDATA\s*\[(?<text>(?>[^]]+|](?!]>))*)]]>"
I added two \s* to match all your example strings, but if you want to disallow these optional spaces, you can remove the \s*.

Does the following work for you: http://regex101.com/r/cT0pT0
\[CDATA\[(.*?)\]\]>
It seems to match what you are asking for... Key here is that the use of .*? (non greedy match) stops on the first occasion that you get ]]>
NOTE - it is usually a REALLY BAD IDEA to use regex for parsing HTML. There are plenty of good libraries available to do the job far more robustly.
See for example What is the best way to parse html in C#?

Regular Expression to match /u/{word or underscore or numbers}

I have tried and failed for two days now to successfully match /u/{word or underscore or numbers}. I also need to ignore the value if it is in a link (ex: <a href="asdfasdf/u/word" />. I have exhausted all options. Can someone please help me out here?
Edit: I am unfamiliar with regular expressions and am still trying to figure them out. Excuse me if this is a noobish question. And to clarify, I can get the matches fine. I just don't understand in Regex how to ignore a match completely if a certain character follows.
Example:
/u/username
/u/username this is
this/is/u/user
<a href="http://www.regex.com/u/something/" />
I want to match the first two occurrences of /u/username.
This is embarrassing, but here is my current regex /u/\w*[^"]

You can use do this pattern:
/u/\w*
It will match the string /u/ followed by zero or more letters, numbers, or underscores. To ensure that the string consists only of this pattern, use start (^) and end ($) anchors, like this:
^/u/\w*$
For example:
string result = Regex.Match(input, #"^/u/\w*$").Value;
If you're trying to do some special parsing of HTML, I'm afraid regular expressions are a pretty bad option. You really should find some way of properly parsing the document first. Nevertheless, here's a very crude pattern that will ignore this sequence if it happens to be within inside an href attribute (it also assumes the attribute value will be surrounded by quotation marks):
(?<!href="[^"]*)/u/\w*
For example:
string input = #"/u/bar";
string pattern = #"(?<!href=""[^""]+)/u/\w*";
string Regex.Match(input, pattern).Value; // will match /u/bar but not /u/foo
This pattern will match any sequence that doesn't have a word character (letter, number, or underscore), quote, or forward slash in front of it:
(?<![\w""/])/u/\w*
This example shows how it can be used get all matches from the string:
var input = #"/u/username
/u/username this is
this/is/u/user <a href=""http://www.regex.com/u/something/"" />";
var pattern = #"(?<![\w""/])/u/\w*";
foreach(Match match in Regex.Matches(input, pattern))
{
System.Console.WriteLine(match.Value);
}
The output will me:
/u/username
/u/username

This regular expression will meet your test scenario
\w*(/u)*[a-z,A-Z,0-9]+$
This actually catches on the characters unique to HTML tags, so as long as you want to ignore HTML code. this will do the trick.

Regular expression with URL extraction

I am using C# for this project and basically what I need is a way to make plain text into HTML, I found a regular expression (I think on Stack Overflow actually) for converting links in the text to anchor links in HTML, it looks like this:
Regex regx = new Regex(#"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?", RegexOptions.IgnoreCase);
MatchCollection mactches = regx.Matches(input);
foreach (Match match in mactches)
{
output = output.Replace(match.Value, String.Format("{0}", match.Value));
}
It works great, however I found a flaw in that it doesn't consider a dash (-) as part of the URL, so when it hits the first dash it closes the anchor tag.
So I obviously need to include the dash somehow in the regular expression, but the problem is that I have absolutely no clue about RegEx and it just looks like Russian to me.
Does anyone have an idea what small edit I need to make to the RegEx expression to make it include a dash as allowed characters in the URL?

Try this: #"https?://([-\w\.]+)+(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)?"
I added a dash to the second character class (the part in square brackets) to match dashes in the part of the URL that is not the domain name.

I use this one which supports the ftp and file schemes as well as http:
#"\b((https?|ftp|file)://|(www|ftp)\.)[-A-Z0-9+&##/%?=~_|$!:,.;\(\)]*[A-Z0-9+&##/%=~_|$]"
It will recognise a URL that contain parameters delimited by & like this:
http://www.cbsnews.com/video/watch/?id=7400904n&tag=re1.channel
The original is at Extract URLs from a text (Regex). I modified it slightly to recognise a URL that contains parentheses like this:
http://msdn.microsoft.com/en-us/library/ms686722(v=VS.85).aspx
You need to specify RegexOptions.IgnoreCase with this regex though of course you could simplify by replacing A-Z with \w.

Regex to adjust HTML hrefs in c#

I need to use regex to search through an html file and replace href="pagename" with href="pages/pagename"
Also the href could be formatted like HREF = 'pagename'
I do not want to replace any hrefs that could be upper or lowercase that begin with http, ftp, mailto, javascript, #
I am using c# to develop this little app in.

HTML manipulation through Regex is not recommended since HTML is not a "regular language." I'd highly recommend using the HTML Agility Pack instead. That gives you a DOM interface for HTML.

I have not tested with many cases, but for this case it worked:
var str = "href='page' href = 'www.goo' href='http://' href='ftp://'";
Console.WriteLine(Regex.Replace(str, #"href ?= ?(('|"")([a-z0-9_#.-]+)('|""))", "x", RegexOptions.IgnoreCase));
Result:
"x x href='http://' href='ftp://'"
You better hold backup files before running this :P

There are lots of caveats when using a find/replace with HTML and XML. The problem is, there are many variations of syntax which are permitted. (and many which are not permitted but still work!)
But, you seem to want something like this:
search for
([Hh][Rr][Ee][Ff]\s*=\s*['"])(\w+)(['"])
This means:
[Hh]: any of the items in square-brackets, followed by
\s*: any number of whitespaces (maybe zero),
=
\s* any more whitespaces,
['"] either quote type,
\w+: a word (without any slashes or dots - if you want to include .html then use [.\w]+ instead ),
and ['"]: another quote of any kind.
replace with
$1pages/$2$3
Which means the things in the first bracket, then pages/, then the stuff in the second and third sets of brackets.
You will need to put the first string in #" quotes, and also escape the double-quotes as "".
Note that it won't do anything even vaguely intelligent, like making sure the quotes match. Warning: try never to use as "any character" (.) symbol in this kind of regex, as it will grab large sections of text, over and including the next quotation mark, possibly up to the end of the file!
see a regex tutorial for more info, e.g. http://www.regular-expressions.info/dotnet.html

Regex replace string function not working as expected

I'm trying to implement a hashtag function in a web app to easily embed search links into a page. The issue is that I'm trying to do a replace on the hash marks so they don't appear in the HTML output. Since I'm also wanting to be able to also have hash marks in the output I can't just do a final Replace on the entire string at the end of processing. I'm going to want to be able to escape some hash marks like so \#1 is my answer and I'd find and replace the \# with just # but that is another problem that I'm not even ready for (but still thinking of).
This is what I have so far mocked up in a console app,
static void Main(string[] args)
{
Regex _regex = new Regex(#"(#([a-z0-9]+))");
string link = _regex.Replace("<p>this is #my hash #tag.</p>", MakeLink("$1"));
}
public static string MakeLink(string tag)
{
return string.Format("{1}", tag.Replace("#", ""), tag);
}
The output being:
<p>this is #my hash #tag.</p>
But when I run it with breaks while it's running MakeLink() it's string is displayed at "$1" in the debugger output and it's not replacing the hash's as expected.
Is there a better tool for the job than regex? Or can I do something else to get this working correctly?

Note that you're passing a literal "$1" into MakeLink, not the first captured group. Thus your .Replace("#", "") is doing nothing. The regular expression then replaces the two occurrences of "$1" in the output of MakeLink with the first capture group.
If you replace "$1" with "$2" then I think you get the result you want, just not quite in the manner you're expecting.

To not replace your escaped hashtags, just modify your current regex to not match anything that starts with an escape:
Regex _regex = new Regex(#"[^\\](#([a-z0-9]+))");
And then apply a new regex to find only escaped hashtags and replace them with unescaped ones:
Regex _escape = new Regex(#"\\(#([a-z0-9]+))");
_escape.Replace(input, "$1");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

parsing tweet text with regex - c#

You can parse out the # replies using (\#\w+). You can parse out the hash tags using (#\w+).

Related

I need a regex that matches CDATA elements in html

Regular Expression to match /u/{word or underscore or numbers}

Regular expression with URL extraction

Regex to adjust HTML hrefs in c#

Regex replace string function not working as expected

Categories

Resources