Regular expression with URL extraction

Regular expression with URL extraction - c#

I am using C# for this project and basically what I need is a way to make plain text into HTML, I found a regular expression (I think on Stack Overflow actually) for converting links in the text to anchor links in HTML, it looks like this:
Regex regx = new Regex(#"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?", RegexOptions.IgnoreCase);
MatchCollection mactches = regx.Matches(input);
foreach (Match match in mactches)
{
output = output.Replace(match.Value, String.Format("{0}", match.Value));
}
It works great, however I found a flaw in that it doesn't consider a dash (-) as part of the URL, so when it hits the first dash it closes the anchor tag.
So I obviously need to include the dash somehow in the regular expression, but the problem is that I have absolutely no clue about RegEx and it just looks like Russian to me.
Does anyone have an idea what small edit I need to make to the RegEx expression to make it include a dash as allowed characters in the URL?

Try this: #"https?://([-\w\.]+)+(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)?"
I added a dash to the second character class (the part in square brackets) to match dashes in the part of the URL that is not the domain name.

I use this one which supports the ftp and file schemes as well as http:
#"\b((https?|ftp|file)://|(www|ftp)\.)[-A-Z0-9+&##/%?=~_|$!:,.;\(\)]*[A-Z0-9+&##/%=~_|$]"
It will recognise a URL that contain parameters delimited by & like this:
http://www.cbsnews.com/video/watch/?id=7400904n&tag=re1.channel
The original is at Extract URLs from a text (Regex). I modified it slightly to recognise a URL that contains parentheses like this:
http://msdn.microsoft.com/en-us/library/ms686722(v=VS.85).aspx
You need to specify RegexOptions.IgnoreCase with this regex though of course you could simplify by replacing A-Z with \w.

Related

Obtain a particular URL from a large string

I have some exported performance data from Chrome in C# and it contains a large amount of URL's. I want one specifically and only the first occurance of it. Actually could be any as it's repeated a number of times, however if I have a string of various garbage and URL's mixed in, how would I find the one that starts with https and ends in mpa?
So it would be like https://thisisaurl.com/2020/11/20/14243324324/324234/test.mpa Note everything between the https and mpa could be different. Actually the thisisaurl.com will probably stay the same but can't be sure right now. Just know the URL would end in mpa.
I've been playing with something like this:
var linkParser = new Regex(#"\b(?:https?://|mpa\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
foreach (Match m in linkParser.Matches(logs[i].Message))
Console.WriteLine(m.Value);
But wasn't giving me what I'm looking for. Just some other URL starting with https. Appreciate any help.
Also example below:
{"columnNumber":104001,"functionName":"","lineNumber":1,"scriptId":"9","url":"https://www.blabla.com/assets/build/js/show/videoTop-b8f5d35a3719d4f31aee.min.js"},{"columnNumber":82859,"functionName":"makeRequestStandard","lineNumber":427,"scriptId":"31","url":"https://tags.blabla2.com/utag/i/comsite/prod/utag.js"},{"columnNumber":82357,"functionName":"makeRequest","lineNumber":427,"scriptId":"31","url":"https://tags.blabla2.com/utag/i/comsite/prod/utag.js"},{"columnNumber":168917,"functionName":"request","lineNumber":427,"scriptId":"31","url":"https://tags.blabla2.com/utag/i/comsite/prod/utag.js"},{"columnNumber":162205,"functionName":"send","lineNumber":427,"scriptId":"31","url":"https://tags.blabla2.com/utag/i/comsite/prod/utag.js"},{"columnNumber":132652,"functionName":"postHeartbeat","lineNumber":427,"scriptId":"31","url":"https://tags.blabla2.com/utag/i/comsite/prod/utag.js"},{"columnNumber":131860,"functionName":"sendHeartbeat","lineNumber":427,"scriptId":"31","url":"https://tags.blabla2.com/utag/i/comsite/prod/utag.js"},{"hasPostData":true,"headers":{"Content-Type":"application/json","Referer":"https://www.bla.com/info/me/a4Af8ptKlr5gthauHnde5C9JdeJcNnWa","User-Agent":"Mozilla/5.0
https://blabla3.bla.com/2020/11/20/1822084675943/383378/value.mpa\",\":null,\"evs\":[{\"name\":\"\",\"attr\":
So in the case of above want
https://blabla3.bla.com/2020/11/20/1822084675943/383378/value.mpa
Note the string contains a lot more than above, that's just a middle snippet.

You had the right idea using a regular expression, you just need to tweak it some. For one, you should be escaping the / after https. I was able to match the string you were looking for with a quick and dirty regex of https:\/\/[^\s]+[\w].mpa It will match the characters https: literally, \/ will match the character / literally, [^\s]+ will match a non whitespace character multiple times (^ is a negation, \s short for whitespace characters), \w will match a word character (i.e. value in value.mpa), and .mpa will match those characters literally. You can tweak it as needed for case insensitivity or other needs

how can I use unnamed Regex groups in C# inside my regex?

hey so my current regex is #"(into)(to)add\s[^\s]{1,}\1|\2[^\s]{1,}" I want the input to be something "add word into/to category" the regex in general works fine but just the \1|\2 part, I tried using groups and all sorts of solutions but I just can't seem to figure out how I can make it so that the input can be into or to
Can anyone help me out? (this is in C# and using the Regex class)

If I have understood you correctly, then you don't need back references to (unnamed) Groups, you can use a simple alternation, like this:
#"add \w+ (into|to) \w+"
That will select either into or to in the search string.
Edit:
Let's get a Little more 'advanced', using the optional sign '?':
#"add \w+ (in)?to \w+"
This will match 'in' zero or one time, followed by 'to', so it will match into as well as to, exactly as the original RegEx.
Edit2:
I have a feeling, you want to use a variable inside your RegEx, you can of course do that like this:
string search = "into|to";
RegEx regEx = new ReqEx(#"add \w+ (" + search + ") \w+");

From your given example I think you're looking for a regex like add\s\w+\s(into|to)\s\w+. Your current regex matches only strings starting with "intoto" wich is probably not what you want.

Regex to find anchor tag consist of new line in c# .net

I want to find the href from an achore tag. So I have used regex as
<a\s*[^>]*\s*href\s*\=\s*([^(\s*|\>)]*)\s*[^>]*>\s*Text\s*<\/a>
Options = Ignorecase + singleline
Example
Text
So Group[1]="/abc/xzy/pqr.com"
But If the content is like
<a href="/abc/xzy/ //Contains new line
pqr.com" class="m">Text</a>
so Group[1]="/abc/xzy/
So I want to know how to get "/abc/xzy/pqr.com" if the content contains new line(\r\n)

Your capture group is a bit weird: [^(\s*|\>)]* is a character class and it will match any character not (, ror a character class \s, nor an asterisk *, etc.
What you can do however is to put quotes before and after the capture group:
<a\s*[^>]*\s*href\s*\=\s*"([^(\s*|\>)]*)"\s*[^>]*>\s*Text\s*<\/a>
^ ^
And then change the character class to [^"] (not quotes):
<a\s*[^>]*\s*href\s*\=\s*"([^"]*)"\s*[^>]*>\s*Text\s*<\/a>
^^^^
regex101 demo.
This said, it would be better to use a proper html parser instead of regex. It's just that it's more tedious to make a suitable regex because you can forget about a lot of different scenarios, but if you're certain of how your data comes through, regex might be a quick way to get what you need.
If you want to consider single quotes and no quotes at all in some cases, you might try this instead:
<a\s*[^>]*\s*href\s*=\s*((?:[^ ]|[\n\r])+)\s*[^>]*>\s*Text\s*<\/a>
Updated regex101.
This regex has this part instead (?:[^ ]|[\n\r])+ which accepts non-spaces and newlines (and carriage returns just in case). Note that \s contains white spaces, tabs, newlines and form-feed.

parsing tweet text with regex

Regex-noob here. Looking for some C# regex code to "syntax highlight" twitter text. So given this tweet:
#taglius here's some tweet text that shouldn't be highlighted #tagtestpix http://aurl.jpg
I want to find the user mentions (#), hashtags (#), and urls (http://) and add appropriate html to color highlight these elements. Something like
<font color=red>#taglius</font> here's some tweet text that shouldn't be highlighted <font color=blue>#tagtestpix</font> <font color=yellow>http://aurl.jpg</font>
This isn't the exact html I will use, but I think you get the idea.

The answers above are parts of the whole answer, so I think I can add a little extra to answer your question:
Your highlight function would look something like this:
public static String HighlightTwitter(String input)
{
String result = Regex.Replace(input, #"\b\#\w+", #"<font color=""red"">$0</font>");
result = Regex.Replace(result, #"\b#\w+", #"<font color=""blue"">$0</font");
result = Regex.Replace(result, #"\bhttps?://[-\w]+(\.\w[-\w]*)+(:\d+)?(/[^.!,?;""\'<>()\[\]\{\}\s\x7F-\xFF]*([.!,?]+[^.!,?;""\'<>\(\)\[\]\{\}\s\x7F-\xFF]+)*)?\b", #"<font color=""yellow"">$0</font", RegexOptions.IgnoreCase);
return result;
}
I have include \b to make sure that # and # is the start of the word and make sure that urls stands alone. This means that #this_will_highlight but#this_will_not.
If performance might be an issue you can make the Regex'es as static members with RegexOptions.Compiled
E.g.:
private static Regex regexAt = new Regex(#"\b\#\w+", RegexOptions.Compiled);
...
String result = regexAt.Replace(input, #"<font color=""red"">$0</font>");
...

The following would match the '#' character followed by a sequence of alpha-num characters:
#\w+
The following would match the '#' character followed by a sequence of alpha-num characters:
\#\w+
There are a lot of free-form http url match expressions, this is the one I use most commonly:
https?://[-\w]+(\.\w[-\w]*)+(:\d+)?(/[^.!,?;""\'<>()\[\]\{\}\s\x7F-\xFF]*([.!,?]+[^.!,?;""\'<>\(\)\[\]\{\}\s\x7F-\xFF]+)*)?
Lastly, You're going to get false positive hits with all of these so you're going to need to look real hard at how to correctly delineate these tags... For instance you have the following tweet:
the url http://Roger#example.com/#bookmark is interesting.
Obviously this is going to be a problem as all three of the expressions will match inside the url. To avoid this you will need to figure out what characters are allowed to precede or follow the match. As an example, the following requires a whitespace or start of string to precede the #name reference and requires a ',' or space following it.
(?<=[^\s])#\w+(?=[,\s])
Regex patterns are not easy, I recommend getting a tool like Expresso.

You can parse out the # replies using (\#\w+). You can parse out the hash tags using (#\w+).

Regex to adjust HTML hrefs in c#

I need to use regex to search through an html file and replace href="pagename" with href="pages/pagename"
Also the href could be formatted like HREF = 'pagename'
I do not want to replace any hrefs that could be upper or lowercase that begin with http, ftp, mailto, javascript, #
I am using c# to develop this little app in.

HTML manipulation through Regex is not recommended since HTML is not a "regular language." I'd highly recommend using the HTML Agility Pack instead. That gives you a DOM interface for HTML.

I have not tested with many cases, but for this case it worked:
var str = "href='page' href = 'www.goo' href='http://' href='ftp://'";
Console.WriteLine(Regex.Replace(str, #"href ?= ?(('|"")([a-z0-9_#.-]+)('|""))", "x", RegexOptions.IgnoreCase));
Result:
"x x href='http://' href='ftp://'"
You better hold backup files before running this :P

There are lots of caveats when using a find/replace with HTML and XML. The problem is, there are many variations of syntax which are permitted. (and many which are not permitted but still work!)
But, you seem to want something like this:
search for
([Hh][Rr][Ee][Ff]\s*=\s*['"])(\w+)(['"])
This means:
[Hh]: any of the items in square-brackets, followed by
\s*: any number of whitespaces (maybe zero),
=
\s* any more whitespaces,
['"] either quote type,
\w+: a word (without any slashes or dots - if you want to include .html then use [.\w]+ instead ),
and ['"]: another quote of any kind.
replace with
$1pages/$2$3
Which means the things in the first bracket, then pages/, then the stuff in the second and third sets of brackets.
You will need to put the first string in #" quotes, and also escape the double-quotes as "".
Note that it won't do anything even vaguely intelligent, like making sure the quotes match. Warning: try never to use as "any character" (.) symbol in this kind of regex, as it will grab large sections of text, over and including the next quotation mark, possibly up to the end of the file!
see a regex tutorial for more info, e.g. http://www.regular-expressions.info/dotnet.html

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular expression with URL extraction - c#

Try this: #"https?://([-\w\.]+)+(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)?" I added a dash to the second character class (the part in square brackets) to match dashes in the part of the URL that is not the domain name.

Related

Obtain a particular URL from a large string

how can I use unnamed Regex groups in C# inside my regex?

Regex to find anchor tag consist of new line in c# .net

parsing tweet text with regex

Regex to adjust HTML hrefs in c#

Categories

Resources