C# regex to parse /simple1/1.2-SNAPSHOT/

C# regex to parse /simple1/1.2-SNAPSHOT/ - c#

I need to find the last two values at the end of such a string, "simple1" and "1.2-SNAPSHOT" in the sample url below. But my code below (try to get simple1/1.2-SNAPSHOT/) doesn't work, can anyone help?
http://localhost:8060/nexus/service/local/repositories/snapshots/content/org/sonatype/mavenbook/simple1/1.2-SNAPSHOT/
List<string> artifacts = new List<string>(); // this is already foler URL
// store all URLs to the artifacts be deleted
artifacts = nexusAPI.findArtifacts(repository, contents, days, pattern);
var regex = new Regex(".*\\/(.*\\/.*\\/)$");
foreach (string url in artifacts)
{
Console.WriteLine("group/artifact: {0}", regex.Matches(url));
}

I would just split the string on '/' and get the last two parts. The regex isn't going to do anything more then that.

If you must use RegEx, you're encountering an issue in that regexes are greedy - that means it puts as much in each .* as it possibly can. So your first step is to make the regex not greedy. Simply use this as your pattern:
(.*?)/
Here's a simple test showing how that this works.
This tells the regex to look for any character up to the slash, and then stop.
When you call Regex.Matches(url, "(.*?)/"), you will get returned an array of the matching data. From there, you can just look at the last two elements.
Of course, as SledgeHammer mentioned, this is one case where regex is unnecessary and even cumbersome. Simply working with url.Split(new char[] {'/'}) will give you the results you need.

Related

Regex - Extract also URLs with www

I use this regex to find URLs:
(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?
Problem is, that it doesn't find urls which start with www.
How can I solve this?
Here is my data source that I need to extract urls from.

This answer is based on the provide xml file you come with in your comment.
There are couple of issues with your file, beside starting with https, http and www, it contains urls that start with download.somedomain.com, marketplace.somedmain.com, so it is inconsistence. the other issues is the ending of the the url can end with ., </, it does not have spaces after ending the url and it does not have a pattern to go through it line by line or chunk by chunk.
And last thing it contains duplicates.
The way I chose to solve, by chopping regex in 2 parts:
One part take all urls that start with valid url, with out looking at the end of it.
The second part take care of the valid url of what is remained from first part.
Regarding duplicates, I used hashset for that.
The solution does not consider specific tags in the xml or specific contain, it just care about urls in content.
Here is the solution:
HashSet<string> urls = new HashSet<string>();
var beginWith = new Regex(#"\b(?:(http|ftp|https)?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
foreach (Match item in beginWith.Matches(input))
{
var endWith = new Regex(#"([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?");
foreach (Match url in endWith.Matches(item.ToString()))
{
urls.Add(url.ToString());
}
}
The code here can in deed be reduced and improved. I leave it for your fantasy.
Here is the final and 5 first urls output of the file:
www.w3.org/2005/Atom
marketplace.xboxlive.com/resource/product/v1
www.xbox.com/live/accounts
download.xbox.com/content/images/66acd000-77fe-1000-9115-d802534307d4/1033/boxartlg.jpg
download.xbox.com/content/images/66acd000-77fe-1000-9115-d802534307d4/1033/boxartsm.jpg
etc.....

Well, just check if your string contain "https://" or "http://", if not, add https:// at the beginning ^^
string url = "";
if (!url.Contains("https://") || !url.Contains("http://"))
{
url.Insert(0, "https://");
}

Finding reCaptcha ID with Regex

Alright, so I've been trying to pull the reCaptcha ID out of a web source that I'm downloading, I was going to do this with Regex, pull the line out with what it contains it, then pull the ID from there [If that makes sense].
Here's how I'm doing it right now:
WebClient W = new WebClient();
W.Encoding = System.Text.Encoding.UTF8;
string pattern = "recaptcha_challenge_field";
string SourceCode = W.DownloadString("http://www.xtremetop100.com/in.php?COLLCC=4025385947&COLLCC=1765882190&site=1132330052");
foreach (string Match in Regex.Split(SourceCode, Environment.NewLine))
{
if (Regex.IsMatch(Match, pattern, RegexOptions.IgnoreCase))
{
MessageBox.Show(Match);
}
}
Problem being, that it just shows the whole page source besides the line with the "pattern" in it. I tried changing the encoding type because I thought it was returning the source as one big sentence, but I guess that's not the answer. Any help here guys? Thank you.

Fist of all you have terrible name convention!
Local variable should have names starts with small letter, so "sourceCode", "match". Big letter suggest that "Match" is class not variable.
Secondly, why you using Split from Regex class only to split string by new lines? Use build-in string method:
foreach (string line in sourceCode.Split(new string[] { Environment.NewLine }, StringSplitOptions.None))
Now... if you notice, I change the name of variable, so now your code will looks like
if (Regex.IsMatch(line, pattern, RegexOptions.IgnoreCase))
{
MessageBox.Show(line);
}
And you will see obvious, code do excaly what you want: If line match the pattern, then show whole line.
Another thing: what is your regex pattern? This is more like comparison to check if it's match or not. Try read about regex more. Your pattern should looks more like
recaptcha_challenge_field=([0-9]+). I don't know exactly, because the link you are posted contain only "refresh" meta tag.
And try to use Regex.Match method instead Regex.IsMatch. It gives you more information, not only if the string match your pattern, but what groups within you capture.

Regex to Replace the end of the Url

I have a url something that follows a pattern as below :
http://i.ebayimg.com/00/s/MTUw12323gxNTAw/$(KGr123qF,!p0F123Q~~60_12.JPG?set_id=88123231232F
I need a regex to find and replace the end of the url _12.JPG with _14.JPG. So basically i need to capture the _[numbers only].JPG pattern and replace it with my value.

var regex = new Regex(#"_\d+\.JPG");
var newUrl = regex.Replace(url, "_14.JPG");

_[0-9]+\.JPG\?
works for the sample URL. You didn't really mention whether you wanted the
?set_id=88123231232F gone or not.

Basically, you shouldn't normally be concerned with periods anywhere else in the URL. It is possible, but the additional constraint of the jpg extension should limit anything returned with not much issue.
///_(\d?\d).jpg/ig
var regex = new Regex(#"_(\d?\d).[Jj][Pp][Gg]");
That will capture one or two numbers between an underscore and .jpg
I will double check this, but it should work for both one digit and two digits.

Regex between, from the last to specific end

Today my wish is to take text form the string.
This string must be, between last slash and .partX.rar or .rar
First I tried to find edge's end of the word and then the beginning. After I get that two elements I merged them but I got empty results.
String:
http://hosting.xyz/1234/15-game.part4.rar.html
http://hosting.xyz/1234/16-game.rar.html
Regex:
Begin:(([^/]*)$) - start from last /
End:(.*(?=.part[0-9]+.rar|.rar)) stop before .partX.rar or .rar
As you see, if I merge that codes I won't get any result.
What is more, "end" select me only .partX instead of .partX.rar
All what I want is:
15-game.part4.rar and 16-game.rar
What i tried:
(([^/]*)$)(.*(?=.part[0-9]+.rar|.rar))
(([^/]*)$)
(.*(?=.part[0-9]+.rar|.rar))
I tried also
/[a-zA-Z0-9]+
but I do not know how select symbols.. This could be the easiest way. But this select only letters and numbers, not - or _.
If I could select symbols..

You don't really need a regex for this as you can merely split the url on / and then grab the part of the file name that you need. Since you didn't mention a language, here's an implementation in Perl:
use strict;
use warnings;
my $str1="http://hosting.xyz/1234/15-game.part4.rar.html";
my $str2="http://hosting.xyz/1234/16-game.rar.html";
my $file1=(split(/\//,$str1))[-1]; #last element of the resulting array from splitting on slash
my $file2=(split(/\//,$str2))[-1];
foreach($file1,$file2)
{
s/\.html$//; #for each file name, if it ends in ".html", get rid of that ending.
print "$_\n";
}
The output is:
15-game.part4.rar
16-game.rar

Nothing could be simpler! :-)
Use this:
new Regex("^.*\/(.*)\.html$")
You'll find your filename in the first captured group (don't have a c# compiler at hand, so can't give you working sample, but you have a working regex now! :-) )
See a demo here: http://rubular.com/r/UxFNtJenyF

I'm not a C# coder so can't write full code here but I think you'll need support of negative lookahead here like this:
new Regex("/(?!.*/)(.+?)\.html$");
Matched Group # 1 will have your string i.e. "16-game.rar" OR "15-game.part4.rar"

Use two regexes:
start to substitute .*/ with nothing;
then substitute \.html with nothing.
Job done!

Regex BBCode to HTML

I writing BBcode converter to html.
Converter should skip unclosed tags.
I thought about 2 options to do it:
1) match all tags in once using one regex call, like:
Regex re2 = new Regex(#"\[(\ /?(?:b|i|u|quote|strike))\]");
MatchCollection mc = re2.Matches(sourcestring);
and then, loop over MatchCollection using 2 pointers to find start and open tags and than replacing with right html tag.
2) call regex multiple time for every tag and replace directly:
Regex re = new Regex(#"\[b\](.*?)\[\/b\]");
string s1 = re.Replace(sourcestring2,"<b>$1</b>");
What is more efficient?
The first option uses one regex but will require me to loop through all tags and find all pairs, and skip tags that don't have a pair.
Another positive thins is that I don't care about the content between the tags, i just work and replace them using the position.
In second option I don't need to worry about looping and making special replace function.
But will require to execute multiple regex and replaces.
What can you suggest?
If the second option is the right one,
there is a problem with regex
\[b\](.*?)\[\/b\]
how can i fix it to also match multi lines like:
[b]
test 1
[/b]
[b]
test 2
[/b]

One option would be to use more SAX-like parsing, where instead of looking for a particular regex you look for [, then have your program handle that even in some manner, look for the ], handle that even, etc. Although more verbose than the regex it may be easier to understand, and wouldn't necessarily be slower.

r = new System.Text.RegularExpressions.Regex(#"(?:\[b\])(?<name>(?>\[b\](?<DEPTH>)|\[/b\](?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/b\])", System.Text.RegularExpressions.RegexOptions.Singleline);
var s = r.Replace("asdfasdf[b]test[/b]asdfsadf", "<b>$1</b>");
That should give you only elements that have matched closing tags and also handle multi line (even though i specified the option of SingleLine it actually treats it as a single line)
It should also handle [b][b][/b] properly by ignoring the first [b].
As to whether or not this method is better than your first method I couldn't say. But hopefully this will point you in the right direction.
Code that works with your example below:
System.Text.RegularExpressions.Regex r;
r = new System.Text.RegularExpressions.Regex(#"(?:\[b\])(?<name>(?>\[b\](?<DEPTH>)|\[/b\](?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/b\])", System.Text.RegularExpressions.RegexOptions.Singleline);
var s = r.Replace("[b]bla bla[/b]bla bla[b] " + "\r\n" + "bla bla [/b]", "<b>$1</b>");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# regex to parse /simple1/1.2-SNAPSHOT/ - c#

I would just split the string on '/' and get the last two parts. The regex isn't going to do anything more then that.

Related

Regex - Extract also URLs with www

Finding reCaptcha ID with Regex

Regex to Replace the end of the Url

Regex between, from the last to specific end

Regex BBCode to HTML

Categories

Resources