Finding reCaptcha ID with Regex - c#

Alright, so I've been trying to pull the reCaptcha ID out of a web source that I'm downloading, I was going to do this with Regex, pull the line out with what it contains it, then pull the ID from there [If that makes sense].
Here's how I'm doing it right now:
WebClient W = new WebClient();
W.Encoding = System.Text.Encoding.UTF8;
string pattern = "recaptcha_challenge_field";
string SourceCode = W.DownloadString("http://www.xtremetop100.com/in.php?COLLCC=4025385947&COLLCC=1765882190&site=1132330052");
foreach (string Match in Regex.Split(SourceCode, Environment.NewLine))
{
if (Regex.IsMatch(Match, pattern, RegexOptions.IgnoreCase))
{
MessageBox.Show(Match);
}
}
Problem being, that it just shows the whole page source besides the line with the "pattern" in it. I tried changing the encoding type because I thought it was returning the source as one big sentence, but I guess that's not the answer. Any help here guys? Thank you.

Fist of all you have terrible name convention!
Local variable should have names starts with small letter, so "sourceCode", "match". Big letter suggest that "Match" is class not variable.
Secondly, why you using Split from Regex class only to split string by new lines? Use build-in string method:
foreach (string line in sourceCode.Split(new string[] { Environment.NewLine }, StringSplitOptions.None))
Now... if you notice, I change the name of variable, so now your code will looks like
if (Regex.IsMatch(line, pattern, RegexOptions.IgnoreCase))
{
MessageBox.Show(line);
}
And you will see obvious, code do excaly what you want: If line match the pattern, then show whole line.
Another thing: what is your regex pattern? This is more like comparison to check if it's match or not. Try read about regex more. Your pattern should looks more like
recaptcha_challenge_field=([0-9]+). I don't know exactly, because the link you are posted contain only "refresh" meta tag.
And try to use Regex.Match method instead Regex.IsMatch. It gives you more information, not only if the string match your pattern, but what groups within you capture.

Related

Regex - Extract also URLs with www

I use this regex to find URLs:
(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?
Problem is, that it doesn't find urls which start with www.
How can I solve this?
Here is my data source that I need to extract urls from.
This answer is based on the provide xml file you come with in your comment.
There are couple of issues with your file, beside starting with https, http and www, it contains urls that start with download.somedomain.com, marketplace.somedmain.com, so it is inconsistence. the other issues is the ending of the the url can end with ., </, it does not have spaces after ending the url and it does not have a pattern to go through it line by line or chunk by chunk.
And last thing it contains duplicates.
The way I chose to solve, by chopping regex in 2 parts:
One part take all urls that start with valid url, with out looking at the end of it.
The second part take care of the valid url of what is remained from first part.
Regarding duplicates, I used hashset for that.
The solution does not consider specific tags in the xml or specific contain, it just care about urls in content.
Here is the solution:
HashSet<string> urls = new HashSet<string>();
var beginWith = new Regex(#"\b(?:(http|ftp|https)?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
foreach (Match item in beginWith.Matches(input))
{
var endWith = new Regex(#"([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?");
foreach (Match url in endWith.Matches(item.ToString()))
{
urls.Add(url.ToString());
}
}
The code here can in deed be reduced and improved. I leave it for your fantasy.
Here is the final and 5 first urls output of the file:
www.w3.org/2005/Atom
marketplace.xboxlive.com/resource/product/v1
www.xbox.com/live/accounts
download.xbox.com/content/images/66acd000-77fe-1000-9115-d802534307d4/1033/boxartlg.jpg
download.xbox.com/content/images/66acd000-77fe-1000-9115-d802534307d4/1033/boxartsm.jpg
etc.....
Well, just check if your string contain "https://" or "http://", if not, add https:// at the beginning ^^
string url = "";
if (!url.Contains("https://") || !url.Contains("http://"))
{
url.Insert(0, "https://");
}

C# regex to parse /simple1/1.2-SNAPSHOT/

I need to find the last two values at the end of such a string, "simple1" and "1.2-SNAPSHOT" in the sample url below. But my code below (try to get simple1/1.2-SNAPSHOT/) doesn't work, can anyone help?
http://localhost:8060/nexus/service/local/repositories/snapshots/content/org/sonatype/mavenbook/simple1/1.2-SNAPSHOT/
List<string> artifacts = new List<string>(); // this is already foler URL
// store all URLs to the artifacts be deleted
artifacts = nexusAPI.findArtifacts(repository, contents, days, pattern);
var regex = new Regex(".*\\/(.*\\/.*\\/)$");
foreach (string url in artifacts)
{
Console.WriteLine("group/artifact: {0}", regex.Matches(url));
}
I would just split the string on '/' and get the last two parts. The regex isn't going to do anything more then that.
If you must use RegEx, you're encountering an issue in that regexes are greedy - that means it puts as much in each .* as it possibly can. So your first step is to make the regex not greedy. Simply use this as your pattern:
(.*?)/
Here's a simple test showing how that this works.
This tells the regex to look for any character up to the slash, and then stop.
When you call Regex.Matches(url, "(.*?)/"), you will get returned an array of the matching data. From there, you can just look at the last two elements.
Of course, as SledgeHammer mentioned, this is one case where regex is unnecessary and even cumbersome. Simply working with url.Split(new char[] {'/'}) will give you the results you need.

Regular expression for recognizing url

I want to create a Regex for url in order to get all links from input string.
The Regex should recognize the following formats of the url address:
http(s)://www.webpage.com
http(s)://webpage.com
www.webpage.com
and also the more complicated urls like:
- http://www.google.pl/#sclient=psy&hl=pl&site=&source=hp&q=regex+url&pbx=1&oq=regex+url&aq=f&aqi=g1&aql=&gs_sm=e&gs_upl=1582l3020l0l3199l9l6l0l0l0l0l255l1104l0.2.3l5l0&bav=on.2,or.r_gc.r_pw.&fp=30a1604d4180f481&biw=1680&bih=935
I have the following one
((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)
but it does not recognize the following pattern: www.webpage.com. Can someone please help me to create an appropriate Regex?
EDIT:
It should works to find an appropriate link and moreover place a link in an appropriate index like this:
private readonly Regex RE_URL = new Regex(#"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)", RegexOptions.Multiline);
foreach (Match match in (RE_URL.Matches(new_text)))
{
// Copy raw string from the last position up to the match
if (match.Index != last_pos)
{
var raw_text = new_text.Substring(last_pos, match.Index - last_pos);
text_block.Inlines.Add(new Run(raw_text));
}
// Create a hyperlink for the match
var link = new Hyperlink(new Run(match.Value))
{
NavigateUri = new Uri(match.Value)
};
link.Click += OnUrlClick;
text_block.Inlines.Add(link);
// Update the last matched position
last_pos = match.Index + match.Length;
}
I don't know why your result in match is only http:// but I cleaned your regex a bit
((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:##%/;$()~_?\+,\-=\\.&]+)
(?:) are non capturing groups, that means there is only one capturing group left and this contains the complete matched string.
(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.) The link has now to start with something fom the first list followed by an optional www. or with an www.
[\w\d:##%/;$()~_?\+,\-=\\.&] I added a comma to the list (otherwise your long example does not match) escaped the - (you were creating a character range) and unescaped the . (not needed in a character class.
See this here on Regexr, a useful tool to test regexes.
But URL matching is not a simple task, please see this question here
I've just written up a blog post on recognising URLs in most used formats such as:
www.google.com
http://www.google.com
mailto:somebody#google.com
somebody#google.com
www.url-with-querystring.com/?url=has-querystring
The regular expression used is /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)/ however I would recommend you got to http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without-the to see a complete working example along with an explanation of the regular expression in case you need to extend or tweak it.
The regex you give doesn't work for www. addresses because it is expecting a URI scheme (the bit before the URL, like http://). The 'www.' part in your regular expression doesn't work because it would only match www.:// (which is meaningless)
Try something like this instead:
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:##%/;$()~_?\+-=\\\.&]*)
This will match something with a valid URI scheme, or something beginning with 'www.'

How to Extract the Word Following a Symbol?

I have a string that could have any sentence in it but somewhere in that string will be the # symbol, followed by an attached word, sort of like #username you see on some sites.
so maybe the string is "hey how are you" or it's "#john hey how are you".
IF there's an "#" in the string i want to pull what comes immediately after it into its own new string.
in this instance how can i pull "john" into a different string so i could theoretically notify this person of his new message? i'm trying to play with string.contains or .replace but i'm pretty new and having a hard time.
this btw is in c# asp.net
You can use the Substring and IndexOf methods together to achieve this.
I hope this helps.
Thanks,
Damian
Here's how you do it without regex:
string s = "hi there #john how are you";
string getTag(string s)
{
int atSign = s.IndexOf("#");
if (atSign == -1) return "";
// start at #, stop at sentence or phrase end
// I'm assuming this is English, of course
// so we leave in ' and -
int wordEnd = s.IndexOfAny(" .,;:!?", atSign);
if (wordEnd > -1)
return s.Substring(atSign, wordEnd - atSign);
else
return s.Substring(atSign);
}
You should really learn regular expressions. This will work for you:
using System.Text.RegularExpressions;
var res = Regex.Match("hey #john how are you", #"#(\S+)");
if (res.Success)
{
//john
var name = res.Groups[1].Value;
}
Finds the first occurrence. If you want to find all you can use Regex.Matches. \S means anything else than a whitespace. This means it also make hey #john, how are you => john, and #john123 => john123 which may be wrong. Maybe [a-zA-Z] or similar would suit you better (depends on which characters the usernames is made of). If you would give more examples, I could tune it :)
I can recommend this page:
http://www.regular-expressions.info/
and this tool where you can test your statements:
http://regexlib.com/RESilverlight.aspx
The best way to solve this is using Regular Expressions. You can find a great resource here.
Using RegEx, you can search for the pattern you are after. I always have to refer to some documentation to write one...
Here is a pattern to start with - "#(\w+)" - the # will get matched, and then the parentheses will indicate that you want what comes after. The "\w" means you want only word characters to match (a-z or A-Z), and the "+" indicates that there should be one or more word characters in a row.
You can try Regex...
I think will be something like this
string userName = Regex.Match(yourString, "#(.+)\\s").Groups[1].Value;
RegularExpressions. Dont know C#, but the RegEx would be
/(#[\w]+) / - Everything in the parans is captured in a special variable, or attached to RegEx object.
Use this:
var r = new Regex(#"#\w+");
foreach (Match m in r.Matches(stringToSearch))
DoSomething(m.Value);
DoSomething(string foundName) is a function that handles name (found after #).
This will find all #names in stringToSearch

C# Extracting a name from a string

I want to extract 'James\, Brown' from the string below but I don't always know what the name will be. The comma is causing me some difficuly so what would you suggest to extract James\, Brown?
OU=James\, Brown,OU=Test,DC=Internal,DC=Net
Thanks
A regex is likely your best approach
static string ParseName(string arg) {
var regex = new Regex(#"^OU=([a-zA-Z\\]+\,\s+[a-zA-Z\\]+)\,.*$");
var match = regex.Match(arg);
return match.Groups[1].Value;
}
You can use a regex:
string input = #"OU=James\, Brown,OU=Test,DC=Internal,DC=Net";
Match m = Regex.Match(input, "^OU=(.*?),OU=.*$");
Console.WriteLine(m.Groups[1].Value);
A quite brittle way to do this might be...
string name = #"OU=James\, Brown,OU=Test,DC=Internal,DC=Net";
string[] splitUp = name.Split("=".ToCharArray(),3);
string namePart = splitUp[1].Replace(",OU","");
Console.WriteLine(namePart);
I wouldn't necessarily advocate this method, but I've just come back from a departmental Christmas lyunch and my brain is not fully engaged yet.
I'd start off with a regex to split up the groups:
Regex rx = new Regex(#"(?<!\\),");
String test = "OU=James\\, Brown,OU=Test,DC=Internal,DC=Net";
String[] segments = rx.Split(test);
But from there I would split up the parameters in the array by splitting them up manually, so that you don't have to use a regex that depends on more than the separator character used. Since this looks like an LDAP query, it might not matter if you always look at params[0], but there is a chance that the name might be set as "CN=". You can cover both cases by just reading the query like this:
String name = segments[0].Split('=', 2)[1];
That looks suspiciously like an LDAP or Active Directory distinguished name formatted according to RFC 2253/4514.
Unless you're working with well known names and/or are okay with a fragile hackaround (like the regex solutions) - then you should start by reading the spec.
If you, like me, generally hate implementing code according to RFCs - then hope this guy did a better job following the spec than you would. At least he claims to be 2253 compliant.
If the slash is always there, I would look at potentially using RegEx to do the match, you can use a match group for the last and first names.
^OU=([a-zA-Z])\,\s([a-zA-Z])
That RegEx will match names that include characters only, you will need to refine it a bit for better matching for the non-standard names. Here is a RegEx tester to help you along the way if you go this route.
Replace \, with your own preferred magic string (perhaps & #44;), split on remaining commas or search til the first comma, then replace your magic string with a single comma.
i.e. Something like:
string originalStr = #"OU=James\, Brown,OU=Test,DC=Internal,DC=Net";
string replacedStr = originalStr.Replace("\,", ",");
string name = replacedStr.Substring(0, replacedStr.IndexOf(","));
Console.WriteLine(name.Replace(",", ","));
Assuming you're running in Windows, use PInvoke with DsUnquoteRdnValueW. For code, see my answer to another question: https://stackoverflow.com/a/11091804/628981
If the format is always the same:
string line = GetStringFromWherever();
int start = line.IndexOf("=") + 1;//+1 to get start of name
int end = line.IndexOf("OU=",start) -1; //-1 to remove comma
string name = line.Substring(start, end - start);
Forgive if syntax is not quite right - from memory. Obviously this is not very robust and fails if the format ever changes.

Categories

Resources