Extracting Groups and Sub-groups in RegEx

Extracting Groups and Sub-groups in RegEx - c#

This question is, in a way, continuation of my previously answered question: Getting "Unterminated [] set." Error in C#
I'm using regular expression in C# to extract URLs:
Regex find = new Regex(#"(?<First>[,""]url=)(?<Url>[^\\]+)(?<Last>\\u00)");
Where the text contains URLs in the format:
,url=http://domain.com?itag=25\u0026,url=http://hello.com?itag=11\u0026
I'm getting the entire URL in 'Url' group, but I'd also like to have the itag value in a separate "iTag" group. I know this can be done using sub-groups and I've been trying but can't figure out exactly how to do this.

You already have named groups defined in the Regex. The syntax ?<First> is naming everything within those parenthesis First.
When you match using Regex, using the Groups property to access the GroupCollection and extract a group value by name.
var first = regex.Match(line).Groups["First"].Value;
This will add an additional group for iTag, but retain the full Url. Move it outside the other parenthesis to change this.
(?<First>[,""]url=)(?<Url>[^\?]+?itag=(?<iTag>[0-9]*))(?<Last>\\u0026)
Here's the code.
Regex regex = new Regex("(?<First>[,\"]url=)(?<Url>[^\\?]*\\?itag=(?<iTag>[0-9]*))(?<Last>\\u0026)");
string input = ",url=http://domain.com?itag=25\u0026,url=http://hello.com?itag=11\u0026";
foreach(Match match in regex.Matches(input))
{
System.Console.WriteLine("1. "+match);
System.Console.WriteLine(" 1. "+match.Groups["First"]);
System.Console.WriteLine(" 2. "+match.Groups["Url"]);
System.Console.WriteLine(" 3. "+match.Groups["iTag"]);
System.Console.WriteLine(" 4. "+match.Groups["Last"]);
}
Results:
1. ,url=http://domain.com?itag=25&
1. ,url=
2. http://domain.com?itag=25
3. 25
4. &
1. ,url=http://hello.com?itag=11&
1. ,url=
2. http://hello.com?itag=11
3. 11
4. &

Related

c# Regex named groups access non existent group name [duplicate]

In my regex the pattern is something like this:
#"Something\(\d+, ""(.+)""(, .{1,5}, \d+, (?<somename>\d+)?\),"
So I would like to know if <somename> exists. If it was a normal capture group, I could just check if the capture groups are greater than the number of groups without that/those capture group(s), but I don't have the option here.
Could anyone help me find a way round this? I don't need it to be efficient, it's just for a one-time program that's used for sorting, so I don't mind if it takes a bit to run. It's not going to be for public code.

According to the documentation:
If groupname is not the name of a capturing group in the collection,
or if groupname is the name of a capturing group that has not been
matched in the input string, the method returns a Group object whose
Group.Success property is false and whose Group.Value property is
String.Empty.
var regex = new Regex(#"Something\(\d+, ""(.+)""(, .{1,5}, \d+, (?<somename>\d+)?\),");
var match = regex.Match(input);
var group = match.Groups["somename"];
bool exists = group.Success;

Stripping text line with regular expression with c #

In the text shown below, I would need to extract the info in between the double quotes (The input is a text file)
Tag = "571EC002A-TD"
Tag = "571GI001-RUN"
Tag = "571GI001-TD"
The output should be,
571EC002A-TD
571GI001-RUN
571GI001-TD
How should I frame my regex in C# to match this and save it to a text file.
I was successful till reading all the lines into my code, but the regex gives me some undesirable values.
thanks and appreciate in advance.

A simple regex could be:
Regex tagRegex = new Regex(#"Tag\s?=\s?""(.+?)""");
Example with your input

UPDATE
For those that ask why not use String.Substring: The great advantage of regular expressions over string operations is that they don't generate temporary strings untily you actually ask for a matched value. Matches and groups contain only indexes to the source string. This cane be a huge advantage when processing log files.
You can match the content of a tag using a regex like
Tag\s*=\s*"(<tagValue>.*?)"
The ? in .*? results in a non-greedy search, ie only text up to the first double quote is extracted. Otherwise the pattern would match everything up to the last double quote.
(<tagValue>.*?) defines a named group. This way you can refer to the actual value captured by name and even use LINQ to process the values
The resulting C# code may look like this after escaping:
var myRegex=new Regex("Tag\\s*=\\s*\"(<tagValue>.*?)\"");
...
var tags=myRegex.Matches(someText)
.OfType<Match>()
.Select(match=>match.Groups["tagValue"].Value);
The result is an IEnumerable with all tag values. You can convert it to an array or List using ToArray() or ToList() just like any other IEnumerable
The equivalent code using a loop would be
var myRegex=new Regex("Tag\\s*=\\s*\"(<tagValue>.*?)\"");
...
List<string> tagValues=new List<string>();
foreach(Match m in myRegex.Matches(someText))
{
tagValues.Add(m.Groups["tagValue"].Value;
}
The LINQ version though can be extended very easily. For example, File.ReadLines returns an IEnumerable and doesn't wait to load everything in memory before returning. You could write something like:
var tags=File.ReadLines(myBigLog)
.SelectMany(line=>myRegex.Matches(line))
.OfType<Match>()
.Select(match=>match.Groups["tagValue"].Value);
If the tag names changed, you could also capture the tag name. If eg tags have a tag prefix you could use the pattern:
(?<tagName>tag\w+)\s*=\s*"(<tagValue>.*?)"
And extract both tag name and value in the Select function, eg :
.Select(match=>new {
TagName=match.Groups["tagName"].Value,
Value=match.Groups["tagValue"].Value
});
Regex.Matches is thread safe which means you can create one static Regex object and use it repeatedly, or even use PLINQ to match multiple lines in parallel simply by adding AsParallel() before the call to SelectMany.

If those strings will always be like that, you can go for a simpler approach by just using Substring:
line.Substring(7, line.Length - 8)
That will give you your desired output.

Reading in a text file more 'intelligently'

I have a text file which contains a list of alphabetically organized variables with their variable numbers next to them formatted something like follows:
aabcdef 208
abcdefghijk 1191
bcdefga 7
cdefgab 12
defgab 100
efgabcd 999
fgabc 86
gabcdef 9
h 11
ijk 80
...
...
I would like to read each text as a string and keep it's designated id# something like read "aabcdef" and store it into an array at spot 208.
The 2 issues I'm running into are:
I've never read from file in C#, is there a way to read, say from
start of line to whitespace as a string? and then the next string as
an int until the end of line?
given the nature and size of these files I do not know the highest ID value of each file (not all numbers are used so some
files could house a number like 3000, but only actually list 200
variables) So how could I make a flexible way to store these
variables when I don't know how big the array/list/stack/etc.. would
need to be.

Basically you need a Dictionary instead of an array or list. You can read all lines with File.ReadLines method then split each of them based on space and \t (tab), like this:
var values = File.ReadLines("path")
.Select(line => line.Split(new [] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries))
.ToDictionary(parts => int.Parse(parts[1]), parts => parts[0]);
Then values[208] will give you aabcdef. It looks like an array doesn't it :)
Also make sure you have no duplicate numbers because Dictionary keys should be unique otherwise you will get an exception.

I've been thinking about how I would improve other answers and I've found this alternative solution based on Regex which makes the search into the whole string (either coming from a file or not) safer.
Check that you can alter the whole regular expression to include other separators. Sample expression will detect spaces and tabs.
At the end of the day, I found that MatchCollection returns a safer result, since you always know that 3rd group is an integer and 2nd group is a text because regular expression does a lot of checking for you!
StringBuilder builder = new StringBuilder();
builder.AppendLine("djdodjodo\t\t3893983");
builder.AppendLine("dddfddffd\t\t233");
builder.AppendLine("djdodjodo\t\t39838");
builder.AppendLine("djdodjodo\t\t12");
builder.AppendLine("djdodjodo\t\t444");
builder.AppendLine("djdodjodo\t\t5683");
builder.Append("djdodjodo\t\t33");
// Replace this line with calling File.ReadAllText to read a file!
string text = builder.ToString();
MatchCollection matches = Regex.Matches(text, #"([^\s^\t]+)(?:[\s\t])+([0-9]+)", RegexOptions.IgnoreCase | RegexOptions.Multiline);
// Here's the magic: we convert an IEnumerable<Match> into a dictionary!
// Check that using regexps, int.Parse should never fail because
// it matched numbers only!
IDictionary<int, string> lines = matches.Cast<Match>()
.ToDictionary(match => int.Parse(match.Groups[2].Value), match => match.Groups[1].Value);
// Now you can access your lines as follows:
string value = lines[33]; // <-- By value
Update:
As we discussed in chat, this solution wasn't working in some actual use case you showed me, but it's not the approach what's not working but your particular case, because keys are "[something].[something]" (for example: address.Name).
I've changed given regular expression to ([\w\.]+)[\s\t]+([0-9]+) so it covers the case of key having a dot.
It's about improving the matching regular expression to fit your requirements! ;)
Update 2:
Since you told me that you need keys having any character, I've changed the regular expression to ([^\s^\t]+)(?:[\s\t])+([0-9]+).
Now it means that key is anything excepting spaces and tabs.
Update 3:
Also I see you're stuck in .NET 3.0 and ToDictionary was introduced in .NET 3.5. If you want to get the same approach in .NET 3.0, replace ToDictionary(...) with:
Dictionary<int, string> lines = new Dictionary<int, string>();
foreach(Match match in matches)
{
lines.Add(int.Parse(match.Groups[2].Value), match.Groups[1].Value);
}

Error in adapter

how can I find a number in web page.
Let me to take an example:
for example, if I wanna to find 1234 from following numbers, it just show me the 1234, not 123412(which is including 1234).
1234124 -
113412 -
352523434653 -
1234
I wrote the following code, How can i change it to get my result from it?
foreach(DataRow row in dt.Rows)
{
string url = "http://play.dcc.fc.up.pt:2241/PTECH/recommenders/music/<userid>?groups=<userid>";
var test = url.Replace("<userid>", Convert.ToString(row["UserID"]));
System.Diagnostics.Process.Start(url);
string client = (new WebClient()).DownloadString("http://play.dcc.fc.up.pt:2241/PTECH/recommenders/music/UserID?groups=UserID");
if (client.Contains(Convert.ToString(TrackID)))

Regular expression that sets some sort of boundary before/after the value should work fine. I.e. word boundary \b will let you pick number if it have spaces/other separators around it (don't forget to check if match found when using it):
var value = Regex.Matches("foo,1234 bar", #"\b1234\b")[0].Value;
Check Regular Expression Language reference for more options.

Regular expression for recognizing url

I want to create a Regex for url in order to get all links from input string.
The Regex should recognize the following formats of the url address:
http(s)://www.webpage.com
http(s)://webpage.com
www.webpage.com
and also the more complicated urls like:
- http://www.google.pl/#sclient=psy&hl=pl&site=&source=hp&q=regex+url&pbx=1&oq=regex+url&aq=f&aqi=g1&aql=&gs_sm=e&gs_upl=1582l3020l0l3199l9l6l0l0l0l0l255l1104l0.2.3l5l0&bav=on.2,or.r_gc.r_pw.&fp=30a1604d4180f481&biw=1680&bih=935
I have the following one
((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)
but it does not recognize the following pattern: www.webpage.com. Can someone please help me to create an appropriate Regex?
EDIT:
It should works to find an appropriate link and moreover place a link in an appropriate index like this:
private readonly Regex RE_URL = new Regex(#"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)", RegexOptions.Multiline);
foreach (Match match in (RE_URL.Matches(new_text)))
{
// Copy raw string from the last position up to the match
if (match.Index != last_pos)
{
var raw_text = new_text.Substring(last_pos, match.Index - last_pos);
text_block.Inlines.Add(new Run(raw_text));
}
// Create a hyperlink for the match
var link = new Hyperlink(new Run(match.Value))
{
NavigateUri = new Uri(match.Value)
};
link.Click += OnUrlClick;
text_block.Inlines.Add(link);
// Update the last matched position
last_pos = match.Index + match.Length;
}

I don't know why your result in match is only http:// but I cleaned your regex a bit
((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:##%/;$()~_?\+,\-=\\.&]+)
(?:) are non capturing groups, that means there is only one capturing group left and this contains the complete matched string.
(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.) The link has now to start with something fom the first list followed by an optional www. or with an www.
[\w\d:##%/;$()~_?\+,\-=\\.&] I added a comma to the list (otherwise your long example does not match) escaped the - (you were creating a character range) and unescaped the . (not needed in a character class.
See this here on Regexr, a useful tool to test regexes.
But URL matching is not a simple task, please see this question here

I've just written up a blog post on recognising URLs in most used formats such as:
www.google.com
http://www.google.com
mailto:somebody#google.com
somebody#google.com
www.url-with-querystring.com/?url=has-querystring
The regular expression used is /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)/ however I would recommend you got to http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without-the to see a complete working example along with an explanation of the regular expression in case you need to extend or tweak it.

The regex you give doesn't work for www. addresses because it is expecting a URI scheme (the bit before the URL, like http://). The 'www.' part in your regular expression doesn't work because it would only match www.:// (which is meaningless)
Try something like this instead:
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:##%/;$()~_?\+-=\\\.&]*)
This will match something with a valid URI scheme, or something beginning with 'www.'

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extracting Groups and Sub-groups in RegEx - c#

Related

c# Regex named groups access non existent group name [duplicate]

Stripping text line with regular expression with c #

Reading in a text file more 'intelligently'

Error in adapter

Regular expression for recognizing url

Categories

Resources