Multiple pattern matching using RegEx - c#

I'm trying to use RegEx to split a string into several objects. Each record is separated by a :, and each field is separated by a ~.
So sample data would look like:
:1~Name1:2~Name2:3~Name3
The RegEx I have so far is
:(?<id>\d+)~(?<name>.+)
This however will only match the first record, when really I would expect 3. How do I get the RegEx to return all matches rather than just the first?

Your last .+ is greedy, so it gobbles up the Name1 as well as the rest of the string.
Try
:(?<id>\d+)~(?<name>[^:]+)
This means that the Name can't have a : in it (which is probably OK for your data), and makes sure the name doesn't grab into the next field.
(And also use the Regex.Matches method which grabs all matches, not just the first).

Use:
var result = Regex.Matches(input, #":(?<id>\d+)~(?<name>[^:]+)").Cast<Match>()
.Select(m => new
{
Id = m.Groups["id"].Value,
Name = m.Groups["name"].Value
});

you better use .split() method for strings.
String[] records = myString.split(':');
for(String rec : records)
{
String[] fields = rec.split('~');
//use fields
}

Related

How to split a string every time the character changes?

I'd like to turn a string such as abbbbcc into an array like this: [a,bbbb,cc] in C#. I have tried the regex from this Java question like so:
var test = "aabbbbcc";
var split = new Regex("(?<=(.))(?!\\1)").Split(test);
but this results in the sequence [a,a,bbbb,b,cc,c] for me. How can I achieve the same result in C#?
Here is a LINQ solution that uses Aggregate:
var input = "aabbaaabbcc";
var result = input
.Aggregate(" ", (seed, next) => seed + (seed.Last() == next ? "" : " ") + next)
.Trim()
.Split(' ');
It aggregates each character based on the last one read, then if it encounters a new character, it appends a space to the accumulating string. Then, I just split it all at the end using the normal String.Split.
Result:
["aa", "bb", "aaa", "bb", "cc"]
I don't know how to get it done with split. But this may be a good alternative:
//using System.Linq;
var test = "aabbbbcc";
var matches = Regex.Matches(test, "(.)\\1*");
var split = matches.Cast<Match>().Select(match => match.Value).ToList();
There are several things going on here that are producing the output you're seeing:
The regex combines a positive lookbehind and a negative lookahead to find the last character that matches the one preceding it but does not match the one following it.
It creates capture groups for every match, which are then fed into the Split method as delimiters. The capture groups are required by the negative lookahead, specifically the \1 identifier, which basically means "the value of the first capture group in the statement" so it can not be omitted.
Regex.Split, given a capture group or multiple capture groups to match on when identifying the splitting delimiters, will include the delimiters used for every individual Split operation.
Number 3 is why your string array is looking weird, Split will split on the last a in the string, which becomes split[0]. This is followed by the delimiter at split[1], etc...
There is no way to override this behaviour on calling Split.
Either compensation as per Gusman's answer or projecting the results of a Matches call as per Ruard's answer will get you what you want.
To be honest I don't exactly understand how that regex works, but you can "repair" the output very easily:
Regex reg = new Regex("(?<=(.))(?!\\1)", RegexOptions.Singleline);
var res = reg.Split("aaabbcddeee").Where((value, index) => index % 2 == 0 && value != "").ToArray();
Could do this easily with Linq, but I don't think it's runtime will be as good as regex.
A whole lot easier to read though.
var myString = "aaabbccccdeee";
var splits = myString.ToCharArray()
.GroupBy(chr => chr)
.Select(grp => new string(grp.Key, grp.Count()));
returns the values `['aaa', 'bb', 'cccc', 'd', 'eee']
However this won't work if you have a string like "aabbaa", you'll just get ["aaaa","bb"] as a result instead of ["aa","bb","aa"]

match first digits before # symbol

How to match all first digits before # in this line
26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html
Im trying to get this number 26909578
My try
string text = #"26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html";
MatchCollection m1 = Regex.Matches(text, #"(.+?)#", RegexOptions.Singleline);
but then its outputs all text
Make it explicit that it has to start at the beginning of the string:
#"^(.+?)#"
Alternatively, if you know that this will always be a number, restrict the possible characters to digits:
#"^\d+"
Alternatively use the function Match instead of Matches. Matches explicitly says, "give me all the matches", while Match will only return the first one.
Or, in a trivial case like this, you might also consider a non-RegEx approach. The IndexOf() method will locate the '#' and you could easily strip off what came before.
I even wrote a sscanf() replacement for C#, which you can see in my article A sscanf() Replacement for .NET.
If you dont want to/dont like to use regex, use a string builder and just loop until you hit the #.
so like this
StringBuilder sb = new StringBuilder();
string yourdata = "yourdata";
int i = 0;
while(yourdata[i]!='#')
{
sb.Append(yourdata[i]);
i++;
}
//when you get to that # your stringbuilder will have the number you want in it so return it with .toString();
string answer = sb.toString();
The entire string (except the final url) is composed of segments that can be matched by (.+?)#, so you will get several matches. Retrieve only the first match from the collection returned by matching .+?(?=#)

Match.Regex syntax

I have a string that can be either
"MyName (ctid 5555)"
or
"OtherName (id 555-5555-5555-555)"
I tried to write a regex to fetch ctid or id, like so:
"(?<=ctid=).+(?=))"
Checking here gave 0 results.
What's wrong with my syntax?
Try this pattern: (?<=\((?:ctid|id)\s).+?(?=\))
It uses a look-behind to check for "ctid" or "id" followed by whitespace, then it matches any content up till the closing parenthesis.
string[] inputs = { "MyName (ctid 5555)", "OtherName (id 555-5555-5555-555)" };
string pattern = #"(?<=\((?:ctid|id)\s).+?(?=\))";
foreach (var input in inputs)
{
var result = Regex.Match(input, pattern).Value;
Console.WriteLine(result);
}
If you clarify your question a better solution might exist. If you care to know whether the value was a "ctid" or an "id" then named capture groups could be used.
Based on your example, I am assuming you require a regex to explicitally match
try
{
var idRegEx = "^.*?\s\(id\s(\d{3}-\d{4}-\d{4}-\d{3})\)$";
var ctIdRegex = "^.*?\s\(ctid\s(\d{4})\)$";
var idMatch = Regex.Replace(textToTest, idRegEx, RegexOptions.IgnoreCase).Groups[1].Value;
var ctIdMatch = Regex.Replace(textToTest, ctIdRegex , RegexOptions.IgnoreCase).Groups[1].Value;
}
catch(ArgumentException)
{
// Regex is wrong
}
catch(ArgumentOutOfRangeException)
{
// No match found on one or the other
}
Assuming that a ctid is always 4 digits, and an id is always 3-4-4-3 digits, and that either way it is enclosed in round brackets, I would do:
\((?:ctid (?<ctid>\d{4})|id (?<id>\d{3}-\d{4}-\d{4}-\d{3}))\)
This adds named groups and does validity checking at the same time. For example, you can use match.Groups['ctid'].value to get the ctid value, or ['id'] to get the id value. Because there is validity checking, you'll never get (what I am assuming is) an invalid id like "(id 123)" (since it doesn't have the 3-4-4-3 pattern).
Not sure what you want exactly
(?:(ct)?id)\s(.+?)\)
But this regex worked for me at
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
you just need to grab the 2nd group though...
If you don't really want the look around regex, then
\((ct)?id\s(.+?)\)
might do it as well (and is more readable for regex beginners)
Well, you're looking for 'ctid=' and the string has 'ctid '. You'll also need to escape the parenthesis in the lookahead (change ')' to '\)'.

C# search into a string for a specific pattern, and put in an Array

I'm having the following string as an example:
<tr class="row_odd"><td>08:00</td><td>08:10</td><td>TEST1</td></tr><tr class="row_even"><td>08:10</td><td>08:15</td><td>TEST2</td></tr><tr class="row_odd"><td>08:15</td><td>08:20</td><td>TEST3</td></tr><tr class="row_even"><td>08:20</td><td>08:25</td><td>TEST4</td></tr><tr class="row_odd"><td>08:25</td><td>08:30</td><td>TEST5</td></tr>
I need to have to have the output as a onedimensional Array.
Like 11111=myArray(0) , 22222=myArray(1) , 33333=myArray(2) ,......
I have already tried the myString.replace, but it seems I can only replace a single Char that way. So I need to use expressions and a for loop for filling the array, but since this is my first c# project, that is a bridge too far for me.
Thanks,
It seems like you want to use a Regex search pattern. Then return the matches (using a named group) into an array.
var regex = new Regex("act=\?(<?Id>\d+)");
regex.Matches(input).Cast<Match>()
.Select(m => m.Groups["Id"])
.Where(g => g.Success)
.Select(g => Int32.Parse(g.Value))
.ToArray();
(PS. I'm not positive about the regex pattern - you should check into it yourself).
Several ways you could do this. A couple are:
a) Use a regular expression to look for what you want in the string. Used a named group so you can access the matches directly
http://www.regular-expressions.info/dotnet.html
b) Split the expression at the location where your substrings are (e.g. split on "act="). You'll have to do a bit more parsing to get what you want, but that won't be to difficult since it will be at the beginning of the split string (and your have other srings that dont have your substring in them)
Use a combination of IndexOf and Substring... something like this would work (not sure how much your string varies). This will probably be quicker than any Regex you come up with. Although, looking at the length of your string, it might not really be an issue.
public static List<string> GetList(string data)
{
data = data.Replace("\"", ""); // get rid of annoying "'s
string[] S = data.Split(new string[] { "act=" }, StringSplitOptions.None);
var results = new List<string>();
foreach (string s in S)
{
if (!s.Contains("<tr"))
{
string output = s.Substring(0, s.IndexOf(">"));
results.Add(output);
}
}
return results;
}
Split your string using HTML tags like "<tr>","</tr>","<td>","</td>", "<a>","</a>" with strinng-variable.split() function. This gives you list of array.
Split html row into string array

how to obtain substrings those are within the angular brackets in a string

i think this would be really silly question , but iam not succesful with extratic srtings those within the angular barkets in a sentence .
var str = "MR. {Name} of {Department} department stood first in {Subjectname}"
i need to obtain the substrings (as array) those are within the angular brakets
like strArray should contain {Name,Department,Subjectname} from the above given string
Noting the use of var in your question, I will assume that you are using .NET 3.5.
The one line of code below should do the trick.
string[] result = Regex.Matches(str, #"\{([^\}]*)\}").Cast<Match>().Select(o => o.Value).ToArray();
List<string> fields = new List<string>();
foreach(Match match in Regex.Matches(str, #"\{([^\}]*)\}")) {
fields.Add(match.Groups[1].Value);
}
Or for formatting (filling in the blanks) - see this example.
Use String.IndexOf("{") to get the index of the first open tag and String.IndexOf("}") to get the index of the first close tag. Then use the other string functions to get it out (substring, remove etc)...while there are still tags

Categories

Resources