Regex Split - with pattern for start and end characters

Regex Split - with pattern for start and end characters - c#

if i use regex to split a string, and my pattern defines the start and end characters for the split. what will happen to partial data on the end of the string that does not match the pattern? will it still be included in the array like a normal split would? or would it be bypassed since i am telling it what the start and end characters should be?
This is a rare occurrence but will happen. they expect this to be accounted for. i need to know if the partial data will be included or ignored. nothing i have read so far talks about a pattern for the entire string, rather they just talk about the delimiter. if the data that does match is not included, is there a way to get it? or do i need to do custom parsing?
thanks in advance
string pattern = byteStart + #"[\s.\S]*?" + byteEnd;
string[] str = Regex.Split(e.Text, #pattern);

Related

Find all character after N match until next N match

I have a long string. From this i need to find a part of string and also cut them after a certain characters. For better understanding i have added code example also with output i want from this string.
string myStr = "/NETGEAR-N300-WiFi-Range-Extender/dp/B00L0YLRUW/ref=sr_1_1?keywords=0606449104899&qid=1548142454&sr=8-1";
So basically i need to find the character by /dp/ and grab after this until i find next /. This is the main pattern i want to use to achieve my output
//output i want like this- B00L0YLRUW

From what you described, no need to get fancy, you can use old-school split.
var result = myStr.Split('/')[3]
However, if you string format is not so clear, regex is your friend

Failure To Get Specific Text From Regex Group

My example is working fine with greedy when I use to capture the whole value of a string and a group(in group[1] ONLY) enclose with a pair of single quote.
But when I want to capture the whole value of a string and a group(in group[1] ONLY) enclose with multiple pair of single quote , it only capture the value of string enclose with last pair but not the string between first and last single quotes.
string val1 = "Content:abc'23'asad";
string val2 = "Content:'Scale['#13212']'ta";
Match match1 = Regex.Match(val1, #".*'(.*)'.*");
Match match2 = Regex.Match(val2, #".*'(.*)'.*");
if (match1.Success)
{
string value1 = match1.Value;
string GroupValue1 = match1.Groups[1].Value;
Console.WriteLine(value1);
Console.WriteLine(GroupValue1);
string value2 = match2.Value;
string GroupValue2 = match2.Groups[1].Value;
Console.WriteLine(value2);
Console.WriteLine(GroupValue2);
Console.ReadLine();
// using greedy For val1 i am getting perfect value for-
// value1--->Content:abc'23'asad
// GroupValue1--->23
// BUT using greedy For val2 i am getting the string elcosed by last single quote-
// value2--->Content:'Scale['#13212']'ta
// GroupValue2---> ]
// But i want GroupValue2--->Scale['#13212']
}

The problem with your existing regex is that you are using too many greedy modifiers. That first one is going to grab everything it can until it runs into the second to last apostrophe in the string. That's why your end result of the second example is just the stuff within the last pair of quotes.
There are a few ways to approach this. The simplest way is to use Slai's suggestion - just a pattern to grab anything and everything within the most "apart" apostrophes available:
'(.*)'
A more explicitly defined approach would be to slightly tweak the pattern you are currently using. Just change the first greedy modifier into a lazy one:
.*?'(.*)'.*
Alternatively, you could change the dot in that first and last section to instead match every character other than an apostrophe:
[^']*'(.*)'[^']*
Which one you end up using depends on what you're personally going after. One thing of note, though, is that according to Regex101, the first option involves the fewest steps, so it will be the most efficient method. However, it also dumps the rest of the string, but I don't know if that matters to you.

First off use named match capture groups such as (?<Data> ... ) then you can access that group by its name in C# such as match1.Groups["Data"].Value.
Secondly try not to use * which means zero to many. Is there really going to be no data? For a majority of the cases, that answer is no, there is data.
Use the +, one to many instead.
IMHO * screws up more patterns because it has to find zero data, when it does that, it skips ungodly amounts of data. When you know there is data use +.
It is better to match on what is known, than unknown and we will create a pattern to what is known. Also in that light use the negation set [^ ] to capture text such as [^']+ which says capture everything that is not a ', one to many times.
Pattern
Content:\x27?[^\x27?]+\x27(?<Data>[^\27]+?)\x27
The results on your two sets of data are 23 and #13212 and placed into match capture group[1] and group["Data"].
Note \x27 is the hex escape of the single quote '. \x22 is for the double quote ", which I bet is what you are really running into.
I use the hex escapes when dealing with quotes so not to have to mess with the C# compiler thinking they are quotes while parsing.

parsing a method Signature using regular expressions

I am trying to use regular expressions to parse a method in the following format from a text:
mvAddSell[value, type1, reference(Moving, 60)]
so using the regular expressions, I am doing the following
tokensizedStrs = Regex.Split(target, "([A-Za-z ]+[\\[ ][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[\\( ][A-Za-z0-9 ]+[, ].+[\\) ][\\] ])");
It is working, but the problem is that it always gives me an empty array at the beginning if the string started with a method in the given format and the same happens if it comes at the end. Also if two methods appeared in the string, it catches only the first one! why is that ?
I think what is causing the parser not to catch two methods is the existance of ".+" in my patern, what I wanted to do is that I want to tell it that there will be a number of a date in that location, so I tell it that there will be a sequence of any chars, is that wrong ?
it woooorked with ,e =D ... I replaced ".+" by ".+?" which meant as few as possible of any number of chars ;)

Your goal is quite unclear to me. What do you want as result? If you split on that method pattern, you will get the part before your pattern and the part after your pattern in an array, but not the method itself.
Answer to your question
To answer your concrete question: your .+ is greedy, that means it will match anything till the last )] (in the same line, . does not match newline characters by default).
You can change this behaviour by adding a ? after the quantifier to make it lazy, then it matches only till the first )].
tokensizedStrs = Regex.Split(target, "([A-Za-z ]+[\\[ ][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[ ,][A-Za-z0-9 ]+[\\( ][A-Za-z0-9 ]+[, ].+?[\\) ][\\] ])");
Problems in your regex
There are several other problems in your regex.
I think you misunderstood character classes, when you write e.g. [\\[ ]. this construct will match either a [ or a space. If you want to allow optional space after the [ (would be logical to me), do it this way: \\[\\s*
Use a verbatim string (with a leading #) to define your regex to avoid excessive escaping.
tokensizedStrs = Regex.Split(target, #"([A-Za-z ]+\[\s*[A-Za-z0-9 ]+\s*,\s*[A-Za-z0-9 ]+\s*,\s*[A-Za-z0-9 ]+\(\s*[A-Za-z0-9 ]+\s*,\s*.+?\)s*\]\s*)");
You can simplify your regex, by avoiding repeating parts
tokensizedStrs = Regex.Split(target, #"([A-Za-z ]+\[\s*[A-Za-z0-9 ]+(?:\s*,\s*[A-Za-z0-9 ]+){2}\(\s*[A-Za-z0-9 ]+\s*,\s*.+?\)s*\]\s*)");
This is an non capturing group (?:\s*,\s*[A-Za-z0-9 ]+){2} repeated two times.

How to parse a text file with c#?

How do I parse a Textfile like:
{:block1:}
%param1%= value1
%param2% = value2
%paramn% =valuen
{:block2:}
1st html - sourcecode Just copy 1:1
{:block3:}
2nd html - sourcecode Just copy 1:1
...{:block4:}
3rd html - sourcecode Just copy 1:1
I would like to convert data to a XmlDocument.
Blocks are identified by {::} and params are identified by %%=
Thanx a lot.
What I'm looking for is more an idea but complete code. I have found many examples reading ini-files using RegEx and a TextReader to get some lines. The problem is: It's possible, that more than one {:block:} is within a line. There are so many whitespaces, linebreaks...

If the problem is that more than one {:block:} can appear within a line, could you replace every "{" with a "\r\n{" to guarantee that every block is in its own line? (In other words, replace every "{" with a "newline{" ) would the extra spaces cause a problem? Otherwise, you could write a Regex expression to identify only those blocks where you need to enter a linebreak.
The whitespaces and line breaks are both handled with the Regex escape character \s. A common way to use \s in Regex is either as "\s+" or "\s*", depending on whether whitespace is optional or necessary.
It would also help if you were more specific about particular problems.

Replacing numbers in strings with C#

I'd thought i do a regex replace
Regex r = new Regex("[0-9]");
return r.Replace(sz, "#");
on a file named aa514a3a.4s5 . It works exactly as i expect. It replaces all the numbers including the numbers in the ext. How do i make it NOT replace the numbers in the ext. I tried numerous regex strings but i am beginning to think that its a all or nothing pattern so i cant do this? do i need to separate the ext from the string or can i use regex?

This one does it for me:
(?<!\.[0-9a-z]*)[0-9]
This does a negative lookbehind (the string must not occur before the matched string) on a period, followed by zero or more alphanumeric characters. This ensures only numbers are matched that are not in your extension.
Obviously, the [0-9a-z] must be replaced by which characters you expect in your extension.

I don't think you can do that with a single regular expression.
Probably best to split the original string into base and extension; do the replace on the base; then join them back up.

Yes, I thing you'd be better off separating the extension.
If you are sure there is always a 3-character extension at the end of your string, the easiest, most readable/maintainable solution would be to only perform the replace on
yourString.Substring(0,YourString.Length-4)
..and then append
yourString.Substring(YourString.Length-4, 4)

Why not run the regex on the substring?
String filename = "aa514a3a.4s5";
String nameonly = filename.Substring(0,filename.Length-4);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex Split - with pattern for start and end characters - c#

Related

Find all character after N match until next N match

Failure To Get Specific Text From Regex Group

parsing a method Signature using regular expressions

How to parse a text file with c#?

Replacing numbers in strings with C#

Categories

Resources