How does this regex find triangular numbers? - c#

Part of a series of educational regex articles, this is a gentle introduction to the concept of nested references.
The first few triangular numbers are:
1 = 1
3 = 1 + 2
6 = 1 + 2 + 3
10 = 1 + 2 + 3 + 4
15 = 1 + 2 + 3 + 4 + 5
There are many ways to check if a number is triangular. There's this interesting technique that uses regular expressions as follows:
Given n, we first create a string of length n filled with the same character
We then match this string against the pattern ^(\1.|^.)+$
n is triangular if and only if this pattern matches the string
Here are some snippets to show that this works in several languages:
PHP (on ideone.com)
$r = '/^(\1.|^.)+$/';
foreach (range(0,50) as $n) {
if (preg_match($r, str_repeat('o', $n))) {
print("$n ");
}
}
Java (on ideone.com)
for (int n = 0; n <= 50; n++) {
String s = new String(new char[n]);
if (s.matches("(\\1.|^.)+")) {
System.out.print(n + " ");
}
}
C# (on ideone.com)
Regex r = new Regex(#"^(\1.|^.)+$");
for (int n = 0; n <= 50; n++) {
if (r.IsMatch("".PadLeft(n))) {
Console.Write("{0} ", n);
}
}
So this regex seems to work, but can someone explain how?
Similar questions
How to determine if a number is a prime with regex?

Explanation
Here's a schematic breakdown of the pattern:
from beginning…
| …to end
| |
^(\1.|^.)+$
\______/|___match
group 1 one-or-more times
The (…) brackets define capturing group 1, and this group is matched repeatedly with +. This subpattern is anchored with ^ and $ to see if it can match the entire string.
Group 1 tries to match this|that alternates:
\1., that is, what group 1 matched (self reference!), plus one of "any" character,
or ^., that is, just "any" one character at the beginning
Note that in group 1, we have a reference to what group 1 matched! This is a nested/self reference, and is the main idea introduced in this example. Keep in mind that when a capturing group is repeated, generally it only keeps the last capture, so the self reference in this case essentially says:
"Try to match what I matched last time, plus one more. That's what I'll match this time."
Similar to a recursion, there has to be a "base case" with self references. At the first iteration of the +, group 1 had not captured anything yet (which is NOT the same as saying that it starts off with an empty string). Hence the second alternation is introduced, as a way to "initialize" group 1, which is that it's allowed to capture one character when it's at the beginning of the string.
So as it is repeated with +, group 1 first tries to match 1 character, then 2, then 3, then 4, etc. The sum of these numbers is a triangular number.
Further explorations
Note that for simplification, we used strings that consists of the same repeating character as our input. Now that we know how this pattern works, we can see that this pattern can also match strings like "1121231234", "aababc", etc.
Note also that if we find that n is a triangular number, i.e. n = 1 + 2 + … + k, the length of the string captured by group 1 at the end will be k.
Both of these points are shown in the following C# snippet (also seen on ideone.com):
Regex r = new Regex(#"^(\1.|^.)+$");
Console.WriteLine(r.IsMatch("aababc")); // True
Console.WriteLine(r.IsMatch("1121231234")); // True
Console.WriteLine(r.IsMatch("iLoveRegEx")); // False
for (int n = 0; n <= 50; n++) {
Match m = r.Match("".PadLeft(n));
if (m.Success) {
Console.WriteLine("{0} = sum(1..{1})", n, m.Groups[1].Length);
}
}
// 1 = sum(1..1)
// 3 = sum(1..2)
// 6 = sum(1..3)
// 10 = sum(1..4)
// 15 = sum(1..5)
// 21 = sum(1..6)
// 28 = sum(1..7)
// 36 = sum(1..8)
// 45 = sum(1..9)
Flavor notes
Not all flavors support nested references. Always familiarize yourself with the quirks of the flavor that you're working with (and consequently, it almost always helps to provide this information whenever you're asking regex-related questions).
In most flavors, the standard regex matching mechanism tries to see if a pattern can match any part of the input string (possibly, but not necessarily, the entire input). This means that you should remember to always anchor your pattern with ^ and $ whenever necessary.
Java is slightly different in that String.matches, Pattern.matches and Matcher.matches attempt to match a pattern against the entire input string. This is why the anchors can be omitted in the above snippet.
Note that in other contexts, you may need to use \A and \Z anchors instead. For example, in multiline mode, ^ and $ match the beginning and end of each line in the input.
One last thing is that in .NET regex, you CAN actually get all the intermediate captures made by a repeated capturing group. In most flavors, you can't: all intermediate captures are lost and you only get to keep the last.
Related questions
(Java) method matches not work well - with examples on how to do prefix/suffix/infix matching
Is there a regex flavor that allows me to count the number of repetitions matched by * and + (.NET!)
Bonus material: Using regex to find power of twos!!!
With very slight modification, you can use the same techniques presented here to find power of twos.
Here's the basic mathematical property that you want to take advantage of:
1 = 1
2 = (1) + 1
4 = (1+2) + 1
8 = (1+2+4) + 1
16 = (1+2+4+8) + 1
32 = (1+2+4+8+16) + 1
The solution is given below (but do try to solve it yourself first!!!!)
(see on ideone.com in PHP, Java, and C#):
^(\1\1|^.)*.$

Related

String validations c#

I have a 9 character string I am trying to provide multiple checks on. I want to first check if the first 1 - 7 characters are numbers and then say for example the first 3 characters are numbers how would I check the 5th character for a letter range of G through T.
I am using c# and have tried this so far...
string checkString = "123H56789";
Regex charactorSet = new Regex("[G-T]");
Match matchSetOne = charactorSetOne.Match(checkString, 3);
if (Char.IsNumber(checkString[0]) && Char.IsNumber(checkString[1]) && Char.IsNumber(checkString[2]))
{
if (matchSetOne.Success)
{
Console.WriteLine("3th char is a letter");
}
}
But am not sure if this is the best way to handle the validations.
UPDATE:
The digits can be 0 - 9, but can concatenate from one number to seven. Like this "12345T789" or "1R3456789" etc.
It'a easy with LINQ:
check if the first 1 - 7 characters are numbers :
var condition1 = input.Take(7).All(c => Char.IsDigit(c));
check the 5th character for a letter range of G through T
var condition2 = input.ElementAt(4) >= 'G' && input.ElementAt(4) <= 'T';
As it is, both conditions can't be true at the same time (if the first 7 chars are digits, then the 5th char can't be a letter).

Filtering on full string match but not on substrings

So I've got a long string of numbers and characters and I'd like to filter out a substring. The thing I'm struggling with is that I need a full match on a certain value (starting with S) but this may not be matched in another value.
Input:
S10 1+0000000297472+00EURS100 1+0000000297472+00EURS1023P 1+0000000816072+00EUR
The input is exactly like this.
Breakdown of input:
S10 1+0000000297472+00EUR
Every part starts with a tag S and ends with EUR
There are spaces in between because every part has a fixed length
=>
index 0 : tag 'S' with length 1
index 1 : code with length 7
index 8 : numbertype with length 1
index 9 : sign with length 1
index 10 : value with length 13
index 23 : sign with length 1
index 24 : exponent with length 2
index 26 : unit with length 3
I need to match on for example S10 and I only want this substring till EUR. I don't want it to match on S100 or S1023P or any other combination. Only on exactly S10
Output:
S10 1+0000000297472+00EUR
I'm trying to use Regex to find my match on 'S + code'. I'm doing a full match on my search query and then as soon as anything follows I don't want it anymore. But doing it like this also discards the actual match as after the S10 the value will follow which will match with [^\d|^\D])+\w
foreach (var field in fieldList)
{
var query = "S" + field.BallanceCode;
var index = Regex.Match(values, Regex.Escape(query) + #"([^\d|^\D])+\w").Index;
}
For example when looking for S10
needs to match:
S10 1+0000000297472+00EUR
may not match:
S10/15 1+0000001748447+00EUR
S1023P 1+0000000816072+00EUR
S10000001+0000000546546+00EUR
Update:
Using this code
var index = Regex.Match(values, Regex.Escape(query) + #"\p{Zs}.*?EUR").Index;
wil yield S10, S10/15, etc when looked for. However looking for S1000000 in the string doesn't work because there is no whitespace between the code and 1+
S10000001+0000000546546+00EUR
For example when looking for S1000000
needs to match:
S10000001+0000000297472+00EUR
may not match:
S10 1+0000001748447+00EUR
S1023P 1+0000000816072+00EUR
S10/15 1+0000000546546+00EUR
You can use a regex that requires a space (or whitespace) to appear right after the field.BallanceCode:
var index = Regex.Match(values, Regex.Escape(query) + (field.BallanceCode.Length < 7 ? #"\p{Zs}" : "") + ".*?EUR").Index;
The regex will match the S10, then any horizontal whitespace (\p{Zs}), then any 0 or more characters other than a newline (as few as possible due to *?) up to the first EUR.
The (field.BallanceCode.Length < 7 ? #"\p{Zs}" : "") check is necessary to support a 7-digit BallanceCode. If it contains 7 digits or more, we do not check if there is a whitespace after it. If the length is less than 7, we check for a space.
So you just want the start (S...) and end (...EUR) of each line and skip everything in between?
^([sS]\d+).*?([\d\+]+EUR)$
http://regexr.com/3c1ob

Regex help - match any number of characters

I have following kind of string-sets in a text file:
<< /ImageType 1
/Width 986 /Height 1
/BitsPerComponent 8
/Decode [0 1 0 1 0 1]
/ImageMatrix [986 0 0 -1 0 1]
/DataSource <
803fe0503824160d0784426150b864361d0f8844625138a4562d178c466351b8e4763d1f904864523924964d27944a6552b964b65d2f984c665339a4d66d379c4e6753b9e4f67d3fa05068543a25168d47a4526954ba648202
> /LZWDecode filter >> image } def
There are 100s of Images defined like above.
I need to find all such images defined in the document.
Here is my code -
string txtFile = #"text file path";
string fileContents = File.ReadAllText(txtFile);
string pattern = #"<< /ImageType 1.*(\n|\r|\r\n)*image } def"; //match any number of characters between `<< /ImageType 1` and `image } def`
MatchCollection matchCollection = Regex.Matches(fileContents, pattern, RegexOptions.Singleline);
int count = matchCollection.Count; // returns 1
However, I am getting just one match - whereas there are around 600 images defined.
But it seems they all are matched in one because of 'newline' character used in pattern.
Can anyone please guide what do I need to modify the correct result of regex match as 600.
The reason is that regular expressions are usually greedy, i.e. the matches are always as long as possible. Thus, the image } def is contained in the .*. I think the best approach here would be to perform two separate regex queries, one for << /ImageType 1 and one for image } def. Every match of the first pattern would correspond to exactly one match of the second one and as these matches carry their indices in the original string, you can reconstruct the image by accessing the appropriate substring.
Instead of .* you should use the non-greedy quantifier .*?:
string pattern = #"<< /ImageType 1.*?image } def";
Here is a site that can help you out with REGEX that I use. http://webcheatsheet.com/php/regular_expressions.php.
if(preg_match('/^/[a-z]/i', $string, $matches)){
echo "Match was found <br />";
echo $matches[0];
}

Regex masking of words that contain a digit

Trying to come up with a 'simple' regex to mask bits of text that look like they might contain account numbers.
In plain English:
any word containing a digit (or a train of such words) should be matched
leave the last 4 digits intact
replace all previous part of the matched string with four X's (xxxx)
So far
I'm using the following:
[\-0-9 ]+(?<m1>[\-0-9]{4})
replacing with
xxxx${m1}
But this misses on the last few samples below
sample data:
123456789
a123b456
a1234b5678
a1234 b5678
111 22 3333
this is a a1234 b5678 test string
Actual results
xxxx6789
a123b456
a1234b5678
a1234 b5678
xxxx3333
this is a a1234 b5678 test string
Expected results
xxxx6789
xxxxb456
xxxx5678
xxxx5678
xxxx3333
this is a xxxx5678 test string
Is such an arrangement possible with a regex replace?
I think I"m going to need some greediness and lookahead functionality, but I have zero experience in those areas.
This works for your example:
var result = Regex.Replace(
input,
#"(?<!\b\w*\d\w*)(?<m1>\s?\b\w*\d\w*)+",
m => "xxxx" + m.Value.Substring(Math.Max(0, m.Value.Length - 4)));
If you have a value like 111 2233 33, it will print xxxx3 33. If you want this to be free from spaces, you could turn the lambda into a multi-line statement that removes whitespace from the value.
To explain the regex pattern a bit, it's got a negative lookbehind, so it makes sure that the word behind it does not have a digit in it (with optional word characters around the digit). Then it's got the m1 portion, which looks for words with digits in them. The last four characters of this are grabbed via some C# code after the regex pattern resolves the rest.
I don't think that regex is the best way to solve this problem and that's why I am posting this answer. For so complex situations, building the corresponding regex is too difficult and, what is worse, its clarity and adaptability is much lower than a longer-code approach.
The code below these lines delivers the exact functionality you are after, it is clear enough and can be easily extended.
string input = "this is a a1234 b5678 test string";
string output = "";
string[] temp = input.Trim().Split(' ');
bool previousNum = false;
string tempOutput = "";
foreach (string word in temp)
{
if (word.ToCharArray().Where(x => char.IsDigit(x)).Count() > 0)
{
previousNum = true;
tempOutput = tempOutput + word;
}
else
{
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
output = output + " " + word;
}
}
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
Have you tried this:
.*(?<m1>[\d]{4})(?<m2>.*)
with replacement
xxxx${m1}${m2}
This produces
xxxx6789
xxxx5678
xxxx5678
xxxx3333
xxxx5678 test string
You are not going to get 'a123b456' to match ... until 'b' becomes a number. ;-)
Here is my really quick attempt:
(\s|^)([a-z]*\d+[a-z,0-9]+\s)+
This will select all of those test cases. Now as for C# code, you'll need to check each match to see if there is a space at the beginning or end of the match sequence (e.g., the last example will have the space before and after selected)
here is the C# code to do the replace:
var redacted = Regex.Replace(record, #"(\s|^)([a-z]*\d+[a-z,0-9]+\s)+",
match => "xxxx" /*new String("x",match.Value.Length - 4)*/ +
match.Value.Substring(Math.Max(0, match.Value.Length - 4)));

Regex for checking for compass directions

I'm looking to match the 8 main directions as might appear in a street or location prefix or suffix, such as:
N Main
south I-22
124 Grover Ave SE
This is easy to code using a brute force list of matches and cycle through every match possibility for every street address, matching once with a start-of-string anchor and once with a end-of-string anchor. My blunt starting point is shown farther down, if you want to see it.
My question is if anyone has some clever ideas for compact, fast-executing patterns to accomplish the same thing. You can assume:
Compound directions always start with the north / south component. So I need to match South East but not EastSouth
The pattern should not match [direction]-ern words, like "Northern" or "Southwestern"
The match will always be at the very beginning or very end of the string.
I'm using C#, but I'm just looking for a pattern so I'm not emphasizing the language. /s(outh)?/ is just as good as #"s(outh)?" for me or future readers.
SO emphasizes real problems, so FYI this is one. I'm parsing a few hundred thousand nasty, unvalidated user-typed address strings. I want to check if the start or end of the "street" field (which is free-form jumble of PO boxes, streets, apartments, and straight up invalid junk) begins or ends with a compass direction. I'm trying to deconstruct these free form strings to find similar addresses which may be accidental or intentional variations and obfuscations.
My blunt attempt
Core pattern: /n(orth)?|e(ast)?|s(outh)?|w(est)?|n(orth\s*east|e|orth\s*west|w)|s(outh\s*east|e|outh\s*west|w)/
In a function:
public static Tuple<Match, Match> MatchDirection(String value) {
string patternBase = #"n(orth)?|e(ast)?|s(outh)?|w(est)?|n(orth\s*east|e|orth\s*west|w)|s(outh\s*east|e|outh\s*west|w)";
Match[] matches = new Match[2];
string[] compassPatterns = new[] { #"^(" + patternBase + #")\b", #"\b(" + patternBase + #")$" };
for (int i = 0; i < 2; i++) { matches[i] = Regex.Match(value, compassPatterns[i], RegexOptions.IgnoreCase); }
return new Tuple<Match, Match>(matches[0], matches[1]);
}
In use, where sourceDt is a table with all the addresses:
var parseQuery = sourceDt.AsEnumerable()
.Select((DataRow row) => {
string addr = ((string)row["ADDR_STREET"]).Trim();
Tuple<Match, Match> dirMatches = AddressParser.MatchDirection(addr);
return new string[] { addr, dirMatches.Item1.Value, dirMatches.Item2.Value };
})
Edit: Actually this is probably wrong answer - so keeping it just so people not suggest the same thing - figuring out tokenization for "South East" is task in itself. Also I still doubt RegExp will be very usable either.
Original answer:
Don't... your initial RegExp attempt is already non-readable.
Dictionary look up for each word you want from the tokenized string ("brute force approach") already gives you linear time on length and constant time per word. And it is very easy to customize with new words.
(^[nesw][^n\s]*)|([nesw][^n\s]*$)
So this will match a line that:
begins or ends with a word that:
Begins with a cardinal direction
Doesn't have an n otherwise in it (to get rid of the "-ern"s)
Perl/PCRE compatible expression:
(?xi)
(^)?
\b
(?:
n(?:orth)?
(?:\s* (?: e(?:ast)? | w(?:est)? ))?
|
s(?:outh)?
(?:\s* (?: e(?:ast)? | w(?:est)? ))?
|
e(?:ast)?
|
w(?:est)?
)
\b
(?(1)|$)
I think C# supports all the features used here.

Categories

Resources