weird regex behavior in the tokenization

weird regex behavior in the tokenization - c#

I am using the following regex to tokenize:
reg = new Regex("([ \\t{}%$^&*():;_–`,\\-\\d!\"?\n])");
The regex is supposed to filter out everything later, however the input string format that i am having problem with is in the following form:
; "string1"; "string2"; "string...n";
the result of the string: ; "social life"; "city life"; "real life" as I know should be like the following:
; White " social White life " ; White " city White life " ; White " real White life "
However there is a problem such that, I get the output in the following form
; empty White empty " social White life " empty ; empty White empty " city White life " empty ; empty White empty " real White life " empty
White: means White-Space,
empty: means empty entry in the split array.
My code for split is as following:
string[] ret = reg.Split(input);
for (int i = 0; i < ret.Length; i++)
{
if (ret[i] == "")
Response.Write("empty<br>");
else
if (ret[i] == " ")
Response.Write("White<br>");
else
Response.Write(ret[i] + "<br>");
}
Why I get these empty entries ? and especially when there is ; followed by space followed by " then the result looks like the following:
; empty White empty "
can I get explanation of why the command adds empty entries ? and how to remove them without any additional O(n) complexity or using another data structure as ret

In my experience, splitting at regex matches is almost always not the best idea. You'll get much better results through plain matching.
And regexes are very well suited for tokenization purposes, as they let you implement a state machine really easily, just take a look at that:
\G(?:
(?<string> "(?>[^"\\]+|\\.)*" )
| (?<separator> ; )
| (?<whitespace> \s+ )
| (?<invalid> . )
)
Demo - use this with RegexOptions.IgnorePatternWhitespace of course.
Here, each match will have the following properties:
It will start at the end of the previous match, so there will be no unmatched text
It will contain exactly one matching group
The name of the group tells you the token type
You can ignore the whitespace group, and you should raise an error if you ever encounter a matching invalid group.
The string group will match an entire quoted string, it can handle escapes such as \" inside the string.
The invalid group should always be last in the pattern. You may add rules for other other types.
Some example code:
var regex = new Regex(#"
\G(?:
(?<string> ""(?>[^""\\]+|\\.)*"" )
| (?<separator> ; )
| (?<whitespace> \s+ )
| (?<invalid> . )
)
", RegexOptions.IgnorePatternWhitespace);
var input = "; \"social life\"; \"city life\"; \"real life\"";
var groupNames = regex.GetGroupNames().Skip(1).ToList();
foreach (Match match in regex.Matches(input))
{
var groupName = groupNames.Single(name => match.Groups[name].Success);
var group = match.Groups[groupName];
Console.WriteLine("{0}: {1}", groupName, group.Value);
}
This produces the following:
separator: ;
whitespace:
string: "social life"
separator: ;
whitespace:
string: "city life"
separator: ;
whitespace:
string: "real life"
See how much easier it is to deal with these results rather than using split?

Related

Removal of colon and carriage returns and replace with colon

I'm working on a project where I have a HMTL fragment which needs to be cleaned up - the HTML has been removed and as a result of table being removed, there are some strange ends where they shouldnt be :-)
the characters as they appear are
a space at the beginning of a line
a colon, carriage return and linefeed at the end of the line - which needs to be replaced simply with the colon;
I am presently using regex as follows:
s = Regex.Replace(s, #"(:[\r\n])", ":", RegexOptions.Multiline | RegexOptions.IgnoreCase);
// gets rid of the leading space
s = Regex.Replace(s, #"(^[( )])", "", RegexOptions.Multiline | RegexOptions.IgnoreCase);
Example of what I am dealing with:
Tomas Adams
Solicitor
APLawyers
p:
1800 995 718
f:
07 3102 9135
a:
22 Fultam Street
PO Box 132, Booboobawah QLD 4113
which should look like:
Tomas Adams
Solicitor
APLawyers
p:1800 995 718
f:07 3102 9135
a:22 Fultam Street
PO Box 132, Booboobawah QLD 4313
as my attempt to clean the string, but the result is far from perfect ... Can someone assist me to correct the error and achive my goal ...
[EDIT]
the offending characters
f:\r\n07 3102 9135\r\na:\r\n22
the combination of :\r\n should be replaced by a single colon.
MTIA
Darrin

You may use
var result = Regex.Replace(s, #"(?m)^\s+|(?<=:)(?:\r?\n)+|(\r?\n){2,}", "$1")
See the .NET regex demo.
Details
(?m) - equal to RegexOptions.Multiline - makes ^ match the start of any line here
^ - start of a line
\s+ - 1+ whitespaces
| - or
(?<=:)(?:\r?\n)+ - a position that is immediately preceded with : (matched with (?<=:) positive lookbehind) followed with 1+ occurrences of an optional CR and LF (those are removed)
| - or
(\r?\n){2,} - two or more consecutive occurrences of an optional CR followed with an LF symbol. Only the last occurrence is saved in Group 1 memory buffer, thus the $1 replacement pattern inserts that last, single, occurrence.

A basic solution without Regex:
var lines = input.Split(new []{"\n"}, StringSplitOptions.RemoveEmptyEntries);
var output = new StringBuilder();
for (var i = 0; i < lines.Length; i++)
{
if (lines[i].EndsWith(":")) // feel free to also check for the size
{
lines[i + 1] = lines[i] + lines[i + 1];
continue;
}
output.AppendLine(lines[i].Trim()); // remove space before or after a line
}
Try it Online!

I tried to use your regular expression.I was able to replace "\n" and ":" with the following regular expression.This is removing ":" and "\n" at the end of the line.
#"([:\r\n])"

A Linq solution without Regex:
var tmp = string.Empty;
var output = input.Split(new []{"\n"}, StringSplitOptions.RemoveEmptyEntries).Aggregate(new StringBuilder(), (a,b) => {
if (b.EndsWith(":")) { // feel free to also check for the size
tmp = b;
}
else {
a.AppendLine((tmp + b).Trim()); // remove space before or after a line
tmp = string.Empty;
}
return a;
});
Try it Online!

string manipulation : how to split and join a string with delimiters include

I want to split a string and join a certain string at the same time. the string that will be splitted is SQL query.
I set the split delimiters: {". ", ",", ", ", " "}
for example:
select id, name, age, status from tb_test where age > 20 and status = 'Active'
I want it to produce a result something like this:
select
id
,
name
,
age
,
status
from
tb_test
where
age > 20
and
status = 'Active'
but the one that I got by using string split is only word by word.
what should I do to make it have a result like the above?
Thanks in advance.

First create a list of all SQL commands where you want to split on:
List<string> sql = new List<string>() {
"select",
"where",
"and",
"or",
"from",
","
};
After that loop over this list and replace the command with his self surrounded by $.
This $ dollar sign will be the character to split on later on.
string query = "select id, name, age, status from tb_test where age > 20 and status = 'Active'";
foreach (string s in sql)
{
//Use ToLower() so that all strings don't have capital characters
query = query.Replace(s.ToLower(), "$" + s.ToLower() + "$");
}
Now do the split and remove the spaces in front and end using Trim():
string[] splits = query.Split(new char[] { '$' }, StringSplitOptions.RemoveEmptyEntries);
foreach (string s in splits) Console.WriteLine(s.Trim() + "\r\n");
This will split on the SQL commands. Now you can further customize it to your needs.
Result:
select
id
,
name
,
age
,
status
from
tb_test
where
age > 20
and
status = 'Active'

Here's a pure-regex solution:
(?:(?=,)|(?<![<>=]) +(?! *[<>=])|(?:(?<=,)))(?=(?:(?:[^'"]*(?P<s>['"])(?:(?!(?P=s)).)*(?P=s)))*[^'"]*$)
I made it so it can deal with the usual pitfalls, like strings, but there's probably still some stuff that'll break it. See demo.
Explanation:
(?:
(?=,) # split before a comma.
|
(?<! # if not preceded by an operator, ...
[<>=]
)
+ #...split at a space...
(?! *[<>=]) #...unless there's an operator behind the space.
|
(?: # also split after a comma.
(?<=,)
)
)
# HOWEVER, make sure this isn't inside of a string.
(?= # assert that there's an even number of quotes left in the text.
(?: # consume pairs of quotes.
[^'"]* # all text up to a quote
(?P<s>['"]) # capture the quote
(?: # consume everything up to the next quote.
(?!
(?P=s)
)
.
)*
(?P=s)
)*
[^'"]* # then make sure there are no more quotes until the end of the text.
$
)

First split splits keywords SELECT, FROM, WHERE.
Second split splits all columns by using your delimeters

One approach using regex:
string strRegex = #"(select)|(from)|(where)|([,\.])";
Regex myRegex = new Regex(strRegex, RegexOptions.IgnoreCase | RegexOptions.Multiline);
string strTargetString = #"select id, name, age, status from tb_test where age > 20 and status = 'Active'";
string strReplace = "$1\r\n";
return myRegex.Replace(strTargetString, strReplace);
This should output:
select
id ,
name ,
age ,
status from
tb_test where
age > 20 and status = 'Active'
You may want to perform another replacement to trim spaces before coma.
And also use "\r\n$1\r\n" only for sql keywords (select, from where, ...)
Hope this help.

Regex masking of words that contain a digit

Trying to come up with a 'simple' regex to mask bits of text that look like they might contain account numbers.
In plain English:
any word containing a digit (or a train of such words) should be matched
leave the last 4 digits intact
replace all previous part of the matched string with four X's (xxxx)
So far
I'm using the following:
[\-0-9 ]+(?<m1>[\-0-9]{4})
replacing with
xxxx${m1}
But this misses on the last few samples below
sample data:
123456789
a123b456
a1234b5678
a1234 b5678
111 22 3333
this is a a1234 b5678 test string
Actual results
xxxx6789
a123b456
a1234b5678
a1234 b5678
xxxx3333
this is a a1234 b5678 test string
Expected results
xxxx6789
xxxxb456
xxxx5678
xxxx5678
xxxx3333
this is a xxxx5678 test string
Is such an arrangement possible with a regex replace?
I think I"m going to need some greediness and lookahead functionality, but I have zero experience in those areas.

This works for your example:
var result = Regex.Replace(
input,
#"(?<!\b\w*\d\w*)(?<m1>\s?\b\w*\d\w*)+",
m => "xxxx" + m.Value.Substring(Math.Max(0, m.Value.Length - 4)));
If you have a value like 111 2233 33, it will print xxxx3 33. If you want this to be free from spaces, you could turn the lambda into a multi-line statement that removes whitespace from the value.
To explain the regex pattern a bit, it's got a negative lookbehind, so it makes sure that the word behind it does not have a digit in it (with optional word characters around the digit). Then it's got the m1 portion, which looks for words with digits in them. The last four characters of this are grabbed via some C# code after the regex pattern resolves the rest.

I don't think that regex is the best way to solve this problem and that's why I am posting this answer. For so complex situations, building the corresponding regex is too difficult and, what is worse, its clarity and adaptability is much lower than a longer-code approach.
The code below these lines delivers the exact functionality you are after, it is clear enough and can be easily extended.
string input = "this is a a1234 b5678 test string";
string output = "";
string[] temp = input.Trim().Split(' ');
bool previousNum = false;
string tempOutput = "";
foreach (string word in temp)
{
if (word.ToCharArray().Where(x => char.IsDigit(x)).Count() > 0)
{
previousNum = true;
tempOutput = tempOutput + word;
}
else
{
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
output = output + " " + word;
}
}
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}

Have you tried this:
.*(?<m1>[\d]{4})(?<m2>.*)
with replacement
xxxx${m1}${m2}
This produces
xxxx6789
xxxx5678
xxxx5678
xxxx3333
xxxx5678 test string
You are not going to get 'a123b456' to match ... until 'b' becomes a number. ;-)

Here is my really quick attempt:
(\s|^)([a-z]*\d+[a-z,0-9]+\s)+
This will select all of those test cases. Now as for C# code, you'll need to check each match to see if there is a space at the beginning or end of the match sequence (e.g., the last example will have the space before and after selected)
here is the C# code to do the replace:
var redacted = Regex.Replace(record, #"(\s|^)([a-z]*\d+[a-z,0-9]+\s)+",
match => "xxxx" /*new String("x",match.Value.Length - 4)*/ +
match.Value.Substring(Math.Max(0, match.Value.Length - 4)));

String Manipulation Using RegEx

Given the following scenario, I am wondering if a better solution could be written with Regular Expressions for which I am not very familiar with yet. I am seeing holes in my basic c# string manipulation even though it somewhat works. Your thoughts and ideas are most appreciated.
Thanks much,
Craig
Given the string "story" below, write a script to do the following:
Variable text is enclosed by { }.
If the variable text is blank, remove any other text enclosed in [ ].
Text to be removed can be nested deep with [ ].
Format:
XYZ Company [- Phone: [({404}) ]{321-4321} [Ext: {6789}]]
Examples:
All variable text filled in.
XYZ Company - Phone: (404) 321-4321 Ext: 6789
No Extension entered, remove "Ext:".
XYZ Company - Phone: (404) 321-4321
No Extension and no area code entered, remove "Ext:" and "( ) ".
XYZ Company - Phone: 321-4321
No extension, no phone number, and no area code, remove "Ext:" and "( ) " and "- Phone: ".
XYZ Company
Here is my solution with plain string manipulation.
private string StoryManipulation(string theStory)
{
// Loop through story while there are still curly brackets
while (theStory.IndexOf("{") > 0)
{
// Extract the first curly text area
string lcCurlyText = StringUtils.ExtractString(theStory, "{", "}");
// Look for surrounding brackets and blank all text between
if (String.IsNullOrWhiteSpace(lcCurlyText))
{
for (int lnCounter = theStory.IndexOf("{"); lnCounter >= 0; lnCounter--)
{
if (theStory.Substring(lnCounter - 1, 1) == "[")
{
string lcSquareText = StringUtils.ExtractString(theStory.Substring(lnCounter - 1), "[", "]");
theStory = StringUtils.ReplaceString(theStory, ("[" + lcSquareText + "]"), "", false);
break;
}
}
}
else
{
// Replace current curly brackets surrounding the text
theStory = StringUtils.ReplaceString(theStory, ("{" + lcCurlyText + "}"), lcCurlyText, false);
}
}
// Replace all brackets with blank (-1 all instances)
theStory = StringUtils.ReplaceStringInstance(theStory, "[", "", -1, false);
theStory = StringUtils.ReplaceStringInstance(theStory, "]", "", -1, false);
return theStory.Trim();
}

Dealing with nested structures is generally beyond the scope of regular expressions. But I think there is a solution, if you run the regex replacement in a loop, starting from the inside out. You will need a callback-function though (a MatchEvaluator):
string ReplaceCallback(Match match)
{
if(String.IsNullOrWhiteSpace(match.Groups[2])
return "";
else
return match.Groups[1]+match.Groups[2]+match.Groups[3];
}
Then you can create the evaluator:
MatchEvaluator evaluator = new MatchEvaluator(ReplaceCallback);
And then you can call this in a loop until the replacement does not change anything any more:
newString = Regex.Replace(
oldString,
#"
\[ # a literal [
( # start a capturing group. this is what we access with "match.Groups[1]"
[^{}[\]]
# a negated character class, that matches anything except {, }, [ and ]
* # arbitrarily many of those
) # end of the capturing group
\{ # a literal {
([^{}[\]]*)
# the same thing as before, we will access this with "match.Groups[2]"
} # a literal }
([^{}[\]]*)
# "match.Groups[3]"
] # a literal ]
",
evaluator,
RegexOptions.IgnorePatternWhitespace
);
Here is the whitespace-free version of the regex:
\[([^{}[\]]*)\{([^{}[\]]*)}([^{}[\]]*)]

How can I exclude the first match in a regular expression?

I have the following regex, so far:
([0-9]+){1}\s*[xX]\s*([A-Za-z\./%\$\s\*]+)
to be used on strings such as:
2x Soup, 2x Meat Balls, 4x Iced Tea
My intent was to capture the number of times something was ordered, as well as the name of item ordered.
In this regular expression however, the multiplier 'x' gets captured before the title.
How can I make it so that the x is ignored, and what comes after the x (and a space) is captured?

You can't ignore something in the middle of the pattern. Therefore you do have your capturing groups.
([0-9]+){1}\s*[xX]\s*([A-Za-z\./%\$\s\*]+)
^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^
The marked parts of your pattern are stored in capturing groups, because of the brackets around them.
Your number is in group 1 and the name is in group 2. The "x" is not captured in a group.
How you now access your groups depends on the language you are using.
Btw. the {1} is obsolete.
So for c# try this:
string text = "2x Soup, 2x Meat Balls, 4x Iced Tea";
MatchCollection result = Regex.Matches(text, #"([0-9]+)\s*[xX]\s*([A-Za-z\./%\$\s\*]+)");
int counter = 0;
foreach (Match m in result)
{
counter++;
Console.WriteLine("Order {0}: " + m.Groups[1] + " " + m.Groups[2], counter);
}
Console.ReadLine();
Further I would change the regex to this, since it seems you want to match as name every character till the next comma
#"([0-9]+)\s*x\s*([^,]+)"
and use RegexOptions.IgnoreCase to avoid having to write [xX]

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

weird regex behavior in the tokenization - c#

Related

Removal of colon and carriage returns and replace with colon

string manipulation : how to split and join a string with delimiters include

Regex masking of words that contain a digit

String Manipulation Using RegEx

How can I exclude the first match in a regular expression?

Categories

Resources