string manipulation : how to split and join a string with delimiters include

string manipulation : how to split and join a string with delimiters include - c#

I want to split a string and join a certain string at the same time. the string that will be splitted is SQL query.
I set the split delimiters: {". ", ",", ", ", " "}
for example:
select id, name, age, status from tb_test where age > 20 and status = 'Active'
I want it to produce a result something like this:
select
id
,
name
,
age
,
status
from
tb_test
where
age > 20
and
status = 'Active'
but the one that I got by using string split is only word by word.
what should I do to make it have a result like the above?
Thanks in advance.

First create a list of all SQL commands where you want to split on:
List<string> sql = new List<string>() {
"select",
"where",
"and",
"or",
"from",
","
};
After that loop over this list and replace the command with his self surrounded by $.
This $ dollar sign will be the character to split on later on.
string query = "select id, name, age, status from tb_test where age > 20 and status = 'Active'";
foreach (string s in sql)
{
//Use ToLower() so that all strings don't have capital characters
query = query.Replace(s.ToLower(), "$" + s.ToLower() + "$");
}
Now do the split and remove the spaces in front and end using Trim():
string[] splits = query.Split(new char[] { '$' }, StringSplitOptions.RemoveEmptyEntries);
foreach (string s in splits) Console.WriteLine(s.Trim() + "\r\n");
This will split on the SQL commands. Now you can further customize it to your needs.
Result:
select
id
,
name
,
age
,
status
from
tb_test
where
age > 20
and
status = 'Active'

Here's a pure-regex solution:
(?:(?=,)|(?<![<>=]) +(?! *[<>=])|(?:(?<=,)))(?=(?:(?:[^'"]*(?P<s>['"])(?:(?!(?P=s)).)*(?P=s)))*[^'"]*$)
I made it so it can deal with the usual pitfalls, like strings, but there's probably still some stuff that'll break it. See demo.
Explanation:
(?:
(?=,) # split before a comma.
|
(?<! # if not preceded by an operator, ...
[<>=]
)
+ #...split at a space...
(?! *[<>=]) #...unless there's an operator behind the space.
|
(?: # also split after a comma.
(?<=,)
)
)
# HOWEVER, make sure this isn't inside of a string.
(?= # assert that there's an even number of quotes left in the text.
(?: # consume pairs of quotes.
[^'"]* # all text up to a quote
(?P<s>['"]) # capture the quote
(?: # consume everything up to the next quote.
(?!
(?P=s)
)
.
)*
(?P=s)
)*
[^'"]* # then make sure there are no more quotes until the end of the text.
$
)

First split splits keywords SELECT, FROM, WHERE.
Second split splits all columns by using your delimeters

One approach using regex:
string strRegex = #"(select)|(from)|(where)|([,\.])";
Regex myRegex = new Regex(strRegex, RegexOptions.IgnoreCase | RegexOptions.Multiline);
string strTargetString = #"select id, name, age, status from tb_test where age > 20 and status = 'Active'";
string strReplace = "$1\r\n";
return myRegex.Replace(strTargetString, strReplace);
This should output:
select
id ,
name ,
age ,
status from
tb_test where
age > 20 and status = 'Active'
You may want to perform another replacement to trim spaces before coma.
And also use "\r\n$1\r\n" only for sql keywords (select, from where, ...)
Hope this help.

Related

How can I remove certain characters from a string only if it comes before a specific character

I have a string that has parenthesis in it. I want to write a function to remove the parenthesis if they exist and what's inside only if it comes before a comma. Sometimes in my string, there may be more than 1 set of parenthesis in the string; but I would only like to remove the set before the parenthesis.
var string1 = "Dog (big), 0"
var string2 = "Dog (medium) (black), 1"
var string3 = "Dog (small) (brown), 1"
What I want:
string1 = "Dog, 0"
string1 = "Dog (medium), 0"
string1 = "Dog (small), 0"

If you dont want to use regex:
var string2 = "Dog (medium) (black), 1";
string2 = string2.Replace(string2.Substring(string2.LastIndexOf('(') - 1, string2.LastIndexOf(')') - string2.LastIndexOf('(') + 2), "");

You can use Regex to perform this kind of string manipulation. For example:
Regex regex = new Regex(#"\s\([^\(\)]+\)(?=,)");
var inputs = new[] {
"Dog (small) (brown), 1",
"Dog (medium) (black), 1",
"Dog (big), 0"
};
foreach (var input in inputs)
Console.WriteLine(regex.Replace(input, ""));
// will output:
// Dog (small), 1
// Dog (medium), 1
// Dog, 0
This is what the pattern means:
\s will match white space, to remove spaces before the bracket.
\( will match the opening bracket (the slash escapes because brackets have meaning in regex)
[^\(\)]+ is a character set - ^ negates, so means anything except an open or close bracket - this is to match the content in a pair of brackets, and avoid matching both pairs of brackets. The + is a quantifier meaning one or more times.
\) will match the closing bracket
(?=,) is a positive look ahead. It says what comes before must be followed by a comma.

Regular expression replace (C#)

How to make Regex.Replace for the following texts:
1) "Name's", "Sex", "Age", "Height_(in)", "Weight (lbs)"
2) " LatD", "LatM ", 'LatS', "NS", "LonD", "LonM", "LonS", "EW", "City", "State"
Result:
1) Name's, Sex, Age, Height (in), Weight (lbs)
2) LatD, LatM, LatS, NS, LonD, LonM, LonS, EW, City, State
Spaces between brackets can be any size (Example 1). There may also be incorrect spaces in brackets (Example 2). Also, instead of spaces, the "_" sign can be used (Example 1). And instead of double quotes, single quotes can be used (Example 2).
As a result, words must be separated with a comma and a space.
Snippet of my code
StreamReader fileReader = new StreamReader(...);
var fileRow = fileReader.ReadLine();
fileRow = Regex.Replace(fileRow, "_", " ");
fileRow = Regex.Replace(fileRow, "\"", "");
var fileDataField = fileRow.Split(',');

I don't well know C# syntax, but this regex does the job:
Find: (?:_|^["']\h*|\h*["']$|\h*["']\h*,\h*["']\h*)
Replace: A space
Explanation:
(?: # non capture group
_ # undersscore
| # OR
^["']\h* # beginning of line, quote or apostrophe, 0 or more horizontal spaces
| # OR
\h*["']$ # 0 or more horizontal spaces, quote or apostrophe, end of line
| # OR
\h*["']\h* # 0 or more horizontal spaces, quote or apostrophe, 0 or more horizontal spaces
, #
\h*["']\h* # 0 or more horizontal spaces, quote or apostrophe, 0 or more horizontal spaces
) # end group
Demo

How about a simple straight string manipulation way?
using System;
using System.Linq;
static void Main(string[] args)
{
string dirty1 = "\"Name's\", \"Sex\", \"Age\", \"Height_(in)\", \"Weight (lbs)\"";
string dirty2 = "\" LatD\", \"LatM \", 'LatS', \"NS\", \"LonD\", \"LonM\", \"LonS\", \"EW\", \"City\", \"State\"";
Console.WriteLine(Clean(dirty1));
Console.WriteLine(Clean(dirty2));
Console.ReadKey();
}
private static string Clean(string dirty)
{
return dirty.Split(',').Select(item => item.Trim(' ', '"', '\'')).Aggregate((a, b) => string.Join(", ", a, b));
}
private static string CleanNoLinQ(string dirty)
{
string[] items = dirty.Split(',');
for(int i = 0; i < items.Length; i++)
{
items[i] = items[i].Trim(' ', '"', '\'');
}
return String.Join(", ", items);
}
You can even replace the LinQ with a foreach and then string.Join().
Easier to understand - easier to maintain.

How to write a regular expression that captures tags in a comma-separated list?

Here is my input:
#
tag1, tag with space, !##%^, 🦄
I would like to match it with a regex and yield the following elements easily:
tag1
tag with space
!##%^
🦄
I know I could do it this way:
var match = Regex.Match(input, #"^#[\n](?<tags>[\S ]+)$");
// if match is a success
var tags = match.Groups["tags"].Value.Split(',').Select(x => x.Trim());
But that's cheating, as it involves messing around with C#. There must be a neat way to do this with a regex. Just must be... right? ;D
The question is: how to write a regular expression that would allow me to iterate through captures and extract tags, without the need of splitting and trimming?

This works (?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+
It uses C#'s Capture Collection to find a variable amount of field data
in a single record.
You could extend the regex further to get all records at once.
Where each record contains its own variable amount of field data.
The regex has built-in trimming as well.
Expanded:
(?ms) # Inline modifiers: multi-line, dot-all
^ \# \s+ # Beginning of record
(?: # Quantified group, 1 or more times, get all fields of record at once
\s* # Trim leading wsp
( # (1 start), # Capture collector for variable fields
(?: # One char at a time, but not comma or begin of record
(?!
,
| ^ \# \s+
)
.
)*?
) # (1 end)
\s*
(?: , | $ ) # End of this field, comma or EOL
)+
C# code:
string sOL = #"
#
tag1, tag with space, !##%^, 🦄";
Regex RxOL = new Regex(#"(?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+");
Match _mOL = RxOL.Match(sOL);
while (_mOL.Success)
{
CaptureCollection ccOL1 = _mOL.Groups[1].Captures;
Console.WriteLine("-------------------------");
for (int i = 0; i < ccOL1.Count; i++)
Console.WriteLine(" '{0}'", ccOL1[i].Value );
_mOL = _mOL.NextMatch();
}
Output:
-------------------------
'tag1'
'tag with space'
'!##%^'
'??'
''
Press any key to continue . . .

Nothing wrong with cheating ;]
string input = #"#
tag1, tag with space, !##%^, 🦄";
string[] tags = Array.ConvertAll(input.Split('\n').Last().Split(','), s => s.Trim());

You can pretty much make it without regex. Just split it like this:
var result = input.Split(new []{'\n','\r'}, StringSplitOptions.RemoveEmptyEntries).Skip(1).SelectMany(x=> x.Split(new []{','},StringSplitOptions.RemoveEmptyEntries).Select(y=> y.Trim()));

weird regex behavior in the tokenization

I am using the following regex to tokenize:
reg = new Regex("([ \\t{}%$^&*():;_–`,\\-\\d!\"?\n])");
The regex is supposed to filter out everything later, however the input string format that i am having problem with is in the following form:
; "string1"; "string2"; "string...n";
the result of the string: ; "social life"; "city life"; "real life" as I know should be like the following:
; White " social White life " ; White " city White life " ; White " real White life "
However there is a problem such that, I get the output in the following form
; empty White empty " social White life " empty ; empty White empty " city White life " empty ; empty White empty " real White life " empty
White: means White-Space,
empty: means empty entry in the split array.
My code for split is as following:
string[] ret = reg.Split(input);
for (int i = 0; i < ret.Length; i++)
{
if (ret[i] == "")
Response.Write("empty<br>");
else
if (ret[i] == " ")
Response.Write("White<br>");
else
Response.Write(ret[i] + "<br>");
}
Why I get these empty entries ? and especially when there is ; followed by space followed by " then the result looks like the following:
; empty White empty "
can I get explanation of why the command adds empty entries ? and how to remove them without any additional O(n) complexity or using another data structure as ret

In my experience, splitting at regex matches is almost always not the best idea. You'll get much better results through plain matching.
And regexes are very well suited for tokenization purposes, as they let you implement a state machine really easily, just take a look at that:
\G(?:
(?<string> "(?>[^"\\]+|\\.)*" )
| (?<separator> ; )
| (?<whitespace> \s+ )
| (?<invalid> . )
)
Demo - use this with RegexOptions.IgnorePatternWhitespace of course.
Here, each match will have the following properties:
It will start at the end of the previous match, so there will be no unmatched text
It will contain exactly one matching group
The name of the group tells you the token type
You can ignore the whitespace group, and you should raise an error if you ever encounter a matching invalid group.
The string group will match an entire quoted string, it can handle escapes such as \" inside the string.
The invalid group should always be last in the pattern. You may add rules for other other types.
Some example code:
var regex = new Regex(#"
\G(?:
(?<string> ""(?>[^""\\]+|\\.)*"" )
| (?<separator> ; )
| (?<whitespace> \s+ )
| (?<invalid> . )
)
", RegexOptions.IgnorePatternWhitespace);
var input = "; \"social life\"; \"city life\"; \"real life\"";
var groupNames = regex.GetGroupNames().Skip(1).ToList();
foreach (Match match in regex.Matches(input))
{
var groupName = groupNames.Single(name => match.Groups[name].Success);
var group = match.Groups[groupName];
Console.WriteLine("{0}: {1}", groupName, group.Value);
}
This produces the following:
separator: ;
whitespace:
string: "social life"
separator: ;
whitespace:
string: "city life"
separator: ;
whitespace:
string: "real life"
See how much easier it is to deal with these results rather than using split?

What RegEx string will find the last (rightmost) group of digits in a string?

Looking for a regex string that will let me find the rightmost (if any) group of digits embedded in a string. We only care about contiguous digits. We don't care about sign, commas, decimals, etc. Those, if found should simply be treated as non-digits just like a letter.
This is for replacement/incrementing purposes so we also need to grab everything before and after the detected number so we can reconstruct the string after incrementing the value so we need a tokenized regex.
Here's examples of what we are looking for:
"abc123def456ghi" should identify the'456'
"abc123def456ghi789jkl" should identify the'789'
"abc123def" should identify the'123'
"123ghi" should identify the'123'
"abc123,456ghi" should identify the'456'
"abc-654def" should identify the'654'
"abcdef" shouldn't return any match
As an example of what we want, it would be something like starting with the name 'Item 4-1a', extracting out the '1' with everything before being the prefix and everything after being the suffix. Then using that, we can generate the values 'Item 4-2a', 'Item 4-3a' and 'Item 4-4a' in a code loop.
Now If I were looking for the first set, this would be easy. I'd just find the first contiguous block of 0 or more non-digits for the prefix, then the block of 1 or more contiguous digits for the number, then everything else to the end would be the suffix.
The issue I'm having is how to define the prefix as including all (if any) numbers except the last set. Everything I try for the prefix keeps swallowing that last set, even when I've tried anchoring it to the end by basically reversing the above.

How about:
^(.*?)(\d+)(\D*)$
then increment the second group and concat all 3.
Explanation:
^ : Begining of string
( : start of 1st capture group
.*? : any number of any char not greedy
) : end group
( : start of 2nd capture group
\d+ : one or more digits
) : end group
( : start of 3rd capture group
\D* : any number of non digit char
) : end group
$ : end of string
The first capture group will match all characters until the first digit of last group of digits before the end of the string.
or if you can use named group
^(?<prefix>.*?)(?<number>\d+)(?<suffix>\D*)$

Try next regex:
(\d+)(?!.*\d)
Explanation:
(\d+) # One or more digits.
(?!.*\d) # (zero-width) Negative look-ahead: Don't find any characters followed with a digit.
EDIT (OFFTOPIC of the question):: This answer is incorrect but this question has already been answered in other posts so to avoid delete this one I will use this same regex other way, for example in Perl could be used like this to get same result as in C# (increment last digit):
s/(\d+)(?!.*\d)/$1 + 1/e;

You can also try little bit simpler version:
(\d+)[^\d]*$

This should do it:
Regex regexObj = new Regex(#"
# Grab last set of digits, prefix and suffix.
^ # Anchor to start of string.
(.*) # $1: Stuff before last set of digits.
(?<!\d) # Anchor start of last set of digits.
(\d+) # $2: Last set of one or more digits.
(\D*) # $3: Zero or more trailing non digits.
$ # Anchor to end of string.
", RegexOptions.IgnorePatternWhitespace);

What about not using Regex. Here's code snippet (for console)
string[] myStringArray = new string[] { "abc123def456ghi", "abc123def456ghi789jkl", "abc123def", "123ghi", "abcdef","abc-654def" };
char[] numberSet = new char[] { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };
char[] filterSet = new char[] {'a','b','c','d','e','f','g','h','i','j','k','l','m',
'n','o','p','q','r','s','t','u','v','w','x','y','z','-'};
foreach (string myString in myStringArray)
{
Console.WriteLine("your string - {0}",myString);
int index1 = myString.LastIndexOfAny(numberSet);
if (index1 == -1)
Console.WriteLine("no number");
else
{
string mySubString = myString.Substring(0,index1 + 1);
string prefix = myString.Substring(index1 + 1);
Console.WriteLine("prefix - {0}", prefix);
int index2 = mySubString.LastIndexOfAny(filterSet);
string suffix = myString.Substring(0, index2 + 1);
Console.WriteLine("suffix - {0}",suffix);
mySubString = mySubString.Substring(index2 + 1);
Console.WriteLine("number - {0}",mySubString);
Console.WriteLine("_________________");
}
}
Console.Read();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

string manipulation : how to split and join a string with delimiters include - c#

First split splits keywords SELECT, FROM, WHERE. Second split splits all columns by using your delimeters

Related

How can I remove certain characters from a string only if it comes before a specific character

Regular expression replace (C#)

How to write a regular expression that captures tags in a comma-separated list?

weird regex behavior in the tokenization

What RegEx string will find the last (rightmost) group of digits in a string?

Categories

Resources