Replace Single WhiteSpace without Replacing Multiple WhiteSpace

Replace Single WhiteSpace without Replacing Multiple WhiteSpace - c#

I have a string in the format:
abc def ghi xyz
I would like to end with it in format:
abcdefghi xyz
What is the best way to do this? In this particular case, I could just strip off the last three characters, remove spaces, and then add them back at the end, but this won't work for cases in which the multiple spaces are in the middle of the string.
In Short, I want to remove all single whitespaces, and then replace all multiple whitespaces with a single. Each of those steps is easy enough by itself, but combining them seems a bit less straightforward.
I'm willing to use regular expressions, but I would prefer not to.

This approach uses regular expressions but hopefully in a way that's still fairly readable. First, split your input string on multiple spaces
var pattern = #" +"; // match two or more spaces
var groups = Regex.Split(input, pattern);
Next, remove the (individual) spaces from each token:
var tokens = groups.Select(group => group.Replace(" ", String.Empty));
Finally, join your tokens with single spaces
var result = String.Join(' ', tokens.ToArray());
This example uses a literal space character rather than 'whitespace' (which includes tabs, linefeeds, etc.) - substitute \s for ' ' if you need to split on multiple whitespace characters rather than actual spaces.

Well, Regular Expressions would probably be the fastest here, but you could implement some algorithm that uses a lookahead for single spaces and then replaces multiple spaces in a loop:
// Replace all single whitespaces
for (int i = 0; i < sourceString.Length; i++)
{
if (sourceString[i] = ' ')
{
if (i < sourceString.Length - 1 && sourceString[i+1] != ' ')
sourceString = sourceString.Delete(i);
}
}
// Replace multiple whitespaces
while (sourceString.Contains(" ")) // Two spaces here!
sourceString = sourceString.Replace(" ", " ");
But hey, that code is pretty ugly and slow compared to a proper regular expression...

For a Non-REGEX option you can use:
string str = "abc def ghi xyz";
var result = str.Split(); //This will remove single spaces from the result
StringBuilder sb = new StringBuilder();
bool ifMultipleSpacesFound = false;
for (int i = 0; i < result.Length;i++)
{
if (!String.IsNullOrWhiteSpace(result[i]))
{
sb.Append(result[i]);
ifMultipleSpacesFound = false;
}
else
{
if (!ifMultipleSpacesFound)
{
ifMultipleSpacesFound = true;
sb.Append(" ");
}
}
}
string output = sb.ToString();
The output would be:
output = "abcdefghi xyz"

Here's an approach which uses some fairly subtle logic:
public static string RemoveUnwantedSpaces(string text)
{
var sb = new StringBuilder();
char lhs = '\0';
char mid = '\0';
foreach (char rhs in text)
{
if (rhs != ' ' || (mid == ' ' && lhs != ' '))
sb.Append(rhs);
lhs = mid;
mid = rhs;
}
return sb.ToString().Trim();
}
How it works:
We will examine each possible three-character subsequence linearly across the string (in a kind of three-character sliding window). These three characters will be represented, in order, by the variables lhs, mid and rhs.
For each rhs character in the string:
If it's not a space we should output it.
If it is a space, and the previous character was also space but the one before that isn't, then this is the second in a sequence of at least two spaces, and therefore we should output one space.
Otherwise, don't output a space because this is either the first or the third (or later) space in a sequence of two or more spaces and in either case we don't want to output a space: If this happens to be the first in a sequence of two or more spaces, a space will be output when the second space comes along. If this is the third or later, we've already output a space for it.
The subtlety here is that I've avoided special casing the beginning of the sequence by initialising the lhs and mid variables with non-space characters. It doesn't matter what those values are, as long as they are not spaces, but I made them \0 to indicate that they are special values.

After second thought here is one line regex solution:
Regex.Replace("abc def ghi xyz", "( )( )*([^ ])", "$2$3")
the result of this is "abcdefghi xyz"
ORIGINAL ANSWER:
Two lines of code regex solution:
var tmp = Regex.Replace("abc def ghi xyz", "( )([^ ])", "$2")
tmp is "abcdefghi xyz"
then:
var result = Regex.Replace(tmp, "( )+", " ");
result is "abcdefghi xyz"
Explanation:
The first line of code removes single whitespaces and removes one whitespace for multiple whitespaces (so there are 3 spaces in tmp between letters i and x).
The second line just replace multiple whitespaces with one.
In-depth explanation of first line:
We match input string to regex that matches one space and non-space next to it. We also put this two characters in separate groups (we use ( ) for anonymous grouping).
So for "abc def ghi xyz" string we have this matches and groups:
match: " d" group1: " " group2: "d"
match: " g" group1: " " group2: "g"
match: " x" group1: " " group2: "x"
We are using substitution syntax for Regex.Replace method to replace match with the content of second group (which is non-whitespace character)

Related

C# Regex split() without removing the split condition character

I am splitting a string with regex using its Split() method.
var splitRegex = new Regex(#"[\s|{]");
string input = "/Tests/ShowMessage { 'Text': 'foo' }";
//second version of the input:
//string input = "/Tests/ShowMessage{ 'Text': 'foo' }";
string[] splittedText = splitRegex.Split(input, 2);
The string is just a sample pattern of the input. There are two different structures of input, once with a space before the { or without the space. I want to split the input on the { bracket in order to get the following result:
/Tests/ShowMessage
{ 'Text': 'foo' }
If there is a space, the string gets splitted there (space gets removed) and i get my desired result. But if there isnt a space i split the string on the {, so the { gets removed, what i dont want though. How can i use Regex.Split() without removing the split condition character?

The square brackets create a character set, so you want it to match exactly one of those inner characters. For your desire start off by removing them.
So to match it a random count of whitespaces you have to add *, the result is this one\s*.
\s is a whitespace
* means zero-or-more
That you don't remove the split condition character, you can use lookahead assertion (?=...).
(?=...) or (?!...) is a lookahead assertion
The combined Regex looks like this: \s*(?={)
This is a really good and detailed documentation of all the different Regex parts, you might have a look at it. Furthermore you can test your Regex easy and for free here.

In order to not include the curly brace in the match you can put it into a look ahead
\s*(?={)
That will match any number of white spaces up to the position before a open curly brace.

You can use regular string split, on "{" and trim the spaces off:
var bits = "/Tests/ShowMessage { 'Text': 'foo' }".Split("{", StringSplitOptions.RemoveEmptyEntries);
bits[0] = bits[0].TrimEnd();
bits[1] = "{" + bits[1];
If you want to use the RegEx route, you can add the { back if you change the regex a bit:
var splitRegex = new Regex(#"\s*{");
string input = "/Tests/ShowMessage { 'Text': 'foo' }";
//second version of the input:
//string input = "/Tests/ShowMessage{ 'Text': 'foo' }";
string[] splittedText = splitRegex.Split(input, 2);
splittedText[1] = "{" + splittedText[1];
It means "split at occurrence of (zero or more whitespace followed by {)" - so the split operation nukes your spaces (you want), and your { (you don't want) but you can put the { back with certainty that it will mean you get what you want

var splitedList = srt.Text.Replace(".", ".#").Replace("?", "?#").Replace("!", "!#").Split(new[] { "#"}, StringSplitOptions.RemoveEmptyEntries).ToList();
This will split text for .!? and will not remove condition chars. For better result just replace # with some uniq char. Like this one for example '®' That is all. Simple as it is. No regex.split which is slow and difficult due to many different task criterias, etc...
passing-> "Hello. I'am dev!"
result (split condition character exist )
"Hello."
"I'am dev!"

Replace one character but not two in a string

I want to replace single occurrences of a character but not two in a string using C#.
For example, I want to replace & by an empty string but not when the ocurrence is &&. Another example, a&b&&c would become ab&&c after the replacement.
If I use a regex like &[^&], it will also match the character after the & and I don't want to replace it.
Another solution I found is to iterate over the string characters.
Do you know a cleaner solution to do that?

To only match one & (not preceded or followed by &), use look-arounds (?<!&) and (?!&):
(?<!&)&(?!&)
See regex demo
You tried to use a negated character class that still matches a character, and you need to use a look-ahead/look-behind to just check for some character absence/presence, without consuming it.
See regular-expressions.info:
Negative lookahead is indispensable if you want to match something not followed by something else. When explaining character classes, this tutorial explained why you cannot use a negated character class to match a q not followed by a u. Negative lookahead provides the solution: q(?!u).
Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?<!a)b matches a "b" that is not preceded by an "a", using negative lookbehind. It doesn't match cab, but matches the b (and only the b) in bed or debt.

You can match both & and && (or any number of repetition) and only replace the single one with an empty string:
str = Regex.Replace(str, "&+", m => m.Value.Length == 1 ? "" : m.Value);

You can use this regex: #"(?<!&)&(?!&)"
var str = Regex.Replace("a&b&&c", #"(?<!&)&(?!&)", "");
Console.WriteLine(str); // ab&&c

You can go with this:
public static string replacement(string oldString, char charToRemove)
{
string newString = "";
bool found = false;
foreach (char c in oldString)
{
if (c == charToRemove && !found)
{
found = true;
continue;
}
newString += c;
}
return newString;
}
Which is as generic as possible

I would use something like this, which IMO should be better than using Regex:
public static class StringExtensions
{
public static string ReplaceFirst(this string source, char oldChar, char newChar)
{
if (string.IsNullOrEmpty(source)) return source;
int index = source.IndexOf(oldChar);
if (index < 0) return source;
var chars = source.ToCharArray();
chars[index] = newChar;
return new string(chars);
}
}

I'll contribute to this statement from the comments:
in this case, only the substring with odd number of '&' will be replaced by all the "&" except the last "&" . "&&&" would be "&&" and "&&&&" would be "&&&&"
This is a pretty neat solution using balancing groups (though I wouldn't call it particularly clean nor easy to read).
Code:
string str = "11&222&&333&&&44444&&&&55&&&&&";
str = Regex.Replace(str, "&((?:(?<2>&)(?<-2>&)?)*)", "$1$2");
Output:
11222&&333&&44444&&&&55&&&&
ideone demo
It always matches the first & (not captured).
If it's followed by an even number of &, they're matched and stored in $1. The second group is captured by the first of the pair, but then it's substracted by the second.
However, if there's there's an odd number of of &, the optional group (?<-2>&)? does not match, and the group is not substracted. Then, $2 will capture an extra &
For example, matching the subject "&&&&", the first char is consumed and it isn't captured (1). The second and third chars are matched, but $2 is substracted (2). For the last char, $2 is captured (3). The last 3 chars were stored in $1, and there's an extra & in $2.
Then, the substitution "$1$2" == "&&&&".

Detect Two Consecutive Single Quotes Inside Single Quotes

I'm struggling to get this regex pattern exactly right, and am open to other options outside of regex if someone has a better alternative.
The situation:
I'm basically looking to parse a T-SQL "in" clause against a text column in C#. So, I need to take a string value like this:
"'don''t', 'do', 'anything', 'stupid'"
And interpret that as a list of values (I'll take care of the double single quotes later):
"don''t"
"do"
"anything"
"stupid"
I have a regex that works for most cases, but I'm struggling to generalize it to the point where it will accept any character OR a doubled-up single quote inside my group: (?:')([a-z0-9\s(?:'(?='))]+)(?:')[,\w]*
I'm fairly experienced with regexes, but have rarely, if ever, found a need for look-arounds (so downgrade my assessment of my regex experience accordingly).
So, to put this another way, I'm wanting to take a string of comma-delimited values, each enclosed in single quotes but can contain doubled single quotes, and output each such value.
EDIT
Here's a non-working example with my current regex (my problem is I need to handle all characters in my grouping and stop when I encounter a single quote not followed by a second single quote):
"'don''t', 'do?', 'anything!', '#stupid$'"

If you still think about a regex-based solution, you can use the following regex:
'(?:''|[^'])*'
Or an "un-rolled" version suggested by #sln:
'[^']*(?:''[^']*)*'
See demo
It is fairly simple, it captures double single quotation marks OR anything that is not a single quotation mark. No need using any look-behinds or look-aheads. It does not take care of any escaped entities, but I do not see this requirement in your question.
Moreover, this regex will return matches that are easy to access and deal with:
var text = "'don''t', 'do', 'anything', 'stupid'";
var re = new Regex(#"'[^']*(?:''[^']*)*'"); // Updated thanks to #sln, previous (#"'(?:''|[^'])*'");
var match_values = re.Matches(text).Cast<Match>().Select(p => p.Value).ToList();
Output:

If you want to use the Capture Collection feature, you can grab them all in a
single pass.
# #"""\s*(?:'([^']*(?:''[^']*)*)'\s*(?:,\s*|(?="")))+"""
"
\s*
(?:
'
( # (1 start)
[^']*
(?:
'' [^']*
)*
) # (1 end)
'
\s*
(?:
, \s*
| (?= " )
)
)+
"
C# code:
string strSrc = "\"'don''t', 'do', 'anything', 'stupid'\"";
Regex rx = new Regex(#"""\s*(?:'([^']*(?:''[^']*)*)'\s*(?:,\s*|(?="")))+""");
Match srcMatch = rx.Match(strSrc);
if (srcMatch.Success)
{
CaptureCollection cc = srcMatch.Groups[1].Captures;
for (int i = 0; i < cc.Count; i++)
Console.WriteLine("{0} = '{1}'", i, cc[i].Value);
}
Output:
0 = 'don''t'
1 = 'do'
2 = 'anything'
3 = 'stupid'
Press any key to continue . . .

Why don't you split on ', ':
Regex regex = new Regex(#"'\s*,\s*'");
string[] substrings = regex.Split(str);
And then take care of the extra single quotes by Trimming

Looks to me like you're over-thinking the problem. A quoted string with an escaped quote looks just like two strings without escaped quotes, one right after the other (not even spaces between them).
(?:'[^']*')+
Of course, you'll have to remove the enclosing quotes, but you probably had to do some post-processing anyway, to unescape the escaped quotes.
Also note that I'm not trying to validate the input or work around possible errors; for example, I don't bother matching the commas between the strings. If the input is well formed, this regex should be all you need.

In the interest of maintainability, I decided against a regex and followed the advice of using a state machine. Here's the crux of my implementation:
string currentTerm = string.Empty;
State currentState = State.BetweenTerms;
foreach (char c in valueToParse)
{
switch (currentState)
{
// if between terms, only need to do something if we encounter a single quote, signalling to start a new term
// encloser is client-specified char to look for (e.g. ')
case State.BetweenTerms:
if (c == encloser)
{
currentState = State.InTerm;
}
break;
case State.InTerm:
if (c == encloser)
{
if (valueToParse.Length > index + 1 && valueToParse[index + 1] == encloser && valueToParse.Length > index + 2)
{
// if next character is also encloser then add it and move on
currentTerm += c;
}
else if (currentTerm.Length > 0 && currentTerm[currentTerm.Length - 1] != encloser)
{
// on an encloser and didn't just add encloser, so we are done
// converterFunc is a client-specified Func<string,T> to return terms in the specified type (to allow for converting to int, for example)
yield return converterFunc(currentTerm);
currentTerm = string.Empty;
currentState = State.BetweenTerms;
}
}
else
{
currentTerm += c;
}
break;
}
index++;
}
if (currentTerm.Length > 0)
{
yield return converterFunc(currentTerm);
}

How to remove only certain substrings from a string?

Using C#, I have a string that is a SQL script containing multiple queries. I want to remove sections of the string that are enclosed in single quotes. I can do this using Regex.Replace, in this manner:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, "'[^']*'", string.Empty);
Results in: "Only can we turn him to the of the Force"
What I want to do is remove the substrings between quotes EXCEPT for substrings containing a specific substring. For example, using the string above, I want to remove the quoted substrings except for those that contain "dark," such that the resulting string is:
Results in: "Only can we turn him to the 'dark side' of the Force"
How can this be accomplished using Regex.Replace, or perhaps by some other technique? I'm currently trying a solution that involves using Substring(), IndexOf(), and Contains().
Note: I don't care if the single quotes around "dark side" are removed or not, so the result could also be: "Only can we turn him to the dark side of the Force." I say this because a solution using Split() would remove all the single quotes.
Edit: I don't have a solution yet using Substring(), IndexOf(), etc. By "working on," I mean I'm thinking in my head how this can be done. I have no code, which is why I haven't posted any yet. Thanks.
Edit: VKS's solution below works. I wasn't escaping the \b the first attempt which is why it failed. Also, it didn't work unless I included the single quotes around the whole string as well.
test = Regex.Replace(test, "'(?![^']*\\bdark\\b)[^']*'", string.Empty);

'(?![^']*\bdark\b)[^']*'
Try this.See demo.Replace by empty string.You can use lookahead here to check if '' contains a word dark.
https://www.regex101.com/r/rG7gX4/12

While vks's solution works, I'd like to demonstrate a different approach:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, #"'[^']*'", match => {
if (match.Value.Contains("dark"))
return match.Value;
// You can add more cases here
return string.Empty;
});
Or, if your condition is simple enough:
test = Regex.Replace(test, #"'[^']*'", match => match.Value.Contains("dark")
? match.Value
: string.Empty
);
That is, use a lambda to provide a callback for the replacement. This way, you can run arbitrary logic to replace the string.

some thing like this would work. you can add all strings you want to keep into the excludedStrings array
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
var excludedString = new string[] { "dark side" };
int startIndex = 0;
while ((startIndex = test.IndexOf('\'', startIndex)) >= 0)
{
var endIndex = test.IndexOf('\'', startIndex + 1);
var subString = test.Substring(startIndex, (endIndex - startIndex) + 1);
if (!excludedString.Contains(subString.Replace("'", "")))
{
test = test.Remove(startIndex, (endIndex - startIndex) + 1);
}
else
{
startIndex = endIndex + 1;
}
}

Another method through regex alternation operator |.
#"('[^']*\bdark\b[^']*')|'[^']*'"
Then replace the matched character with $1
DEMO
string str = "Only 'together' can we turn him to the 'dark side' of the Force";
string result = Regex.Replace(str, #"('[^']*\bdark\b[^']*')|'[^']*'", "$1");
Console.WriteLine(result);
IDEONE
Explanation:
(...) called capturing group.
'[^']*\bdark\b[^']*' would match all the single quoted strings which contains the substring dark . [^']* matches any character but not of ', zero or more times.
('[^']*\bdark\b[^']*'), because the regex is within a capturing group, all the matched characters are stored inside the group index 1.
| Next comes the regex alternation operator.
'[^']*' Now this matches all the remaining (except the one contains dark) single quoted strings. Note that this won't match the single quoted string which contains the substring dark because we already matched those strings with the pattern exists before to the | alternation operator.
Finally replacing all the matched characters with the chars inside group index 1 will give you the desired output.

I made this attempt that I think you were thinking about (some solution using split, Contain, ... without regex)
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
string[] separated = test.Split('\'');
string result = "";
for (int i = 0; i < separated.Length; i++)
{
string str = separated[i];
str = str.Trim(); //trim the tailing spaces
if (i % 2 == 0 || str.Contains("dark")) // you can expand your condition
{
result += str+" "; // add space after each added string
}
}
result = result.Trim(); //trim the tailing space again

Go to each white space in a string. C#

Is it possible to pass over a string, finding the white spaces?
For example a data set of:
string myString = "aa bbb cccc dd";
How could I loop through and detect each white space, and manipulate that space?
I need to do this in the most effecient way possible.
Thanks.
UPDATE:
I need to manipulate the space by increasing the white space from an integer value. So for instance increase the space to have 3 white spaces instead of one. I'd like to make it go through each white space in one loop, any method of doing this already in .NET? By white space I mean a ' '.

You can use the Regex.Replace method. This will replace any group of white space character with a dash:
myString = Regex.Replace(myString, "(\s+)", m => "-");
Update:
This will find groups of space characters and replace with the tripple amount of spaces:
myString = Regex.Replace(
myString,
"( +)",
m => new String(' ', m.Groups[1].Value.Length * 3)
);
However, that's a bit too simple to make use of regular expressions. You can do the same with a regular replace:
myString = myString.Replace(" ", " ");
This will replace each space intead of replace groups of spaces, but the regular replace is much simpler than Regex.Replace, so it should still be at least as fast, and the code is simpler.

If you want to replace all whitespace in one swoop, you can do:
// changes all strings to dashes
myString.Replace(' ', '-');
If you want to go case by case (that is, not just a mass replace), you can loop through IndexOf():
int pos = myString.IndexOf(' ');
while (pos >= 0)
{
// do whatever you want with myString # pos
// find next
pos = myString.IndexOf(' ', pos + 1);
}
UPDATE
As per your update, you could replace single spaces with the number of spaces specified by a variable (such as numSpaces) as follows:
myString.Replace(" ", new String(' ', numSpaces));

If you just want to replace all spaces with some other character:
myString = myString.Replace(' ', 'x');
If you need the possibility of doing something different to each:
foreach(char c in myString)
{
if (c == ' ')
{
// do something
}
}
Edit:
Per your comment clarifying your question:
To change each space to three spaces, you can do this:
myString = myString.Replace(" ", " ");
However note that this doesn't take into account instances where your input string already has two or more spaces. If that is a possibility you will want to use a regex.

Depending on what you're tring to do:
for(int k = 0; k < myString.Length; k++)
{
if(myString[k].IsWhiteSpace())
{
// do something with it
}
}
The above is a single pass through the string, so it's O(n). You can't really get more efficient that that.
However, if you want to manipulate the original string your best bet is to Use a StringBuilder to process the changes:
StringBuilder sb = new StringBuilder(myString);
for(int k = 0; k < myString.Length; k++)
{
if(myString[k].IsWhiteSpace())
{
// do something with sb
}
}
Finally, don't forget about Regular Expressions. It may not always be the most efficient method in terms of code run-time complexity but as far as efficiency of coding it may be a good trade-off.
For instance, here's a way to match all white spaces:
var rex = new System.Text.RegularExpressions.Regex("[^\\s](\\s+)[^\\s]");
var m = rex.Match(myString);
while(m.Success)
{
// process the match here..
m.NextMatch();
}
And here's a way to replace all white spaces with an arbitrary string:
var rex = new System.Text.RegularExpressions.Regex("\\s+");
String replacement = "[white_space]";
// replaces all occurrences of white space with the string [white_space]
String result = rex.Replace(myString, replacement);

Use string.Replace().
string newString = myString.Replace(" ", " ");

LINQ query below returns a set of anonymous type items with two properties - "sybmol" represents a white space character, and "index" - index in the input sequence. After that you have all whitespace characters and a position in the input sequence, now you can do what you want with this.
string myString = "aa bbb cccc dd";
var res = myString.Select((c, i) => new { symbol = c, index = i })
.Where(c => Char.IsWhiteSpace(c.symbol));
EDIT: For educational purposes below is implementation you are looking for, but obviously in real system use built in string constructor and String.Replace() as shown in other answers
string myString = "aa bbb cccc dd";
var result = this.GetCharacters(myString, 5);
string output = new string(result.ToArray());
public IEnumerable<char> GetCharacters(string input, int coeff)
{
foreach (char c in input)
{
if (Char.IsWhiteSpace(c))
{
int counter = coeff;
while (counter-- > 0)
{
yield return c;
}
}
else
{
yield return c;
}
}
}

var result = new StringBuilder();
foreach(Char c in myString)
{
if (Char.IsWhiteSpace(c))
{
// you can do what you wish here. strings are immutable, so you can only make a copy with the results you want... hence the "result" var.
result.Append('_'); // for example, replace space with _
}
else result.Append(c);
}
myString = result.ToString();

If you want to replace the white space with, e.g. '_', you can using String.Replace.
Example:
string myString = "aa bbb cccc dd";
string newString = myString.Replace(" ", "_"); // gives aa_bbb_cccc_dd

In case you want to left/right justify your string
int N=10;
string newstring = String.Join(
"",
myString.Split(' ').Select(s=>s.PadRight(N-s.Length)));

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Replace Single WhiteSpace without Replacing Multiple WhiteSpace - c#

Related

C# Regex split() without removing the split condition character

Replace one character but not two in a string

Detect Two Consecutive Single Quotes Inside Single Quotes

How to remove only certain substrings from a string?

Go to each white space in a string. C#

Categories

Resources