I need to remove all additional spaces in a string.
I use regex for matching strings and matched strings i replace with some others.
For better understanding please see examples below:
3 input strings:
Hello, how are you?
Hello , how are you?
Hello , how are you ?
This are 3 strings that should match by one pattern-regex.
It looks something like this:
Hello\s*,\s+how\s+are\s+you\s*?
It works fine but there is a perfomance problem.
If I have a lot of patterns (~20k) and try to execute each pattern it runs very slow (3-5 minutes).
Maybe there is better way for doing this?
for example use some 3d-party libs?
UPD: Folks, this question is not about how to do this. It's about how to do this with best perfomance. :)
Let me explain more detailed. The main goal is tokenize text. (replace some token with special symbols)
For example I have a token "nice try".
Then I input text "this is nice try".
result: "this is #tokenizedtext#" where #tokenizedtext# some special symbols. It doesen't matter in this case.
Next I have string "Mike said it was a nice try".
result should be "Mike said it was a #tokenizedtext#".
I think the main idea is clear.
So I can have a lot of tokens. When I process it I convert my token from "nice try" to pattern "nice\s+try". and try to replace with this pattern input text.
It works fine. But if in tokens there is more spaces and there is also punctuation then my regexes became bigger and works very slow.
Do you have some suggestions (technical or logic) for solving this problem?
I can suggest a few solutions.
First of all, avoid the static Regex method. Create an instance of it (and store it, don't call the constructor for each replacement!) and, if possible, use RegexOptions.Compiled. It should improve your performance.
Second, you can try to review your pattern. I'll do some profiling, but I'm currently undecisive between:
#"(?<=\s)\s+"
With replacement being an empty string or:
#"\s+"
With a space as a replacement. You can try this code, in the meanwhile:
var s = "Hello , how are you?";
var pattern = #"\s+";
var regex = new Regex(pattern, RegexOptions.Compiled);
var replaced = regex.Replace(s, " ");
EDIT: After having done some measurement, the second pattern seems to be faster. I'm editing my sample to adapt it.
EDIT 2: I've written an unsafe method. It's much faster than the other ones presented here, including the Regex ones, but, as the word itself says, it's unsafe. I don't think that there's any problem with the code I've written but I may be wrong -- So please, check it again and again in case there's a bug in the method.
static unsafe string TrimInternal(string input)
{
var length = input.Length;
var array = stackalloc char[length];
fixed (char* fix = input)
{
var ptr = fix;
var counter = 0;
var lastWasSpace = false;
while (*ptr != '\x0')
{
//Current char is a space?
var isSpace = *ptr == ' ';
//If it's a space but the last one wasn't
//Or if it's not a space
if (isSpace && !lastWasSpace || !isSpace)
//Write into the result array
array[counter++] = *ptr;
//The last character (before the next loop) was a space
lastWasSpace = isSpace;
//Increase the pointer
ptr++;
}
return new string(array, 0, counter);
}
}
Usage (compile with /unsafe):
var s = TrimInternal("Hello , how are you?");
Profiling made in Release build, optimizations on, 1000000 iterations:
My above solution with Regex: 00:00:03.2130121
The unsafe solution: 00:00:00.2063467
This might work for you. It should be pretty fast. Note that it also removes spaces at the end of the string; that might not be what you want...
using System;
namespace Demo
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(">{0}<", RemoveExtraSpaces("Hello, how are you?"));
Console.WriteLine(">{0}<", RemoveExtraSpaces("Hello , how are you?"));
Console.WriteLine(">{0}<", RemoveExtraSpaces("Hello , how are you ?"));
}
public static string RemoveExtraSpaces(string text)
{
var buffer = new char[text.Length];
bool isSpaced = false;
int n = 0;
foreach (char c in text)
{
if (c == ' ')
{
isSpaced = true;
}
else
{
if (isSpaced)
{
if ((c != ',') && (c != '?'))
{
buffer[n++] = ' ';
}
isSpaced = false;
}
buffer[n++] = c;
}
}
return new string(buffer, 0, n);
}
}
}
Something of my own :
find all the position of WhiteSpacechar in string;
private static IEnumerable<int> GetWhiteSpacePos(string input)
{
int iPos = -1;
while ((iPos = input.IndexOf(" ", iPos + 1, StringComparison.Ordinal)) > -1)
{
yield return iPos;
}
}
Remove all whitespace that are in in sequence Returned from GetWhiteSpacePos
string original_string = "Hello , how are you ?";
var poss = GetWhiteSpacePos(original_string).ToList();
int startPos;
int endPos;
StringBuilder builder = new StringBuilder(original_string);
for (int i = poss.Count -1; i > 1; i--)
{
endPos = poss[i];
while ((poss[i] == poss[i - 1] + 1) && i > 1)
{
i--;
}
startPos = poss[i];
if (endPos - startPos > 1)
{
builder.Remove(startPos, endPos - startPos);
}
}
string new_string = builder.ToString();
You are using a very complex regex..simplify the regex and that would definitely increasre the performance
Use \s+ and replace it with a single space
Well, these kind of problems really trouble us. Use this code, and I'm sure you're getting the result for what you've asked. This command removes any extra white space between any string.
cleanString= Regex.Replace(originalString, #"\s", " ");
Hope thar works for you. Thanks.
And since this is a single Instruction. It will utilize less CPU resource and hence less CPU time, which ultimately increases your performance. Therefore A/C to me this method works the best when compared in terms of performance.
if its just a matter of SPACE;
try this
Source : http://www.codeproject.com/Articles/10890/Fastest-C-Case-Insenstive-String-Replace
private static string ReplaceEx(string original,
string pattern, string replacement)
{
int count, position0, position1;
count = position0 = position1 = 0;
string upperString = original.ToUpper();
string upperPattern = pattern.ToUpper();
int inc = (original.Length / pattern.Length) *
(replacement.Length - pattern.Length);
char[] chars = new char[original.Length + Math.Max(0, inc)];
while ((position1 = upperString.IndexOf(upperPattern,
position0)) != -1)
{
for (int i = position0; i < position1; ++i)
chars[count++] = original[i];
for (int i = 0; i < replacement.Length; ++i)
chars[count++] = replacement[i];
position0 = position1 + pattern.Length;
}
if (position0 == 0) return original;
for (int i = position0; i < original.Length; ++i)
chars[count++] = original[i];
return new string(chars, 0, count);
}
Usage:
string original_string = "Hello , how are you ?";
while (original_string.Contains(" "))
{
original_string = ReplaceEx(original_string, " ", " ");
}
Replacing the regex way:
string resultString = null;
try {
resultString = Regex.Replace(subjectString, #"\s+", " ", RegexOption.Compiled);
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Related
I am a complete newbie when it comes to Regular Expressions, and was wondering if somebody could help me out. I'm not sure if using a regEx is the correct approach here, so please feel free to chime in if you have a better idea. (I will be looping thru many strings).
Basically, I'd like to find/replace on a string, wrapping the matches with {} and keeping the original case of the string.
Example:
Source: "The CAT sat on the mat."
Find/Replace: "cat"
Result: "The {CAT} sat on the mat."
I would like the find/replace to work on only the first occurance, and I also need to know whether the find/replace did indeed match or not.
I hope I've explained things clearly enough.
Thank you.
Regex theRegex =
new Regex("(" + Regex.Escape(FindReplace) + ")", RegexOptions.IgnoreCase);
theRegex.Replace(Source, "{$1}", 1);
If you want word boundary tolerance:
Regex theRegex =
(#"([\W_])(" + Regex.Escape(FindReplace) + #")([\W_])", RegexOptions.IgnoreCase)
theRegex.Replace(str, "$1{$2}$3", 1)
If you will be looping through many strings, then perhaps Regex might not be the best idea - it's a great tool, but not the fastest.
Here's a sample code that would also work:
var str = "The Cat ate a mouse";
var search = "cat";
var index = str.IndexOf(search, StringComparison.CurrentCultureIgnoreCase);
if (index == -1)
throw new Exception("String not found"); //or do something else in this case here
var newStr = str.Substring(0, index) + "{" + str.Substring(index, search.Length) + "}" + str.Substring(index + search.Length);
EDIT:
As noted in the comments, the above code has some issues.
So I decided to try and find a way to make it work without using Regex. Don't get me wrong, I love Regex as much as the next guy. I did this mostly out of curiosity. ;)
Here's what I came upon:
public static class StringExtendsionsMethods
{
public static int IndexOfUsingBoundary(this String s, String word)
{
var firstLetter = word[0].ToString();
StringBuilder sb = new StringBuilder();
bool previousWasLetterOrDigit = false;
int i = 0;
while (i < s.Length - word.Length + 1)
{
bool wordFound = false;
char c = s[i];
if (c.ToString().Equals(firstLetter, StringComparison.CurrentCultureIgnoreCase))
if (!previousWasLetterOrDigit)
if (s.Substring(i, word.Length).Equals(word, StringComparison.CurrentCultureIgnoreCase))
{
wordFound = true;
bool wholeWordFound = true;
if (s.Length > i + word.Length)
{
if (Char.IsLetterOrDigit(s[i + word.Length]))
wholeWordFound = false;
}
if (wholeWordFound)
return i;
sb.Append(word);
i += word.Length;
}
if (!wordFound)
{
previousWasLetterOrDigit = Char.IsLetterOrDigit(c);
sb.Append(c);
i++;
}
}
return -1;
}
}
But I can't take credit for this! I found this after some Googling here, on StackOverflow and then modified it. ;)
Use this method instead of the standard IndexOf in the above code.
Try this:
class Program
{
const string FindReplace = "cat";
static void Main(string[] args)
{
var input = "The CAT sat on the mat as a cat.";
var result = Regex
.Replace(
input,
"(?<=.*)" + FindReplace + "(?=.*)",
m =>
{
return "{" + m.Value.ToUpper() + "}";
},
RegexOptions.IgnoreCase);
Console.WriteLine(result);
}
}
I'm trying to have a suggestion feature for the search function in my program eg I type janw doe in the search section and it will output NO MATCH - did you mean jane doe? I'm not sure what the problem is, maybe something to do with char/string comparison..I've tried comparing both variables as type char eg char temp -->temp.Contains ...etc but an error appears (char does not contain a definition for Contains). Would love any help on this! 8)
if (found == false)
{
Console.WriteLine("\n\nMATCH NOT FOUND");
int charMatch = 0, charCount = 0;
string[] checkArray = new string[26];
//construction site /////////////////////////////////////////////////////////////////////////////////////////////////////////////
for (int controlLoop = 0; controlLoop < contPeople.Length; controlLoop++)
{
foreach (char i in userContChange)
{
charCount = charCount + 1;
}
for (int i = 0; i < userContChange.Length; )
{
string temp = contPeople[controlLoop].name;
string check=Convert.ToString(userContChange[i]);
if (temp.Contains(check))
{
charMatch = charMatch + 1;
}
}
int half = charCount / 2;
if (charMatch >= half)
{
checkArray[controlLoop] = contPeople[controlLoop].name;
}
}
///////////////////////////////////////////////////////////////////////////////////////////////////////////
Console.WriteLine("Did you mean: ");
for (int a = 0; a < checkArray.Length; a++)
{
Console.WriteLine(checkArray[a]);
}
///////////////////////////////////////////////////////////////////////////////////////////////////
A string is made up of many characters. A character is a primitive, likewise, it doesn't "contain" any other items. A string is basically an array of characters.
For comparing string and characters:
char a = 'A';
String alan = "Alan";
Debug.Assert(alan[0] == a);
Or if you have a single digit string.. I suppose
char a = 'A';
String alan = "A";
Debug.Assert(alan == a.ToString());
All of these asserts are true
But, the main reason I wanted to comment on your question, is to suggest an alternative approach for suggesting "Did you mean?". There's an algorithm called Levenshtein Distance which calculates the "number of single character edits" required to convert one string to another. It can be used as a measure of how close two strings are. You may want to look into how this algorithm works because it could help you.
Here's an applet that I found which demonstrates: Approximate String Matching with k-differences
Also the wikipedia link Levenshtein distance
Char type cannot have .Contains() because is only 1 char value type.
In your case (if i understand), maybe you need to use .Equals() or the == operator.
Note: for compare String correctly, use .Equals(),
the == operator does not work good in this case because String is reference type.
Hope this help!
char type dosen't have the Contains() method, but you can use iit like this: 'a'.ToString().Contains(...)
if do not consider the performance, another simple way:
var input = "janw doe";
var people = new string[] { "abc", "123", "jane", "jane doe" };
var found = Array.BinarySearch<string>(people, input);//or use FirstOrDefault(), FindIndex, search engine...
if (found < 0)//not found
{
var i = input.ToArray();
var target = "";
//most similar
//target = people.OrderByDescending(p => p.ToArray().Intersect(i).Count()).FirstOrDefault();
//as you code:
foreach (var p in people)
{
var count = p.ToArray().Intersect(i).Count();
if (count > input.Length / 2)
{
target = p;
break;
}
}
if (!string.IsNullOrWhiteSpace(target))
{
Console.WriteLine(target);
}
}
Is there an easy method to insert spaces between the characters of a string? I'm using the below code which takes a string (for example ( UI$.EmployeeHours * UI.DailySalary ) / ( Month ) ) . As this information is getting from an excel sheet, i need to insert [] for each columnname. The issue occurs if user avoids giving spaces after each paranthesis as well as an operator. AnyOne to help?
text = e.Expression.Split(Splitter);
string expressionString = null;
for (int temp = 0; temp < text.Length; temp++)
{
string str = null;
str = text[temp];
if (str.Length != 1 && str != "")
{
expressionString = expressionString + "[" + text[temp].TrimEnd() + "]";
}
else
expressionString = expressionString + str;
}
User might be inputing something like (UI$.SlNo-UI+UI$.Task)-(UI$.Responsible_Person*UI$.StartDate) while my desired output is ( [UI$.SlNo-UI] + [UI$.Task] ) - ([UI$.Responsible_Person] * [UI$.StartDate] )
Here is a short way to insert spaces after every single character in a string (which I know isn't exactly what you were asking for):
var withSpaces = withoutSpaces.Aggregate(string.Empty, (c, i) => c + i + ' ');
This generates a string the same as the first, except with a space after each character (including the last character).
You can do that with regular expressions:
using System.Text.RegularExpressions;
class Program {
static void Main() {
string expression = "(UI$.SlNo-UI+UI$.Task)-(UI$.Responsible_Person*UI$.StartDate) ";
string replaced = Regex.Replace(expression, #"([\w\$\.]+)", " [ $1 ] ");
}
}
If you are not familiar with regular expressions this might look rather cryptic, but they are a powerful tool, and worth learning. In case, you may check how regular expressions work, and use a tool like Expresso to test your regular expressions.
Hope this helps...
Here is an algorithm that does not use regular expressions.
//Applies dobule spacing between characters
public static string DoubleSpace(string s)
{
if (string.IsNullOrEmpty(s))
{
return string.Empty;
}
char[] a = s.ToCharArray();
char[] b = new char[ (a.Length * 2) - 1];
int bIndex = 0;
for(int i = 0; i < a.Length; i++)
{
b[bIndex++] = a[i];
//Insert a white space after the char
if(i < (a.Length - 1))
{
b[bIndex++] = ' ';
}
}
return new string(b);
}
Well, you can do this by using Regular expressions, search for specific paterns and add brackets where needed. You could also simply Replace every Paranthesis with the same Paranthesis but with spaces on each end.
I would also advice you to use StringBuilder aswell instead of appending to an existing string (this creates a new string for each manipulation, StringBuilder has a smaller memory footprint when doing this kind of manipulation)
i have a string like this:
some_string = "A simple demo of SMS text messaging.\r\n+CMGW: 3216\r\n\r\nOK\r\n\"
im coming from vb.net and i need to know in c#, if i know the position of CMGW, how do i get "3216" out of there?
i know that my start should be the position of CMGW + 6, but how do i make it stop as soon as it finds "\r" ??
again, my end result should be 3216
thank you!
Find the index of \r from the start of where you're interested in, and use the Substring overload which takes a length:
// Production code: add validation here.
// (Check for each index being -1, meaning "not found")
int cmgwIndex = text.IndexOf("CMGW: ");
// Just a helper variable; makes the code below slightly prettier
int startIndex = cmgwIndex + 6;
int crIndex = text.IndexOf("\r", startIndex);
string middlePart = text.Substring(startIndex, crIndex - startIndex);
If you know the position of 3216 then you can just do the following
string inner = some_string.SubString(positionOfCmgw+6,4);
This code will take the substring of some_string starting at the given position and only taking 4 characters.
If you want to be more general you could do the following
int start = positionOfCmgw+6;
int endIndex = some_string.IndexOf('\r', start);
int length = endIndex - start;
string inner = some_string.SubString(start, length);
One option would be to start from your known index and read characters until you hit a non-numeric value. Not the most robust solution, but it will work if you know your input's always going to look like this (i.e., no decimal points or other non-numeric characters within the numeric part of the string).
Something like this:
public static int GetNumberAtIndex(this string text, int index)
{
if (index < 0 || index >= text.Length)
throw new ArgumentOutOfRangeException("index");
var sb = new StringBuilder();
for (int i = index; i < text.Length; ++i)
{
char c = text[i];
if (!char.IsDigit(c))
break;
sb.Append(c);
}
if (sb.Length > 0)
return int.Parse(sb.ToString());
else
throw new ArgumentException("Unable to read number at the specified index.");
}
Usage in your case would look like:
string some_string = #"A simple demo of SMS text messaging.\r\n+CMGW: 3216\r\n...";
int index = some_string.IndexOf("CMGW") + 6;
int value = some_string.GetNumberAtIndex(index);
Console.WriteLine(value);
Output:
3216
If you're looking to extract the number portion of 'CMGW: 3216' then a more reliable method would be to use regular expressions. That way you can look for the entire pattern, and not just the header.
var some_string = "A simple demo of SMS text messaging.\r\n+CMGW: 3216\r\n\r\nOK\r\n";
var match = Regex.Match(some_string, #"CMGW\: (?<number>[0-9]+)", RegexOptions.Multiline);
var number = match.Groups["number"].Value;
More general, if you don't know the start position of CMGW but the structure remains as before.
String s;
char[] separators = {'\r'};
var parts = s.Split(separators);
parts.Where(part => part.Contains("CMGW")).Single().Reverse().TakeWhile(c => c != ' ').Reverse();
I'm a little surprised that there isn't some information on this on the web, and I keep finding that the problem is a little stickier than I thought.
Here's the rules:
You are starting with delimited/escaped data to split into an array.
The delimiter is one arbitrary character
The escape character is one arbitrary character
Both the delimiter and the escape character could occur in data
Regex is fine, but a good-performance solution is best
Edit: Empty elements (including leading or ending delimiters) can be ignored
The code signature (in C# would be, basically)
public static string[] smartSplit(
string delimitedData,
char delimiter,
char escape) {}
The stickiest part of the problem is the escaped consecutive escape character case, of course, since (calling / the escape character and , the delimiter): ////////, = ////,
Am I missing somewhere this is handled on the web or in another SO question? If not, put your big brains to work... I think this problem is something that would be nice to have on SO for the public good. I'm working on it myself, but don't have a good solution yet.
A simple state machine is usually the easiest and fastest way. Example in Python:
def extract(input, delim, escape):
# states
parsing = 0
escaped = 1
state = parsing
found = []
parsed = ""
for c in input:
if state == parsing:
if c == delim:
found.append(parsed)
parsed = ""
elif c == escape:
state = escaped
else:
parsed += c
else: # state == escaped
parsed += c
state = parsing
if parsed:
found.append(parsed)
return found
void smartSplit(string const& text, char delim, char esc, vector<string>& tokens)
{
enum State { NORMAL, IN_ESC };
State state = NORMAL;
string frag;
for (size_t i = 0; i<text.length(); ++i)
{
char c = text[i];
switch (state)
{
case NORMAL:
if (c == delim)
{
if (!frag.empty())
tokens.push_back(frag);
frag.clear();
}
else if (c == esc)
state = IN_ESC;
else
frag.append(1, c);
break;
case IN_ESC:
frag.append(1, c);
state = NORMAL;
break;
}
}
if (!frag.empty())
tokens.push_back(frag);
}
private static string[] Split(string input, char delimiter, char escapeChar, bool removeEmpty)
{
if (input == null)
{
return new string[0];
}
char[] specialChars = new char[]{delimiter, escapeChar};
var tokens = new List<string>();
var token = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
var c = input[i];
if (c.Equals(escapeChar))
{
if (i >= input.Length - 1)
{
throw new ArgumentException("Uncompleted escape sequence has been encountered at the end of the input");
}
var nextChar = input[i + 1];
if (nextChar != escapeChar && nextChar != delimiter)
{
throw new ArgumentException("Unknown escape sequence has been encountered: " + c + nextChar);
}
token.Append(nextChar);
i++;
}
else if (c.Equals(delimiter))
{
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
token.Length = 0;
}
}
else
{
var index = input.IndexOfAny(specialChars, i);
if (index < 0)
{
token.Append(c);
}
else
{
token.Append(input.Substring(i, index - i));
i = index - 1;
}
}
}
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
}
return tokens.ToArray();
}
The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.
You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).
Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:
state(input) action
========================
BEGIN(*): token.clear(); state=START;
END(*): return;
*(\n\0): token.emit(); state=END;
START(DELIMITER): ; // NB: the input is *not* added to the token!
START(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
START(*): token.append(input); state=NORM;
NORM(DELIMITER): token.emit(); token.clear(); state=START;
NORM(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
NORM(*): token.append(input);
ESC(*): token.append(input); state=NORM;
This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).
Here's my ported function in C#
public static void smartSplit(string text, char delim, char esc, ref List<string> listToBuild)
{
bool currentlyEscaped = false;
StringBuilder fragment = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (currentlyEscaped)
{
fragment.Append(c);
currentlyEscaped = false;
}
else
{
if (c == delim)
{
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
fragment.Remove(0, fragment.Length);
}
}
else if (c == esc)
currentlyEscaped = true;
else
fragment.Append(c);
}
}
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
}
}
Hope this helps someone in the future. Thanks to KenE for pointing me in the right direction.
Here's a more idiomatic and readable way to do it:
public IEnumerable<string> SplitAndUnescape(
string encodedString,
char separator,
char escape)
{
var inEscapeSequence = false;
var currentToken = new StringBuilder();
foreach (var currentCharacter in encodedString)
if (inEscapeSequence)
{
currentToken.Append(currentCharacter);
inEscapeSequence = false;
}
else
if (currentCharacter == escape)
inEscapeSequence = true;
else
if (currentCharacter == separator)
{
yield return currentToken.ToString();
currentToken.Clear();
}
else
currentToken.Append(currentCharacter);
yield return currentToken.ToString();
}
Note that this doesn't remove empty elements. I don't think that should be the responsibility of the parser. If you want to remove them, just call Where(item => item.Any()) on the result.
I think this is too much logic for a single method; it gets hard to follow. If someone has time, I think it would be better to break it up into multiple methods and maybe its own class.
You'ew looking for something like a "string tokenizer". There's a version I found quickly that's similar. Or look at getopt.