Remove whitespaces in c# without any built in functions - c#

Hi I am a beginner in C# and I was trying to remove the whitespaces in a string.
I use the following code:
public String RemoveSpace(string str1)
{
char[] source = str1.ToCharArray();
int oldIndex = 0;
int newIndex = 0;
while (oldIndex < source.Length)
{
if (source[oldIndex] != ' ' && source[oldIndex] != '\t')
{
source[newIndex] = source[oldIndex];
newIndex++;
}
oldIndex++;
}
source[oldIndex] = '\0';
return new String(source);
}
But the problem I'm facing is when I give the
input string as "H e l"
the output shows "Hel l"
which is because the at the last iteration oldIndex is at arr[2] being replaced by arr[4] and the last character 'l' is being left out. Can some one point out the mistake that is being done?
Note: There should not be any use of Regex, trim or replace functions.
Thanks.

There's a String constructor which allows you to control the length
So just change the last line to
return new String(source, 0, newIndex);
Note that .NET doesn't care about NUL characters (strings can contain them just fine), so you can remove source[oldIndex] = '\0'; since it's ineffective.

Some key learning points:
Incrementally concatenating strings is relatively slow. Since you know you're going to be doing a 'lot' (indeterminate) number of character-by-character operations, use a char array for the working string.
The fastest way to iterate through characters is C# is to use the built-in string indexer.
If you need to check additional characters besides space, tab, carriage return, and line feed, then add additional conditions in the if statement:
public static string RemoveWhiteSpace(string input) {
int len = input.Length;
int ixOut = 0;
char[] outBuffer = new char[len];
for(int i = 0; i < len; i++) {
char c = input[i];
if(!(c == ' ' || c == '\t' || c == '\r' || c == '\n'))
outBuffer[ixOut++] = c;
}
return new string(outBuffer, 0, ixOut);
}

You can use LINQ for that:
var output = new string(input.Where(x => !char.IsWhiteSpace(x)).ToArray());
Your mistake is you are removing the spaces but your source array still contains the remaining chars.Using that logic you will never get the correct result because you are not removing anything, you are just replacing the chars.After your while loop you can try this:
return new String(source.Take(newIndex+1).ToArray());
Using Take method get the subset of your source array and ignore the rest.
Here is another alternative way of doing this:
var output = string.Concat(input.Split());

You should note that much depends on how you define "whitespace". Unicode and the CLR define whitespace as being a rather exhaustive list of characters: char.IsWhitespace() return true for quite a few characters.
The "classic" definition of whitespace are the following characters: HT, LF, VT, FF, CR and SP (and some might include BS as well).
Myself, I'd probably do something like this:
public static class StringHelpers
{
public static string StripWhitespace( this string s )
{
StringBuilder sb = new StringBuilder() ;
foreach ( char c in s )
{
switch ( c )
{
//case '\b' : continue ; // U+0008, BS uncomment if you want this
case '\t' : continue ; // U+0009, HT
case '\n' : continue ; // U+000A, LF
case '\v' : continue ; // U+000B, VT
case '\f' : continue ; // U+000C, FF
case '\r' : continue ; // U+000D, CR
case ' ' : continue ; // U+0020, SP
}
sb.Append(c) ;
}
string stripped = sb.ToString() ;
return stripped ;
}
}
You could use your approach thusly. However, it's important to READ THE DOCUMENTATION): you'll note the use of a string constructor overload that lets you specify a range within an array as the initialization vector for the string:
public static string StripWhitespace( string s )
{
char[] buf = s.ToCharArray() ;
int j = 0 ; // target pointer
for ( int i = 0 ; i < buf.Length ; ++i )
{
char c = buf[i] ;
if ( !IsWs(c) )
{
buf[j++] = c ;
}
}
string stripped = new string(buf,0,j) ;
return stripped ;
}
private static bool IsWs( char c )
{
bool ws = false ;
switch ( c )
{
//case '\b' : // U+0008, BS uncomment if you want BS as whitespace
case '\t' : // U+0009, HT
case '\n' : // U+000A, LF
case '\v' : // U+000B, VT
case '\f' : // U+000C, FF
case '\r' : // U+000D, CR
case ' ' : // U+0020, SP
ws = true ;
break ;
}
return ws ;
}
You could also use Linq, something like:
public static string StripWhitespace( this string s )
{
return new string( s.Where( c => !char.IsWhiteSpace(c) ).ToArray() ) ;
}
Though, I'm willing to be that the Linq approach will be significantly slower than the other two. It's elegant, though.

Related

Remove specific characters except last

I have a text string and I want to replace the dots with underscores except for the last character found in the string.
Example:
input = "video.coffee.example.mp4"
result = "video_coffe_example.mp4"
I have a code but this replaces everything including the last character
first option failed
static string replaceForUnderScore(string file)
{
return file = file.Replace(".", "_");
}
I implemented a second option that works for me but I find that it is very extensive and not very optimized
static string replaceForUnderScore(string file)
{
string result = "";
var splits = file.Split(".");
var extension = splits.LastOrDefault();
splits = splits.Take(splits.Count() - 1).ToArray();
foreach (var strItem in splits)
{
result = result + "_" + strItem;
}
result = result.Substring(1, result.Length-1);
string finalResult = result + "."+extension;
return finalResult;
}
Is there a better way to do it?
Since you work with files, I suggest using Path class: all
we want is to change file name only while keeping extension intact:
static string replaceForUnderScore(string file) =>
Path.GetFileNameWithoutExtension(file).Replace('.', '_') + Path.GetExtension(file);
You can replace all the dots with an underscore except for the last dot by asserting that there is still a dot present to the right when matching one.
string result = Regex.Replace(input, #"\.(?=[^.]*\.)", "_");
The result will be
video_coffee_example.mp4
Regex will help you to do this.
Add the namespace using System.Text.RegularExpressions;
And use this code:
var regex = new Regex(Regex.Escape("."));
var newText = regex.Replace("video.coffee.example.mp4", "_", 2);
Here we specified the maximum number of times to replace the .
The output would be the following:
video_coffee_example.mp4
Additionally, you can update the code to replace any number of dots excluding the last one.
var replaceChar = '.';
var regex = new Regex(Regex.Escape(replaceChar.ToString()));
var replaceWith = "_";
// The text to process
var text = "video.coffee.example.mp4";
// Count how many chars to replace excluding extension
var replaceCount = text.Count(s => s == replaceChar) - 1;
var newText = regex.Replace(text, replaceWith, replaceCount);
Off the top of my head but this might work.
return $"{file.Replace(".mp4","").Replace(".","_")}.mp4";
The simplest (and probably fastest) way is just to iterate over the string:
static string replaceForUnderScore(string file)
{
StringBuilder sb = new StringBuilder( file.Length ) ;
int lastDot = -1 ;
for ( int i = 0 ; i < file.Length ; ++i )
{
char c = file[i] ;
// if we found a '.', replace it with '_' and save its position
if ( c == '.' )
{
c = '_' ;
lastDot = i ;
}
sb.Append( c ) ;
}
// if we changed any '.' to '_', convert the last such replacement back to '.'
if ( lastDot >= 0 )
{
sb.Replace ( '.' , '_' , lastDot, 1 );
}
return sb.ToString();
}
Another approach would be to use System.IO.Path. It's certainly the most succinct:
static string replaceForUnderScore( string file )
{
string ext = Path.GetExtension( file ) ;
string name = Path
.GetFileNameWithoutExtension( file )
.Replace( '.' , '_' )
;
return Path.ChangeExtension( name , ext ) ;
}

c# read file content code optimization

I have a large string which is converted from a text file (eg 1 MB text 0file) and I want to process the string. It takes near 10 minutes to process the string.
Basically string is read character by character and increment counter for each character by one, some characters such as space, comma, colon and semi-colon are counted as space and rest characters are just ignored and thus space's counter is incremented.
Code:
string fileContent = "....." // a large string
int min = 0;
int max = fileContent.Length;
Dictionary<char, int> occurrence // example c=>0, m=>4, r=>8 etc....
// Note: occurrence has only a-z alphabets, and a space. comma, colon, semi-colon are coutned as space and rest characters ignored.
for (int i = min; i <= max; i++) // run loop to end
{
try // increment counter for alphabets and space
{
occurrence[fileContent[i]] += 1;
}
catch (Exception e) // comma, colon and semi-colon are spaces
{
if (fileContent[i] == ' ' || fileContent[i] == ',' || fileContent[i] == ':' || fileContent[i] == ';')
{
occurrence[' '] += 1;
//new_file_content += ' ';
}
else continue;
}
totalFrequency++; // increment total frequency
}
Try this:
string input = "test string here";
Dictionary<char, int> charDict = new Dictionary<char, int>();
foreach(char c in input.ToLower()) {
if(c < 97 || c > 122) {
if(c == ' ' || c == ',' || c == ':' || c == ';') {
charDict[' '] = (charDict.ContainsKey(' ')) ? charDict[' ']++ : 0;
}
} else {
charDict[c] = (charDict.ContainsKey(c)) ? charDict[c]++ : 0;
}
}
Given your loop is iterating through a large number you want to minimize the checks inside the loop and remove the catch which is pointed out in the comments. There should never be a reason to control flow logic with a try catch block. I would assume you initialize the dictionary first to set the occurrence cases to 0 otherwise you have to add to the dictionary if the character is not there. In the loop you can test the character with something like char.IsLetter() or other checks as D. Stewart is suggesting. I would not do a toLower on the large string if you are going to iterate every character anyway (this would do the iteration twice). You can do that conversion in the loop if needed.
Try something like the below code. You could also initialize all 256 possible characters in the dictionary and completely remove the if statement and then remove items you don't care about and add the 4 space items to the space character dictionary after the counting is complete.
foreach (char c in fileContent)
{
if (char.IsLetter(c))
{
occurrence[c] += 1;
}
else
{
if (c == ' ' || c == ',' || c == ':' || c == ';')
{
occurrence[' '] += 1;
}
}
}
}
You could initialize the entire dictionary in advance like this also:
for (int i = 0; i < 256; i++)
{
occurrence.Add((char)i, 0);
}
There are several issues with that code snippet (i <= max, accessing dictionary entry w/o being initialized etc.), but of course the performance bottleneck is relying on exceptions, since throwing / catching exceptions is extremely slow (especially when done in a inner loop).
I would start with putting the counts into a separate array.
Then I would either prepare a char to count index map and use it inside the loop w/o any ifs:
var indexMap = new Dictionary<char, int>();
int charCount = 0;
// Map the valid characters to be counted
for (var ch = 'a'; ch <= 'z'; ch++)
indexMap.Add(ch, charCount++);
// Map the "space" characters to be counted
foreach (var ch in new[] { ' ', ',', ':', ';' })
indexMap.Add(ch, charCount);
charCount++;
// Allocate count array
var occurences = new int[charCount];
// Process the string
foreach (var ch in fileContent)
{
int index;
if (indexMap.TryGetValue(ch, out index))
occurences[index]++;
}
// Not sure about this, but including it for consistency
totalFrequency = occurences.Sum();
or not use dictionary at all:
// Allocate array for char counts
var occurences = new int['z' - 'a' + 1];
// Separate count for "space" chars
int spaceOccurences = 0;
// Process the string
foreach (var ch in fileContent)
{
if ('a' <= ch && ch <= 'z')
occurences[ch - 'a']++;
else if (ch == ' ' || ch == ',' || ch == ':' || ch == ';')
spaceOccurences++;
}
// Not sure about this, but including it for consistency
totalFrequency = spaceOccurences + occurences.Sum();
The former is more flexible (you can add more mappings), the later - a bit faster. But both are fast enough (complete in milliseconds for 1M size string).
Ok, it´s a little late, but it should be the fastest solution:
using System.Collections.Generic;
using System.Linq;
namespace ConsoleApplication99
{
class Program
{
static void Main(string[] args)
{
string fileContent = "....."; // a large string
// --- high perf section to count all chars ---
var charCounter = new int[char.MaxValue + 1];
for (int i = 0; i < fileContent.Length; i++)
{
charCounter[fileContent[i]]++;
}
// --- combine results with linq (all actions consume less than 1 ms) ---
var allResults = charCounter.Select((count, index) => new { count, charValue = (char)index }).Where(c => c.count > 0).ToArray();
var spaceChars = new HashSet<char>(" ,:;");
int countSpaces = allResults.Where(c => spaceChars.Contains(c.charValue)).Sum(c => c.count);
var usefulChars = new HashSet<char>("abcdefghijklmnopqrstuvwxyz");
int countLetters = allResults.Where(c => usefulChars.Contains(c.charValue)).Sum(c => c.count);
}
}
}
for very large text-files, it´s better to use the StreamReader...

Keep only numeric value from a string?

I have some strings like this
string phoneNumber = "(914) 395-1430";
I would like to strip out the parethenses and the dash, in other word just keep the numeric values.
So the output could look like this
9143951430
How do I get the desired output ?
You do any of the following:
Use regular expressions. You can use a regular expression with either
A negative character class that defines the characters that are what you don't want (those characters other than decimal digits):
private static readonly Regex rxNonDigits = new Regex( #"[^\d]+");
In which case, you can do take either of these approaches:
// simply replace the offending substrings with an empty string
private string CleanStringOfNonDigits_V1( string s )
{
if ( string.IsNullOrEmpty(s) ) return s ;
string cleaned = rxNonDigits.Replace(s, "") ;
return cleaned ;
}
// split the string into an array of good substrings
// using the bad substrings as the delimiter. Then use
// String.Join() to splice things back together.
private string CleanStringOfNonDigits_V2( string s )
{
if (string.IsNullOrEmpty(s)) return s;
string cleaned = String.Join( rxNonDigits.Split(s) );
return cleaned ;
}
a positive character set that defines what you do want (decimal digits):
private static Regex rxDigits = new Regex( #"[\d]+") ;
In which case you can do something like this:
private string CleanStringOfNonDigits_V3( string s )
{
if ( string.IsNullOrEmpty(s) ) return s ;
StringBuilder sb = new StringBuilder() ;
for ( Match m = rxDigits.Match(s) ; m.Success ; m = m.NextMatch() )
{
sb.Append(m.Value) ;
}
string cleaned = sb.ToString() ;
return cleaned ;
}
You're not required to use a regular expression, either.
You could use LINQ directly, since a string is an IEnumerable<char>:
private string CleanStringOfNonDigits_V4( string s )
{
if ( string.IsNullOrEmpty(s) ) return s;
string cleaned = new string( s.Where( char.IsDigit ).ToArray() ) ;
return cleaned;
}
If you're only dealing with western alphabets where the only decimal digits you'll see are ASCII, skipping char.IsDigit will likely buy you a little performance:
private string CleanStringOfNonDigits_V5( string s )
{
if (string.IsNullOrEmpty(s)) return s;
string cleaned = new string(s.Where( c => c-'0' < 10 ).ToArray() ) ;
return cleaned;
}
Finally, you can simply iterate over the string, chucking the digits you don't want, like this:
private string CleanStringOfNonDigits_V6( string s )
{
if (string.IsNullOrEmpty(s)) return s;
StringBuilder sb = new StringBuilder(s.Length) ;
for (int i = 0; i < s.Length; ++i)
{
char c = s[i];
if ( c < '0' ) continue ;
if ( c > '9' ) continue ;
sb.Append(s[i]);
}
string cleaned = sb.ToString();
return cleaned;
}
Or this:
private string CleanStringOfNonDigits_V7(string s)
{
if (string.IsNullOrEmpty(s)) return s;
StringBuilder sb = new StringBuilder(s);
int j = 0 ;
int i = 0 ;
while ( i < sb.Length )
{
bool isDigit = char.IsDigit( sb[i] ) ;
if ( isDigit )
{
sb[j++] = sb[i++];
}
else
{
++i ;
}
}
sb.Length = j;
string cleaned = sb.ToString();
return cleaned;
}
From a standpoint of clarity and cleanness of code, the version 1 is what you want. It's hard to beat a one liner.
If performance matters, my suspicion is that the version 7, the last version, is the winner. It creates one temporary — a StringBuilder() and does the transformation in-place within the StringBuilder's in-place buffer.
The other options all do more work.
use reg expression
string result = Regex.Replace(phoneNumber, #"[^\d]", "");
try something like this
return new String(input.Where(Char.IsDigit).ToArray());
string phoneNumber = "(914) 395-1430";
var numbers = String.Join("", phoneNumber.Where(char.IsDigit));
He means everything #gleng
Regex rgx = new Regex(#"\D");
str = rgx.Replace(str, "");
Instead of a regular expression, you can use a LINQ method:
phoneNumber = String.Concat(phoneNumber.Where(c => c >= '0' && c <= '9'));
or:
phoneNumber = String.Concat(phoneNumber.Where(Char.IsDigit));

Convert a word into character array

How do I convert a word into a character array?
Lets say i have the word "Pneumonoultramicroscopicsilicovolcanoconiosis" yes this is a word ! I would like to take this word and assign a numerical value to it.
a = 1
b = 2
... z = 26
int alpha = 1;
int Bravo = 2;
basic code
if (testvalue == "a")
{
Debug.WriteLine("TRUE A was found in the string"); // true
FinalNumber = Alpha + FinalNumber;
Debug.WriteLine(FinalNumber);
}
if (testvalue == "b")
{
Debug.WriteLine("TRUE B was found in the string"); // true
FinalNumber = Bravo + FinalNumber;
Debug.WriteLine(FinalNumber);
}
My question is how do i get the the word "Pneumonoultramicroscopicsilicovolcanoconiosis" into a char string so that I can loop the letters one by one ?
thanks in advance
what about
char[] myArray = myString.ToCharArray();
But you don't actually need to do this if you want to iterate the string. You can simply do
for( int i = 0; i < myString.Length; i++ ){
if( myString[i] ... ){
//do what you want here
}
}
This works since the string class implements it's own indexer.
string word = "Pneumonoultramicroscopicsilicovolcanoconiosis";
char[] characters = word.ToCharArray();
Voilá!
you can use simple for loop.
string word = "Pneumonoultramicroscopicsilicovolcanoconiosis";
int wordCount = word.Length;
for(int wordIndex=0;wordIndex<wordCount; wordIndex++)
{
char c = word[wordIndex];
// your code
}
You can use the Linq Aggregate function to do this:
"wordsto".ToLower().Aggregate(0, (running, c) => running + c - 97);
(This particular example assumes you want to treat upper- and lower-case identically.)
The subtraction of 97 translates the ASCII value of the letters such that 'a' is zero. (Obviously subtract 96 if you want 'a' to be 1.)
you can use ToCharArray() method of string class
string strWord = "Pneumonoultramicroscopicsilicovolcanoconiosis";
char[] characters = strWord.ToCharArray();

What is the best algorithm for arbitrary delimiter/escape character processing?

I'm a little surprised that there isn't some information on this on the web, and I keep finding that the problem is a little stickier than I thought.
Here's the rules:
You are starting with delimited/escaped data to split into an array.
The delimiter is one arbitrary character
The escape character is one arbitrary character
Both the delimiter and the escape character could occur in data
Regex is fine, but a good-performance solution is best
Edit: Empty elements (including leading or ending delimiters) can be ignored
The code signature (in C# would be, basically)
public static string[] smartSplit(
string delimitedData,
char delimiter,
char escape) {}
The stickiest part of the problem is the escaped consecutive escape character case, of course, since (calling / the escape character and , the delimiter): ////////, = ////,
Am I missing somewhere this is handled on the web or in another SO question? If not, put your big brains to work... I think this problem is something that would be nice to have on SO for the public good. I'm working on it myself, but don't have a good solution yet.
A simple state machine is usually the easiest and fastest way. Example in Python:
def extract(input, delim, escape):
# states
parsing = 0
escaped = 1
state = parsing
found = []
parsed = ""
for c in input:
if state == parsing:
if c == delim:
found.append(parsed)
parsed = ""
elif c == escape:
state = escaped
else:
parsed += c
else: # state == escaped
parsed += c
state = parsing
if parsed:
found.append(parsed)
return found
void smartSplit(string const& text, char delim, char esc, vector<string>& tokens)
{
enum State { NORMAL, IN_ESC };
State state = NORMAL;
string frag;
for (size_t i = 0; i<text.length(); ++i)
{
char c = text[i];
switch (state)
{
case NORMAL:
if (c == delim)
{
if (!frag.empty())
tokens.push_back(frag);
frag.clear();
}
else if (c == esc)
state = IN_ESC;
else
frag.append(1, c);
break;
case IN_ESC:
frag.append(1, c);
state = NORMAL;
break;
}
}
if (!frag.empty())
tokens.push_back(frag);
}
private static string[] Split(string input, char delimiter, char escapeChar, bool removeEmpty)
{
if (input == null)
{
return new string[0];
}
char[] specialChars = new char[]{delimiter, escapeChar};
var tokens = new List<string>();
var token = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
var c = input[i];
if (c.Equals(escapeChar))
{
if (i >= input.Length - 1)
{
throw new ArgumentException("Uncompleted escape sequence has been encountered at the end of the input");
}
var nextChar = input[i + 1];
if (nextChar != escapeChar && nextChar != delimiter)
{
throw new ArgumentException("Unknown escape sequence has been encountered: " + c + nextChar);
}
token.Append(nextChar);
i++;
}
else if (c.Equals(delimiter))
{
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
token.Length = 0;
}
}
else
{
var index = input.IndexOfAny(specialChars, i);
if (index < 0)
{
token.Append(c);
}
else
{
token.Append(input.Substring(i, index - i));
i = index - 1;
}
}
}
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
}
return tokens.ToArray();
}
The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.
You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).
Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:
state(input) action
========================
BEGIN(*): token.clear(); state=START;
END(*): return;
*(\n\0): token.emit(); state=END;
START(DELIMITER): ; // NB: the input is *not* added to the token!
START(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
START(*): token.append(input); state=NORM;
NORM(DELIMITER): token.emit(); token.clear(); state=START;
NORM(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
NORM(*): token.append(input);
ESC(*): token.append(input); state=NORM;
This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).
Here's my ported function in C#
public static void smartSplit(string text, char delim, char esc, ref List<string> listToBuild)
{
bool currentlyEscaped = false;
StringBuilder fragment = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (currentlyEscaped)
{
fragment.Append(c);
currentlyEscaped = false;
}
else
{
if (c == delim)
{
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
fragment.Remove(0, fragment.Length);
}
}
else if (c == esc)
currentlyEscaped = true;
else
fragment.Append(c);
}
}
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
}
}
Hope this helps someone in the future. Thanks to KenE for pointing me in the right direction.
Here's a more idiomatic and readable way to do it:
public IEnumerable<string> SplitAndUnescape(
string encodedString,
char separator,
char escape)
{
var inEscapeSequence = false;
var currentToken = new StringBuilder();
foreach (var currentCharacter in encodedString)
if (inEscapeSequence)
{
currentToken.Append(currentCharacter);
inEscapeSequence = false;
}
else
if (currentCharacter == escape)
inEscapeSequence = true;
else
if (currentCharacter == separator)
{
yield return currentToken.ToString();
currentToken.Clear();
}
else
currentToken.Append(currentCharacter);
yield return currentToken.ToString();
}
Note that this doesn't remove empty elements. I don't think that should be the responsibility of the parser. If you want to remove them, just call Where(item => item.Any()) on the result.
I think this is too much logic for a single method; it gets hard to follow. If someone has time, I think it would be better to break it up into multiple methods and maybe its own class.
You'ew looking for something like a "string tokenizer". There's a version I found quickly that's similar. Or look at getopt.

Categories

Resources