c#: how to split string using default whitespaces + set of addtional delimiters? - c#

I need to split a string in C# using a set of delimiter characters. This set should include the default whitespaces (i.e. what you effectively get when you String.Split(null, StringSplitOptions.RemoveEmptyEntries)) plus some additional characters that I specify like '.', ',', ';', etc. So if I have a char array of those additional characters, how to I add all the default whitespaces to it, in order to then feed that expanded array to String.Split? Or is there a better way of splitting using my custom delimiter set + whitespaces? Thx

Just use the appropriate overload of string.Split if you're at least on .NET 2.0:
char[] separator = new[] { ' ', '.', ',', ';' };
string[] parts = text.Split(separator, StringSplitOptions.RemoveEmptyEntries);
I guess i was downvoted because of the incomplete answer. OP has asked for a way to split by all white-spaces(which are 25 on my pc) but also by other delimiters:
public static class StringExtensions
{
static StringExtensions()
{
var whiteSpaceList = new List<char>();
for (int i = char.MinValue; i <= char.MaxValue; i++)
{
char c = Convert.ToChar(i);
if (char.IsWhiteSpace(c))
{
whiteSpaceList.Add(c);
}
}
WhiteSpaces = whiteSpaceList.ToArray();
}
public static readonly char[] WhiteSpaces;
public static string[] SplitWhiteSpacesAndMore(this string str, IEnumerable<char> otherDeleimiters, StringSplitOptions options = StringSplitOptions.None)
{
var separatorList = new List<char>(WhiteSpaces);
separatorList.AddRange(otherDeleimiters);
return str.Split(separatorList.ToArray(), options);
}
}
Now you can use this extension method in this way:
string str = "word1 word2\tword3.word4,word5;word6";
char[] separator = { '.', ',', ';' };
string[] split = str.SplitWhiteSpacesAndMore(separator, StringSplitOptions.RemoveEmptyEntries);

The answers above do not use all whitespace characters as delimiters, as you state in your request, only the ones specified by the program. In the solution examples above, this is only SPACE, but not TAB, CR, LF, and all the other Unicode-defined whitespace chars.
I have not found a way to retrieve the default whitespace chars from String. However, they are defined in Regex, and you can use that instead of String. In your case, adding period and comma to the Regex whitespace set:
Regex regex = new Regex(#"[\s\.,]+"); // The "+" will remove blank entries
input = #"1.2 3, 4";
string[] tokens = regex.Split(input);
will produce
tokens[0] "1"
tokens[1] "2"
tokens[2] "3"
tokens[3] "4"

str.Split(" .,;".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

I use something like the following to ensure I'm always splitting on Split's default whitespace characters:
public static string[] SplitOnWhitespaceAnd(this string value,
char[] separator, StringSplitOptions options = StringSplitOptions.RemoveEmptyEntries)
=> value.Split().SelectMany(s => s.Split(separator, options)).ToArray();
Note that to be consistent with Microsoft's naming conventions, you'd want to use WhiteSpace rather than Whitespace.
Refer to Microsoft's Char.IsWhiteSpace documentation to see the whitespace characters split on by default.

string[] splitSentence(string sentence)
{
return sentence
.Replace(",", " , ")
.Replace(".", " . ")
.Split(' ', StringSplitOptions.RemoveEmptyEntries)
}
or
string[] result = test.Split(new string[] {"\n", "\r\n"},
StringSplitOptions.RemoveEmptyEntries);

Related

Remove multiple values ​from a string c#

I have a String example:
#5/r/n#12/r/n#23/r/n#43/r/n#54/r/n#23/r/n#77/r/n
I need to pass these values ​​to a list and get the values ​​between # and /r/n
So far I have the following code:
List<string> result = Regex.Split(String, #"/r/n").ToList();
This separates each value, leaving #, how can I remove #, to each value from the list?
You can do this in one line using LINQ:
List<string> result = Regex.Split(String, #"/r/n").Select(s => s.Replace("#", "")).ToList();
You can use the trim function to remove special characters from the front and end of your strings.
myString.Trim( new Char[] { '#', ' '} )
User the string null or empty operator to cleanse any empty strings as well:
List<string> result = Regex.Split(myString, #"/r/n").Select(a => a.Trim(new Char[] { '#', ' ' })).Where(b => !String.IsNullOrEmpty(b)).ToList();
You can use trim
char[] charsToTrim = { '#' };
List<string> result = Regex.Split(String, #"/r/n").
.Select(x => x.Trim(charsToTrim))
.ToList();
You could also split on the # and trim the other. Which ever makes sense.
I believe that Trim will be faster than Replace -- but I did not test it.

Removing words from text with separators in front(using Regex)

I need to remove words from the text with separators next to them. I already removed words but I don't know how I can remove separators at the same time. Any suggestions?
At the moment I have:
static void Main(string[] args)
{
Program p = new Program();
string text = "";
text = p.ReadText("Duomenys.txt", text);
string[] wordsToDelete = { "Hello", "Thanks", "kinda" };
char[] separators = { ' ', '.', ',', '!', '?', ':', ';', '(', ')', '\t' };
p.DeleteWordsFromText(text, wordsToDelete, separators);
}
public string ReadText(string file, string text)
{
text = File.ReadAllText(file);
return text;
}
public void DeleteWordsFromText(string text, string[] wordsToDelete, char[] separators)
{
Console.WriteLine(text);
for (int i = 0; i < wordsToDelete.Length; i++)
{
text = Regex.Replace(text, wordsToDelete[i], String.Empty);
}
Console.WriteLine("-------------------------------------------");
Console.WriteLine(text);
}
The results should be:
how are you?
I am good.
I have:
, how are you?
, I am . good.
Duomenys.txt
Hello, how are you?
Thanks, I am kinda. good.
You may build a regex like
\b(?:Hello|Thanks|kinda)\b[ .,!?:;() ]*
where \b(?:Hello|Thanks|kinda)\b will match any words to delete as whole words and [ .,!?:;() ]* will match all your separators 0 or more times after the words to delete.
The C# solution:
char[] separators = { ' ', '.', ',', '!', '?', ':', ';', '(', ')', '\t' };
string[] wordsToDelete = { "Hello", "Thanks", "kinda" };
string SepPattern = new String(separators).Replace(#"\", #"\\").Replace("^", #"\^").Replace("-", #"\-").Replace("]", #"\]");
var pattern = $#"\b(?:{string.Join("|", wordsToDelete.Select(Regex.Escape))})\b[{SepPattern}]*";
// => \b(?:Hello|Thanks|kinda)\b[ .,!?:;() ]*
Regex rx = new Regex(pattern, RegexOptions.Compiled);
// RegexOptions.IgnoreCase can be added to the above flags for case insensitive matching: RegexOptions.IgnoreCase | RegexOptions.Compiled
DeleteWordsFromText("Hello, how are you?", rx);
DeleteWordsFromText("Thanks, I am kinda. good.", rx);
Here is the DeleteWordsFromText method:
public static void DeleteWordsFromText(string text, Regex p)
{
Console.WriteLine($"---- {text} ----");
Console.WriteLine(p.Replace(text, ""));
}
Output:
---- Hello, how are you? ----
how are you?
---- Thanks, I am kinda. good. ----
I am good.
Notes:
string SepPattern = new String(separators).Replace(#"\", #"\\").Replace("^", #"\^").Replace("-", #"\-").Replace("]", #"\]"); - it is a separator pattern that will be used inside a character class, and since only ^, -, \, ] chars require escaping inside a character class, only these chars are escaped
$#"\b(?:{string.Join("|", wordsToDelete.Select(Regex.Escape))})\b" - this will build the alternation from the words to delete and will only match them as whole words.
Pattern details
\b - word boundary
(?: - start of a non-capturing group:
Hello - Hello word
| - or
Thanks- Thanls word
| - or
kinda- kinda word
) - end of the group
\b - word boundary
[ .,!?:;() ]* - any 0+ chars inside the character class.
See the regex demo.
You can build the regex like follows:
var regex = new Regex(#"\b("
+ string.Join("|", wordsToDelete.Select(Regex.Escape)) + ")("
+ string.Join("|", separators.Select(c => Regex.Escape(new string(c, 1)))) + ")?");
Explanation:
the \b at the start matches a word boundary. Just in case you get "XYZThanks"
the next part builds a regex construct matching any of the wordsToDelete
the last part builds a regex construct matching any of the separators; the trailing "?" is there because you said you want to replace the word also if no separator follows
I would not use Regex. In 3 months from now, you'll not understand the Regex any more and fixing bugs is a hard thing then.
I would use simple loops. Everyone will understand:
public void DeleteWordsFromText(string text, string[] wordsToDelete, char[] separators)
{
Console.WriteLine(text);
foreach (string word in wordsToDelete)
{
foreach(char separator in separators)
{
text = text.Replace(word + separator, String.Empty);
}
}
Console.WriteLine("-------------------------------------------");
Console.WriteLine(text);
}

Getting punctuation from the end of a string only

I'm looking for a C# snippet to remove and store any punctuation from the end of a string only.
Example:
Test! would return !
Test;; would return ;;
Test?:? would return ?:?
!!Test!?! would return !?!
I have a rather clunky solution at the moment but wondered if anybody could suggest a more succinct way to do this.
My puncutation list is
new char[] { '.', ':', '-', '!', '?', ',', ';' })
You could use the following regular expression:
\p{P}*$
This breaks down to:
\p{P} - Unicode punctuation
* - Any number of times
$ - End of line anchor
If you know that there will always be some punctuation at the end of the string, use + for efficiency.
And use it like this in order to get the punctuation:
string punctuation = Regex.Match(myString, #"\p{P}*$").Value;
To actually remove it:
string noPunctuation = Regex.Replace(myString, #"\p{P}*$", string.Empty);
Use a regex:
resultString = Regex.Replace(subjectString, #"[.:!?,;-]+$", "");
Explanation:
[.:!?,;-] # Match a character that's one of the enclosed characters
+ # Do this once or more (as many times as possible)
$ # Assert position at the end of the string
As Oded suggested, use \p{P} instead of [.:!?,;-] if you want to remove all punctuation characters, not just the ones from your list.
To also "store" the punctuation, you could split the string:
splitArray = Regex.Split(subjectString, #"(?=\p{P}+$)");
Then splitArray[0] contains the part before the punctuation, and splitArray[1] the punctuation characters. If there are any.
Using Linq:
var punctuationMap = new HashSet<char>(new char[] { '.', ':', '-', '!', '?', ',', ';' });
var endPunctuationChars = aString.Reverse().
TakeWhile(ch => punctuationMap.Contains(ch));
var result = new string(endPunctuationChars.Reverse().ToArray());
The HashSet is not mandatory, you can use Linq's Contains on the array directly.

Split a string by word using one of any or all delimiters?

I may have just hit the point where i;m overthinking it, but I'm wondering: is there a way to designate a list of special characters that should all be considered delimiters, then splitting a string using that list? Example:
"battlestar.galactica-season 1"
should be returned as
battlestar galactica season 1
i'm thinking regex but i'm kinda flustered at the moment, been staring at it for too long.
EDIT:
Thanks guys for confirming my suspicion that i was overthinking it lol: here is what i ended up with:
//remove the delimiter
string[] tempString = fileTitle.Split(#"\/.-<>".ToCharArray());
fileTitle = "";
foreach (string part in tempString)
{
fileTitle += part + " ";
}
return fileTitle;
I suppose i could just replace delimiters with " " spaces as well... i will select an answer as soon as the timer is up!
The built-in String.Split method can take a collection of characters as delimiters.
string s = "battlestar.galactica-season 1";
string[] words = s.split('.', '-');
The standard split method does that for you. It takes an array of characters:
public string[] Split(
params char[] separator
)
You can just call an overload of split:
myString.Split(new char[] { '.', '-', ' ' }, StringSplitOptions.RemoveEmptyEntries);
The char array is a list of delimiters to split on.
"battlestar.galactica-season 1".Split(new string[] { ".", "-" }, StringSplitOptions.RemoveEmptyEntries);
This may not be complete but something like this.
string value = "battlestar.galactica-season 1"
char[] delimiters = new char[] { '\r', '\n', '.', '-' };
string[] parts = value.Split(delimiters,
StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < parts.Length; i++)
{
Console.WriteLine(parts[i]);
}
Are you trying to split the string (make multiple strings) or do you just want to replace the special characters with a space as your example might also suggest (make 1 altered string).
For the first option just see the other answers :)
If you want to replace you could use
string title = "battlestar.galactica-season 1".Replace('.', ' ').Replace('-', ' ');
For more information split with easy examples you may see following Url:
This also include split on words (multiple chars).
C# Split Function explained

Filter a String

I want to make sure a string has only characters in this range
[a-z] && [A-Z] && [0-9] && [-]
so all letters and numbers plus the hyphen.
I tried this...
C# App:
char[] filteredChars = { ',', '!', '#', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+', '=', '{', '}', '[', ']', ':', ';', '"', '\'', '?', '/', '.', '<', '>', '\\', '|' };
string s = str.TrimStart(filteredChars);
This TrimStart() only seems to work with letters no otehr characters like $ % etc
Did I implement it wrong?
Is there a better way to do it?
I just want to avoid looping through each string's index checking because there will be a lot of strings to do...
Thoughts?
Thanks!
This seems like a perfectly valid reason to use a regular expression.
bool stringIsValid = Regex.IsMatch(inputString, #"^[a-zA-Z0-9\-]*?$");
In response to miguel's comment, you could do this to remove all unwanted characters:
string cleanString = Regex.Replace(inputString, #"[^a-zA-Z0-9\-]", "");
Note that the caret (^) is now placed inside the character class, thus negating it (matching any non-allowed character).
Here's a fun way to do it with LINQ - no ugly loops, no complicated RegEx:
private string GetGoodString(string input)
{
var allowedChars =
Enumerable.Range('0', 10).Concat(
Enumerable.Range('A', 26)).Concat(
Enumerable.Range('a', 26)).Concat(
Enumerable.Range('-', 1));
var goodChars = input.Where(c => allowedChars.Contains(c));
return new string(goodChars.ToArray());
}
Feed it "Hello, world? 123!" and it will return "Helloworld123".
Why not just use replace instead? Trimstart will only remove the leading characters in your list...
Try the following
public bool isStringValid(string input) {
if ( null == input ) {
throw new ArgumentNullException("input");
}
return System.Text.RegularExpressions.Regex.IsMatch(input, "^[A-Za-z0-9\-]*$");
}
I'm sure that with a bit more time you can come up wiht something better, but this will give you a good idea:
public string NumberOrLetterOnly(string s)
{
string rtn = s;
for (int i = 0; i < s.Length; i++)
{
if (!char.IsLetterOrDigit(rtn[i]) && rtn[i] != '-')
{
rtn = rtn.Replace(rtn[i].ToString(), " ");
}
}
return rtn.Replace(" ", "");
}
I have tested these two solutions in Linqpad 5. The benefit of these is that they can be used not only for integers, but also decimals / floats with a number decimal separator, which is culture dependent. For example, in Norway we use the comma as the decimal separator, whereas in the US, the dot is used. The comma is used there as a thousands separator. Anyways, first the Linq version and then the Regex version. The most terse bit is accessing the Thread's static property for number separator, but you can compress this a bit using static at the top of the code, or better - put such functionality into C# extension methods, preferably having overloads with arbitrary Regex patterns.
string crappyNumber = #"40430dfkZZZdfldslkggh430FDFLDEFllll340-DIALNOWFORCHRISTSAKE.,CAKE-FORFIRSTDIAL920932903209032093294faøj##R#KKL##K";
string.Join("", crappyNumber.Where(c => char.IsDigit(c)|| c.ToString() == Thread.CurrentThread.CurrentCulture.NumberFormat.NumberDecimalSeparator)).Dump();
new String(crappyNumber.Where(c => new Regex($"[\\d]+{Thread.CurrentThread.CurrentUICulture.NumberFormat.NumberDecimalSeparator}\\d+").IsMatch(c.ToString())).ToArray()).Dump();
Note to the code above, the Dump() method dumps the results to Linqpad. Your code will of course skip this very last part. Also note that we got it down to a one liner, but it is a bit verbose still and can be put into C# extension methods as suggested.
Also, instead of string.join, newing a new String object is more compact syntax and less error prone.
We got a crappy number as input, but we managed to get our number in the end! And it is Culture aware in C#!

Categories

Resources