How can I strip punctuation from a string? - c#

For the hope-to-have-an-answer-in-30-seconds part of this question, I'm specifically looking for C#
But in the general case, what's the best way to strip punctuation in any language?
I should add: Ideally, the solutions won't require you to enumerate all the possible punctuation marks.
Related: Strip Punctuation in Python

new string(myCharCollection.Where(c => !char.IsPunctuation(c)).ToArray());

Why not simply:
string s = "sxrdct?fvzguh,bij.";
var sb = new StringBuilder();
foreach (char c in s)
{
if (!char.IsPunctuation(c))
sb.Append(c);
}
s = sb.ToString();
The usage of RegEx is normally slower than simple char operations. And those LINQ operations look like overkill to me. And you can't use such code in .NET 2.0...

Describes intent, easiest to read (IMHO) and best performing:
s = s.StripPunctuation();
to implement:
public static class StringExtension
{
public static string StripPunctuation(this string s)
{
var sb = new StringBuilder();
foreach (char c in s)
{
if (!char.IsPunctuation(c))
sb.Append(c);
}
return sb.ToString();
}
}
This is using Hades32's algorithm which was the best performing of the bunch posted.

Assuming "best" means "simplest" I suggest using something like this:
String stripped = input.replaceAll("\\p{Punct}+", "");
This example is for Java, but all sufficiently modern Regex engines should support this (or something similar).
Edit: the Unicode-Aware version would be this:
String stripped = input.replaceAll("\\p{P}+", "");
The first version only looks at punctuation characters contained in ASCII.

You can use the regex.replace method:
replace(YourString, RegularExpressionWithPunctuationMarks, Empty String)
Since this returns a string, your method will look something like this:
string s = Regex.Replace("Hello!?!?!?!", "[?!]", "");
You can replace "[?!]" with something more sophiticated if you want:
(\p{P})
This should find any punctuation.

This thread is so old, but I'd be remiss not to post a more elegant (IMO) solution.
string inputSansPunc = input.Where(c => !char.IsPunctuation(c)).Aggregate("", (current, c) => current + c);
It's LINQ sans WTF.

Based off GWLlosa's idea, I was able to come up with the supremely ugly, but working:
string s = "cat!";
s = s.ToCharArray().ToList<char>()
.Where<char>(x => !char.IsPunctuation(x))
.Aggregate<char, string>(string.Empty, new Func<string, char, string>(
delegate(string s, char c) { return s + c; }));

The most braindead simple way of doing it would be using string.replace
The other way I would imagine is a regex.replace and have your regular expression with all the appropriate punctuation marks in it.

Here's a slightly different approach using linq. I like AviewAnew's but this avoids the Aggregate
string myStr = "Hello there..';,]';';., Get rid of Punction";
var s = from ch in myStr
where !Char.IsPunctuation(ch)
select ch;
var bytes = UnicodeEncoding.ASCII.GetBytes(s.ToArray());
var stringResult = UnicodeEncoding.ASCII.GetString(bytes);

If you want to use this for tokenizing text you can use:
new string(myText.Select(c => char.IsPunctuation(c) ? ' ' : c).ToArray())

For anyone who would like to do this via RegEx:
This code shows the full RegEx replace process and gives a sample Regex that only keeps letters, numbers, and spaces in a string - replacing ALL other characters with an empty string:
//Regex to remove all non-alphanumeric characters
System.Text.RegularExpressions.Regex TitleRegex = new
System.Text.RegularExpressions.Regex("[^a-z0-9 ]+",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
string ParsedString = TitleRegex.Replace(stringToParse, String.Empty);
return ParsedString;

I faced the same issue and was concerned about the performance impact of calling the IsPunctuation for every single check.
I found this post: http://www.dotnetperls.com/char-ispunctuation.
Accross the lines: char.IsPunctuation also handles Unicode on top of ASCII.
The method matches a bunch of characters including control characters. By definiton, this method is heavy and expensive.
The bottom line is that I finally didn't go for it because of its performance impact on my ETL process.
I went for the custom implemetation of dotnetperls.
And jut FYI, here is some code deduced from the previous answers to get the list of all punctuation characters (excluding the control ones):
var punctuationCharacters = new List<char>();
for (int i = char.MinValue; i <= char.MaxValue; i++)
{
var character = Convert.ToChar(i);
if (char.IsPunctuation(character) && !char.IsControl(character))
{
punctuationCharacters.Add(character);
}
}
var commaSeparatedValueOfPunctuationCharacters = string.Join("", punctuationCharacters);
Console.WriteLine(commaSeparatedValueOfPunctuationCharacters);
Cheers,
Andrew

$newstr=ereg_replace("[[:punct:]]",'',$oldstr);

For long strings I use this:
var normalized = input
.Where(c => !char.IsPunctuation(c))
.Aggregate(new StringBuilder(),
(current, next) => current.Append(next), sb => sb.ToString());
performs much better than using string concatenations (though I agree it's less intuitive).

This is simple code for removing punctuation from strings given by the user
Import required library
import string
Ask input from user in string format
strs = str(input('Enter your string:'))
for c in string.punctuation:
strs= strs.replace(c,"")
print(f"\n Your String without punctuation:{strs}")

#include<string>
#include<cctype>
using namespace std;
int main(int a, char* b[]){
string strOne = "H,e.l/l!o W#o#r^l&d!!!";
int punct_count = 0;
cout<<"before : "<<strOne<<endl;
for(string::size_type ix = 0 ;ix < strOne.size();++ix)
{
if(ispunct(strOne[ix]))
{
++punct_count;
strOne.erase(ix,1);
ix--;
}//if
}
cout<<"after : "<<strOne<<endl;
return 0;
}//main

Related

convert non alphanumeric glyphs to unicode while preserving alphanumeric

I need to convert non alpha-numeric glyphs in a string to their unicode value, while preserving the alphanumeric characters. Is there a method to do this in C#?
As an example, I need to convert this string:
"hello world!"
To this:
"hello_x0020_world_x0021_"
To get string safe for XML node name you should use XmlConver.EncodeName.
Note that if you need to encode all non-alphanumeric characters you'd need to write it yourself as "_" is not encoded by that method.
You could start with this code using LINQ Select extension method:
string str = "hello world!";
string a = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
a += a.ToLower();
char[] alphabet = a.ToCharArray();
str = string.Join("",
str.Select(ch => alphabet.Contains(ch) ?
ch.ToString() : String.Format("_x{0:x4}_", ch)).ToArray()
);
Now clearly it has some problems:
it does linear search in the list of characters
missed numeric...
if we add numeric need to decide if first character is ok to be digit (assuming yes)
code creates large number of strings that are immediately discarded (one per character)
alphanumeric is limited to ASCII (assuming ok, if not Char.IsLetterOrDigit to help)
does to much work for pure alpha-numeric strings
First two are easy - we can use HashSet (O(1) Contains) initialized by full list of characters (if any alpahnumeric characters are ok more readable to use existing method - Char.IsLetterOrDigit):
public static HashSet<char> asciiAlphaNum = new HashSet<char>
("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");
To avoid ch.ToString() that really pointlessly produces strings for immediate GC we need to figure out how to construct string from mix of char and string. String.Join does not work because it wants strings to start with, regular new string(...) does not have option for mix of char and string. So we are left with StringBuilder that happily takes both to Append. Consider starting with initial size str.Length if most strings don't have other characters.
So for each character we just need to either builder.Append(ch) or builder.AppendFormat(("_x{0:x4}_", (int)ch). To perform iteration it is easier to just use regular foreach, but if one really wants LINQ - Enumerable.Aggregate is the way to go.
string ReplaceNonAlphaNum(string str)
{
var builder = new StringBuilder();
foreach (var ch in str)
{
if (asciiAlphaNum.Contains(ch))
builder.Append(ch);
else
builder.AppendFormat("_x{0:x4}_", (int)ch);
}
return builder.ToString();
}
string ReplaceNonAlphaNumLinq(string str)
{
return str.Aggregate(new StringBuilder(), (builder, ch) =>
asciiAlphaNum.Contains(ch) ?
builder.Append(ch) : builder.AppendFormat("_x{0:x4}_", (int)ch)
).ToString();
}
To the last point - we don't really need to do anything if there is nothing to convert - so some check like check alphanumeric characters in string in c# would help to avoid extra strings.
Thus final version (LINQ as it is a bit shorter and fancier):
private static asciiAlphaNumRx = new Regex(#"^[a-zA-Z0-9]*$");
public static HashSet<char> asciiAlphaNum = new HashSet<char>
("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");
string ReplaceNonAlphaNumLinq(string str)
{
return asciiAlphaNumRx.IsMatch(str) ? str :
str.Aggregate(new StringBuilder(), (builder, ch) =>
asciiAlphaNum.Contains(ch) ?
builder.Append(ch) : builder.AppendFormat("_x{0:x4}_", (int)ch)
).ToString();
}
Alternatively whole thing could be done with Regex - see Regex replace: Transform pattern with a custom function for starting point.

Using string.ToUpper on substring

Have an assignment to allow a user to input a word in C# and then display that word with the first and third characters changed to uppercase. Code follows:
namespace Capitalizer
{
class Program
{
static void Main(string[] args)
{
string text = Console.ReadLine();
char[] delimiterChars = { ' ' };
string[] words = text.Split(delimiterChars);
string Upper = text.ToUpper();
Console.WriteLine(Upper);
Console.ReadKey();
}
}
}
This of course generates the entire word in uppercase, which is not what I want. I can't seem to make text.ToUpper(0,2) work, and even then that'd capitalize the first three letters. Only solution I can think of now that would make the word appear on one line (and I don't know if it works) is to move the capitalized letters and lowercase letters into a character array and try to get that to print all values in a modified order.
The simplest way I can think of to address your exact question as described — to convert to upper case the first and third characters of the input — would be something like the following:
StringBuilder sb = new StringBuilder(text);
sb[0] = char.ToUpper(sb[0]);
sb[2] = char.ToUpper(sb[2]);
text = sb.ToString();
The StringBuilder class is essentially a mutable string object, so when doing these kinds of operations is the most fluid way to approach the problem, as it provides the most straightforward conversions to and from, as well as the full range of string operations. Changing individual characters is easy in many data structures, but insertions, deletions, appending, formatting, etc. all also come with StringBuilder, so it's a good habit to use that versus other approaches.
But frankly, it's hard to see how that's a useful operation. I can't help but wonder if you have stated the requirements incorrectly and there's something more to this question than is seen here.
You could use LINQ:
var upperCaseIndices = new[] { 0, 2 };
var message = "hello";
var newMessage = new string(message.Select((c, i) =>
upperCaseIndices.Contains(i) ? Char.ToUpper(c) : c).ToArray());
Here is how it works. message.Select (inline LINQ query) selects characters from message one by one and passes into selector function:
upperCaseIndices.Contains(i) ? Char.ToUpper(c) : c
written as C# ?: shorthand syntax for if. It reads as "If index is present in the array, then select upper case character. Otherwise select character as is."
(c, i) => condition
is a lambda expression. See also:
Understand Lambda Expressions in 3 minutes
The rest is very simple - represent result as array of characters (.ToArray()), and create a new string based off that (new string(...)).
Only solution I can think of now that would make the word appear on one line (and I don't know if it works) is to move the capitalized letters and lowercase letters into a character array and try to get that to print all values in a modified order.
That seems a lot more complicated than necessary. Once you have a character array, you can simply change the elements of that character array. In a separate function, it would look something like
string MakeFirstAndThirdCharacterUppercase(string word) {
var chars = word.ToCharArray();
chars[0] = chars[0].ToUpper();
chars[2] = chars[2].ToUpper();
return new string(chars);
}
My simple solution:
string text = Console.ReadLine();
char[] delimiterChars = { ' ' };
string[] words = text.Split(delimiterChars);
foreach (string s in words)
{
char[] chars = s.ToCharArray();
chars[0] = char.ToUpper(chars[0]);
if (chars.Length > 2)
{
chars[2] = char.ToUpper(chars[2]);
}
Console.Write(new string(chars));
Console.Write(' ');
}
Console.ReadKey();

How to get the count of only special character in a string using Regex?

If my input string is ~!##$%^&*()_+{}:"<>?
How do I get the count of each special character using Regex? For example:
Regex.Matches(inputText, "each special character").Count;
This should be the answer to your question:
Regex.Matches("Little?~ birds! like to# sing##", "[~!##$%^&*()_+{}:\"<>?]").Count
Count should return 6 matches, change the sentence to other variable or something else.
You can find more info about regex expressions here:
http://www.zytrax.com/tech/web/regex.htm
Best Regards!
Instead of thinking of every special characters and adding them up, do it the other way; count every letters/digits and subtract them from the count.
You can do that with a simple one-liner :
string input = "abc?&;3";
int numberOfSpecialCharacters = input.Length - input.Count(char.IsLetterOrDigit); //Gives 3
Which you can also change to
int numberOfSpecialCharacters = input.Count(c => !char.IsLetterOrDigit(c));
Regex is not the best way to do this. here is the Linq based solution
string chars = "~!##$%^&*()_+{}:\"<>?";
foreach (var item in chars.Where(x=> !char.IsLetterOrDigit(x)).GroupBy(x => x))
{
Console.WriteLine(string.Format("{0},{1}",item.Key,item.Count()));
}
I understand that you need to count each spl character count. Correct me If am mistaken.
The non-regex way (which sounds much easier) it to make a list of characters you want to check and use Linq to find the count of those characters.
string inputString = "asdf1!%jkl(!*";
List<char> charsToCheckFor = new List<char>() { '!', '#', '#', ..... };
int charCount = inputString.Count(x => charsToCheckFor.Contains(x));
I am making you write in all the characters you need to check for, because you need to figure out what you want.
If you want to follow other approach then you can use.
string str = "#123:*&^789'!##$*()_+=";
int count = 0;
foreach (char c in str)
{
if (!char.IsLetterOrDigit(c.ToString(),0))
{
count++;
}
}
MessageBox.Show(count.ToString());
It's been a while and I needed a similar answer for handling password validation. Pretty much what VITA said, but here was my specific take for others needing it for the same thing:
var pwdSpecialCharacterCount = Regex.Matches(item, "[~!##$%^&*()_+{}:\"<>?]").Count;
var pwdMinNumericalCharacters = Regex.Matches(item, "[0-9]").Count;
var pwdMinUpperCaseCharacters = Regex.Matches(item, "[A-Z]").Count;
var pwdMinLowerCaseCharacters = Regex.Matches(item, "[a-z]").Count;

What's the best way to merge strings?

Let's say I have a foreach-loop with strings like this:
String newStr='';
String str='a b c d e';
foreach(String strChar in str.split(' ')) {
newStr+=strChar+',';
}
the result would be something like: a,b,c,d,e, but what I want is a,b,c,d,e without the last comma. I normally split the last comma out but this seems ugly and overweight. Is there any lightweight way to do this?
Additional to this question: Is there any easy solution to add an "and" to the constellation that the result is something like: a, b, c, d and e for user output?
p.s.: I know that I can use the replace-method in the example but this is not what I'm looking because in most cases you can't use it (for example when you build a sql string).
I would use string.Join:
string newStr = string.Join(",", str.Split(' '));
Alternatively, you could add the separator at the start of the body of the loop, but not on the first time round.
I'd suggest using StringBuilder if you want to keep doing this by hand though. In fact, with a StringBuilder you could just unconditionally append the separator, and then decrement the length at the end to trim that end.
You also wrote:
for example when you build a sql string
It's very rarely a good idea to build a SQL string like this. In particular, you should absolutely not use strings from user input here - use parameterized SQL instead. Building SQL is typically the domain of ORM code... in which case it's usually better to use an existing ORM than to roll your own :)
you're characterizing the problem as appending a comma after every string except the last. Consider characterizing it as prepending a comma before every string but the first. It's an easier problem.
As for your harder version there are several dozen solutions on my blog and in this question.
Eric Lippert's challenge "comma-quibbling", best answer?
string.Join may be your friend:
String str='a b c d e';
var newStr = string.Join(",", str.Split(' '));
Here's how you can do it where you have "and" before the last value.
var vals = str.Split(' ');
var ans = vals.Length == 1 ?
str :
string.Join(", ", vals.Take(vals.Length - 1))) + ", and " + vals.Last();
newStr = String.Join(",", str.split(' '));
You can use Regex and replace whitespaces with commas
string newst = Regex.Replace(input, " ", ",");
First, you should be using a StringBuilder for string manipulations of this sort. Second, it's just an if conditional on the insert.
System.Text.StringBuilder newStr = new System.Text.StringBuilder("");
string oldStr = "a b c d e";
foreach(string c in oldStr.Split(' ')) {
if (newStr.Length > 0) newStr.Append(",");
newStr.Append(c);
}

Does C# have a String Tokenizer like Java's?

I'm doing simple string input parsing and I am in need of a string tokenizer. I am new to C# but have programmed Java, and it seems natural that C# should have a string tokenizer. Does it? Where is it? How do I use it?
You could use String.Split method.
class ExampleClass
{
public ExampleClass()
{
string exampleString = "there is a cat";
// Split string on spaces. This will separate all the words in a string
string[] words = exampleString.Split(' ');
foreach (string word in words)
{
Console.WriteLine(word);
// there
// is
// a
// cat
}
}
}
For more information see Sam Allen's article about splitting strings in c# (Performance, Regex)
I just want to highlight the power of C#'s Split method and give a more detailed comparison, particularly from someone who comes from a Java background.
Whereas StringTokenizer in Java only allows a single delimiter, we can actually split on multiple delimiters making regular expressions less necessary (although if one needs regex, use regex by all means!) Take for example this:
str.Split(new char[] { ' ', '.', '?' })
This splits on three different delimiters returning an array of tokens. We can also remove empty arrays with what would be a second parameter for the above example:
str.Split(new char[] { ' ', '.', '?' }, StringSplitOptions.RemoveEmptyEntries)
One thing Java's String tokenizer does have that I believe C# is lacking (at least Java 7 has this feature) is the ability to keep the delimiter(s) as tokens. C#'s Split will discard the tokens. This could be important in say some NLP applications, but for more general purpose applications this might not be a problem.
The split method of a string is what you need. In fact the tokenizer class in Java is deprecated in favor of Java's string split method.
I think the nearest in the .NET Framework is
string.Split()
For complex splitting you could use a regex creating a match collection.
_words = new List<string>(YourText.ToLower().Trim('\n', '\r').Split(' ').
Select(x => new string(x.Where(Char.IsLetter).ToArray())));
Or
_words = new List<string>(YourText.Trim('\n', '\r').Split(' ').
Select(x => new string(x.Where(Char.IsLetterOrDigit).ToArray())));
The similar to Java's method is:
Regex.Split(string, pattern);
where
string - the text you need to split
pattern - string type pattern, what is splitting the text
use Regex.Split(string,"#|#");
read this, split function has an overload takes an array consist of seperators
http://msdn.microsoft.com/en-us/library/system.stringsplitoptions.aspx
If you're trying to do something like splitting command line arguments in a .NET Console app, you're going to have issues because .NET is either broken or is trying to be clever (which means it's as good as broken). I needed to be able to split arguments by the space character, preserving any literals that were quoted so they didn't get split in the middle. This is the code I wrote to do the job:
private static List<String> Tokenise(string value, char seperator)
{
List<string> result = new List<string>();
value = value.Replace(" ", " ").Replace(" ", " ").Trim();
StringBuilder sb = new StringBuilder();
bool insideQuote = false;
foreach(char c in value.ToCharArray())
{
if(c == '"')
{
insideQuote = !insideQuote;
}
if((c == seperator) && !insideQuote)
{
if (sb.ToString().Trim().Length > 0)
{
result.Add(sb.ToString().Trim());
sb.Clear();
}
}
else
{
sb.Append(c);
}
}
if (sb.ToString().Trim().Length > 0)
{
result.Add(sb.ToString().Trim());
}
return result;
}
If you are using C# 3.5 you could write an extension method to System.String that does the splitting you need. You then can then use syntax:
string.SplitByMyTokens();
More info and a useful example from MS here http://msdn.microsoft.com/en-us/library/bb383977.aspx

Categories

Resources