String normalisation

String normalisation - c#

I'm writing some code which needs to do string normalisation, I want to turn a given string into a camel-case representation (well, to the best guess at least). Example:
"the quick brown fox" => "TheQuickBrownFox"
"the_quick_brown_fox" => "TheQuickBrownFox"
"123The_quIck bROWN FOX" => "TheQuickBrownFox"
"the_quick brown fox 123" => "TheQuickBrownFox123"
"thequickbrownfox" => "Thequickbrownfox"
I think you should be able to get the idea from those examples. I want to strip out all special characters (', ", !, #, ., etc), capitalise every word (words are defined by a space, _ or -) and any leading numbers dropped (trailing/ internal are ok, but this requirement isn't vital, depending on the difficulty really).
I'm trying to work out what would be the best way to achieve this. My first guess would be with a regular expression, but my regex skills are bad at best so I wouldn't really know where to start.
My other idea would be to loop and parse the data, say break it down into words, parse each one, and rebuilt the string that way.
Or is there another way in which I could go about it?

How about a simple solution using Strings.StrConv in the Microsoft.VisualBasic namespace?
(Don't forget to add a Project Reference to Microsoft.VisualBasic):
using System;
using VB = Microsoft.VisualBasic;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(VB.Strings.StrConv("QUICK BROWN", VB.VbStrConv.ProperCase, 0));
Console.ReadLine();
}
}
}

This regex matches all words. Then, we Aggregate them with a method that capitalizes the first chars, and ToLowers the rest of the string.
Regex regex = new Regex(#"[a-zA-Z]*", RegexOptions.Compiled);
private string CamelCase(string str)
{
return regex.Matches(str).OfType<Match>().Aggregate("", (s, match) => s + CamelWord(match.Value));
}
private string CamelWord(string word)
{
if (string.IsNullOrEmpty(word))
return "";
return char.ToUpper(word[0]) + word.Substring(1).ToLower();
}
This method ignores numbers, by the way. To Add them, you can change the regex to #"[a-zA-Z]*|[0-9]*", I suppose - but I haven't tested it.

Any solution that involves matching particular characters may not work well with some character encodings, particularly if Unicode representation is being used, which has dozens of space characters, thousands of 'symbols', thousands of punctuation characters, thousands of 'letters', etc. It would be better where-ever possible to use built-in Unicode-aware functions. In terms of what is a 'special character', well you could decide based on Unicode categories. For instance, it would include 'Punctuation' but would it include 'Symbols'?
ToLower(), IsLetter(), etc should be fine, and take into account all possible letters in Unicode. Matching against dashes and slashes should probably take into account some of the dozens of space and dash characters in Unicode.

You could wear ruby slippers to work :)
def camelize str
str.gsub(/^[^a-zA-z]*/, '').split(/[^a-zA-Z0-9]/).map(&:capitalize).join
end

thought it'd be fun to try it, here's what i came up with:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication2
{
class Program
{
static void Main(string[] args)
{
StringBuilder sb = new StringBuilder();
string sentence = "123The_quIck bROWN FOX1234";
sentence = sentence.ToLower();
char[] s = sentence.ToCharArray();
bool atStart = true;
char pChar = ' ';
char[] spaces = { ' ', '_', '-' };
char a;
foreach (char c in s)
{
if (atStart && char.IsDigit(c)) continue;
if (char.IsLetter(c))
{
a = c;
if (spaces.Contains(pChar))
a = char.ToUpper(a);
sb.Append(a);
atStart = false;
}
else if(char.IsDigit(c))
{
sb.Append(c);
}
pChar = c;
}
Console.WriteLine(sb.ToString());
Console.ReadLine();
}
}
}

Related

How do I get a non lowercase string after quotes in the titlecase condition

In my article titles, I use CultureInfo.CurrentCulture.TextInfo.ToTitleCase(str.ToLower()); but I think, it is not working after double quotes. At least for Turkish.
For example, an article's title like this:
KİRA PARASININ ÖDENMEMESİ NEDENİYLE YAPILAN "İLAMSIZ TAHLİYE"
TAKİPLERİNDE "TAKİP TALEBİ"NİN İÇERİĞİ.
After using the method like this:
private static string TitleCase(this string str)
{
return CultureInfo.CurrentCulture.TextInfo.ToTitleCase(str.ToLower());
}
var art_title = textbox1.Text.TitleCase(); It returns
Kira Parasının Ödenmemesi Nedeniyle Yapılan "İlamsız Tahliye"
Takiplerinde "Takip Talebi"Nin İçeriği.
The problem is here. Because it must be like this:
... "Takip Talebi"nin ...
but it is like this:
... "Takip Talebi"Nin ...
What's more, in the MS Word, when I click "Start a Word Initial Expense," it's transforming like that
... "Takip Talebi"Nin ...
But it is absolutely wrong. How can I fix this problem?
EDIT: Firstly I cut the sentence from the blanks and obtained the words. If a word includes double quote, it would get a lowercase string until the first space after the second double quote. Here is the idea:
private static string _TitleCase(this string str)
{
return CultureInfo.CurrentCulture.TextInfo.ToTitleCase(str.ToLower());
}
public static string TitleCase(this string str)
{
var words = str.Split(' ');
string sentence = null;
var i = 1;
foreach (var word in words)
{
var space = i < words.Length ? " " : null;
if (word.Contains("\""))
{
// After every second quotes, it would get a lowercase string until the first space after the second double quote... But how?
}
else
sentence += word._TitleCase() + space;
i++;
}
return sentence?.Trim();
}
Edit - 2 After 3 Hours: After 9 hours, I found a way to solve the problem. I believe that it is absolutely not scientific. Please don't condemn me for this. If the whole problem is double quotes, I replace it with a number that I think it is unique or an unused letter in Turkish, like alpha, beta, omega etc. before sending it to the ToTitleCase. In this case, the ToTitleCase realizes the title transformation without any problems. Then I replace number or unused letter with double quotes in return time. So the purpose is realized. Please share it in here if you have a programmatic or scientific solution.
Here is my non-programmatic solution:
public static string TitleCase(this string str)
{
str = str.Replace("\"", "9900099");
str = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(str.ToLower());
return str.Replace("9900099", "\"").Trim();
}
var art_title = textbox1.Text.TitleCase();
And the result:
Kira Parasının Ödenmemesi Nedeniyle Yapılan "İlamsız Tahliye" Takiplerinde "Takip Talebi"nin İçeriği

Indeed, Microsoft documentation ToTitleCase states that ToTitleCase is (at least currently) not linguistically correct. In fact, it is REALLY hard to do this correctly (see these blog posts of the great Michael Kaplan: Sometimes, uppercasing sucks and "Michael, why does ToTitleCase suck so much?").
I'm not aware of any service or library providing a linguistically correct version.
So - unless you want to spend a lot of effort - you probably have to live with this inaccuracy.

You can find the apostrophe or quote character with RegEx and replace the character after it.
For apostrophe
Regex.Replace(str, "’(?:.)", m => m.Value.ToLower());
or
Regex.Replace(str, "'(?:.)", m => m.Value.ToLower());

Using string.ToUpper on substring

Have an assignment to allow a user to input a word in C# and then display that word with the first and third characters changed to uppercase. Code follows:
namespace Capitalizer
{
class Program
{
static void Main(string[] args)
{
string text = Console.ReadLine();
char[] delimiterChars = { ' ' };
string[] words = text.Split(delimiterChars);
string Upper = text.ToUpper();
Console.WriteLine(Upper);
Console.ReadKey();
}
}
}
This of course generates the entire word in uppercase, which is not what I want. I can't seem to make text.ToUpper(0,2) work, and even then that'd capitalize the first three letters. Only solution I can think of now that would make the word appear on one line (and I don't know if it works) is to move the capitalized letters and lowercase letters into a character array and try to get that to print all values in a modified order.

The simplest way I can think of to address your exact question as described — to convert to upper case the first and third characters of the input — would be something like the following:
StringBuilder sb = new StringBuilder(text);
sb[0] = char.ToUpper(sb[0]);
sb[2] = char.ToUpper(sb[2]);
text = sb.ToString();
The StringBuilder class is essentially a mutable string object, so when doing these kinds of operations is the most fluid way to approach the problem, as it provides the most straightforward conversions to and from, as well as the full range of string operations. Changing individual characters is easy in many data structures, but insertions, deletions, appending, formatting, etc. all also come with StringBuilder, so it's a good habit to use that versus other approaches.
But frankly, it's hard to see how that's a useful operation. I can't help but wonder if you have stated the requirements incorrectly and there's something more to this question than is seen here.

You could use LINQ:
var upperCaseIndices = new[] { 0, 2 };
var message = "hello";
var newMessage = new string(message.Select((c, i) =>
upperCaseIndices.Contains(i) ? Char.ToUpper(c) : c).ToArray());
Here is how it works. message.Select (inline LINQ query) selects characters from message one by one and passes into selector function:
upperCaseIndices.Contains(i) ? Char.ToUpper(c) : c
written as C# ?: shorthand syntax for if. It reads as "If index is present in the array, then select upper case character. Otherwise select character as is."
(c, i) => condition
is a lambda expression. See also:
Understand Lambda Expressions in 3 minutes
The rest is very simple - represent result as array of characters (.ToArray()), and create a new string based off that (new string(...)).

Only solution I can think of now that would make the word appear on one line (and I don't know if it works) is to move the capitalized letters and lowercase letters into a character array and try to get that to print all values in a modified order.
That seems a lot more complicated than necessary. Once you have a character array, you can simply change the elements of that character array. In a separate function, it would look something like
string MakeFirstAndThirdCharacterUppercase(string word) {
var chars = word.ToCharArray();
chars[0] = chars[0].ToUpper();
chars[2] = chars[2].ToUpper();
return new string(chars);
}

My simple solution:
string text = Console.ReadLine();
char[] delimiterChars = { ' ' };
string[] words = text.Split(delimiterChars);
foreach (string s in words)
{
char[] chars = s.ToCharArray();
chars[0] = char.ToUpper(chars[0]);
if (chars.Length > 2)
{
chars[2] = char.ToUpper(chars[2]);
}
Console.Write(new string(chars));
Console.Write(' ');
}
Console.ReadKey();

C# string to sentence

Is there a way to convert string without spaces to a proper sentence??
E.g. "WhoAmI" needs to be converted to "Who Am I"

A regex replacement would do this, if you're just talking about inserting a space before each capital letter:
using System;
using System.Text.RegularExpressions;
class Test
{
static void Main()
{
var input = "WhoAmI";
var output = Regex.Replace(input, #"\p{Lu}", " $0").TrimStart();
Console.WriteLine(output);
}
}
However, I suspect there will be significant corner cases. Note that the above uses \p{Lu} instead of just [A-Z] to cope with non-ASCII capital letters; you may find A-Z simpler if you only need to deal with ASCII. The TrimStart() call is to remove the leading space you'd get otherwise.

If every word in the string is starting with uppercase you may just convert each part that is starting with uppercase to a space separated string.

You can use LINQ
string words = "WhoAmI";
string sentence = String.Concat(words.Select(letter => Char.IsUpper(letter) ? " " + letter
: letter.ToString()))
.TrimStart();

Regular Expression To Split On Comma Except If Quoted

What is the regular expression to split on comma (,) except if surrounded by double quotes? For example:
max,emily,john = ["max", "emily", "john"]
BUT
max,"emily,kate",john = ["max", "emily,kate", "john"]
Looking to use in C#: Regex.Split(string, "PATTERN-HERE");
Thanks.

Situations like this often call for something other than regular expressions. They are nifty, but patterns for handling this kind of thing are more complicated than they are useful.
You might try something like this instead:
public static IEnumerable<string> SplitCSV(string csvString)
{
var sb = new StringBuilder();
bool quoted = false;
foreach (char c in csvString) {
if (quoted) {
if (c == '"')
quoted = false;
else
sb.Append(c);
} else {
if (c == '"') {
quoted = true;
} else if (c == ',') {
yield return sb.ToString();
sb.Length = 0;
} else {
sb.Append(c);
}
}
}
if (quoted)
throw new ArgumentException("csvString", "Unterminated quotation mark.");
yield return sb.ToString();
}
It probably needs a few tweaks to follow the CSV spec exactly, but the basic logic is sound.

This is a clear-cut case for a CSV parser, so you should be using .NET's own CSV parsing capabilities or cdhowie's solution.
Purely for your information and not intended as a workable solution, here's what contortions you'd have to go through using regular expressions with Regex.Split():
You could use the regex (please don't!)
(?<=^(?:[^"]*"[^"]*")*[^"]*) # assert that there is an even number of quotes before...
\s*,\s* # the comma to be split on...
(?=(?:[^"]*"[^"]*")*[^"]*$) # as well as after the comma.
if your quoted strings never contain escaped quotes, and you don't mind the quotes themselves becoming part of the match.
This is horribly inefficient, a pain to read and debug, works only in .NET, and it fails on escaped quotes (at least if you're not using "" to escape a single quote). Of course the regex could be modified to handle that as well, but then it's going to be perfectly ghastly.

A little late maybe but I hope I can help someone else
String[] cols = Regex.Split("max, emily, john", #"\s*,\s*");
foreach ( String s in cols ) {
Console.WriteLine(s);
}

Justin, resurrecting this question because it had a simple regex solution that wasn't mentioned. This situation sounds straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc.
Here's our simple regex:
"[^"]*"|(,)
The left side of the alternation matches complete "quoted strings" tags. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left. We replace these commas with SplitHere, then we split on SplitHere.
This program shows how to use the regex (see the results at the bottom of the online demo):
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program
{
static void Main() {
string s1 = #"max,""emily,kate"",john";
var myRegex = new Regex(#"""[^""]*""|(,)");
string replaced = myRegex.Replace(s1, delegate(Match m) {
if (m.Groups[1].Value == "") return m.Value;
else return "SplitHere";
});
string[] splits = Regex.Split(replaced,"SplitHere");
foreach (string split in splits) Console.WriteLine(split);
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...

How can I strip punctuation from a string?

For the hope-to-have-an-answer-in-30-seconds part of this question, I'm specifically looking for C#
But in the general case, what's the best way to strip punctuation in any language?
I should add: Ideally, the solutions won't require you to enumerate all the possible punctuation marks.
Related: Strip Punctuation in Python

new string(myCharCollection.Where(c => !char.IsPunctuation(c)).ToArray());

Why not simply:
string s = "sxrdct?fvzguh,bij.";
var sb = new StringBuilder();
foreach (char c in s)
{
if (!char.IsPunctuation(c))
sb.Append(c);
}
s = sb.ToString();
The usage of RegEx is normally slower than simple char operations. And those LINQ operations look like overkill to me. And you can't use such code in .NET 2.0...

Describes intent, easiest to read (IMHO) and best performing:
s = s.StripPunctuation();
to implement:
public static class StringExtension
{
public static string StripPunctuation(this string s)
{
var sb = new StringBuilder();
foreach (char c in s)
{
if (!char.IsPunctuation(c))
sb.Append(c);
}
return sb.ToString();
}
}
This is using Hades32's algorithm which was the best performing of the bunch posted.

Assuming "best" means "simplest" I suggest using something like this:
String stripped = input.replaceAll("\\p{Punct}+", "");
This example is for Java, but all sufficiently modern Regex engines should support this (or something similar).
Edit: the Unicode-Aware version would be this:
String stripped = input.replaceAll("\\p{P}+", "");
The first version only looks at punctuation characters contained in ASCII.

You can use the regex.replace method:
replace(YourString, RegularExpressionWithPunctuationMarks, Empty String)
Since this returns a string, your method will look something like this:
string s = Regex.Replace("Hello!?!?!?!", "[?!]", "");
You can replace "[?!]" with something more sophiticated if you want:
(\p{P})
This should find any punctuation.

This thread is so old, but I'd be remiss not to post a more elegant (IMO) solution.
string inputSansPunc = input.Where(c => !char.IsPunctuation(c)).Aggregate("", (current, c) => current + c);
It's LINQ sans WTF.

Based off GWLlosa's idea, I was able to come up with the supremely ugly, but working:
string s = "cat!";
s = s.ToCharArray().ToList<char>()
.Where<char>(x => !char.IsPunctuation(x))
.Aggregate<char, string>(string.Empty, new Func<string, char, string>(
delegate(string s, char c) { return s + c; }));

The most braindead simple way of doing it would be using string.replace
The other way I would imagine is a regex.replace and have your regular expression with all the appropriate punctuation marks in it.

Here's a slightly different approach using linq. I like AviewAnew's but this avoids the Aggregate
string myStr = "Hello there..';,]';';., Get rid of Punction";
var s = from ch in myStr
where !Char.IsPunctuation(ch)
select ch;
var bytes = UnicodeEncoding.ASCII.GetBytes(s.ToArray());
var stringResult = UnicodeEncoding.ASCII.GetString(bytes);

If you want to use this for tokenizing text you can use:
new string(myText.Select(c => char.IsPunctuation(c) ? ' ' : c).ToArray())

For anyone who would like to do this via RegEx:
This code shows the full RegEx replace process and gives a sample Regex that only keeps letters, numbers, and spaces in a string - replacing ALL other characters with an empty string:
//Regex to remove all non-alphanumeric characters
System.Text.RegularExpressions.Regex TitleRegex = new
System.Text.RegularExpressions.Regex("[^a-z0-9 ]+",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
string ParsedString = TitleRegex.Replace(stringToParse, String.Empty);
return ParsedString;

I faced the same issue and was concerned about the performance impact of calling the IsPunctuation for every single check.
I found this post: http://www.dotnetperls.com/char-ispunctuation.
Accross the lines: char.IsPunctuation also handles Unicode on top of ASCII.
The method matches a bunch of characters including control characters. By definiton, this method is heavy and expensive.
The bottom line is that I finally didn't go for it because of its performance impact on my ETL process.
I went for the custom implemetation of dotnetperls.
And jut FYI, here is some code deduced from the previous answers to get the list of all punctuation characters (excluding the control ones):
var punctuationCharacters = new List<char>();
for (int i = char.MinValue; i <= char.MaxValue; i++)
{
var character = Convert.ToChar(i);
if (char.IsPunctuation(character) && !char.IsControl(character))
{
punctuationCharacters.Add(character);
}
}
var commaSeparatedValueOfPunctuationCharacters = string.Join("", punctuationCharacters);
Console.WriteLine(commaSeparatedValueOfPunctuationCharacters);
Cheers,
Andrew

$newstr=ereg_replace("[[:punct:]]",'',$oldstr);

For long strings I use this:
var normalized = input
.Where(c => !char.IsPunctuation(c))
.Aggregate(new StringBuilder(),
(current, next) => current.Append(next), sb => sb.ToString());
performs much better than using string concatenations (though I agree it's less intuitive).

This is simple code for removing punctuation from strings given by the user
Import required library
import string
Ask input from user in string format
strs = str(input('Enter your string:'))
for c in string.punctuation:
strs= strs.replace(c,"")
print(f"\n Your String without punctuation:{strs}")

#include<string>
#include<cctype>
using namespace std;
int main(int a, char* b[]){
string strOne = "H,e.l/l!o W#o#r^l&d!!!";
int punct_count = 0;
cout<<"before : "<<strOne<<endl;
for(string::size_type ix = 0 ;ix < strOne.size();++ix)
{
if(ispunct(strOne[ix]))
{
++punct_count;
strOne.erase(ix,1);
ix--;
}//if
}
cout<<"after : "<<strOne<<endl;
return 0;
}//main

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

String normalisation - c#

You could wear ruby slippers to work :) def camelize str str.gsub(/^[^a-zA-z]*/, '').split(/[^a-zA-Z0-9]/).map(&:capitalize).join end

Related

How do I get a non lowercase string after quotes in the titlecase condition

Using string.ToUpper on substring

C# string to sentence

Regular Expression To Split On Comma Except If Quoted

How can I strip punctuation from a string?

Categories

Resources