convert non alphanumeric glyphs to unicode while preserving alphanumeric

convert non alphanumeric glyphs to unicode while preserving alphanumeric - c#

I need to convert non alpha-numeric glyphs in a string to their unicode value, while preserving the alphanumeric characters. Is there a method to do this in C#?
As an example, I need to convert this string:
"hello world!"
To this:
"hello_x0020_world_x0021_"

To get string safe for XML node name you should use XmlConver.EncodeName.
Note that if you need to encode all non-alphanumeric characters you'd need to write it yourself as "_" is not encoded by that method.

You could start with this code using LINQ Select extension method:
string str = "hello world!";
string a = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
a += a.ToLower();
char[] alphabet = a.ToCharArray();
str = string.Join("",
str.Select(ch => alphabet.Contains(ch) ?
ch.ToString() : String.Format("_x{0:x4}_", ch)).ToArray()
);
Now clearly it has some problems:
it does linear search in the list of characters
missed numeric...
if we add numeric need to decide if first character is ok to be digit (assuming yes)
code creates large number of strings that are immediately discarded (one per character)
alphanumeric is limited to ASCII (assuming ok, if not Char.IsLetterOrDigit to help)
does to much work for pure alpha-numeric strings
First two are easy - we can use HashSet (O(1) Contains) initialized by full list of characters (if any alpahnumeric characters are ok more readable to use existing method - Char.IsLetterOrDigit):
public static HashSet<char> asciiAlphaNum = new HashSet<char>
("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");
To avoid ch.ToString() that really pointlessly produces strings for immediate GC we need to figure out how to construct string from mix of char and string. String.Join does not work because it wants strings to start with, regular new string(...) does not have option for mix of char and string. So we are left with StringBuilder that happily takes both to Append. Consider starting with initial size str.Length if most strings don't have other characters.
So for each character we just need to either builder.Append(ch) or builder.AppendFormat(("_x{0:x4}_", (int)ch). To perform iteration it is easier to just use regular foreach, but if one really wants LINQ - Enumerable.Aggregate is the way to go.
string ReplaceNonAlphaNum(string str)
{
var builder = new StringBuilder();
foreach (var ch in str)
{
if (asciiAlphaNum.Contains(ch))
builder.Append(ch);
else
builder.AppendFormat("_x{0:x4}_", (int)ch);
}
return builder.ToString();
}
string ReplaceNonAlphaNumLinq(string str)
{
return str.Aggregate(new StringBuilder(), (builder, ch) =>
asciiAlphaNum.Contains(ch) ?
builder.Append(ch) : builder.AppendFormat("_x{0:x4}_", (int)ch)
).ToString();
}
To the last point - we don't really need to do anything if there is nothing to convert - so some check like check alphanumeric characters in string in c# would help to avoid extra strings.
Thus final version (LINQ as it is a bit shorter and fancier):
private static asciiAlphaNumRx = new Regex(#"^[a-zA-Z0-9]*$");
public static HashSet<char> asciiAlphaNum = new HashSet<char>
("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");
string ReplaceNonAlphaNumLinq(string str)
{
return asciiAlphaNumRx.IsMatch(str) ? str :
str.Aggregate(new StringBuilder(), (builder, ch) =>
asciiAlphaNum.Contains(ch) ?
builder.Append(ch) : builder.AppendFormat("_x{0:x4}_", (int)ch)
).ToString();
}
Alternatively whole thing could be done with Regex - see Regex replace: Transform pattern with a custom function for starting point.

Related

Moving the first char in a string to the send of the string using a method. C#

I know there are a lot of similar questions asked, and I've looked over those, but I still can't figure out my solution.
I'm trying to write a method that takes the first character of an inputted string and moves it to the back, then I can add additional characters if needed.
Basically if the input is Hello the output would be elloH + "whatever." I hope that makes sense.
As proof that I'm just not being lazy, here is the rest of the source code for the other parts of what I am working on. It all works, I just don't know where to begin with the last part.
Thanks for looking and thanks for the help!
private string CaseSwap(string str)//method for swaping cases
{
string result = ""; //create blank var
foreach (var c in str)
if (char.IsUpper(c)) //find uppers
result += char.ToLower(c); //change to lower
else
result += char.ToUpper(c); //all other lowers changed to upper
str = result; //assign var to str
return str; //return string to method
}
private string Reverse(string str)//method for reversing string
{
char[] revArray = str.ToCharArray(); //copy into an array
Array.Reverse(revArray); //reverse the array
return new string(revArray); //return the new string
}
private string Latin(string str)//method for latin
{
}
}
}

If you want to move first character to the end of string, then you can try below
public string MoveFirstCharToEnd(string str, string whateverStr="")
{
if(string.IsNullOrEmpty(str))
return str;
string result = str.Substring(1) + str[0] + whateverStr;
return result;
}
Note: I added whateverStr as an optional parameter, so that it can support only moving first character to the end and also it supports concatenating extra string to the result.
String.Substring(Int32):
Retrieves a substring from this instance. The substring starts at a
specified character position and continues to the end of the string.

Why not just take the 1st char and combine it with the rest of the string? E.g.
Hello
^^ ^
|| |
|Substring(1) - rest of the string (substring starting from 1)
|
value[0] - first character
Code:
public static string Rotate(string value) => string.IsNullOrEmpty(value)
? value
: $"{value.Substring(1)}{value[0]}";
Generalized implementation for arbitrary rotation (either positive or negative):
public static string Rotate(string value, int count = 1) {
if (string.IsNullOrWhiteSpace(value))
return value;
return string.Concat(Enumerable
.Range(0, value.Length)
.Select(i => value[(i + count % value.Length + value.Length) % value.Length]));
}
You can simplify your current implementation with a help of Linq
using System.Linq;
...
private static string CaseSwap(string value) =>
string.Concat(value.Select(c => char.IsUpper(c)
? char.ToLower(c)
: char.ToUpper(c)));
private static string Reverse(string value) =>
string.Concat(value.Reverse());

You can try to get the first character of a string with the String.Substring(int startPosition, int length) method . With this method you can also get the rest of your text starting from position 1 (skip the first character). When you have these 2 pieces, you can concat them.
Don't forget to check for empty strings, this can be done with the String.IsNullOrEmpty(string text) method.
public static string RemoveAndConcatFirstChar(string text){
if (string.IsNullOrEmpty(text)) return "";
return text.Substring(1) + text.Substring(0,1);
}

Appending multiple characters to a string is inefficient due to the number of string objects allocated, which is not just memory intensive it's also slow. There's a reason we have StringBuilder and other such options available to us, like working with char[]s.
Here's a fairly quick method that for rotating a string left one character (moving the first character to the end):
string RotateLeft(string source)
{
var chars = source.ToCharArray();
var initial = chars[0];
Array.Copy(chars, 1, chars, 0, chars.Length - 1);
chars[^1] = initial;
return new String(chars);
}
Sadly we can't do that in-place in the string itself since they're immutable, so there's no avoiding the temporary array and string construction at the end.
Based on the fact that you called the method Latin(...) and the bit of the question where you said: "Basically if the input is Hello the output would be elloH + "whatever."... I'm assuming that you're writing a Pig Latin translation. If that's the case, you're going to need a bit more.
Pig Latin is a slightly tricky problem because it's based on the sound of the word, not the letters. For example, onto becomes ontohay (or variants thereof) while one becomes unway because the word is pronounced the same as won (with a u to capture the vowel pronunciation correctly). Phonetic operations on English is quite annoying because of all the variations with silent and implied initial letters. And don't even get me started on pseudo-vowels like y.
Special cases aside, the most common rules of Pig Latin translation code appear to be as follows:
Words starting with a single consonant followed by a vowel: move the consonant to the end and append ay.
Words starting with a pair of consonants followed by a vowel: move the consonant pair to the end and append ay.
Words that start with a vowel: append hay, yay, tay, etc.
That third one is a bit difficult since choosing the right suffix is a matter of what makes the result easiest to say... which code can't really decide all that easily. Just pick one and go with that.
Of course there are plenty of words that don't fit those rules. Anything starting with a consonant triplet for example (Christmas being the first that came to mind, followed shortly by strip... and others). Pseudo-vowels like y mess things up (cry for instance). And of course the ever-present problem of correctly representing the initial vowel sounds when you've stripped context: won is converted to un-way vocally, so rendering it as on-way in text is a little bit wrong. Same with word, whose Pig Latin version is pronounced erd-way.
For a simple first pass though... just follow the rules, treating y as a consonant if it's the first letter and as a vowel in the second or third spots.
And since this is so often a homework problem, I'm going to stop here and let you play with it for a bit. Just in case :P
(Oh, and don't forget to preserve the case of your first character just in case you're working on a capitalized word. Latin should become Atinlay, not atinLay. Just saying.)

Using string.ToUpper on substring

Have an assignment to allow a user to input a word in C# and then display that word with the first and third characters changed to uppercase. Code follows:
namespace Capitalizer
{
class Program
{
static void Main(string[] args)
{
string text = Console.ReadLine();
char[] delimiterChars = { ' ' };
string[] words = text.Split(delimiterChars);
string Upper = text.ToUpper();
Console.WriteLine(Upper);
Console.ReadKey();
}
}
}
This of course generates the entire word in uppercase, which is not what I want. I can't seem to make text.ToUpper(0,2) work, and even then that'd capitalize the first three letters. Only solution I can think of now that would make the word appear on one line (and I don't know if it works) is to move the capitalized letters and lowercase letters into a character array and try to get that to print all values in a modified order.

The simplest way I can think of to address your exact question as described — to convert to upper case the first and third characters of the input — would be something like the following:
StringBuilder sb = new StringBuilder(text);
sb[0] = char.ToUpper(sb[0]);
sb[2] = char.ToUpper(sb[2]);
text = sb.ToString();
The StringBuilder class is essentially a mutable string object, so when doing these kinds of operations is the most fluid way to approach the problem, as it provides the most straightforward conversions to and from, as well as the full range of string operations. Changing individual characters is easy in many data structures, but insertions, deletions, appending, formatting, etc. all also come with StringBuilder, so it's a good habit to use that versus other approaches.
But frankly, it's hard to see how that's a useful operation. I can't help but wonder if you have stated the requirements incorrectly and there's something more to this question than is seen here.

You could use LINQ:
var upperCaseIndices = new[] { 0, 2 };
var message = "hello";
var newMessage = new string(message.Select((c, i) =>
upperCaseIndices.Contains(i) ? Char.ToUpper(c) : c).ToArray());
Here is how it works. message.Select (inline LINQ query) selects characters from message one by one and passes into selector function:
upperCaseIndices.Contains(i) ? Char.ToUpper(c) : c
written as C# ?: shorthand syntax for if. It reads as "If index is present in the array, then select upper case character. Otherwise select character as is."
(c, i) => condition
is a lambda expression. See also:
Understand Lambda Expressions in 3 minutes
The rest is very simple - represent result as array of characters (.ToArray()), and create a new string based off that (new string(...)).

Only solution I can think of now that would make the word appear on one line (and I don't know if it works) is to move the capitalized letters and lowercase letters into a character array and try to get that to print all values in a modified order.
That seems a lot more complicated than necessary. Once you have a character array, you can simply change the elements of that character array. In a separate function, it would look something like
string MakeFirstAndThirdCharacterUppercase(string word) {
var chars = word.ToCharArray();
chars[0] = chars[0].ToUpper();
chars[2] = chars[2].ToUpper();
return new string(chars);
}

My simple solution:
string text = Console.ReadLine();
char[] delimiterChars = { ' ' };
string[] words = text.Split(delimiterChars);
foreach (string s in words)
{
char[] chars = s.ToCharArray();
chars[0] = char.ToUpper(chars[0]);
if (chars.Length > 2)
{
chars[2] = char.ToUpper(chars[2]);
}
Console.Write(new string(chars));
Console.Write(' ');
}
Console.ReadKey();

Is there a better way to trim whitespace and other characters from a string?

For example, if I want to remove whitespace and trailing commas from a string, I can do this:
String x = "abc,\n";
x.Trim().Trim(new char[] { ',' });
which outputs abc correctly. I could easily wrap this in an extension method, but I'm wondering if there is an in-built way of doing this with a single call to Trim() that I'm missing. I'm used to Python, where I could do this:
import string
x = "abc,\n"
x.strip(string.whitespace + ",")
The documentation states that all Unicode whitespace characters, with a few exceptions, are stripped (see Notes to Callers section), but I'm wondering if there is a way to do this without manually defining a character array in an extension method.
Is there an in-built way to do this? The number of non-whitespace characters I want to strip may vary and won't necessarily include commas, and I want to remove all whitespace, not just \n.

Yes, you can do this:
x.Trim(new char[] { '\n', '\t', ' ', ',' });
Because newline is technically a character, you can add it to the array and avoid two calls to Trim.
EDIT
.NET 4.0 uses this method to determine if a character is considered whitespace. Earlier versions maintain an internal list of whitespace characters (Source).
If you really want to only use one Trim call, then your application could do the following:
On startup, scan the range of Unicode whitespace characters, calling Char.IsWhiteSpace on each character.
If the method call returns true, then push the character onto an array.
Add your custom characters to the array as well
Now you can use a single Trim call, by passing the array you constructed.
I'm guessing that Char.IsWhiteSpace depends on the current locale, so you'll have to pay careful attention to locale.

Using regex makes this simple:
text = Regex.Replace(text, #"^[\s,]+|[\s,]+$", "");
This will match Unicode whitespace characters as well.

You can have following Strip Extension method
public static class ExtensionMethod
{
public static string Strip(this string str, char[] otherCharactersToRemove)
{
List<char> charactersToRemove = (from s in str
where char.IsWhiteSpace(s)
select s).ToList();
charactersToRemove.AddRange(otherCharactersToRemove);
string str2 = str.Trim(charactersToRemove.ToArray());
return str2;
}
}
And then you can call it like:
static void Main(string[] args)
{
string str = "abc\n\t\r\n , asdfadf , \n \r \t";
string str2 = str.Strip(new char[]{','});
}
Out put would be:
str2 = "abc\n\t\r\n , asdfadf"
The Strip Extension method will first get all the WhiteSpace characters from the string in a list. Add other characters to remove in the list and then call trim on it.

c# Best way to break up a long string

This question is not related to:
Best way to break long strings in C# source code
Which is about source, this is about processing long outputs. If someone enters:
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
As a comment, it breaks the container and makes the entire page really wide. Is there any clever regexp that can say, define a maximum word length of 20 chars and then force a whitespace character?
Thanks for any help!

There's probably no need to involve regexes in something this simple. Take this extension method:
public static string Abbreviate(this string text, int length) {
if (text.Length <= length) {
return text;
}
char[] delimiters = new char[] { ' ', '.', ',', ':', ';' };
int index = text.LastIndexOfAny(delimiters, length - 3);
if (index > (length / 2)) {
return text.Substring(0, index) + "...";
}
else {
return text.Substring(0, length - 3) + "...";
}
}
If the string is short enough, it's returned as-is. Otherwise, if a "word boundary" is found in the second half of the string, it's "gracefully" cut off at that point. If not, it's cut off the hard way at just under the desired length.
If the string is cut off at all, an ellipsis ("...") is appended to it.
If you expect the string to contain non-natural-language constructs (such as URLs) you 'd need to tweak this to ensure nice behavior in all circumstances. In that case working with a regex might be better.

You could try using a regular expression that uses a positive look-ahead like this:
string outputStr = Regex.Replace(inputStr, #"([\S]{20}(?=\S+))", "$1\n");
This should "insert" a line break into all words that are longer than 20 characters.

Yes you can use this one regex
string pattern = #"^([\w]{1,20})$";
this regex allow to enter not more than 20 characters
string strRegex = #"^([\w]{1,20})$";
string strTargetString = #"asdfasfasfasdffffff";
if(Regex.IsMatch(strTargetString, strRegex))
{
//do something
}
If you need only lenght constraint you should use this regex
^(.{1,20})$
because the \w is match only
alphanumeric and underscore symbol

Does C# have a String Tokenizer like Java's?

I'm doing simple string input parsing and I am in need of a string tokenizer. I am new to C# but have programmed Java, and it seems natural that C# should have a string tokenizer. Does it? Where is it? How do I use it?

You could use String.Split method.
class ExampleClass
{
public ExampleClass()
{
string exampleString = "there is a cat";
// Split string on spaces. This will separate all the words in a string
string[] words = exampleString.Split(' ');
foreach (string word in words)
{
Console.WriteLine(word);
// there
// is
// a
// cat
}
}
}
For more information see Sam Allen's article about splitting strings in c# (Performance, Regex)

I just want to highlight the power of C#'s Split method and give a more detailed comparison, particularly from someone who comes from a Java background.
Whereas StringTokenizer in Java only allows a single delimiter, we can actually split on multiple delimiters making regular expressions less necessary (although if one needs regex, use regex by all means!) Take for example this:
str.Split(new char[] { ' ', '.', '?' })
This splits on three different delimiters returning an array of tokens. We can also remove empty arrays with what would be a second parameter for the above example:
str.Split(new char[] { ' ', '.', '?' }, StringSplitOptions.RemoveEmptyEntries)
One thing Java's String tokenizer does have that I believe C# is lacking (at least Java 7 has this feature) is the ability to keep the delimiter(s) as tokens. C#'s Split will discard the tokens. This could be important in say some NLP applications, but for more general purpose applications this might not be a problem.

The split method of a string is what you need. In fact the tokenizer class in Java is deprecated in favor of Java's string split method.

I think the nearest in the .NET Framework is
string.Split()

For complex splitting you could use a regex creating a match collection.

_words = new List<string>(YourText.ToLower().Trim('\n', '\r').Split(' ').
Select(x => new string(x.Where(Char.IsLetter).ToArray())));
Or
_words = new List<string>(YourText.Trim('\n', '\r').Split(' ').
Select(x => new string(x.Where(Char.IsLetterOrDigit).ToArray())));

The similar to Java's method is:
Regex.Split(string, pattern);
where
string - the text you need to split
pattern - string type pattern, what is splitting the text

use Regex.Split(string,"#|#");

read this, split function has an overload takes an array consist of seperators
http://msdn.microsoft.com/en-us/library/system.stringsplitoptions.aspx

If you're trying to do something like splitting command line arguments in a .NET Console app, you're going to have issues because .NET is either broken or is trying to be clever (which means it's as good as broken). I needed to be able to split arguments by the space character, preserving any literals that were quoted so they didn't get split in the middle. This is the code I wrote to do the job:
private static List<String> Tokenise(string value, char seperator)
{
List<string> result = new List<string>();
value = value.Replace(" ", " ").Replace(" ", " ").Trim();
StringBuilder sb = new StringBuilder();
bool insideQuote = false;
foreach(char c in value.ToCharArray())
{
if(c == '"')
{
insideQuote = !insideQuote;
}
if((c == seperator) && !insideQuote)
{
if (sb.ToString().Trim().Length > 0)
{
result.Add(sb.ToString().Trim());
sb.Clear();
}
}
else
{
sb.Append(c);
}
}
if (sb.ToString().Trim().Length > 0)
{
result.Add(sb.ToString().Trim());
}
return result;
}

If you are using C# 3.5 you could write an extension method to System.String that does the splitting you need. You then can then use syntax:
string.SplitByMyTokens();
More info and a useful example from MS here http://msdn.microsoft.com/en-us/library/bb383977.aspx

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

convert non alphanumeric glyphs to unicode while preserving alphanumeric - c#

I need to convert non alpha-numeric glyphs in a string to their unicode value, while preserving the alphanumeric characters. Is there a method to do this in C#? As an example, I need to convert this string: "hello world!" To this: "hello_x0020_world_x0021_"

To get string safe for XML node name you should use XmlConver.EncodeName. Note that if you need to encode all non-alphanumeric characters you'd need to write it yourself as "_" is not encoded by that method.

Related

Moving the first char in a string to the send of the string using a method. C#

Using string.ToUpper on substring

Is there a better way to trim whitespace and other characters from a string?

c# Best way to break up a long string

Does C# have a String Tokenizer like Java's?

Categories

Resources