Reverse a string with accent chars? - c#

So I saw Jon's skeet video and there was a code sample :
There should have been a problem with the é - after reversing but I guess it fails on .net2 (IMHO), anyway it did work for me and I did see the correct reversed string.
char[] a="Les Misérables".ToCharArray();
Array.Reverse(a);
string n= new string(a);
Console.WriteLine (n); //selbarésiM seL
But I took it further:
In Hebrew there is the "Alef" char : א
and I can add punctuation like : אֳ ( which I believe consists of 2 chars - yet displayed as one.)
But now look what happens :
char[] a="Les Misאֳrables".ToCharArray();
Array.Reverse(a);
string n= new string(a);
Console.WriteLine (n); //selbarֳאsiM seL
There was a split...
I can understand why it is happening :
Console.WriteLine ("אֳ".Length); //2
So I was wondering if there's a workaround for this kind of issue in C# ( or should I build my own mechanism....)

The problem is that Array.Reverse isn't aware that certain sequences of char values may combine to form a single character, or "grapheme", and thus shouldn't be reversed. You have to use something that understands Unicode combining character sequences, like TextElementEnumerator:
// using System.Globalization;
TextElementEnumerator enumerator =
StringInfo.GetTextElementEnumerator("Les Misאֳrables");
List<string> elements = new List<string>();
while (enumerator.MoveNext())
elements.Add(enumerator.GetTextElement());
elements.Reverse();
string reversed = string.Concat(elements); // selbarאֳsiM seL

If you made the extension
public static IEnumerable<string> ToTextElements(this string source)
{
var e = StringInfo.GetTextElementEnumerator(source)
while (e.MoveNext())
{
yield return e.GetTextElement();
}
}
you could do,
const string a = "AnyStringYouLike";
var aReversed = string.Concat(a.ToTextElements().Reverse());

Related

grouping adjacent similar substrings

I am writing a program in which I want to group the adjacent substrings, e.g ABCABCBC can be compressed as 2ABC1BC or 1ABCA2BC.
Among all the possible options I want to find the resultant string with the minimum length.
Here is code what i have written so far but not doing job. Kindly help me in this regard.
using System;
using System.Collections.Generic;
using System.Linq;
namespace EightPrgram
{
class Program
{
static void Main(string[] args)
{
string input;
Console.WriteLine("Please enter the set of operations: ");
input = Console.ReadLine();
char[] array = input.ToCharArray();
List<string> list = new List<string>();
string temp = "";
string firstTemp = "";
foreach (var x in array)
{
if (temp.Contains(x))
{
firstTemp = temp;
if (list.Contains(firstTemp))
{
list.Add(firstTemp);
}
temp = "";
list.Add(firstTemp);
}
else
{
temp += x;
}
}
/*foreach (var item in list)
{
Console.WriteLine(item);
}*/
Console.ReadLine();
}
}
}
You can do this with recursion. I cannot give you a C# solution, since I do not have a C# compiler here, but the general idea together with a python solution should do the trick, too.
So you have an input string ABCABCBC. And you want to transform this into an advanced variant of run length encoding (let's called it advanced RLE).
My idea consists of a general first idea onto which I then apply recursion:
The overall target is to find the shortest representation of the string using advanced RLE, let's create a function shortest_repr(string).
You can divide the string into a prefix and a suffix and then check if the prefix can be found at the beginning of the suffix. For your input example this would be:
(A, BCABCBC)
(AB, CABCBC)
(ABC, ABCBC)
(ABCA, BCBC)
...
This input can be put into a function shorten_prefix, which checks how often the suffix starts with the prefix (e.g. for the prefix ABC and the suffix ABCBC, the prefix is only one time at the beginning of the suffix, making a total of 2 ABC following each other. So, we can compact this prefix / suffix combination to the output (2ABC, BC).
This function shorten_prefix will be used on each of the above tuples in a loop.
After using the function shorten_prefix one time, there still is a suffix for most of the string combinations. E.g. in the output (2ABC, BC), there still is the string BC as suffix. So, need to find the shortest representation for this remaining suffix. Wooo, we still have a function for this called shortest_repr, so let's just call this onto the remaining suffix.
This image displays how this recursion works (I only expanded one of the node after the 3rd level, but in fact all of the orange circles would go through recursion):
We start at the top with a call of shortest_repr to the string ABABB (I selected a shorter sample for the image). Then, we split this string at all possible split positions and get a list of prefix / suffix pairs in the second row. On each of the elements of this list we first call the prefix/suffix optimization (shorten_prefix) and retrieve a shortened prefix/suffix combination, which already has the run-length numbers in the prefix (third row). Now, on each of the suffix, we call our recursion function shortest_repr.
I did not display the upward-direction of the recursion. When a suffix is the empty string, we pass an empty string into shortest_repr. Of course, the shortest representation of the empty string is the empty string, so we can return the empty string immediately.
When the result of the call to shortest_repr was received inside our loop, we just select the shortest string inside the loop and return this.
This is some quickly hacked code that does the trick:
def shorten_beginning(beginning, ending):
count = 1
while ending.startswith(beginning):
count += 1
ending = ending[len(beginning):]
return str(count) + beginning, ending
def find_shortest_repr(string):
possible_variants = []
if not string:
return ''
for i in range(1, len(string) + 1):
beginning = string[:i]
ending = string[i:]
shortened, new_ending = shorten_beginning(beginning, ending)
shortest_ending = find_shortest_repr(new_ending)
possible_variants.append(shortened + shortest_ending)
return min([(len(x), x) for x in possible_variants])[1]
print(find_shortest_repr('ABCABCBC'))
print(find_shortest_repr('ABCABCABCABCBC'))
print(find_shortest_repr('ABCABCBCBCBCBCBC'))
Open issues
I think this approach has the same problem as the recursive levenshtein distance calculation. It calculates the same suffices multiple times. So, it would be a nice exercise to try to implement this with dynamic programming.
If this is not a school assignment or performance critical part of the code, RegEx might be enough:
string input = "ABCABCBC";
var re = new Regex(#"(.+)\1+|(.+)", RegexOptions.Compiled); // RegexOptions.Compiled is optional if you use it more than once
string output = re.Replace(input,
m => (m.Length / m.Result("$1$2").Length) + m.Result("$1$2")); // "2ABC1BC" (case sensitive by default)

Casting HexNumber as character to string

I need to process a numeral as a string.
My value is 0x28 and this is the ascii code for '('.
I need to assign this to a string.
The following lines do this.
char c = (char)0x28;
string s = c.ToString();
string s2 = ((char)0x28).ToString();
My usecase is a function that only accepts strings.
My call ends up looking cluttered:
someCall( ((char)0x28).ToString() );
Is there a way of simplifying this and make it more readable without writing '(' ?
The Hexnumber in the code is always paired with a Variable that contains that hex value in its name, so "translating" it would destroy that visible connection.
Edit:
A List of tuples is initialised with this where the first item has the character in its name and the second item results from a call with that character.
One of the answers below is exactly what i am looking for so i incorporated it here now.
{ existingStaticVar0x28, someCall("\u0028") }
The reader can now instinctively see the connection between item1 and item2 and is less likely to run into a trap when this gets refactored.
You can use Unicode character escape sequence in place of a hex to avoid casting:
string s2 = '\u28'.ToString();
or
someCall("\u28");
Well supposing that you have not a fixed input then you could write an extension method
namespace MyExtensions
{
public static class MyStringExtensions
{
public static string ConvertFromHex(this string hexData)
{
int c = Convert.ToInt32(hexCode, 16);
return new string(new char[] {(char)c});
}
}
}
Now you could call it in your code wjth
string hexNumber = "0x28"; // or whatever hexcode you need to convert
string result = hexNumber.ConvertFromHex();
A bit of error handling should be added to the above conversion.

TextElement Enumerator Class Bug or (Tamil) Unicode Bug

why the TextElementEnumerator not properly parsing the Tamil Unicode character.
using System;
using System.Collections.Generic;
using System.Globalization;
namespace Glyphtest
{
internal class Program
{
private static void Main()
{
const string unicodetxt1 = "ஊரவர் கெளவை";
List<string> output = Syllabify(unicodetxt1);
Console.WriteLine(output.Count);
const string unicodetxt2 = "கௌவை";
output = Syllabify(unicodetxt2);
Console.WriteLine(output.Count);
}
public static List<string> Syllabify(string unicodetext)
{
if (string.IsNullOrEmpty(unicodetext)) return null;
TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(unicodetext);
var data = new List<string>();
while (enumerator.MoveNext())
data.Add(enumerator.Current.ToString());
return data;
}
}
}
Following above code sample deals with Unicode character
'கௌ'-> 0x0bc8 (க) +0xbcc(ௌ). (Correct Form)
'கௌ'->0x0bc8 (க) +0xbc6(ெ) + 0xbb3(ள) (In Correct Form)
Is it bug in Text Element Enumerator Class ,
why its not to Enumerate it properly from the string.
i.e
கெளவை => 'கெள'+ 'வை' has to enumerated in Correct form
கெளவை => 'கெ' +'ள' +'வை' not to be enumerated in Incorrect form.
If so how to overcome this issue.
Its not been bug with Unicode character or TextElementEnumerator Class,
As specific to the lanaguage (Tamil)
letter made by any Tamil consonants followed by visual glyph
for eg-
க -\u0b95
ெ -\u0bc6
ள -\u0bb3
form Tamil character 'கெள' while its seems similar to formation of visual glyph
க -\u0b95
ௌ-\u0bcc
and its right form to solution.
hence before enumerating Tamil character we have replace irregular formation of character.
As with rule of Tamil Grammar (ஔகாரக் குறுக்கம்)
the visual glyph (ௌ) will come as starting letter of a word.
so that. the above code is to be should processed as
internal class Program
{
private static void Main()
{
const string unicodetxt1 = "ஊரவர் கெளவை";
List<string> output = Syllabify(unicodetxt1);
Console.WriteLine(output.Count);
const string unicodetxt2 = "கௌவை";
output = Syllabify(unicodetxt2);
Console.WriteLine(output.Count);
}
public static string CheckVisualGlyphPattern(string txt)
{
string[] data = txt.Split(new[] { ' ', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
string list = string.Empty;
var rx = new Regex("^(.*?){1}(\u0bc6){1}(\u0bb3){1}");
foreach (string s in data)
{
var matches = new List<Match>();
string outputs = rx.Replace(s, match =>
{
matches.Add(match);
return string.Format("{0}\u0bcc", match.Groups[1].Value);
});
list += string.Format("{0} ", outputs);
}
return list.Trim();
}
public static List<string> Syllabify(string unicodetext)
{
var processdata = CheckVisualGlyphPattern(unicodetext);
if (string.IsNullOrEmpty(processdata)) return null;
TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(processdata);
var data = new List<string>();
while (enumerator.MoveNext())
data.Add(enumerator.Current.ToString());
return data;
}
}
It produce the appropriate visual glyph while enumerating.
U+0BB3 ᴛᴀᴍɪʟ ʟᴇᴛᴛᴇʀ ʟʟᴀ has Grapheme_Cluster_Break=XX (Other). This makes the grapheme clusters <U+0BC8 U+0BC6><U+0BB3> the correct ones since there is always a grapheme cluster break before characters with Grapheme_Cluster_Break equal to Other.
<U+0BC8 U+0BCC> has no internal grapheme cluster breaks because U+0BCC has Grapheme_Cluster_Break=SpacingMark and there are usually no breaks before such characters (exceptions are at the start of text or when preceded by a control character).
Well, at least this is what the Unicode standard has to say (http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries).
Now, I have no idea of how Tamil works, so take what follows with a pinch of salt.
U+0BCC decomposes into <U+0BC6 U+0BD7>, meaning the two sequences (<U+0BC8 U+0BC6 U+0BB3> and <U+0BC8 U+0BCC>) not canonically equivalent, so there is no requirement for grapheme cluster segmentation to yield the same results.
When I look at it with my Tamil-ignorant eyes, it seems U+0BCC ᴛᴀᴍɪʟ ᴀᴜ ʟᴇɴɢᴛʜ ᴍᴀʀᴋ and U+0BB3 ᴛᴀᴍɪʟ ʟᴇᴛᴛᴇʀ ʟʟᴀ look exactly the same. However, U+0BCC is a spacing mark, but U+0BB3 isn't. If you use U+0BCC in the input instead of U+0BB3, the result is what you expected.
Going on a limb, I will say that you are using the wrong character but, again, I don't know Tamil at all so I can't be sure.

How can I make nested string splits?

I have what seemed at first to be a trivial problem but turned out to become something I can't figure out how to easily solve. I need to be able to store lists of items in a string. Then those items in turn can be a list, or some other value that may contain my separator character. I have two different methods that unpack the two different cases but I realized I need to encode the contained value from any separator characters used with string.Split.
To illustrate the problem:
string[] nested = { "mary;john;carl", "dog;cat;fish", "plainValue" }
string list = string.Join(";", nested);
string[] unnested = list.Split(';'); // EEK! returns 7 items, expected 3!
This would produce a list "mary;john;carl;dog;cat;fish;plainValue", a value I can't split to get the three original nested strings from. Indeed, instead of the three original strings, I'd get 7 strings on split and this approach thus doesn't work at all.
What I want is to allow the values in my string to be encoded so I can unpack/split the contents just the way before I packed/join them. I assume I might need to go away from string.Split and string.Join and that is perfectly fine. I might just have overlooked some useful class or method.
How can I allow any string values to be packed / unpacked into lists?
I prefer neat, simple solutions over bulky if possible.
For the curious mind, I am making extensions for PlayerPrefs in Unity3D, and I can only work with ints, floats and strings. Thus I chose strings to be my data carrier. This is why I am making this nested list of strings.
try:
const char joinChar = '╗'; // make char const
string[] nested = { "mary;john;carl", "dog;cat;fish", "plainValue" };
string list = string.Join(Convert.ToString(joinChar), nested);
string[] unnested = list.Split(joinChar); // eureka returns 3!
using an ascii character outside the normal 'set' allows you to join and split without ruining your logic that is separated on the ; char.
Encode your strings with base64 encoding before joining.
The expected items are 7 because you're splitting with a ; char. I would suggest to change your code to:
string[] nested = { "mary;john;carl", "dog;cat;fish", "plainValue" }
string list = string.Join("#" nested);
string[] unnested = list.Split('#'); // 3 strings again
Have you considered using a different separator, eg "|"?
This way the joined string will be "mary;john;carl|dog;cat;fish|plainValue" and when you call list.split("|"); it will return the three original strings
Use some other value than semicolon (;) for joining. For example - you can use comma (,) and you will get "mary;john;carl,dog;cat;fish,plainValue". When you again split it based on (,) as a separator, you should get back your original string value.
I came up with a solution of my own as well.
I could encode the length of an item, followed with the contents of an item. It would not use string.Split and string.Join at all, but it would solve my problem. The content would be untouched, and any content that need encoding could in turn use this encoding in its content space.
To illustrate the format (constant length header):
< content length > < raw content >
To illustrate the format (variable length header):
< content length > < header stop character > < raw content >
In the former, a fixed length of characters are used to describe the length of the contents. This could be plain text, hexadecimal, base64 or some other encoding.
Example with 4 hexadecimals (ffff/65535 max length):
0005Hello0005World
In the latter example, we can reduce this to:
5:Hello5:World
Then I could look for the first occurance of : and parse the length first, to extract the substring that follows. After that is the next item of the list.
A nested example could look like:
e:5:Hello5:Worlda:2:Hi4:John
(List - 14 charactes including headers)
Hello (5 characters)
World (5 characters)
(List - 10 characters including headers)
Hi (2 characters)
John (4 characters)
A drawback is that it explicitly requires the length of all items, even if no "shared separator" character wouldn't been present (this solution use no separators if using fixed length header).
Maby not as nice as you wanted. But here goes :)
static void Main(string[] args)
{
string[] str = new string[] {"From;niklas;to;lasse", "another;day;at;work;", "Bobo;wants;candy"};
string compiledString = GetAsString(str);
string[] backAgain = BackToStringArray(compiledString);
}
public static string GetAsString(string[] strings)
{
string returnString = string.Empty;
using (MemoryStream ms = new MemoryStream())
{
using (BinaryWriter writer = new BinaryWriter(ms))
{
writer.Write(strings.Length);
for (int i = 0; i < strings.Length; ++i)
{
writer.Write(strings[i]);
}
}
ms.Flush();
byte[] array = ms.ToArray();
returnString = Encoding.UTF8.GetString(array);
}
return returnString;
}
public static string[] BackToStringArray(string encodedString)
{
string[] returnStrings = new string[0];
byte[] toBytes = Encoding.UTF8.GetBytes(encodedString);
using (MemoryStream stream = new MemoryStream(toBytes))
{
using (BinaryReader reader = new BinaryReader(stream))
{
int numStrings = reader.ReadInt32();
returnStrings = new string[numStrings];
for (int i = 0; i < numStrings; ++i)
{
returnStrings[i] = reader.ReadString();
}
}
}
return returnStrings;
}

C# preg_replace?

What is the PHP preg_replace in C#?
I have an array of string that I would like to replace by an other array of string. Here is an example in PHP. How can I do something like that in C# without using .Replace("old","new").
$patterns[0] = '/=C0/';
$patterns[1] = '/=E9/';
$patterns[2] = '/=C9/';
$replacements[0] = 'à';
$replacements[1] = 'é';
$replacements[2] = 'é';
return preg_replace($patterns, $replacements, $text);
Real men use regular expressions, but here is an extension method that adds it to String if you wanted it:
public static class ExtensionMethods
{
public static String PregReplace(this String input, string[] pattern, string[] replacements)
{
if (replacements.Length != pattern.Length)
throw new ArgumentException("Replacement and Pattern Arrays must be balanced");
for (var i = 0; i < pattern.Length; i++)
{
input = Regex.Replace(input, pattern[i], replacements[i]);
}
return input;
}
}
You use it like this:
class Program
{
static void Main(string[] args)
{
String[] pattern = new String[4];
String[] replacement = new String[4];
pattern[0] = "Quick";
pattern[1] = "Fox";
pattern[2] = "Jumped";
pattern[3] = "Lazy";
replacement[0] = "Slow";
replacement[1] = "Turtle";
replacement[2] = "Crawled";
replacement[3] = "Dead";
String DemoText = "The Quick Brown Fox Jumped Over the Lazy Dog";
Console.WriteLine(DemoText.PregReplace(pattern, replacement));
}
}
You can use .Select() (in .NET 3.5 and C# 3) to ease applying functions to members of a collection.
stringsList.Select( s => replacementsList.Select( r => s.Replace(s,r) ) );
You don't need regexp support, you just want an easy way to iterate over the arrays.
public static class StringManipulation
{
public static string PregReplace(string input, string[] pattern, string[] replacements)
{
if (replacements.Length != pattern.Length)
throw new ArgumentException("Replacement and Pattern Arrays must be balanced");
for (int i = 0; i < pattern.Length; i++)
{
input = Regex.Replace(input, pattern[i], replacements[i]);
}
return input;
}
}
Here is what I will use. Some code of Jonathan Holland but not in C#3.5 but in C#2.0 :)
Thx all.
You are looking for System.Text.RegularExpressions;
using System.Text.RegularExpressions;
Regex r = new Regex("=C0");
string output = r.Replace(text);
To get PHP's array behaviour the way you have you need multiple instances of `Regex
However, in your example, you'd be much better served by .Replace(old, new), it's much faster than compiling state machines.
Edit: Uhg I just realized this question was for 2.0, but I'll leave it in case you do have access to 3.5.
Just another take on the Linq thing. Now I used List<Char> instead of Char[] but that's just to make it look a little cleaner. There is no IndexOf method on arrays but there is one on List. Why did I need this? Well from what I am guessing, there is no direct correlation between the replacement list and the list of ones to be replaced. Just the index.
So with that in mind, you can do this with Char[] just fine. But when you see the IndexOf method, you have to add in a .ToList() before it.
Like this: someArray.ToList().IndexOf
String text;
List<Char> patternsToReplace;
List<Char> patternsToUse;
patternsToReplace = new List<Char>();
patternsToReplace.Add('a');
patternsToReplace.Add('c');
patternsToUse = new List<Char>();
patternsToUse.Add('X');
patternsToUse.Add('Z');
text = "This is a thing to replace stuff with";
var allAsAndCs = text.ToCharArray()
.Select
(
currentItem => patternsToReplace.Contains(currentItem)
? patternsToUse[patternsToReplace.IndexOf(currentItem)]
: currentItem
)
.ToArray();
text = new String(allAsAndCs);
This just converts the text to a character array, selects through each one. If the current character is not in the replacement list, just send back the character as is. If it is in the replacement list, return the character in the same index of the replacement characters list. Last thing is to create a string from the character array.
using System;
using System.Collections.Generic;
using System.Linq;

Categories

Resources