TextElement Enumerator Class Bug or (Tamil) Unicode Bug

TextElement Enumerator Class Bug or (Tamil) Unicode Bug - c#

why the TextElementEnumerator not properly parsing the Tamil Unicode character.
using System;
using System.Collections.Generic;
using System.Globalization;
namespace Glyphtest
{
internal class Program
{
private static void Main()
{
const string unicodetxt1 = "ஊரவர் கெளவை";
List<string> output = Syllabify(unicodetxt1);
Console.WriteLine(output.Count);
const string unicodetxt2 = "கௌவை";
output = Syllabify(unicodetxt2);
Console.WriteLine(output.Count);
}
public static List<string> Syllabify(string unicodetext)
{
if (string.IsNullOrEmpty(unicodetext)) return null;
TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(unicodetext);
var data = new List<string>();
while (enumerator.MoveNext())
data.Add(enumerator.Current.ToString());
return data;
}
}
}
Following above code sample deals with Unicode character
'கௌ'-> 0x0bc8 (க) +0xbcc(ௌ). (Correct Form)
'கௌ'->0x0bc8 (க) +0xbc6(ெ) + 0xbb3(ள) (In Correct Form)
Is it bug in Text Element Enumerator Class ,
why its not to Enumerate it properly from the string.
i.e
கெளவை => 'கெள'+ 'வை' has to enumerated in Correct form
கெளவை => 'கெ' +'ள' +'வை' not to be enumerated in Incorrect form.
If so how to overcome this issue.

Its not been bug with Unicode character or TextElementEnumerator Class,
As specific to the lanaguage (Tamil)
letter made by any Tamil consonants followed by visual glyph
for eg-
க -\u0b95
ெ -\u0bc6
ள -\u0bb3
form Tamil character 'கெள' while its seems similar to formation of visual glyph
க -\u0b95
ௌ-\u0bcc
and its right form to solution.
hence before enumerating Tamil character we have replace irregular formation of character.
As with rule of Tamil Grammar (ஔகாரக் குறுக்கம்)
the visual glyph (ௌ) will come as starting letter of a word.
so that. the above code is to be should processed as
internal class Program
{
private static void Main()
{
const string unicodetxt1 = "ஊரவர் கெளவை";
List<string> output = Syllabify(unicodetxt1);
Console.WriteLine(output.Count);
const string unicodetxt2 = "கௌவை";
output = Syllabify(unicodetxt2);
Console.WriteLine(output.Count);
}
public static string CheckVisualGlyphPattern(string txt)
{
string[] data = txt.Split(new[] { ' ', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
string list = string.Empty;
var rx = new Regex("^(.*?){1}(\u0bc6){1}(\u0bb3){1}");
foreach (string s in data)
{
var matches = new List<Match>();
string outputs = rx.Replace(s, match =>
{
matches.Add(match);
return string.Format("{0}\u0bcc", match.Groups[1].Value);
});
list += string.Format("{0} ", outputs);
}
return list.Trim();
}
public static List<string> Syllabify(string unicodetext)
{
var processdata = CheckVisualGlyphPattern(unicodetext);
if (string.IsNullOrEmpty(processdata)) return null;
TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(processdata);
var data = new List<string>();
while (enumerator.MoveNext())
data.Add(enumerator.Current.ToString());
return data;
}
}
It produce the appropriate visual glyph while enumerating.

U+0BB3 ᴛᴀᴍɪʟ ʟᴇᴛᴛᴇʀ ʟʟᴀ has Grapheme_Cluster_Break=XX (Other). This makes the grapheme clusters <U+0BC8 U+0BC6><U+0BB3> the correct ones since there is always a grapheme cluster break before characters with Grapheme_Cluster_Break equal to Other.
<U+0BC8 U+0BCC> has no internal grapheme cluster breaks because U+0BCC has Grapheme_Cluster_Break=SpacingMark and there are usually no breaks before such characters (exceptions are at the start of text or when preceded by a control character).
Well, at least this is what the Unicode standard has to say (http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries).
Now, I have no idea of how Tamil works, so take what follows with a pinch of salt.
U+0BCC decomposes into <U+0BC6 U+0BD7>, meaning the two sequences (<U+0BC8 U+0BC6 U+0BB3> and <U+0BC8 U+0BCC>) not canonically equivalent, so there is no requirement for grapheme cluster segmentation to yield the same results.
When I look at it with my Tamil-ignorant eyes, it seems U+0BCC ᴛᴀᴍɪʟ ᴀᴜ ʟᴇɴɢᴛʜ ᴍᴀʀᴋ and U+0BB3 ᴛᴀᴍɪʟ ʟᴇᴛᴛᴇʀ ʟʟᴀ look exactly the same. However, U+0BCC is a spacing mark, but U+0BB3 isn't. If you use U+0BCC in the input instead of U+0BB3, the result is what you expected.
Going on a limb, I will say that you are using the wrong character but, again, I don't know Tamil at all so I can't be sure.

Related

c# Convert string with Emoji to unicode

Im getting a string from client side like this:
This is a face :grin:
And i need to convert the :grin: to unicode in order to send it to other service.
Any clue how to do that?

Here is a link to a quite good json file with relevant information. It contains huge array (about 1500 entries) with emojis, and we are interested in 2 properties: "short_name" which represents name like "grin", and "unified" property, which contains unicode representation like "1F601".
I built a helper class to replace short names like ":grin:" with their unicode equivalent:
public static class EmojiParser {
static readonly Dictionary<string, string> _colonedEmojis;
static readonly Regex _colonedRegex;
static EmojiParser() {
// load mentioned json from somewhere
var data = JArray.Parse(File.ReadAllText(#"C:\path\to\emoji.json"));
_colonedEmojis = data.OfType<JObject>().ToDictionary(
// key dictionary by coloned short names
c => ":" + ((JValue)c["short_name"]).Value.ToString() + ":",
c => {
var unicodeRaw = ((JValue)c["unified"]).Value.ToString();
var chars = new List<char>();
// some characters are multibyte in UTF32, split them
foreach (var point in unicodeRaw.Split('-'))
{
// parse hex to 32-bit unsigned integer (UTF32)
uint unicodeInt = uint.Parse(point, System.Globalization.NumberStyles.HexNumber);
// convert to bytes and get chars with UTF32 encoding
chars.AddRange(Encoding.UTF32.GetChars(BitConverter.GetBytes(unicodeInt)));
}
// this is resulting emoji
return new string(chars.ToArray());
});
// build huge regex (all 1500 emojies combined) by join all names with OR ("|")
_colonedRegex = new Regex(String.Join("|", _colonedEmojis.Keys.Select(Regex.Escape)));
}
public static string ReplaceColonNames(string input) {
// replace match using dictoinary
return _colonedRegex.Replace(input, match => _colonedEmojis[match.Value]);
}
}
Usage is obvious:
var target = "This is a face :grin: :hash:";
target = EmojiParser.ReplaceColonNames(target);
It's quite fast (except first run, because of static constructor initialization). On your string it takes less than 1ms (was not able to measure with stopwatch, always shows 0ms). On huge string which you will never meet in practice (1MB of text) it takes 300ms on my machine.

grouping adjacent similar substrings

I am writing a program in which I want to group the adjacent substrings, e.g ABCABCBC can be compressed as 2ABC1BC or 1ABCA2BC.
Among all the possible options I want to find the resultant string with the minimum length.
Here is code what i have written so far but not doing job. Kindly help me in this regard.
using System;
using System.Collections.Generic;
using System.Linq;
namespace EightPrgram
{
class Program
{
static void Main(string[] args)
{
string input;
Console.WriteLine("Please enter the set of operations: ");
input = Console.ReadLine();
char[] array = input.ToCharArray();
List<string> list = new List<string>();
string temp = "";
string firstTemp = "";
foreach (var x in array)
{
if (temp.Contains(x))
{
firstTemp = temp;
if (list.Contains(firstTemp))
{
list.Add(firstTemp);
}
temp = "";
list.Add(firstTemp);
}
else
{
temp += x;
}
}
/*foreach (var item in list)
{
Console.WriteLine(item);
}*/
Console.ReadLine();
}
}
}

You can do this with recursion. I cannot give you a C# solution, since I do not have a C# compiler here, but the general idea together with a python solution should do the trick, too.
So you have an input string ABCABCBC. And you want to transform this into an advanced variant of run length encoding (let's called it advanced RLE).
My idea consists of a general first idea onto which I then apply recursion:
The overall target is to find the shortest representation of the string using advanced RLE, let's create a function shortest_repr(string).
You can divide the string into a prefix and a suffix and then check if the prefix can be found at the beginning of the suffix. For your input example this would be:
(A, BCABCBC)
(AB, CABCBC)
(ABC, ABCBC)
(ABCA, BCBC)
...
This input can be put into a function shorten_prefix, which checks how often the suffix starts with the prefix (e.g. for the prefix ABC and the suffix ABCBC, the prefix is only one time at the beginning of the suffix, making a total of 2 ABC following each other. So, we can compact this prefix / suffix combination to the output (2ABC, BC).
This function shorten_prefix will be used on each of the above tuples in a loop.
After using the function shorten_prefix one time, there still is a suffix for most of the string combinations. E.g. in the output (2ABC, BC), there still is the string BC as suffix. So, need to find the shortest representation for this remaining suffix. Wooo, we still have a function for this called shortest_repr, so let's just call this onto the remaining suffix.
This image displays how this recursion works (I only expanded one of the node after the 3rd level, but in fact all of the orange circles would go through recursion):
We start at the top with a call of shortest_repr to the string ABABB (I selected a shorter sample for the image). Then, we split this string at all possible split positions and get a list of prefix / suffix pairs in the second row. On each of the elements of this list we first call the prefix/suffix optimization (shorten_prefix) and retrieve a shortened prefix/suffix combination, which already has the run-length numbers in the prefix (third row). Now, on each of the suffix, we call our recursion function shortest_repr.
I did not display the upward-direction of the recursion. When a suffix is the empty string, we pass an empty string into shortest_repr. Of course, the shortest representation of the empty string is the empty string, so we can return the empty string immediately.
When the result of the call to shortest_repr was received inside our loop, we just select the shortest string inside the loop and return this.
This is some quickly hacked code that does the trick:
def shorten_beginning(beginning, ending):
count = 1
while ending.startswith(beginning):
count += 1
ending = ending[len(beginning):]
return str(count) + beginning, ending
def find_shortest_repr(string):
possible_variants = []
if not string:
return ''
for i in range(1, len(string) + 1):
beginning = string[:i]
ending = string[i:]
shortened, new_ending = shorten_beginning(beginning, ending)
shortest_ending = find_shortest_repr(new_ending)
possible_variants.append(shortened + shortest_ending)
return min([(len(x), x) for x in possible_variants])[1]
print(find_shortest_repr('ABCABCBC'))
print(find_shortest_repr('ABCABCABCABCBC'))
print(find_shortest_repr('ABCABCBCBCBCBCBC'))
Open issues
I think this approach has the same problem as the recursive levenshtein distance calculation. It calculates the same suffices multiple times. So, it would be a nice exercise to try to implement this with dynamic programming.

If this is not a school assignment or performance critical part of the code, RegEx might be enough:
string input = "ABCABCBC";
var re = new Regex(#"(.+)\1+|(.+)", RegexOptions.Compiled); // RegexOptions.Compiled is optional if you use it more than once
string output = re.Replace(input,
m => (m.Length / m.Result("$1$2").Length) + m.Result("$1$2")); // "2ABC1BC" (case sensitive by default)

Reverse a string with accent chars?

So I saw Jon's skeet video and there was a code sample :
There should have been a problem with the é - after reversing but I guess it fails on .net2 (IMHO), anyway it did work for me and I did see the correct reversed string.
char[] a="Les Misérables".ToCharArray();
Array.Reverse(a);
string n= new string(a);
Console.WriteLine (n); //selbarésiM seL
But I took it further:
In Hebrew there is the "Alef" char : א
and I can add punctuation like : אֳ ( which I believe consists of 2 chars - yet displayed as one.)
But now look what happens :
char[] a="Les Misאֳrables".ToCharArray();
Array.Reverse(a);
string n= new string(a);
Console.WriteLine (n); //selbarֳאsiM seL
There was a split...
I can understand why it is happening :
Console.WriteLine ("אֳ".Length); //2
So I was wondering if there's a workaround for this kind of issue in C# ( or should I build my own mechanism....)

The problem is that Array.Reverse isn't aware that certain sequences of char values may combine to form a single character, or "grapheme", and thus shouldn't be reversed. You have to use something that understands Unicode combining character sequences, like TextElementEnumerator:
// using System.Globalization;
TextElementEnumerator enumerator =
StringInfo.GetTextElementEnumerator("Les Misאֳrables");
List<string> elements = new List<string>();
while (enumerator.MoveNext())
elements.Add(enumerator.GetTextElement());
elements.Reverse();
string reversed = string.Concat(elements); // selbarאֳsiM seL

If you made the extension
public static IEnumerable<string> ToTextElements(this string source)
{
var e = StringInfo.GetTextElementEnumerator(source)
while (e.MoveNext())
{
yield return e.GetTextElement();
}
}
you could do,
const string a = "AnyStringYouLike";
var aReversed = string.Concat(a.ToTextElements().Reverse());

C# Regex Split To Java Pattern split

I have to port some C# code to Java and I am having some trouble converting a string splitting command.
While the actual regex is still correct, when splitting in C# the regex tokens are part of the resulting string[], but in Java the regex tokens are removed.
What is the easiest way to keep the split-on tokens?
Here is an example of C# code that works the way I want it:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
String[] values = Regex.Split("5+10", #"([\+\-\*\(\)\^\\/])");
foreach (String value in values)
Console.WriteLine(value);
}
}
Produces:
5
+
10

I don't know how C# does it, but to accomplish it in Java, you'll have to approximate it. Look at how this code does it:
public String[] split(String text) {
if (text == null) {
text = "";
}
int last_match = 0;
LinkedList<String> splitted = new LinkedList<String>();
Matcher m = this.pattern.matcher(text);
// Iterate trough each match
while (m.find()) {
// Text since last match
splitted.add(text.substring(last_match,m.start()));
// The delimiter itself
if (this.keep_delimiters) {
splitted.add(m.group());
}
last_match = m.end();
}
// Trailing text
splitted.add(text.substring(last_match));
return splitted.toArray(new String[splitted.size()]);
}

This is because you are capturing the split token. C# takes this as a hint that you wish to retain the token itself as a member of the resulting array. Java does not support this.

C# preg_replace?

What is the PHP preg_replace in C#?
I have an array of string that I would like to replace by an other array of string. Here is an example in PHP. How can I do something like that in C# without using .Replace("old","new").
$patterns[0] = '/=C0/';
$patterns[1] = '/=E9/';
$patterns[2] = '/=C9/';
$replacements[0] = 'à';
$replacements[1] = 'é';
$replacements[2] = 'é';
return preg_replace($patterns, $replacements, $text);

Real men use regular expressions, but here is an extension method that adds it to String if you wanted it:
public static class ExtensionMethods
{
public static String PregReplace(this String input, string[] pattern, string[] replacements)
{
if (replacements.Length != pattern.Length)
throw new ArgumentException("Replacement and Pattern Arrays must be balanced");
for (var i = 0; i < pattern.Length; i++)
{
input = Regex.Replace(input, pattern[i], replacements[i]);
}
return input;
}
}
You use it like this:
class Program
{
static void Main(string[] args)
{
String[] pattern = new String[4];
String[] replacement = new String[4];
pattern[0] = "Quick";
pattern[1] = "Fox";
pattern[2] = "Jumped";
pattern[3] = "Lazy";
replacement[0] = "Slow";
replacement[1] = "Turtle";
replacement[2] = "Crawled";
replacement[3] = "Dead";
String DemoText = "The Quick Brown Fox Jumped Over the Lazy Dog";
Console.WriteLine(DemoText.PregReplace(pattern, replacement));
}
}

You can use .Select() (in .NET 3.5 and C# 3) to ease applying functions to members of a collection.
stringsList.Select( s => replacementsList.Select( r => s.Replace(s,r) ) );
You don't need regexp support, you just want an easy way to iterate over the arrays.

public static class StringManipulation
{
public static string PregReplace(string input, string[] pattern, string[] replacements)
{
if (replacements.Length != pattern.Length)
throw new ArgumentException("Replacement and Pattern Arrays must be balanced");
for (int i = 0; i < pattern.Length; i++)
{
input = Regex.Replace(input, pattern[i], replacements[i]);
}
return input;
}
}
Here is what I will use. Some code of Jonathan Holland but not in C#3.5 but in C#2.0 :)
Thx all.

You are looking for System.Text.RegularExpressions;
using System.Text.RegularExpressions;
Regex r = new Regex("=C0");
string output = r.Replace(text);
To get PHP's array behaviour the way you have you need multiple instances of `Regex
However, in your example, you'd be much better served by .Replace(old, new), it's much faster than compiling state machines.

Edit: Uhg I just realized this question was for 2.0, but I'll leave it in case you do have access to 3.5.
Just another take on the Linq thing. Now I used List<Char> instead of Char[] but that's just to make it look a little cleaner. There is no IndexOf method on arrays but there is one on List. Why did I need this? Well from what I am guessing, there is no direct correlation between the replacement list and the list of ones to be replaced. Just the index.
So with that in mind, you can do this with Char[] just fine. But when you see the IndexOf method, you have to add in a .ToList() before it.
Like this: someArray.ToList().IndexOf
String text;
List<Char> patternsToReplace;
List<Char> patternsToUse;
patternsToReplace = new List<Char>();
patternsToReplace.Add('a');
patternsToReplace.Add('c');
patternsToUse = new List<Char>();
patternsToUse.Add('X');
patternsToUse.Add('Z');
text = "This is a thing to replace stuff with";
var allAsAndCs = text.ToCharArray()
.Select
(
currentItem => patternsToReplace.Contains(currentItem)
? patternsToUse[patternsToReplace.IndexOf(currentItem)]
: currentItem
)
.ToArray();
text = new String(allAsAndCs);
This just converts the text to a character array, selects through each one. If the current character is not in the replacement list, just send back the character as is. If it is in the replacement list, return the character in the same index of the replacement characters list. Last thing is to create a string from the character array.
using System;
using System.Collections.Generic;
using System.Linq;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

TextElement Enumerator Class Bug or (Tamil) Unicode Bug - c#

Related

c# Convert string with Emoji to unicode

grouping adjacent similar substrings

Reverse a string with accent chars?

C# Regex Split To Java Pattern split

C# preg_replace?

Categories

Resources