How to count unique characters in string [duplicate] - c#

This question already has answers here:
How to get distinct characters?
(9 answers)
Closed 8 years ago.
Lets say we have variable myString="blabla" or mystring=998769
myString.Length; //will get you your result
myString.Count(char.IsLetter); //if you only want the count of letters:
How to get, unique character count? I mean for "blabla" result must be 3, doe "998769" it will be 4. Is there ready to go function? any suggestions?

You can use LINQ:
var count = myString.Distinct().Count();
It uses a fact, that string implements IEnumerable<char>.
Without LINQ, you can do the same stuff Distinct does internally and use HashSet<char>:
var count = (new HashSet<char>(myString)).Count;

If you handle only ANSI text in English (or characters from BMP) then 80% times if you write:
myString.Distinct().Count()
You will live happy and won't ever have any trouble. Let me post this answer only for who will really need to handle that in the proper way. I'd say everyone should but I know it's not true (quote from Wikipedia):
Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed application software (e.g. CVE-2008-2938, CVE-2012-2135)
Problem of our first naïve solution is that it doesn't handle Unicode properly and it also doesn't consider what user perceive as character. Let's try "𠀑".Distinct().Count() and your code will wrongly return...2 because its UTF-16 representation is 0xD840 0xDC11 (BTW each of them, alone, is not a valid Unicode character because they're high and low surrogate, respectively).
Here I won't be very strict about terms and definitions so please refer to www.unicode.org as reference. For a (much) more broad discussion please read How can I perform a Unicode aware character by character comparison?, encoding isn't only issue you have to consider.
1) It doesn't take into account that .NET System.Char doesn't represent a character (or more specifically a grapheme) but a code unit of a UTF-16 encoded text (possible, for example, with ideographic characters). Often they coincide but now always.
2) If you're counting what user thinks (or perceives) as a character then this will fail again because it doesn't check combined characters like ا́ (many examples of this in Arabic language). There are duplicates that exists for historical reasons: for example é it's both a single Unicode code point and a combination (then that code will fail).
3) We're talking about a western/American definition of character. If you're counting characters for end-users you may need to change your definition to what they expect (for example in Korean language definition of character may not be so obvious, another example is Czech text ch that is always counted as a single character). Finally don't forget some strange things when you convert characters to upper case/lower case (for example in German language ß is SS in upper case, see also this post).
Encoding
C# strings are encoded as UTF-16 (char is two bytes) but UTF-16 isn't a fixed size encoding and char should be properly called code unit. What does it mean? That you may have a string where Length is 2 but actually user will see (and it's actually is) just one character (then count should be 1).
If you need to handle this properly then you have to make things much more complicated (and slow). Fortunately Char class has some helpful methods to handle surrogates.
Following code is untested (and for illustration purposes so absolutely not optimized, I'm sure it can be done much better than this) so get it just as starting point for further investigations:
int CountCharacters(string text)
{
HashSet<string> characters = new HashSet<string>();
string currentCharacter = "";
for (int i = 0; i < text.Length; ++i)
{
if (Char.IsHighSurrogate(text, i))
{
// Do not count this, next one will give the full pair
currentCharacter = text[i].ToString();
continue;
}
else if (Char.IsLowSurrogate(text, i))
{
// Our "character" is encoded as previous one plus this one
currentCharacter += text[i];
}
else
currentCharacter = text[i].ToString();
if (!characters.Contains(currentCharacter))
characters.Add(currentCharacter);
}
return characters.Count;
}
Note that this example doesn't handle duplicates (when same character may have different codes or can be a single code point or a combined character).
Combined Characters
If you have to handle combined characters (and of course encoding) then best way to do it is to use StringInfo class. You'll enumerate (and then count) both combined and encoded characters:
StringInfo.GetTextElementEnumerator(text).Walk()
.Distinct().Count();
Walk() is a trivial to implement extension method that simply walks through all IEnumerator elements (we need it because GetTextElementEnumerator() returns IEnumerator instead of IEnumerable).
Please note that after text has been properly splitted it can be counted with our first solution (the point is that brick isn't char but a sequence of char (for simplicity here returned as string itself). Again this code doesn't handle duplicates.
Culture
There is not much you can do to handle issues listed at point 3. Each language has its own rules and to support them all can be a pain. More examples about culture issues on this longer specific post.
It's important to be aware of them (so you have to know little bit about languages you're targeting) and don't forget that Unicode and few translated resx files won't make your application global.
If text processing is important in your application you can solve many issues using specialized DLLs for each locale you support (to count characters, to count words and so on) like Word Processors do. For example, issues I listed can be simply solved using dictionaries. What I usually do is to do not use standard .NET functions for strings (also because of some bugs), I create a Unicode class with static methods for everything I need (character counting, conversions, comparison) and many specialized derived classes for each supported language. At run-time that static methods will user current thread culture name to pick proper implementation from a dictionary and to delegate work to that. A skeleton may be something like this:
abstract class Unicode
{
public static string CountCharacters(string text)
{
return GetConcreteClass().CountCharactersCore(text);
}
protected virtual string CountCharactersCore(string text)
{
// Default implementation, overridden in derived classes if needed
return StringInfo.GetTextElementEnumerator(text).Cast<string>()
.Distinct().Count();
}
private Dictionary<string, Unicode> _implementations;
private Unicode GetConcreteClass()
{
string cultureName = Thread.Current.CurrentCulture.Name;
// Check if concrete class has been loaded and put in dictionary
...
return _implementations[cultureName];
}
}

If you're using C# then Linq comes nicely to the rescue - again:
"blabla".Distinct().Count()
will do it.

Related

How to map C# compiler error location (line, column) onto the SyntaxTree produced by Roslyn API?

So:
The C# compiler outputs the (line,column) style location.
The Roslyn API expects sequential text location
How to map the former to the latter?
The C# code could be UTF8 with or without the BOM or even UTF16. It could contain all kinds of characters in the form of comments or embedded strings.
Let us assume we know the encoding and have the respective Encoding object handy. I can convert the file bytes to char[]. The problem is that some chars may contribute zero to the final sequential position. I know that the BOM character does. I have no idea if others may too.
Now, if we know for sure that BOM is the only character that contributes 0 to the length, then I can skip it and count the characters and my question becomes trivial. This is what I do today - I just assume that the BOM is the only "bad" player.
But maybe there is a better way? Maybe Roslyn API contains some hidden gem that knows for a change to accept (line,column) and spit the sequential position? Or maybe some of the Microsoft.Build libraries?
EDIT 1
As per the accepted answer the following gives the location:
var srcText = SourceText.From(File.ReadAllText(err.FilePath));
int location = srcText.Lines[err.Line - 1].Start + err.Column - 1;
You have uncovered the reason that the SourceText type exists in the roslyn apis. Its entire purpose is to handle encoding of strings and preform calculations of lines, columns, and spans.
Due to the way .NET handles unicode and depending on which code pages are installed in your OS there could be cases that SourceText does not do what you need. It has generally proven "good enough" for our purposes though.

C# parse Apple formatted strings (printf)

I'm dealing some with Apple formatted strings, for example:
[%d] fsPurgeable type: %#, count: %lld bytes for %lld files
According to Apple's documentation here, the format string specification follows the IEEE printf specification, with some modifications it appears.
I need to parse these strings, and replace the % type placeholders with the actual data that belongs there. My initial thought was to use a Regex, however as these claim to adhere to printf specifications, I was wondering if there was anything already in .net that might help with this.
I've done some reading, but I can't see anything that immediately jumped out as useful.
Any suggestions?
To somewhat answer my own question, I ended up writing a regex to handle the parsing:
var regex = new Regex(#"(?i)%[dspublicxfegahztj{}#%]*");
(?i) = case insensitive flag
% = literal percent sign
[dspublicxfegahztj{}#%] = all valid characters that could follow the %
* = match zero or multiple characters, e.g. %lld
Still curious if there is anything in .Net to handle this sort of thing

Why C# Unicode range cover limited range (up to 0xFFFF)?

I'm getting confused about C# UTF8 encoding...
Assuming those "facts" are right:
Unicode is the "protocol" which define each character.
UTF-8 define the "implementation" - how to store those characters.
Unicode define character range from 0x0000 to 0x10FFFF (source)
According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol?
In contrast to C#, when I using Python for writing UTF8 text - it's covering all the expected range (0x0000 to 0x10FFFF). For example:
u"\U00010000" #WORKING!!!
which isn't working for C#. What's more, when I writing the string u"\U00010000" (single character) in Python to text file and then read it from C#, this single character document became 2 characters in C#!
# Python (write):
import codecs
with codes.open("file.txt", "w+", encoding="utf-8") as f:
f.write(text) # len(text) -> 1
// C# (read):
string text = File.ReadAllText("file.txt", Encoding.UTF8); // How I read this text from file.
Console.Writeline(text.length); // 2
Why? How to fix?
According to C# reference, the accepted ranges for each char is 0x0000 to 0xFFFF. I don't understand what about the other character, which above 0xFFFF, and defined in Unicode protocol?
Unfortunately, a C#/.NET char does not represent a Unicode character.
A char is a 16-bit value in the range 0x0000 to 0xFFFF which represents one “UTF-16 code unit”. Characters in the ranges U+0000–U+D7FF and U+E000–U+FFFF, are represented by the code unit of the same number so everything's fine there.
The less-often-used other characters, in the range U+010000 to U+10FFFF, are squashed into the remaining space 0xD800–0xDFFF by representing each character as two UTF-16 code units together, so the equivalent of the Python string "\U00010000" is C# "\uD800\uDC00".
Why?
The reason for this craziness is that the Windows NT series itself uses UTF-16LE as the native string encoding, so for interoperability convenience .NET chose the same. WinNT chose that encoding—at the time thought of as UCS-2 and without any of the pesky surrogate code unit pairs—because in the early days Unicode only had characters up to U+FFFF, and the thinking was that was going to be all anyone was going to need.
How to fix?
There isn't really a good fix. Some other languages that were unfortunate enough to have based their string type on UTF-16 code units (Java, JavaScript) are starting to add methods to their strings to do operations on them counting a code point at a time; but there is no such functionality in .NET at present.
Often you don't actually need to consistently need to count/find/split/order/etc strings using proper code point items and indexes. But when you really really do, in .NET, you're in for a bad time. You end up having to re-implement each normally-trivial method by manually walking over each char and check it for being part of a two-char surrogate pair, or converting the string to an array of codepoint ints and back. This isn't a lot of fun, either way.
A more elegant and altogether more practical option is to invent a time machine, so we can send the UTF-8 design back to 1988 and prevent UTF-16 from ever having existed.
Unicode has so-called planes (wiki).
As you can see, C#'s char type only supports the first plane, plane 0, the basic multilingual plane.
I know for a fact that C# uses UTF-16 encoding, so I'm a bit surprised to see that it doesn't support code points beyond the first plane in the char datatype. (haven't run into this issue myself...).
This is an artificial restriction in char's implementation, but one that's understandable. The designers of .NET probably didn't want to tie the abstraction of their own character datatype to the abstraction that Unicode defines, in case that standard would not survive (it already superseded others). This is just my guess of course. It just "uses" UTF-16 for memory representation.
UTF-16 uses a trick to squash code points higher than 0xFFFF into 16 bits, as you can read about here. Technically those code points consist of 2 "characters", the so-called surrogate pair. In that sense it breaks the "one code point = one character" abstraction.
You can definitely get around this by working with string and maybe arrays of char. If you have more specific problems, you can find plenty of information on StackOverflow and elsewhere about working with all of Unicode's code points in .NET.

How to arrange the matched output in c#?

i'm matching words to create simple lexical analyzer.
here is my example code and output
example code:
public class
{
public static void main (String args[])
{
System.out.println("Hello");
}
}
output:
public = identifier
void = identifier
main = identifier
class = identifier
as you all can see my output is not arranged as the input comes. void and main comes after class but in output the class comes at the end. i want to print result as the input is matched.
c# code:
private void button1_Click(object sender, EventArgs e)
{
if (richTextBox1.Text.Contains("public"))
richTextBox2.AppendText("public = identifier\n");
if (richTextBox1.Text.Contains("void"))
richTextBox2.AppendText("void = identifier\n");
if (richTextBox1.Text.Contains("class"))
richTextBox2.AppendText("class = identifier\n");
if (richTextBox1.Text.Contains("main"))
richTextBox2.AppendText("main = identifier\n");
}
Your code is asking the following qustions:
Does the input contain the text "public"? If so, write down "public = identifier".
Does the input contain the text "void"? If so, write down "void = identifier".
Does the input contain the text "class"? If so, write down "class = identifier".
Does the input contain the text "main"? If so, write down "main = identifier".
The answer to all of these questions is yes, and since they're executed in that exact order, the output you get should not be surprising. Note: public, void, class and main are keywords, not identifiers.
Splitting on whitespace?
So your approach is not going to help you tokenize that input. Something slightly more in the right direction would be input.Split() - that will cut up the input at whitespace boundaries and give you an array of strings. Still, there's a lot of whitespace entries in there.
input.Split(new char[] { ' ', '\t', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries) is a little better, giving us the following output: public, class, {, public, static, void, main, (String, args[]), {, System.out.println("Hello");, } and }.
But you'll notice that some of these strings contain multiple 'tokens': (String, args[]) and System.out.println("Hello");. And if you had a string with whitespace in it it would get split into multiple tokens. Apparently, just splitting on whitespace is not sufficient.
Tokenizing
At this point, you would start writing a loop that goes over every character in the input, checking if it's whitespace or a punctuation character (such as (, ), {, }, [, ], ., ;, and so on). Those characters should be treated as the end of the previous token, and punctuation characters should also be treated as a token of their own. Whitespace can be skipped.
You'll also have to take things like string literals and comments into account: anything in-between two double-quotes should not be tokenized, but be treated as part of a single 'string' token (including whitespace). Also, strings can contain escape sequences, such as \", that produce a single character (that double quote should not be treated as the end of the string, but as part of its content).
Anything that comes after two forward slashes should be ignored (or parsed as a single 'comment' token, if you want to process comments somehow), until the next newline (newline characters/sequences differ across operating systems). Anything after a /* should be ignored until you encounter a */ sequence.
Numbers can optionally start with a minus sign, can contain a dot (or start with a dot), a scientific notation part (e..), which can also be negative, and there are type suffixes...
In other words, you're writing a state machine, with different behaviour depending on what state you're in: 'string', 'comment', 'block comment', 'numeric literal', and so on.
Lexing
It's useful to assign a type to each token, either while tokenizing or as a separate step (lexing). public is a keyword, main is an identifier, 1234 is an integer literal, "Hello" is a string literal, and so on. This will help during the next step.
Parsing
You can now move on to parsing: turning a list of tokens into an abstract syntax tree (AST). At this point you can check if a list of tokens is actually valid code. You basically repeat the above step, but at a higher level.
For example, public, protected and private are keyword tokens, and they're all access modifiers. As soon as you encounter one of these, you know that either a class, a function, a field or a property definition must follow. If the next token is a while keyword, then you signal an error: public while is not a valid C# construct. If, however, the next token is a class keyword, then you know it's a class definition and you continue parsing.
So you've got a state machine once again, but this time you've got states like 'class definition', 'function definition', 'expression', 'binary expression', 'unary expression', 'statement', 'assignment statement', and so on.
Conclusion
This is by no means complete, but hopefully it'll give you a better idea of all the steps involved and how to approach this. There are also tools available that can generate parsing code from a grammar specification, which can ease the job somewhat (though you still need to learn how to write such grammars).
You may also want to read the C# language specification, specifically the part about its grammar and lexical structure. The spec can be downloaded for free from one of Microsofts websites.
CodeCaster is right. You are not on the right path.
I have an lexical analyzer made by me some time ago as a project.
I know, I know I'm not supposed to put things on a plate here, but the analyzer is for c++ so you'll have to change a few things.
Take a look at the source code and please try to understand how it works at least: C++ Lexical Analyzer
In the strictest sense, the reason for the described behaviour is that in the evaluating code, the search for void comes before the search for class. However, the approach in total seems far too simple for a lexical analysis, as it simply checks for substrings. I totally second the comments above; depending on what you are trying to achieve in the big picture, a more sophisticated approach might be necessary.

Is there a regex to test if a string is for a locale? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I don't know anything about regular expressions but I think I have to use it for my probleme I got some filenames that look like :
MyResource
MyResource.en-GB
MyResource.en-US
MyResource.fr-FR
MyResource.de-DE
The idea is to test if my strings end with "[letter][letter]-[letter][letter]"
I know this is a very noob, but I just have no idea about how to do it, even if I know exactly what I wanna do... :(
To cater for basic variants:
^[A-Za-z]{2,4}([_-][A-Za-z]{4})?([_-]([A-Za-z]{2}|[0-9]{3}))?$
which consists of:
Language code: ISO 639 2 or 3, or 4 for future use, alpha.
Optional script code: ISO 15924 4 alpha.
Optional country code: ISO 3166-1 2 alpha or 3 digit.
Separated by underscores or dashes.
Valid examples are:
de
en-US
zh-Hant-TW
En-au
aZ_cYrl-aZ.
For the OP's specific question, this would need to be prefixed by /^MyResource[.] and suffixed by $/ to ensure the whole file name is for a valid resource file that ends in a locale.
Note that some programming language's functions may only accept particular forms, like only underscores and uppercase country code. PHP's intl functions accept either case and separators. PayPal accepts only the language, or the la_CY form, where la is the language and CY is the country/region. The PHP locale_canonicalize function can be used to standardise to this format.
IETF RFC 5646, which governs internet usage of these tags, recommends a capitalisation and separation format like az-Cyrl-AZ, as used in the first three examples above, though it says processors should accept any mix of case and either separator, as per the last two examples. When displaying locales, using - as the separator allows finer-grained line-wrapping which might otherwise produce significantly empty lines as when the non=wrapping _ is used, especially in table cells.
The regex for the recommended basic format is:
^[a-z]{2,4}(-[A-Z][a-z]{3})?(-([A-Z]{2}|[0-9]{3}))?$
The regexp only covers the basic format. There are variants for extras, like local region. RFC 5646 allows for such variants, along with private extensions and backwards-compatibility forms. It all depends upon the granularity required. The CLDR Unicode database, which is used by PHP's intl functions and other programs, may include such variants from version to version, though they can also disappear at a later time.
If using a CLDR-based function set, like PHP's intl extension, you can check if a locale exists in the intl database using a function like:
<?php
function is_locale($locale=''){
// STANDARDISE INPUT
$locale=locale_canonicalize($locale);
// LOAD ARRAY WITH LOCALES
$locales=resourcebundle_locales('');
// RETURN WHETHER FOUND
return (array_search($locale,$locales)!==F);
}
?>
It takes about half a millisecond to load and search the data, so it won't be too much of a performance hit.
Of course, it will only find those in the database of the CLDR version supplied with the PHP version used, but will be updated with each subsequent PHP release.
Note that some locales are not for countries, but regions, and these are typically numeric, like 001 for 'World', 150 for 'Europe' and 419 for 'Latin America'. So there are now en-001, en-150, ar-001, and es-419, which can be used for generic language purposes. For example, en-001 was designed to decouple dependence upon en-us as an ersatz English, especially since its date formats and spellings are radically different from the 100 other regional en variants. The en-150 locale is the same as en-001 except for numbering separators and other Europe-specific formats.
In general, a regexp is a good front-end sanity check to filter out illegal characters, and especially to reserve the format for possible future additions. It also helps to prevent malicious character combinations being sent to the lookup facility, especially if text-based lookup command mechanisms, like SQL or Xpath, are used.
That would be testing your input against:
\.[a-z]{2}-[A-Z]{2}$
This is really very literal: "match a dot (\., the dot being a special character in regexes), followed by exactly two of any characters from a to z ([a-z]{2} -- [...] is a character class), followed by a dash (-), followed by two of any characters from A to Z ([A-Z]{2}), followed by the end of input ($).
http://www.dotnetperls.com/regex-match <-- how to apply this regex in C# against an input. It means the code would look like (UNTESTED):
// Post edit: this will really return a boolean
if (Regex.Match(input, #"\.[a-z]{2}-[A-Z]{2}$").Success) {
// there is a match
}
http://regex.info <-- buy that and read it, it is the BEST resource for regular expressions in the universe
http://regular-expressions.info <-- the second best resource
Rather than use Regex, I suggest you use the built-in support for cultures in .Net, i.e., the System.Globalization.CultureInfo class; the constructor recognizes valid culture strings, and gives you an object that can be used for culture specific operations:
try
{
string fileName = "MyResource.en-GB";
string cultureName = System.IO.Path.GetExtension(fileName).TrimStart('.');
CultureInfo cultureInfo = new CultureInfo(cultureName);
}
catch (ArgumentException)
{
// Invalid culture.
}
You could try something like this:
[a-z]{2}-[a-z]{2}
You almost answered it in the question. Try:
// This basically grabs the locale.
string x = MyResource.whatever.... //Whatever it might be.
string locale = x.SubString(x.Length - 5) // Assuming the locale is 5 characters long.
// Now you have a 'locale' that is ready for comparisons.
if (locale == "en-GB") { .... }
if (locale == "fr-FR") { .... }
etc....
On a similar note, here is a useful list of two letter country codes.
http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
I know this isn't really regex, but you didn't seem sure about needing to use it absolutely.
cultures = CultureInfo.GetCultures(System.Globalization.CultureTypes.AllCultures);
cultures.Where(o => filename.EndsWith(o.Name));
This might not be an answer to this question, but one may pass by and be looking for this answer.
To match locales like en_GB you can use this expression:
/^[a-z]{2}_[A-Z]{2}$/
I'll try to explain it here:
^[a-z] means start with lower case letters and {2} means you expect exactly 2 of those
follow with _
[A-Z]{2}$ means end with upper case letters and match exactly 2 of those, $ means that these letters have to be in the end of the string.
An extension to the great answer by Patanjali, but also including named groups and support for private-use as defined in RFC 4647. For example: de-DE-x-goethe or zh-Hant-CN-x-private1-private2.
^(?<language>[A-Za-z]{2,4})([_-](?<script>[A-Za-z]{4}|[0-9]{3}))?([_-](?<country>[A-Za-z]{2}|[0-9]{3}))?([_-]x[_-](?<private>[A-Za-z0-9-_]+))?$
^[a-z]{2}([_])?([A-Za-z]{2})?$
I used this regex and it works for locale only having optional '_'
For example:
en,
de,
en_us,
en_US
So Regex works if the locale has only fixed two chars (only lowercase)
or it has two chars (only lowercase) + _ + two chars (can be uppercase)

Categories

Resources