C# parse Apple formatted strings (printf) - c#

I'm dealing some with Apple formatted strings, for example:
[%d] fsPurgeable type: %#, count: %lld bytes for %lld files
According to Apple's documentation here, the format string specification follows the IEEE printf specification, with some modifications it appears.
I need to parse these strings, and replace the % type placeholders with the actual data that belongs there. My initial thought was to use a Regex, however as these claim to adhere to printf specifications, I was wondering if there was anything already in .net that might help with this.
I've done some reading, but I can't see anything that immediately jumped out as useful.
Any suggestions?

To somewhat answer my own question, I ended up writing a regex to handle the parsing:
var regex = new Regex(#"(?i)%[dspublicxfegahztj{}#%]*");
(?i) = case insensitive flag
% = literal percent sign
[dspublicxfegahztj{}#%] = all valid characters that could follow the %
* = match zero or multiple characters, e.g. %lld
Still curious if there is anything in .Net to handle this sort of thing

Related

Conditional Regex replace to add dashes

Ok, so I need to design a regex to insert dashes. Im tasked with building a web API function that returns a specifically formatted string based upon input parameters. For some reason that hasn't been made clear to me, the source data isn't properly formatted, and I need to reformat the data with dashes in the correct place.
Depending on the first two characters and string length there is an optional third dash. Fortunately Im not concerned what those characters are. This system is a passthrough, so garbage in, garbage out. However, i do need to make sure the dashes are spaced appropriately on length.
Structure Types
XX-9999999999-XX AB
XX-9999999999-99 CD, EF
XX-9999999999-XXX-99 GH
XX-9999999999-XX-99 IJ, KL
For Example:
AB123456789044 should be AB-01234567890-44 and
GH1234567890YYY99 becomes GH-01234567890-YYY-99.
Thus far ive gotten to this point.
^(\w\w)(\d{10})(\w{2,3})(\d\d)?$
Which leads to my Question(s)
1) Im attempting to replace with $1-$2-$3-$4 However, whenever there is a fourth section of decimals, such as the case with IJ, its hard to distinguish between that and AB in the replace.
Ive gotten GH-01234567890-YY-99 And GH-01234567890-YY-.
How do I reference a conditional capture group in a replace string such that the dash relating to it only shows up if the grouping exists?
The problem is that you need conditional replacements, and C# doesn't support those. So you've got to do the replacements programmatically. Something like:
string resultString = null;
try {
Regex regexObj = new Regex(#"([A-Z]{2})-?(\d{10})-?(?:([A-Z]{2,3})|(\d{2}))-?(\d{2})?", RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
resultString = regexObj.Replace(subjectString, new MatchEvaluator(ComputeReplacement));
} catch (ArgumentException ex) {
// Error handling
}
public String ComputeReplacement(Match m) {
// Vary the replacement text in C# as needed
return "$1-$2-$3-$4-$5";
}
I haven't paid too much attention to the actual RegEx here, as it seems like you know what you're doing with it. I just included some conditional hyphens in case the data are quite dirty (partially formatted). Obviously you have to edit the "return" part of this, using conditionals in case any of the captures are blank. I haven't worked out that logic for you, as C# isn't my strength.

Find unicode numbers using regex (.NET)

I am attempting to find numbers from any numeral system in strings. I found that the .NET regular expression language supports finding unicode character categories, so I figured I could use that to capture my numbers (at this moment I can reasonably expect the strings I am reading to come from an UTF-8 encoded file).
The problem is that I can't seem to correctly identify all numerals. Here is a fiddle where I have attempted to identify a few numerals as such, but some are not identified as unicode numbers (The same results come from running a console app with the same code locally on .NET version 4.6.2). I have taken each of the test numerals in the fiddle from one of the unicode number category lists here.
Given this fiddle, it seems like the .NET regex language does not recognize all unicode numbers in the standard as numbers. Is this correct? It seems to get most cases correct, so I can probably still use this for what I am doing, but I'd like to know if I am doing something wrong, or if Microsoft has a statement I can't find which is relevant to this problem.
EDIT: Per commenter request, here is the code from the fiddle:
string[] numbers = new string[] { "1", "¼", "㆓", "⑱", "២", "꘩", "꤁", "〺", "፷", "𐌢", "𑁜","𑇩", "𒐘"};
string pattern = #"\p{N}";
foreach (string num in numbers ) {
Console.WriteLine(string.Format("{0}, {1}", num, Regex.IsMatch(num, pattern)));
}
And the output:
1, True
¼, True
㆓, True
⑱, True
២, True
꘩, True
꤁, True
〺, True
፷, True
𐌢, False
𑁜, False
𑇩, False
𒐘, False
The reason this happens is because strings in .NET are UTF-16 encoded.
Only characters in the Basic Multilingual Plane can be represented with 16 bit numbers equal to their code points.
Any characters in the supplementary planes (U+10000 to U+10FFFF) have to be represented using surrogate pairs (they are encoded as a pair of 16 bit numbers).
For this reason, .NET will categorise any of the characters in these supplementary planes as a "Surrogate", rather than one of the other categories such as "LetterNumber", "OtherNumber", etc. This prevents them from matching the Number categories in the regex.
You can check which category .NET thinks a particular character belongs to by calling "Char.GetUnicodeCategory()".

How to count unique characters in string [duplicate]

This question already has answers here:
How to get distinct characters?
(9 answers)
Closed 8 years ago.
Lets say we have variable myString="blabla" or mystring=998769
myString.Length; //will get you your result
myString.Count(char.IsLetter); //if you only want the count of letters:
How to get, unique character count? I mean for "blabla" result must be 3, doe "998769" it will be 4. Is there ready to go function? any suggestions?
You can use LINQ:
var count = myString.Distinct().Count();
It uses a fact, that string implements IEnumerable<char>.
Without LINQ, you can do the same stuff Distinct does internally and use HashSet<char>:
var count = (new HashSet<char>(myString)).Count;
If you handle only ANSI text in English (or characters from BMP) then 80% times if you write:
myString.Distinct().Count()
You will live happy and won't ever have any trouble. Let me post this answer only for who will really need to handle that in the proper way. I'd say everyone should but I know it's not true (quote from Wikipedia):
Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed application software (e.g. CVE-2008-2938, CVE-2012-2135)
Problem of our first naïve solution is that it doesn't handle Unicode properly and it also doesn't consider what user perceive as character. Let's try "𠀑".Distinct().Count() and your code will wrongly return...2 because its UTF-16 representation is 0xD840 0xDC11 (BTW each of them, alone, is not a valid Unicode character because they're high and low surrogate, respectively).
Here I won't be very strict about terms and definitions so please refer to www.unicode.org as reference. For a (much) more broad discussion please read How can I perform a Unicode aware character by character comparison?, encoding isn't only issue you have to consider.
1) It doesn't take into account that .NET System.Char doesn't represent a character (or more specifically a grapheme) but a code unit of a UTF-16 encoded text (possible, for example, with ideographic characters). Often they coincide but now always.
2) If you're counting what user thinks (or perceives) as a character then this will fail again because it doesn't check combined characters like ا́ (many examples of this in Arabic language). There are duplicates that exists for historical reasons: for example é it's both a single Unicode code point and a combination (then that code will fail).
3) We're talking about a western/American definition of character. If you're counting characters for end-users you may need to change your definition to what they expect (for example in Korean language definition of character may not be so obvious, another example is Czech text ch that is always counted as a single character). Finally don't forget some strange things when you convert characters to upper case/lower case (for example in German language ß is SS in upper case, see also this post).
Encoding
C# strings are encoded as UTF-16 (char is two bytes) but UTF-16 isn't a fixed size encoding and char should be properly called code unit. What does it mean? That you may have a string where Length is 2 but actually user will see (and it's actually is) just one character (then count should be 1).
If you need to handle this properly then you have to make things much more complicated (and slow). Fortunately Char class has some helpful methods to handle surrogates.
Following code is untested (and for illustration purposes so absolutely not optimized, I'm sure it can be done much better than this) so get it just as starting point for further investigations:
int CountCharacters(string text)
{
HashSet<string> characters = new HashSet<string>();
string currentCharacter = "";
for (int i = 0; i < text.Length; ++i)
{
if (Char.IsHighSurrogate(text, i))
{
// Do not count this, next one will give the full pair
currentCharacter = text[i].ToString();
continue;
}
else if (Char.IsLowSurrogate(text, i))
{
// Our "character" is encoded as previous one plus this one
currentCharacter += text[i];
}
else
currentCharacter = text[i].ToString();
if (!characters.Contains(currentCharacter))
characters.Add(currentCharacter);
}
return characters.Count;
}
Note that this example doesn't handle duplicates (when same character may have different codes or can be a single code point or a combined character).
Combined Characters
If you have to handle combined characters (and of course encoding) then best way to do it is to use StringInfo class. You'll enumerate (and then count) both combined and encoded characters:
StringInfo.GetTextElementEnumerator(text).Walk()
.Distinct().Count();
Walk() is a trivial to implement extension method that simply walks through all IEnumerator elements (we need it because GetTextElementEnumerator() returns IEnumerator instead of IEnumerable).
Please note that after text has been properly splitted it can be counted with our first solution (the point is that brick isn't char but a sequence of char (for simplicity here returned as string itself). Again this code doesn't handle duplicates.
Culture
There is not much you can do to handle issues listed at point 3. Each language has its own rules and to support them all can be a pain. More examples about culture issues on this longer specific post.
It's important to be aware of them (so you have to know little bit about languages you're targeting) and don't forget that Unicode and few translated resx files won't make your application global.
If text processing is important in your application you can solve many issues using specialized DLLs for each locale you support (to count characters, to count words and so on) like Word Processors do. For example, issues I listed can be simply solved using dictionaries. What I usually do is to do not use standard .NET functions for strings (also because of some bugs), I create a Unicode class with static methods for everything I need (character counting, conversions, comparison) and many specialized derived classes for each supported language. At run-time that static methods will user current thread culture name to pick proper implementation from a dictionary and to delegate work to that. A skeleton may be something like this:
abstract class Unicode
{
public static string CountCharacters(string text)
{
return GetConcreteClass().CountCharactersCore(text);
}
protected virtual string CountCharactersCore(string text)
{
// Default implementation, overridden in derived classes if needed
return StringInfo.GetTextElementEnumerator(text).Cast<string>()
.Distinct().Count();
}
private Dictionary<string, Unicode> _implementations;
private Unicode GetConcreteClass()
{
string cultureName = Thread.Current.CurrentCulture.Name;
// Check if concrete class has been loaded and put in dictionary
...
return _implementations[cultureName];
}
}
If you're using C# then Linq comes nicely to the rescue - again:
"blabla".Distinct().Count()
will do it.

Is there a regex to test if a string is for a locale? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I don't know anything about regular expressions but I think I have to use it for my probleme I got some filenames that look like :
MyResource
MyResource.en-GB
MyResource.en-US
MyResource.fr-FR
MyResource.de-DE
The idea is to test if my strings end with "[letter][letter]-[letter][letter]"
I know this is a very noob, but I just have no idea about how to do it, even if I know exactly what I wanna do... :(
To cater for basic variants:
^[A-Za-z]{2,4}([_-][A-Za-z]{4})?([_-]([A-Za-z]{2}|[0-9]{3}))?$
which consists of:
Language code: ISO 639 2 or 3, or 4 for future use, alpha.
Optional script code: ISO 15924 4 alpha.
Optional country code: ISO 3166-1 2 alpha or 3 digit.
Separated by underscores or dashes.
Valid examples are:
de
en-US
zh-Hant-TW
En-au
aZ_cYrl-aZ.
For the OP's specific question, this would need to be prefixed by /^MyResource[.] and suffixed by $/ to ensure the whole file name is for a valid resource file that ends in a locale.
Note that some programming language's functions may only accept particular forms, like only underscores and uppercase country code. PHP's intl functions accept either case and separators. PayPal accepts only the language, or the la_CY form, where la is the language and CY is the country/region. The PHP locale_canonicalize function can be used to standardise to this format.
IETF RFC 5646, which governs internet usage of these tags, recommends a capitalisation and separation format like az-Cyrl-AZ, as used in the first three examples above, though it says processors should accept any mix of case and either separator, as per the last two examples. When displaying locales, using - as the separator allows finer-grained line-wrapping which might otherwise produce significantly empty lines as when the non=wrapping _ is used, especially in table cells.
The regex for the recommended basic format is:
^[a-z]{2,4}(-[A-Z][a-z]{3})?(-([A-Z]{2}|[0-9]{3}))?$
The regexp only covers the basic format. There are variants for extras, like local region. RFC 5646 allows for such variants, along with private extensions and backwards-compatibility forms. It all depends upon the granularity required. The CLDR Unicode database, which is used by PHP's intl functions and other programs, may include such variants from version to version, though they can also disappear at a later time.
If using a CLDR-based function set, like PHP's intl extension, you can check if a locale exists in the intl database using a function like:
<?php
function is_locale($locale=''){
// STANDARDISE INPUT
$locale=locale_canonicalize($locale);
// LOAD ARRAY WITH LOCALES
$locales=resourcebundle_locales('');
// RETURN WHETHER FOUND
return (array_search($locale,$locales)!==F);
}
?>
It takes about half a millisecond to load and search the data, so it won't be too much of a performance hit.
Of course, it will only find those in the database of the CLDR version supplied with the PHP version used, but will be updated with each subsequent PHP release.
Note that some locales are not for countries, but regions, and these are typically numeric, like 001 for 'World', 150 for 'Europe' and 419 for 'Latin America'. So there are now en-001, en-150, ar-001, and es-419, which can be used for generic language purposes. For example, en-001 was designed to decouple dependence upon en-us as an ersatz English, especially since its date formats and spellings are radically different from the 100 other regional en variants. The en-150 locale is the same as en-001 except for numbering separators and other Europe-specific formats.
In general, a regexp is a good front-end sanity check to filter out illegal characters, and especially to reserve the format for possible future additions. It also helps to prevent malicious character combinations being sent to the lookup facility, especially if text-based lookup command mechanisms, like SQL or Xpath, are used.
That would be testing your input against:
\.[a-z]{2}-[A-Z]{2}$
This is really very literal: "match a dot (\., the dot being a special character in regexes), followed by exactly two of any characters from a to z ([a-z]{2} -- [...] is a character class), followed by a dash (-), followed by two of any characters from A to Z ([A-Z]{2}), followed by the end of input ($).
http://www.dotnetperls.com/regex-match <-- how to apply this regex in C# against an input. It means the code would look like (UNTESTED):
// Post edit: this will really return a boolean
if (Regex.Match(input, #"\.[a-z]{2}-[A-Z]{2}$").Success) {
// there is a match
}
http://regex.info <-- buy that and read it, it is the BEST resource for regular expressions in the universe
http://regular-expressions.info <-- the second best resource
Rather than use Regex, I suggest you use the built-in support for cultures in .Net, i.e., the System.Globalization.CultureInfo class; the constructor recognizes valid culture strings, and gives you an object that can be used for culture specific operations:
try
{
string fileName = "MyResource.en-GB";
string cultureName = System.IO.Path.GetExtension(fileName).TrimStart('.');
CultureInfo cultureInfo = new CultureInfo(cultureName);
}
catch (ArgumentException)
{
// Invalid culture.
}
You could try something like this:
[a-z]{2}-[a-z]{2}
You almost answered it in the question. Try:
// This basically grabs the locale.
string x = MyResource.whatever.... //Whatever it might be.
string locale = x.SubString(x.Length - 5) // Assuming the locale is 5 characters long.
// Now you have a 'locale' that is ready for comparisons.
if (locale == "en-GB") { .... }
if (locale == "fr-FR") { .... }
etc....
On a similar note, here is a useful list of two letter country codes.
http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
I know this isn't really regex, but you didn't seem sure about needing to use it absolutely.
cultures = CultureInfo.GetCultures(System.Globalization.CultureTypes.AllCultures);
cultures.Where(o => filename.EndsWith(o.Name));
This might not be an answer to this question, but one may pass by and be looking for this answer.
To match locales like en_GB you can use this expression:
/^[a-z]{2}_[A-Z]{2}$/
I'll try to explain it here:
^[a-z] means start with lower case letters and {2} means you expect exactly 2 of those
follow with _
[A-Z]{2}$ means end with upper case letters and match exactly 2 of those, $ means that these letters have to be in the end of the string.
An extension to the great answer by Patanjali, but also including named groups and support for private-use as defined in RFC 4647. For example: de-DE-x-goethe or zh-Hant-CN-x-private1-private2.
^(?<language>[A-Za-z]{2,4})([_-](?<script>[A-Za-z]{4}|[0-9]{3}))?([_-](?<country>[A-Za-z]{2}|[0-9]{3}))?([_-]x[_-](?<private>[A-Za-z0-9-_]+))?$
^[a-z]{2}([_])?([A-Za-z]{2})?$
I used this regex and it works for locale only having optional '_'
For example:
en,
de,
en_us,
en_US
So Regex works if the locale has only fixed two chars (only lowercase)
or it has two chars (only lowercase) + _ + two chars (can be uppercase)

Best method of Textfile Parsing in C#?

I want to parse a config file sorta thing, like so:
[KEY:Value]
[SUBKEY:SubValue]
Now I started with a StreamReader, converting lines into character arrays, when I figured there's gotta be a better way. So I ask you, humble reader, to help me.
One restriction is that it has to work in a Linux/Mono environment (1.2.6 to be exact). I don't have the latest 2.0 release (of Mono), so try to restrict language features to C# 2.0 or C# 1.0.
I considered it, but I'm not going to use XML. I am going to be writing this stuff by hand, and hand editing XML makes my brain hurt. :')
Have you looked at YAML?
You get the benefits of XML without all the pain and suffering. It's used extensively in the ruby community for things like config files, pre-prepared database data, etc
here's an example
customer:
name: Orion
age: 26
addresses:
- type: Work
number: 12
street: Bob Street
- type: Home
number: 15
street: Secret Road
There appears to be a C# library here, which I haven't used personally, but yaml is pretty simple, so "how hard can it be?" :-)
I'd say it's preferable to inventing your own ad-hoc format (and dealing with parser bugs)
I was looking at almost this exact problem the other day: this article on string tokenizing is exactly what you need. You'll want to define your tokens as something like:
#"(?&ltlevel>\s) | " +
#"(?&ltterm>[^:\s]) | " +
#"(?&ltseparator>:)"
The article does a pretty good job of explaining it. From there you just start eating up tokens as you see fit.
Protip: For an LL(1) parser (read: easy), tokens cannot share a prefix. If you have abc as a token, you cannot have ace as a token
Note: The article's missing the | characters in its examples, just throw them in.
There is another YAML library for .NET which is under development. Right now it supports reading YAML streams and has been tested on Windows and Mono. Write support is currently being implemented.
Using a library is almost always preferably to rolling your own. Here's a quick list of "Oh I'll never need that/I didn't think about that" points which will end up coming to bite you later down the line:
Escaping characters. What if you want a : in the key or ] in the value?
Escaping the escape character.
Unicode
Mix of tabs and spaces (see the problems with Python's white space sensitive syntax)
Handling different return character formats
Handling syntax error reporting
Like others have suggested, YAML looks like your best bet.
You can also use a stack, and use a push/pop algorithm. This one matches open/closing tags.
public string check()
{
ArrayList tags = getTags();
int stackSize = tags.Count;
Stack stack = new Stack(stackSize);
foreach (string tag in tags)
{
if (!tag.Contains('/'))
{
stack.push(tag);
}
else
{
if (!stack.isEmpty())
{
string startTag = stack.pop();
startTag = startTag.Substring(1, startTag.Length - 1);
string endTag = tag.Substring(2, tag.Length - 2);
if (!startTag.Equals(endTag))
{
return "Fout: geen matchende eindtag";
}
}
else
{
return "Fout: geen matchende openeningstag";
}
}
}
if (!stack.isEmpty())
{
return "Fout: geen matchende eindtag";
}
return "Xml is valid";
}
You can probably adapt so you can read the contents of your file. Regular expressions are also a good idea.
It looks to me that you would be better off using an XML based config file as there are already .NET classes which can read and store the information for you relatively easily. Is there a reason that this is not possible?
#Bernard: It is true that hand editing XML is tedious, but the structure that you are presenting already looks very similar to XML.
Then yes, has a good method there.
#Gishu
Actually once I'd accommodated for escaped characters my regex ran slightly slower than my hand written top down recursive parser and that's without the nesting (linking sub-items to their parents) and error reporting the hand written parser had.
The regex was a slightly faster to write (though I do have a bit of experience with hand parsers) but that's without good error reporting. Once you add that it becomes slightly harder and longer to do.
I also find the hand written parser easier to understand the intention of. For instance, here is the a snippet of the code:
private static Node ParseNode(TextReader reader)
{
Node node = new Node();
int indentation = ParseWhitespace(reader);
Expect(reader, '[');
node.Key = ParseTerminatedString(reader, ':');
node.Value = ParseTerminatedString(reader, ']');
}
Regardless of the persisted format, using a Regex would be the fastest way of parsing.
In ruby it'd probably be a few lines of code.
\[KEY:(.*)\]
\[SUBKEY:(.*)\]
These two would get you the Value and SubValue in the first group. Check out MSDN on how to match a regex against a string.
This is something everyone should have in their kitty. Pre-Regex days would seem like the Ice Age.

Categories

Resources