Why isn't string.Normalize consistent depending on the context?

Why isn't string.Normalize consistent depending on the context? - c#

I have the following code:
string input = "ç";
string normalized = input.Normalize(NormalizationForm.FormD);
char[] chars = normalized.ToCharArray();
I build this code with Visual studio 2010, .net4, on a 64 bits windows 7.
I run it in a unit tests project (platform: Any CPU) in two contexts and check the content of chars:
Visual Studio unit tests : chars contains { 231 }.
ReSharper : chars contains { 231 }.
NCrunch : chars contains { 99, 807 }.
In the msdn documentation, I could not find any information presenting different behaviors.
So, why do I get different behaviors? For me the NCrunch behavior is the expected one, but I would expect the same for others.
Edit:
I switched back to .Net 3.5 and still have the same issue.

In String.Normalize(NormalizationForm) documentation it says that
binary representation is in the normalization form specified by the
normalizationForm parameter.
which means you'd be using FormD normalization on both cases, so CurrentCulture and such should not really matter.
The only thing that could change, then, what I can think of is the "ç" character. That character is interpreted as per character encoding that is either assumed or configured for Visual Studio source code files. In short, I think NCrunch is assuming different source file encoding than the others.
Based on quick searching on NCrunch forum, there was a mention of some UTF-8 -> UTF-16 conversion, so I would check that.

Related

Does the line-feed style of a C# here-document match Environment.Newline?

I have some unit-tests that appear to be failing because they depend on the assumption that the line-break inside a locally-compiled here-document string should match Environment.NewLine.
I.e
var testString=#"a multiline
string";
//this fails
teststring.Should().Be("a multiline" + Environment.Newline +"string");
The real test is significantly more complicated and is being compiled and run on a machine I don't control so it's difficult to determine the exact cause but it does appear to boil down to this kind of mismatch.
The C# specification (see the section on string literals) says that "A verbatim string literal may span multiple lines" but doesn't actually say what line-break style will be used.
My working theory is that there's nothing magic about the line-breaks and that the compiler is just using whatever style happens to be in the source code file as it is checked out on the local machine. Therefore if someone has got their RCS line-translation wrong somewhere along the line it's possible that the string as compiled has Unix style line-endings even though it's being compiled on a Windows machine or vice-versa.
So my question is really:
Is there a formal description of how line-breaks within a multiline verbatim string literal should be encoded?
If not, is there an alternative explanation for why the test above might sometimes fail?
I am aware that Environment.NewLine is a runtime variable so the test could fail if compiled on one machine then run on a different one but in this case it appears it is being compiled and run on the same machine.

System.Uri.ToString behaviour change after VS2012 install

After installing VS2012 Premium on a dev machine a unit test failed, so the developer fixed the issue. When the changes were pushed to TeamCity the unit test failed. The project has not changed other than the solution file being upgraded to be compatible with VS2012. It still targets .net framework 4.0
I've isolated the problem to an issue with unicode characters being escaped when calling Uri.ToString. The following code replicates the behavior.
Imports NUnit.Framework
<TestFixture()>
Public Class UriTest
<Test()>
Public Sub UriToStringUrlDecodes()
Dim uri = New Uri("http://www.example.org/test?helloworld=foo%B6bar")
Assert.AreEqual("http://www.example.org/test?helloworld=foo¶bar", uri.ToString())
End Sub
End Class
Running this in VS2010 on a machine that does not have VS2012 installed succeeds, running this in VS2010 on a machine with VS2012 installed fails. Both using the latest version of NCrunch and NUnit from NuGet.
The messages from the failed assert are
Expected string length 46 but was 48. Strings differ at index 42.
Expected: "http://www.example.org/test?helloworld=foo¶bar"
But was: "http://www.example.org/test?helloworld=foo%B6bar"
-----------------------------------------------------^
The documentation on MSDN for both .NET 4 and .NET 4.5 shows that ToString should not encode this character, meaning that the old behavior should be the correct one.
A String instance that contains the unescaped canonical representation of the Uri instance. All characters are unescaped except #, ?, and %.
After installing VS2012, that unicode character is being escaped.
The file version of System.dll on the machine with VS2012 is 4.0.30319.17929
The file version of System.dll on the build server is 4.0.30319.236
Ignoring the merits of why we are using uri.ToString(), what we are testing and any potential work around. Can anyone explain why this behavior seems to have changed, or is this a bug?
Edit, here is the C# version
using System;
using NUnit.Framework;
namespace SystemUriCSharp
{
[TestFixture]
public class UriTest
{
[Test]
public void UriToStringDoesNotEscapeUnicodeCharacters()
{
var uri = new Uri(#"http://www.example.org/test?helloworld=foo%B6bar");
Assert.AreEqual(#"http://www.example.org/test?helloworld=foo¶bar", uri.ToString());
}
}
}
A bit of further investigation, if I target .NET 4.0 or .NET 4.5 the tests fail, if I switch it to .NET 3.5 then it succeeds.

There are some changes introduced in .NET Framework 4.5, which is installed along with VS2012, and which is also (to the best of my knowledge) a so called "in place upgrade". This means that it actually upgrades .NET Framework 4.
Furthermore, there are breaking changes documented in System.Uri. One of them says Unicode normalization form C (NFC) will no longer be performed on non-host portions of URIs. I am not sure whether this is applicable to your case, but it could serve as a good starting point in your investigation of the error.

The change is related to problems with earlier .NET versions, which have now changed to become more compliant to the standards. %B6 is UTF-16, but according to the standards UTF-8 should be used in the Uri, meaning that it should be %C2%B6. So as %B6 is not UTF-8 it is now correctly ignored and not decoded.
More details from the connect report quoted in verbatim below.
.NET 4.5 has enhanced and more compatible application of RFC 3987
which supports IRI parsing rules for URI's. IRIs are International
Resource Identifiers. This allows for non-ASCII characters to be in a
URI/IRI string to be parsed.
Prior to .NET 4.5, we had some inconsistent handling of IRIs. We had
an app.config entry with a default of false that you could turn on:
which did some IRI handling/parsing. However, it had some problems. In
particular it allowed for incorrect percent encoding handling.
Percent-encoded items in a URI/IRI string are supposed to be
percent-encoded UTF-8 octets according to RFC 3987. They are not
interpreted as percent-encoded UTF-16. So, handling “%B6” is incorrect
according to UTF-8 and no decoding will occur. The correct UTF-8
encoding for ¶ is actually “%C2%B6”.
If your string was this instead:
string strUri = #"http://www.example.com/test?helloworld=foo%C2%B6bar";
Then it will get normalized in the ToString() method and the
percent-encoding decoded and removed.
Can you provide more information about your application needs and the
use of ToString() method? Usually, we recommend the AbsoluteUri
property of the Uri object for most normalization needs.
If this issue is blocking your application development and business
needs then please let us know via the "netfx45compat at Microsoft dot
com" email address.
Thx,
Networking Team

In that situation you can't do like that.
The main issue is the character "¶".
In .Net we got a problem on character ¶.
You can make a research on that.
Take the uri' parameters one by one.
Add them by one by and compare them.
May be you can use a method for "¶" character to create it or replace it.
For example;
Dim uri = New Uri("http://www.example.org/test?helloworld=foo%B6bar")
Assert.AreEqual("http://www.example.org/test?helloworld=foo¶bar", uri.Host+uri.AbsolutePath+"?"+uri.Query)
that'll work
uri.AbsolutePath: /test
url.Host: http://www.example.org
uri.Query: helloworld=foo¶bar

What version of Unicode is supported by which .NET platform and on which version of Windows in regards to character classes?

Updated question ¹
With regards to character classes, comparison, sorting, normalization and collations, what Unicode version or versions are supported by which .NET platforms?
Original question
I remember somewhat vaguely having read that .NET supported Unicode version 3.0 and that the internal UTF-16 encoding is not really UTF-16 but actually uses UCS-2, which is not the same. It seems, for instance, that characters above U+FFFF are not possible, i.e. consider:
string s = "\u1D7D9"; // ("Mathematical double-struck digit one")
and it stores the string "ᵽ9".
I'm basically looking for definitive references of answers to the following:
If it isn't true UTF-16 in .NET, what is it?
What version of Unicode is supported by .NET?
If recent versions are not supported or planned in the near future, does anybody know of a (non)commercial library or how I can workaround this issue?
¹) I updated the question as with passing time, it seems more appropriate with respect to the answers and to the larger community. I left the original question in place of which parts have been answered in the comments. Also the old UCS-2 (no surrogates) was used in now-ancient 32 bit Windows versions, .NET has always used UTF-16 (with surrogates) internally.

Internally, .NET is UTF-16. In some cases, e.g. when ASP.NET writes to a response, by default it uses UTF-8. Both of them can handle higher planes.
The reason people sometimes refer to .NET as UCS2 is (I think, because I see few other reasons) that Char is strictly 16 bit and a single Char can't be used to represent the upper planes. Char does, however, have static method overloads (e.g. Char.IsLetter) that can operate on high plane UTF-16 characters inside a string. Strings are stored as true UTF-16.
You can address high Unicode codepoints directly using uppercase \U - e.g. "\U0001D7D9" - but again, only inside strings, not chars.
As for Unicode version, from the MSDN documentation:
"In the .NET Framework 4, sorting, casing, normalization, and Unicode character information is synchronized with Windows 7 and conforms to the Unicode 5.1 standard."
Update 1: It's worth noting, however, that this does not imply that the entirety of Unicode 5.1 is supported - neither in Windows 7 nor in .NET 4.0
Windows 8 targets Unicode 6.0 - I'm guessing that .NET Framework 4.5 might synchronize with that, but have found no sources confirming it. And once again, that doesn't mean the entire standard is implemented.
Update 2: This note on Roslyn confirms that the underlying platform defines the Unicode support for the compiler, and in the link to the code it explains that C# 6.0 supports Unicode 6.0 and up (with a breaking change for C# identifiers as a result).
Update 3: Since .NET version 4.5 a new class SortVersion is introduced to get the supported Unicode version by calling the static property SortVersion.FullVersion. On the same page, Microsoft explains that .NET 4.0 supports Unicode 5.0 on all platforms and .NET 4.5 supports Unicode 5.0 on Windows 7 and Unicode 6.0 on Windows 8. This slightly contrasts the official "what is new" statement here, which talks of version 5.x and 6.0 respectively. From my own (editor: Abel) experience, in most cases it seems that in .NET 4.0, Unicode 5.1 is supported at least for character classes, but I didn't test sorting, normalization and collations. This seems in line with what is said in MSDN as quoted above.

That character is supported. One thing to note is that for unicode characters with more than 2 bytes, you must declare them with an uppercase '\U', like this:
string text = "\U0001D7D9"
If you create a WPF app with that character in a text block, it should render the double-one character perfectly.

MSDN covers it briefly here: http://msdn.microsoft.com/en-us/library/9b1s4yhz(v=vs.90).aspx
I tried this:
static void Main(string[] args) {
string someText = char.ConvertFromUtf32(0x1D7D9);
using (var stream = new MemoryStream()) {
using (var writer = new StreamWriter(stream, Encoding.UTF32)) {
writer.Write(someText);
writer.Flush();
}
var bytes = stream.ToArray();
foreach (var oneByte in bytes) {
Console.WriteLine(oneByte.ToString("x"));
}
}
}
And got a dump of a byte array containing a correct BOM and the correct representation of the \u1D7D9 codepoint, for these encodings:
UTF8
UTF32
Unicode (UTF-16)
So my guess is that higher planes are supported, and that UTF-16 is really UTF-16 (and not UCS-2)

.NET Framework 4.6 and 4.5 and 4 and 3.5 and 3.0
- The Unicode Standard, version 5.0
.NET Framework 2.0 and 1.1
- The Unicode Standard, Version 3.1
The complete answers can be found here under the section Remarks.

Can I rely on OSVERSIONINFO.szCSDVersion or Environment.OSVersion.ServicePack to always be of the form "Service Pack X"?

The title basically says it all. I need to determine the Windows Service Pack number (in numeric form), and Environment.OSVersion.ServicePack (which basically just returns OSVERSIONINFO.szCSDVersion) just returns a string.
In all my tests, this string turned out to be in the form "" (no service pack) or "Service Pack X", with X being a number. So the algorithm to parse this should be quite simple.
My question: Can I rely on this string to always have this format?
(One part of me says no, beause it's not documented. The other part says yes, because surely a lot of existing code would break if MS would decide to return, say, "SP 2 (x86)" for Windows 7 SP2. Thus, they won't do it. Does anyone have more information on that?)

No you can't, some versions use translated strings! If you look at the strings from the image in that link you see that you might get away with just using the first number you find in the string.
OSVERSIONINFOEX was added in NT4 SP6, if you call GetVersionEx you only need to deal with the string on Win9x and < NT4 SP6 and use OSVERSIONINFOEX.wServicePackMajor on other systems.

You should use BuildLabEx instead. It has a specified format which has held since early builds of Windows. Not sure if you can find it in WMI (you should be able to), but it's in the registry:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\BuildLabEx
Example:
7601.17640.amd64fre.win7sp1_gdr.110632-1508
If it makes you feel more comfortable, you could rely initially on CSDVersion matching a certain regex for simplicity, and fail back onto BuildLabEx if it doesn't match.

Best approach to render MediaWiki in C#?

Question:
I want to render MediaWiki syntax (and I mean MediaWiki syntax as used by WikiPedia, not some other wiki format from some other engine such as WikiPlex), and that in C#.
Input: MediaWiki Markup string
Output: HTML string
There are some alternative mediawiki parsers, but nothing in C#, and additionally pinvoking C/C++ looks bleak, because of the structure of those libaries.
As syntax guidance, I use
http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet
My first goal is to render that page's markup correctly.
Markup can be seen here:
http://en.wikipedia.org/w/index.php?title=Wikipedia:Cheatsheet&action=edit
Now, if I use Regex, it's not of much use, because one can't exactly say which tag ends which starting ones, especially when some elements, such as italic, become an attribute of the parent element.
On the other hand, parsing character by character is not a good approach either, because
for example ''' means bold, '' means italic, and ''''' means bold and italic...
I looked into porting some of the other parsers' code, but the java implementations are obscure, and the Python implementations have have a very different regex syntax.
The best approach I see so far would be to port mwlib to IronPython
http://www.mediawiki.org/wiki/Alternative_parsers
But frankly, I'm not looking forward to having the IronPython runtime added as a dependency to my application, and even if I would want to, the documentation is bad at best.

Update per 2017:
You can use ParseoidSharp to get a fully compatible MediaWiki-renderer.
It uses the official Wikipedia Parsoid library via NodeServices.
(NetStandard 2.0)
Since Parsoid is GPL 2.0, and and the GPL-code is invoked in nodejs in a separate process via network, you can even use any license you like ;)
Pre-2017
Problem solved.
As originally assumed, the solution lies in using one of the existing alternative parsers in C#.
WikiModel (Java) works well for that purpose.
First attempt was pinvoke kiwi.
It worked, but but failed because:
kiwi uses char* (fails on anything non-English/ASCII)
not thread safe.
bad because of the need have a native dll in the code for every architecture
(did add x86 and amd64, then it went kaboom on my ARM processor)
Second attempt was mwlib.
That failed because somehow IronPython doesn't work as it should.
Third attempt was Swebele, which essentially turned out to be academic vapoware.
The fourth attempt was using the original mediawiki renderer, using Phalanger. That failed because the MediaWiki renderer is not really modular.
The fifth attempt was using Wiky.php via Phalanger, which worked, but was slow and Wiky.php doesn't very completely implement MediaWiki.
The sixth attempt was using bliki via ikvmc, which failed because of the excessive use of 3rd party libraries ==> it compiles, but yields null-reference exceptions only
The seventh attempt was using JavaScript in C#, which worked but was very slow, plus the MediaWiki functionality implemented was very incomplete.
The 8th attempt was writing an own "parser" via Regex.
But the time required to make it work is just excessive, so I stopped.
The 9th attempt was successful.
Using ikvmc on WikiModel yields a useful dll.
The problem there was the example-code was hoplessly out of date.
But using google and the WikiModel sourcecode, I was able to piece it together.
The end-result can be found here:
https://github.com/ststeiger/MultiWikiParser

Why shouldn't this be possible with regular expressions?
inputString = Regex.Replace(inputString, #"(?:'''''')(.*?)(?:'''''')", #"<strong><em>$1</em></strong>");
inputString = Regex.Replace(inputString, #"(?:''')(.*?)(?:''')", #"<strong>$1</strong>");
inputString = Regex.Replace(inputString, #"(?:'')(.*?)(?:'')", #"<em>$1</em>");
This will, as far as I can see, render all 'Bold and italic', 'Bold' and 'Italic' text.

Here is how I once implemented a solution:
define your regular expressions for Markup->HTML conversion
regular expressions must be non greedy
collect the regular expressions in a Dictionary<char, List<RegEx>>
The char is the first (Markup) character in each RegEx, and RegEx's must be sorted by Markup keyword length desc, e.g. === before ==.
Iterate through the characters of the input string, and check if Dictionary.ContainsKey(char). If it does, search the List for matching RegEx. First matching RegEx wins.
As MediaWiki allows recursive markup (except for <pre> and others), the string inside the markup must also be processed in this fashion recursively.
If there is a match, skip ahead the number of characters matching the RegEx in input string. Otherwise proceed to next character.

Kiwi (https://github.com/aboutus/kiwi, mentioned on http://mediawiki.org/wiki/Alternative_parsers) may be a solution. Since it is C based, and I/O is simply done by stdin/stdout, it should not be too hard to create a "PInvoke"-able DLL from it.

As with the accepted solution I found parsoid is the best way forward as it's the official library - and has the greatest support for the wikimedia markup; that said I found ParseoidSharp to be using obsolete methods such as Microsoft.AspNetCore.NodeServices and really it's just a wrapper for a fairly old version of pasoid's npm package.
Since there is a fairly current version of parsoid in node.js you can use Jering.Javascript.NodeJS to do the same thing as ParseoidSharp, the steps are fairly similar also.
Install nodeJS (
download parsoid https://www.npmjs.com/package/parsoid place the required files in your project.
in powershell cd to your project
npm install
Then it's as simple as
output = StaticNodeJSService.InvokeFromFileAsync(Of String)(HttpContext.Current.Request.PhysicalApplicationPath & "./NodeScripts/parsee.js", args:=New Object() {Markup})
Bonus it's now much easier than ParseoidSharp's method to add the options required, e.g. you'll probably want to set the domain to your own domain.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.