After installing VS2012 Premium on a dev machine a unit test failed, so the developer fixed the issue. When the changes were pushed to TeamCity the unit test failed. The project has not changed other than the solution file being upgraded to be compatible with VS2012. It still targets .net framework 4.0
I've isolated the problem to an issue with unicode characters being escaped when calling Uri.ToString. The following code replicates the behavior.
Imports NUnit.Framework
<TestFixture()>
Public Class UriTest
<Test()>
Public Sub UriToStringUrlDecodes()
Dim uri = New Uri("http://www.example.org/test?helloworld=foo%B6bar")
Assert.AreEqual("http://www.example.org/test?helloworld=foo¶bar", uri.ToString())
End Sub
End Class
Running this in VS2010 on a machine that does not have VS2012 installed succeeds, running this in VS2010 on a machine with VS2012 installed fails. Both using the latest version of NCrunch and NUnit from NuGet.
The messages from the failed assert are
Expected string length 46 but was 48. Strings differ at index 42.
Expected: "http://www.example.org/test?helloworld=foo¶bar"
But was: "http://www.example.org/test?helloworld=foo%B6bar"
-----------------------------------------------------^
The documentation on MSDN for both .NET 4 and .NET 4.5 shows that ToString should not encode this character, meaning that the old behavior should be the correct one.
A String instance that contains the unescaped canonical representation of the Uri instance. All characters are unescaped except #, ?, and %.
After installing VS2012, that unicode character is being escaped.
The file version of System.dll on the machine with VS2012 is 4.0.30319.17929
The file version of System.dll on the build server is 4.0.30319.236
Ignoring the merits of why we are using uri.ToString(), what we are testing and any potential work around. Can anyone explain why this behavior seems to have changed, or is this a bug?
Edit, here is the C# version
using System;
using NUnit.Framework;
namespace SystemUriCSharp
{
[TestFixture]
public class UriTest
{
[Test]
public void UriToStringDoesNotEscapeUnicodeCharacters()
{
var uri = new Uri(#"http://www.example.org/test?helloworld=foo%B6bar");
Assert.AreEqual(#"http://www.example.org/test?helloworld=foo¶bar", uri.ToString());
}
}
}
A bit of further investigation, if I target .NET 4.0 or .NET 4.5 the tests fail, if I switch it to .NET 3.5 then it succeeds.
There are some changes introduced in .NET Framework 4.5, which is installed along with VS2012, and which is also (to the best of my knowledge) a so called "in place upgrade". This means that it actually upgrades .NET Framework 4.
Furthermore, there are breaking changes documented in System.Uri. One of them says Unicode normalization form C (NFC) will no longer be performed on non-host portions of URIs. I am not sure whether this is applicable to your case, but it could serve as a good starting point in your investigation of the error.
The change is related to problems with earlier .NET versions, which have now changed to become more compliant to the standards. %B6 is UTF-16, but according to the standards UTF-8 should be used in the Uri, meaning that it should be %C2%B6. So as %B6 is not UTF-8 it is now correctly ignored and not decoded.
More details from the connect report quoted in verbatim below.
.NET 4.5 has enhanced and more compatible application of RFC 3987
which supports IRI parsing rules for URI's. IRIs are International
Resource Identifiers. This allows for non-ASCII characters to be in a
URI/IRI string to be parsed.
Prior to .NET 4.5, we had some inconsistent handling of IRIs. We had
an app.config entry with a default of false that you could turn on:
which did some IRI handling/parsing. However, it had some problems. In
particular it allowed for incorrect percent encoding handling.
Percent-encoded items in a URI/IRI string are supposed to be
percent-encoded UTF-8 octets according to RFC 3987. They are not
interpreted as percent-encoded UTF-16. So, handling “%B6” is incorrect
according to UTF-8 and no decoding will occur. The correct UTF-8
encoding for ¶ is actually “%C2%B6”.
If your string was this instead:
string strUri = #"http://www.example.com/test?helloworld=foo%C2%B6bar";
Then it will get normalized in the ToString() method and the
percent-encoding decoded and removed.
Can you provide more information about your application needs and the
use of ToString() method? Usually, we recommend the AbsoluteUri
property of the Uri object for most normalization needs.
If this issue is blocking your application development and business
needs then please let us know via the "netfx45compat at Microsoft dot
com" email address.
Thx,
Networking Team
In that situation you can't do like that.
The main issue is the character "¶".
In .Net we got a problem on character ¶.
You can make a research on that.
Take the uri' parameters one by one.
Add them by one by and compare them.
May be you can use a method for "¶" character to create it or replace it.
For example;
Dim uri = New Uri("http://www.example.org/test?helloworld=foo%B6bar")
Assert.AreEqual("http://www.example.org/test?helloworld=foo¶bar", uri.Host+uri.AbsolutePath+"?"+uri.Query)
that'll work
uri.AbsolutePath: /test
url.Host: http://www.example.org
uri.Query: helloworld=foo¶bar
Related
I have a C#/.NET library, which works fine in my own environment, but not in a customer's for some reason. In this specific case, .NET Framework 4.8 is used, but they have tried .NET 6 as well with the same results.
I am converting a double with value 0.25 to a string like this:
doubleValue.ToString("E16", CultureInfo.InvariantCulture);
In my environment, I get the expected string
2.5000000000000000E-001
with a hyphen-minus sign (U+002D) in the scientific notation. As seen in the code, I am using InvariantCulture in order to avoid any confusions regarding decimal point signs and minus signs.
In the customer's environment, with the same code they get the string
2.5000000000000000E−001
with a mathematical minus sign (U+2212) in the scientific notation.
We are both running Windows, with the en-SV culture active. I am printing out the details of InvariantCulture and CurrentCulture in a test program, and in both environments, the negative sign for both cultures is hyphen-minus. Not that current culture should affect anything, since I'm explicitly using the InvariantCulture for the conversion.
The customer has tried setting the environment variable DOTNET_SYSTEM_GLOBALIZATION_USENLS to true, in case there were issues with ICU, but it didn't help. Not that it was likely, since ICU isn't used in .NET Framework. I just couldn't find anything else to try.
What else could affect .NET's choice of minus sign in ToString, apart from culture and NLS/ICU?
EDIT: Additional information: This was not an issue in the previous release of my library. I just released a new version where this became a problem. Since the previous release, I have not touched this conversion code at all. I have added support for .NET 6 (new code that the customer is not running above), and migrated my code from VS2019 to VS2022.
EDIT: Clarified the unicode characters used.
I am working on some C# code which uses the .NET class Microsoft.VisualBasic.CompilerServices.LikeOperator. (Some additional context: I am porting this code to target .NET Standard 2.0, which does not support that class. It is missing from the .NET Standard 2.0 flavor of the Microsoft.VisualBasic nuget package. So I would like to replace the usage of LikeOperator, but unfortunately there is other code and maybe even end users which depend on the pattern language.)
While playing with LikeOperator.LikeString to make sure that I understand exactly what it does, I got an unexpected result. I used the pattern "foo*?bar" to express that I want to match any string which starts with "foo", ends with "bar", and has one or more characters in between.
However, the following call unexpectedly returns true:
LikeOperator.LikeString("foobar", "foo*?bar", CompareMethod.Text)
As far as I understand, the ? wildcard should force at least one character to be present between foo and bar, so I don't understand this result. Are these adjacent *? wildcards a special case?
edit: the like operator does seem to work as expected when tested in the VB.NET implementation of mono 6.12.0. So there seems to be a difference between the .NET Framework and mono here. Could this actually be a bug in Microsoft's implementation?
The following code fragment
var x = Path.GetFullPath(#"C:\test:");
throws this (expected as the path is invalid) exception when run with .Net 4.6.2
System.NotSupportedException: 'The given path's format is not supported.'
But when I run the same code with .Net Core 3.2.1, the method simply returns the input without throwing an exception. AFAIKT the doc does not state that there should be such a behavior change MSND
So my questions are:
Am I missing something in the docs etc.?
Can somebody else reproduce this behavior?
Should I probably report this as an issue to the dotnet/runtime repository?
This is interesting. I can reproduce it perfectly.
It seems that in .NET Framework, it manages to get the full path successfully, but then demands the neccessary File I/O code access permissions. In emulating that, it goes out of its way to check for the colon after the drive separator and throws an exception.
On .NET Core, it has a vastly different implementation, but it only does the first bit. It gets the full path. It doesn't deal with code access permissions, because these don't exist in .NET Core and the APIs are just stubs for compatibility purposes. They're somewhat deprecated in Framework already anyway.
However, if we turn to the documentation, there's no differentiation. The Framework docs say that Path.GetFullPath can throw a NotSupportedException if:
path contains a colon (":") that is not part of a volume identifier (for example, "c:\").
Strangely, the documentation for .NET Core says the exact same thing, despite not actually throwing the exception in this scenario.
I'd suggest that at the very least this is a documentation bug, if not a runtime bug.
I have the following code:
string input = "ç";
string normalized = input.Normalize(NormalizationForm.FormD);
char[] chars = normalized.ToCharArray();
I build this code with Visual studio 2010, .net4, on a 64 bits windows 7.
I run it in a unit tests project (platform: Any CPU) in two contexts and check the content of chars:
Visual Studio unit tests : chars contains { 231 }.
ReSharper : chars contains { 231 }.
NCrunch : chars contains { 99, 807 }.
In the msdn documentation, I could not find any information presenting different behaviors.
So, why do I get different behaviors? For me the NCrunch behavior is the expected one, but I would expect the same for others.
Edit:
I switched back to .Net 3.5 and still have the same issue.
In String.Normalize(NormalizationForm) documentation it says that
binary representation is in the normalization form specified by the
normalizationForm parameter.
which means you'd be using FormD normalization on both cases, so CurrentCulture and such should not really matter.
The only thing that could change, then, what I can think of is the "ç" character. That character is interpreted as per character encoding that is either assumed or configured for Visual Studio source code files. In short, I think NCrunch is assuming different source file encoding than the others.
Based on quick searching on NCrunch forum, there was a mention of some UTF-8 -> UTF-16 conversion, so I would check that.
Updated question ¹
With regards to character classes, comparison, sorting, normalization and collations, what Unicode version or versions are supported by which .NET platforms?
Original question
I remember somewhat vaguely having read that .NET supported Unicode version 3.0 and that the internal UTF-16 encoding is not really UTF-16 but actually uses UCS-2, which is not the same. It seems, for instance, that characters above U+FFFF are not possible, i.e. consider:
string s = "\u1D7D9"; // ("Mathematical double-struck digit one")
and it stores the string "ᵽ9".
I'm basically looking for definitive references of answers to the following:
If it isn't true UTF-16 in .NET, what is it?
What version of Unicode is supported by .NET?
If recent versions are not supported or planned in the near future, does anybody know of a (non)commercial library or how I can workaround this issue?
¹) I updated the question as with passing time, it seems more appropriate with respect to the answers and to the larger community. I left the original question in place of which parts have been answered in the comments. Also the old UCS-2 (no surrogates) was used in now-ancient 32 bit Windows versions, .NET has always used UTF-16 (with surrogates) internally.
Internally, .NET is UTF-16. In some cases, e.g. when ASP.NET writes to a response, by default it uses UTF-8. Both of them can handle higher planes.
The reason people sometimes refer to .NET as UCS2 is (I think, because I see few other reasons) that Char is strictly 16 bit and a single Char can't be used to represent the upper planes. Char does, however, have static method overloads (e.g. Char.IsLetter) that can operate on high plane UTF-16 characters inside a string. Strings are stored as true UTF-16.
You can address high Unicode codepoints directly using uppercase \U - e.g. "\U0001D7D9" - but again, only inside strings, not chars.
As for Unicode version, from the MSDN documentation:
"In the .NET Framework 4, sorting, casing, normalization, and Unicode character information is synchronized with Windows 7 and conforms to the Unicode 5.1 standard."
Update 1: It's worth noting, however, that this does not imply that the entirety of Unicode 5.1 is supported - neither in Windows 7 nor in .NET 4.0
Windows 8 targets Unicode 6.0 - I'm guessing that .NET Framework 4.5 might synchronize with that, but have found no sources confirming it. And once again, that doesn't mean the entire standard is implemented.
Update 2: This note on Roslyn confirms that the underlying platform defines the Unicode support for the compiler, and in the link to the code it explains that C# 6.0 supports Unicode 6.0 and up (with a breaking change for C# identifiers as a result).
Update 3: Since .NET version 4.5 a new class SortVersion is introduced to get the supported Unicode version by calling the static property SortVersion.FullVersion. On the same page, Microsoft explains that .NET 4.0 supports Unicode 5.0 on all platforms and .NET 4.5 supports Unicode 5.0 on Windows 7 and Unicode 6.0 on Windows 8. This slightly contrasts the official "what is new" statement here, which talks of version 5.x and 6.0 respectively. From my own (editor: Abel) experience, in most cases it seems that in .NET 4.0, Unicode 5.1 is supported at least for character classes, but I didn't test sorting, normalization and collations. This seems in line with what is said in MSDN as quoted above.
That character is supported. One thing to note is that for unicode characters with more than 2 bytes, you must declare them with an uppercase '\U', like this:
string text = "\U0001D7D9"
If you create a WPF app with that character in a text block, it should render the double-one character perfectly.
MSDN covers it briefly here: http://msdn.microsoft.com/en-us/library/9b1s4yhz(v=vs.90).aspx
I tried this:
static void Main(string[] args) {
string someText = char.ConvertFromUtf32(0x1D7D9);
using (var stream = new MemoryStream()) {
using (var writer = new StreamWriter(stream, Encoding.UTF32)) {
writer.Write(someText);
writer.Flush();
}
var bytes = stream.ToArray();
foreach (var oneByte in bytes) {
Console.WriteLine(oneByte.ToString("x"));
}
}
}
And got a dump of a byte array containing a correct BOM and the correct representation of the \u1D7D9 codepoint, for these encodings:
UTF8
UTF32
Unicode (UTF-16)
So my guess is that higher planes are supported, and that UTF-16 is really UTF-16 (and not UCS-2)
.NET Framework 4.6 and 4.5 and 4 and 3.5 and 3.0
- The Unicode Standard, version 5.0
.NET Framework 2.0 and 1.1
- The Unicode Standard, Version 3.1
The complete answers can be found here under the section Remarks.