Converting HTML entities to Unicode Characters in C#

Converting HTML entities to Unicode Characters in C# - c#

I found similar questions and answers for Python and Javascript, but not for C# or any other WinRT compatible language.
The reason I think I need it, is because I'm displaying text I get from websites in a Windows 8 store app. E.g. é should become é.
Or is there a better way? I'm not displaying websites or rss feeds, but just a list of websites and their titles.

I recommend using System.Net.WebUtility.HtmlDecode and NOT HttpUtility.HtmlDecode.
This is due to the fact that the System.Web reference does not exist in Winforms/WPF/Console applications and you can get the exact same result using this class (which is already added as a reference in all those projects).
Usage:
string s = System.Net.WebUtility.HtmlDecode("é"); // Returns é

Use HttpUtility.HtmlDecode() .Read on msdn here
decodedString = HttpUtility.HtmlDecode(myEncodedString)

This might be useful, replaces all (for as far as my requirements go) entities with their unicode equivalent.
public string EntityToUnicode(string html) {
var replacements = new Dictionary<string, string>();
var regex = new Regex("(&[a-z]{2,5};)");
foreach (Match match in regex.Matches(html)) {
if (!replacements.ContainsKey(match.Value)) {
var unicode = HttpUtility.HtmlDecode(match.Value);
if (unicode.Length == 1) {
replacements.Add(match.Value, string.Concat("&#", Convert.ToInt32(unicode[0]), ";"));
}
}
}
foreach (var replacement in replacements) {
html = html.Replace(replacement.Key, replacement.Value);
}
return html;
}

Different coding/encoding of HTML entities and HTML numbers in Metro App and WP8 App.
With Windows Runtime Metro App
{
string inStr = "ó";
string auxStr = System.Net.WebUtility.HtmlEncode(inStr);
// auxStr == ó
string outStr = System.Net.WebUtility.HtmlDecode(auxStr);
// outStr == ó
string outStr2 = System.Net.WebUtility.HtmlDecode("ó");
// outStr2 == ó
}
With Windows Phone 8.0
{
string inStr = "ó";
string auxStr = System.Net.WebUtility.HtmlEncode(inStr);
// auxStr == ó
string outStr = System.Net.WebUtility.HtmlDecode(auxStr);
// outStr == ó
string outStr2 = System.Net.WebUtility.HtmlDecode("ó");
// outStr2 == ó
}
To solve this, in WP8, I have implemented the table in HTML ISO-8859-1 Reference before calling System.Net.WebUtility.HtmlDecode().

This worked for me, replaces both common and unicode entities.
private static readonly Regex HtmlEntityRegex = new Regex("&(#)?([a-zA-Z0-9]*);");
public static string HtmlDecode(this string html)
{
if (html.IsNullOrEmpty()) return html;
return HtmlEntityRegex.Replace(html, x => x.Groups[1].Value == "#"
? ((char)int.Parse(x.Groups[2].Value)).ToString()
: HttpUtility.HtmlDecode(x.Groups[0].Value));
}
[Test]
[TestCase(null, null)]
[TestCase("", "")]
[TestCase("'fark'", "'fark'")]
[TestCase(""fark"", "\"fark\"")]
public void should_remove_html_entities(string html, string expected)
{
html.HtmlDecode().ShouldEqual(expected);
}

Improved Zumey method (I can`t comment there). Max char size is in the entity: &exclamation; (11). Upper case in the entities are also possible, ex. À (Source from wiki)
public string EntityToUnicode(string html) {
var replacements = new Dictionary<string, string>();
var regex = new Regex("(&[a-zA-Z]{2,11};)");
foreach (Match match in regex.Matches(html)) {
if (!replacements.ContainsKey(match.Value)) {
var unicode = HttpUtility.HtmlDecode(match.Value);
if (unicode.Length == 1) {
replacements.Add(match.Value, string.Concat("&#", Convert.ToInt32(unicode[0]), ";"));
}
}
}
foreach (var replacement in replacements) {
html = html.Replace(replacement.Key, replacement.Value);
}
return html;
}

Related

How to fill object from contents of a string and populate a List?

I have a string that I have sent through a HTTP Web Request compressing with GZIP with the following data:
[Route("Test")]
public IActionResult Test()
{
var data = "[0].meetingDate=2019-07-12&[0].courseId=12&[0].raceNumber=1&[0].horseCode=000000331213&[1].meetingDate=2019-07-12&[1].courseId=12&[1].raceNumber=1&[1].horseCode=000000356650";
try
{
var req = WebRequest.Create("https://localhost:44374/HorseRacingApi/Prices/GetPriceForEntries");
req.Method = "POST";
req.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate");
req.Headers.Add(HttpRequestHeader.ContentEncoding, "gzip");
if (!string.IsNullOrEmpty(data))
{
var dataBytes = Encoding.ASCII.GetBytes(data);
using (var requestDS = req.GetRequestStream())
{
using (var zipStream = new GZipStream(requestDS, CompressionMode.Compress))
{
zipStream.Write(dataBytes, 0, dataBytes.Length);
}
requestDS.Flush();
}
}
HttpWebResponse response = (HttpWebResponse)req.GetResponse();
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = new StreamReader(receiveStream, Encoding.UTF8);
Debug.WriteLine("Response stream received.");
Debug.WriteLine(readStream.ReadToEnd());
response.Close();
readStream.Close();
return Ok("Sent!");
}
catch(Exception ex)
{
throw ex;
}
}
I am receiving the http data in this function and decompressing it:
[HttpPost]
[Route("GetPriceForEntries")]
[DisableRequestSizeLimit]
public JsonResult GetPriceForEntries(bool? ShowAll)
{
string contents = null;
using (GZipStream zip = new GZipStream(Request.Body, CompressionMode.Decompress))
{
using (StreamReader unzip = new StreamReader(zip))
{
contents = unzip.ReadToEnd();
}
}
//CONVERT CONTENTS TO LIST HERE?
return Json("GOT");
}
I have a object/model setup:
public class JsonEntryKey
{
public DateTime meetingDate { get; set; }
public int courseId { get; set; }
public int raceNumber { get; set; }
public string horseCode { get; set; }
}
How do I convert this 'string' to the List object above?
The reason I am sending this data by compressing is because sometimes the data will be very big.
Cheers
EDIT: Here is my attempt at creating my owner 'Converter'
//Convert string to table.
string[] unzipString = contents.Split('=','&');
List<Core.Models.JsonEntryKey> entries = new List<Core.Models.JsonEntryKey>();
for (int i = 1; i < entries.Count; i += 8)
{
DateTime meetingDate = Convert.ToDateTime(entries[i]);
int courseId = int.Parse(unzipString[i + 2]);
int raceNumber = int.Parse(unzipString[i + 4]);
string horseCode = unzipString[i + 6];
entries.Add(new Core.Models.JsonEntryKey
{
meetingDate = meetingDate,
courseId = courseId,
raceNumber = raceNumber,
horseCode = horseCode
});
}
Is there a better way?

the basic parsing can be done in 3 steps.
1) Split the entire string by '&'
string [] parts = data.Split('&')
you end up with the sigle parts:
[0].meetingDate=2019-07-12
[0].courseId=12
[0].raceNumber=1
[0].horseCode=000000331213
[1].meetingDate=2019-07-12
[1].courseId=12
[1].raceNumber=1
[1].horseCode=000000356650
2) now you can GroupBy the number in the parenthesis, since it seems to denote the index of the object [0] , [1], .... Split by the '.' and take the first element:
var items = parts.GroupBy(x => x.Split('.').First());
3) now for each group (which is basically a collection of property information about each object) you need to iterate through the properties, find the corresponding property via reflection and set the value. In the end: don't forget to collect your newly created objects into a collection:
List<JsonEntryKey> collection = new List<JsonEntryKey>();
foreach (var item in items)
{
var entry = new JsonEntryKey();
foreach (var property in item)
{
// here the position propInfo[1] has the property name and propInfo[2] has the value
string [] propInfo = property.Split(new string[] {"].", "="}, StringSplitOptions.RemoveEmptyEntries);
// extract here the corresponding property information
PropertyInfo info = typeof(JsonEntryKey).GetProperties().Single(x => x.Name == propInfo[1]);
info.SetValue(entry, Convert.ChangeType(propInfo[2], info.PropertyType));
}
collection.Add(entry);
}
The outcome from your string looks in a LINQPad Dump like this:

An alternative solution that I wanted to share is a Regex based one. The regular expression that I have built for this string will work after appending the & character at the end of the string and based on the regex logic, the required data will be parsed out from the string. This is just an example of how you can use regular expressions for handling string scenarios. Regarding the performance as per the official specs:
The regular expression engine in .NET is a powerful, full-featured tool that processes text based on pattern matches rather than on comparing and matching literal text. In most cases, it performs pattern matching rapidly and efficiently. However, in some cases, the regular expression engine can appear to be very slow. In extreme cases, it can even appear to stop responding as it processes a relatively small input over the course of hours or even days.
The performance of a regular expression is based on the length of the string and the complexity of the regular expression. Regarding your string data, I have prepared a DEMO here.
The code looks like:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var data = "[0].meetingDate=2019-07-12&[0].courseId=12&[0].raceNumber=1&[0].horseCode=000000331213&[1].meetingDate=2019-07-12&[1].courseId=12&[1].raceNumber=1&[1].horseCode=000000356650";
var dataRegex=data+"&";
//Console.WriteLine(dataRegex);
showMatch(dataRegex, #"(?<==)(.*?)(?=&)");
}
private static void showMatch(string text, string expr) {
MatchCollection mc = Regex.Matches(text, expr);
foreach (Match m in mc) {
Console.WriteLine(m);
}
}
}
And the output is:
2019-07-12
12
1
000000331213
2019-07-12
12
1
000000356650
Regular expression used: (?<==)(.*?)(?=&)
Explanation:
Positive Lookbehind (?<==): Matches the character = literally (case sensitive)
1st Capturing Group (.*?): .*? matches any character (except for line terminators). *? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed.
Positive Lookahead (?=&): Matches the character & literally (case sensitive)

Converting Arabic Words to Unicode format in C#

I am designing an API where the API user needs Arabic text to be returned in Unicode format, to do so I tried the following:
public static class StringExtensions
{
public static string ToUnicodeString(this string str)
{
StringBuilder sb = new StringBuilder();
foreach (var c in str)
{
sb.Append("\\u" + ((int)c).ToString("X4"));
}
return sb.ToString();
}
}
The issue with the above code that it returns the unicode of letters regardless of its position in word.
Example: let us assume we have the following word:
"سمير" which consists of:
'س' which is written like 'سـ' because it is the first letter in word.
'م' which is written like 'ـمـ' because it is in the middle of word.
'ي' which is written like 'ـيـ' because it is in the middle of word.
'ر' which is written like 'ـر' because it is last letter of word.
The above code returns unicode of { 'س', 'م' , 'ي' , 'ر'} which is:
\u0633\u0645\u064A\u0631
instead of { 'سـ' , 'ـمـ' , 'ـيـ' , 'ـر'} which is
\uFEB3\uFEE4\uFEF4\uFEAE
Any ideas on how to update code to get correct Unicode?
Helpful link

The string is just a sequence of Unicode code points; it does not know the rules of Arabic. You're getting out exactly the data you put in; if you want different data out, then put different data in!
Try this:
Console.WriteLine("\u0633\u0645\u064A\u0631");
Console.WriteLine("\u0633\u0645\u064A\u0631".ToUnicodeString());
Console.WriteLine("\uFEB3\uFEE4\uFEF4\uFEAE");
Console.WriteLine("\uFEB3\uFEE4\uFEF4\uFEAE".ToUnicodeString());
As expected the output is
سمير
\u0633\u0645\u064A\u0631
ﺳﻤﻴﺮ
\uFEB3\uFEE4\uFEF4\uFEAE
Those two sequences of Unicode code points render the same in the browser, but they're different sequences. If you want to write out the second sequence, then don't pass in the first sequence.

Based on Eric's answer I knew how to solve my problem, I have created a solution on Github.
You will find a simple tool to run on Windows, and if you want to use the code in your projects then just copy paste UnicodesTable.cs and Unshaper.cs.
Basically you need a table of Unicodes for each Arabic letter then you can use something like the following extension method.
public static string GetUnShapedUnicode(this string original)
{
original = Regex.Unescape(original.Trim());
var words = original.Split(' ');
StringBuilder builder = new StringBuilder();
var unicodesTable = UnicodesTable.GetArabicGliphes();
foreach (var word in words)
{
string previous = null;
for (int i = 0; i < word.Length; i++)
{
string shapedUnicode = #"\u" + ((int)word[i]).ToString("X4");
if (!unicodesTable.ContainsKey(shapedUnicode))
{
builder.Append(shapedUnicode);
previous = null;
continue;
}
else
{
if (i == 0 || previous == null)
{
builder.Append(unicodesTable[shapedUnicode][1]);
}
else
{
if (i == word.Length - 1)
{
if (!string.IsNullOrEmpty(previous) && unicodesTable[previous][4] == "2")
{
builder.Append(unicodesTable[shapedUnicode][0]);
}
else
builder.Append(unicodesTable[shapedUnicode][3]);
}
else
{
bool previouChar = unicodesTable[previous][4] == "2";
if (previouChar)
builder.Append(unicodesTable[shapedUnicode][1]);
else
builder.Append(unicodesTable[shapedUnicode][2]);
}
}
}
previous = shapedUnicode;
}
if (words.ToList().IndexOf(word) != words.Length - 1)
builder.Append(#"\u" + ((int)' ').ToString("X4"));
}
return builder.ToString();
}

Ignore special characters in Examine

In Umbraco, I use Examine to search in the website but the content is in french. Everything works fine except when I search for "Français" it's not the same result as "Francais". Is there a way to ignore those french characters? I try to find a FrenchAnalyser for Leucene/Examine but did not found anything. I use Fuzzy so it return results even if the words is not the same.
Here's the code of my search :
public static ISearchResults Search(string searchTerm)
{
var provider = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var criteria = provider.CreateSearchCriteria(BooleanOperation.Or);
var crawl = criteria.GroupedOr(BoostedSearchableFields, searchTerm.Boost(15))
.Or().GroupedOr(BoostedSearchableFields, searchTerm.Fuzzy(Fuzziness))
.Or().GroupedOr(SearchableFields, searchTerm.Fuzzy(Fuzziness))
.Not().Field("umbracoNavHide", "1");
return provider.Search(crawl.Compile());
}

We ended up using a custom analyer based on the SnowballAnalyzer
public class CustomAnalyzer : SnowballAnalyzer
{
public CustomAnalyzer() : base("French") { }
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
TokenStream result = base.TokenStream(fieldName, reader);
result = new ISOLatin1AccentFilter(result);
return result;
}
}

Try using Regex like this below:
var strInput ="Français";
var strToReplace = string.Empty;
var sNewString = Regex.Replace(strInput, "[^A-Za-z0-9]", strToReplace);
I've used this pattern "[^A-Za-z0-9]" to replace all non-alphanumeric string with a blank.
Hope it helps.

You can actually convert the unicode characters with diacritics to english equivalents using the following method. That will enable you to search for "Français" with the search term "Francais".
public static string RemoveDiacritics(this string text)
{
if (string.IsNullOrWhiteSpace(text))
return text;
text = text.Normalize(NormalizationForm.FormD);
var chars = text.Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark).ToArray();
return new string(chars).Normalize(NormalizationForm.FormC);
}
Use it on any string like this:
var converted = unicodeString.RemoveDiacritics();

Extract Common Name from Distinguished Name

Is there a call in .NET that parses the CN from a rfc-2253 encoded distinguished name? I know there are some third-party libraries that do this, but I would prefer to use native .NET libraries if possible.
Examples of a string encoded DN
CN=L. Eagle,O=Sue\, Grabbit and Runn,C=GB
CN=Jeff Smith,OU=Sales,DC=Fabrikam,DC=COM

If you are working with an X509Certificate2, there is a native method that you can use to extract the Simple Name. The Simple Name is equivalent to the Common Name RDN within the Subject field of the main certificate:
x5092Cert.GetNameInfo(X509NameType.SimpleName, false);
Alternatively, X509NameType.DnsName can be used to retrieve the Subject Alternative Name, if present; otherwise, it will default to the Common Name:
x5092Cert.GetNameInfo(X509NameType.DnsName, false);

After digging around in the .NET source code it looks like there is an internal utility class that can parse Distinguished Names into their different components. Unfortunately the utility class is not made public, but you can access it using reflection:
string dn = "CN=TestGroup,OU=Groups,OU=UT-SLC,OU=US,DC=Company,DC=com";
Assembly dirsvc = Assembly.Load("System.DirectoryServices");
Type asmType = dirsvc.GetType("System.DirectoryServices.ActiveDirectory.Utils");
MethodInfo mi = asmType.GetMethod("GetDNComponents", BindingFlags.NonPublic | BindingFlags.Static);
string[] parameters = { dn };
var test = mi.Invoke(null, parameters);
//test.Dump("test1");//shows details when using Linqpad
//Convert Distinguished Name (DN) to Relative Distinguished Names (RDN)
MethodInfo mi2 = asmType.GetMethod("GetRdnFromDN", BindingFlags.NonPublic | BindingFlags.Static);
var test2 = mi2.Invoke(null, parameters);
//test2.Dump("test2");//shows details when using Linqpad
The results would look like this:
//test1 is array of internal "Component" struct that has name/values as strings
Name Value
CN TestGroup
OU Groups
OU UT-SLC
OU US
DC company
DC com
//test2 is a string with CN=RDN
CN=TestGroup
Please not this is an internal utility class and could change in a future release.

I had the same question, myself, when I found yours. Didn't find anything in the BCL; however, I did stumble across this CodeProject article that hit the nail squarely on the head.
I hope it helps you out, too.
http://www.codeproject.com/Articles/9788/An-RFC-2253-Compliant-Distinguished-Name-Parser

Do Win32 functions count? You can use PInvoke with DsGetRdnW. For code, see my answer to another question: https://stackoverflow.com/a/11091804/628981.

You can extract the common name from an ASN.1-encoded distinguished name using AsnEncodedData class:
var distinguishedName= new X500DistinguishedName("CN=TestGroup,OU=Groups,OU=UT-SLC,OU=US,DC=Company,DC=com");
var commonNameData = new AsnEncodedData("CN", distinguishedName.RawData);
var commonName = commonNameData.Format(false);
A downside of this approach is that if you specify an unrecognized OID or the field identified with the OID is missing in the distinguished name, Format method will return a hex string with the encoded value of full distinguished name so you may want to verify the result.
Also the documentation does not seem to specify if the rawData parameter of the AsnEncodedData constructor is allowed to contain other OIDs besides the one specified as the first argument so it may break on non-Windows OS or in a future version of .NET Framework.

If you are on Windows, #MaxKiselev's answer works perfectly. On non-Windows platforms, it returns the ASN1 dumps of each attribute.
.Net Core 5+ includes an ASN1 parser, so you can access the RDN's in a cross-platform manner by using AsnReader.
Helper class:
public static class X509DistinguishedNameExtensions
{
public static IEnumerable<KeyValuePair<string, string>> GetRelativeNames(this X500DistinguishedName dn)
{
var reader = new AsnReader(dn.RawData, AsnEncodingRules.BER);
var snSeq = reader.ReadSequence();
if (!snSeq.HasData)
{
throw new InvalidOperationException();
}
// Many types are allowable. We're only going to support the string-like ones
// (This excludes IPAddress, X400 address, and other wierd stuff)
// https://www.rfc-editor.org/rfc/rfc5280#page-37
// https://www.rfc-editor.org/rfc/rfc5280#page-112
var allowedRdnTags = new[]
{
UniversalTagNumber.TeletexString, UniversalTagNumber.PrintableString,
UniversalTagNumber.UniversalString, UniversalTagNumber.UTF8String,
UniversalTagNumber.BMPString, UniversalTagNumber.IA5String,
UniversalTagNumber.NumericString, UniversalTagNumber.VisibleString,
UniversalTagNumber.T61String
};
while (snSeq.HasData)
{
var rdnSeq = snSeq.ReadSetOf().ReadSequence();
var attrOid = rdnSeq.ReadObjectIdentifier();
var attrValueTagNo = (UniversalTagNumber)rdnSeq.PeekTag().TagValue;
if (!allowedRdnTags.Contains(attrValueTagNo))
{
throw new NotSupportedException($"Unknown tag type {attrValueTagNo} for attr {attrOid}");
}
var attrValue = rdnSeq.ReadCharacterString(attrValueTagNo);
var friendlyName = new Oid(attrOid).FriendlyName;
yield return new KeyValuePair<string, string>(friendlyName ?? attrOid, attrValue);
}
}
}
Example usage:
// Subject: CN=Example, O=Organization
var cert = new X509Certificate2("foo.cer");
var names = this.cert.SubjectName.GetRelativeNames().ToArray();
// names has [ { "CN": "Example" }, { "O": "Organization" } ]
Since this does not involve any string parsing, no escape or injections can be mishandled. It doesn't support decoding DN's that contain non-string elements, but those seem exceedingly rare.

How about this one:
string cnPattern = #"^CN=(?<cn>.+?)(?<!\\),";
string dn = #"CN=Doe\, John,OU=My OU,DC=domain,DC=com";
Regex re = new Regex(cnPattern);
Match m = re.Match(dn);
if (m.Success)
{
// Item with index 1 returns the first group match.
string cn = m.Groups[1].Value;
}
Adapted from Powershell Regular Expression for Extracting Parts of an Active Directory Distiniguished Name.

Just adding my two cents here. This implementation works "best" if you first learn what business rules are in place that will ultimately dictate how much of the RFC will ever be implemented at your company.
private static string ExtractCN(string distinguishedName)
{
// CN=...,OU=...,OU=...,DC=...,DC=...
string[] parts;
parts = distinguishedName.Split(new[] { ",DC=" }, StringSplitOptions.None);
var dc = parts.Skip(1);
parts = parts[0].Split(new[] { ",OU=" }, StringSplitOptions.None);
var ou = parts.Skip(1);
parts = parts[0].Split(new[] { ",CN=" }, StringSplitOptions.None);
var cnMulti = parts.Skip(1);
var cn = parts[0];
if (!Regex.IsMatch(cn, "^CN="))
throw new CustomException(string.Format("Unable to parse distinguishedName for commonName ({0})", distinguishedName));
return Regex.Replace(cn, "^CN=", string.Empty);
}

You could use regular expressions to do this. Here's a regex pattern than can parse the whole DN, then you can just take the parts you are interested in:
(?:^|,\s?)(?:(?<name>[A-Z]+)=(?<val>"(?:[^"]|"")+"|(?:\\,|[^,])+))+
Here it is formatted a bit nicer, and with some comments:
(?:^|,\s?) <-- Start or a comma
(?:
(?<name>[A-Z]+)
=
(?<val>
"(?:[^"]|"")+" <-- Quoted strings
|
(?:\\,|[^,])+ <-- Unquoted strings
)
)+
This regex will give you name and val capture groups for each match.
DN strings can optionally be quoted (e.g. "Hello", which allows them to contain unescaped commas. Alternatively, if not quoted, commas must be escaped with a backslash (e.g. Hello\, there!). This regex handles both quoted and unquoted strings.
Here's a link so you can see it in action: https://regex101.com/r/7vhdDz/1

If the order is uncertain, I do this:
private static string ExtractCN(string dn)
{
string[] parts = dn.Split(new char[] { ',' });
for (int i = 0; i < parts.Length; i++)
{
var p = parts[i];
var elems = p.Split(new char[] { '=' });
var t = elems[0].Trim().ToUpper();
var v = elems[1].Trim();
if (t == "CN")
{
return v;
}
}
return null;
}

This is my almost RFC-compliant fail-safe DN parser derived from https://www.codeproject.com/Articles/9788/An-RFC-2253-Compliant-Distinguished-Name-Parser and an example of its usage (extract subject name as CN and O, both optional, concatenated with comma):
private static string GetCertificateString(X509Certificate2 certificate)
{
var subjectComponents = certificate.Subject.ParseDistinguishedName();
var subjectName = string.Join(", ", subjectComponents
.Where(m => (m.Item1 == "CN") || (m.Item1 == "O"))
.Select(n => n.Item2)
.Distinct());
return $"{certificate.SerialNumber} {certificate.NotBefore:yyyy.MM.dd}-{certificate.NotAfter:yyyy.MM.dd} {subjectName}";
}
private enum DistinguishedNameParserState
{
Component,
QuotedString,
EscapedCharacter,
};
public static IEnumerable<Tuple<string, string>> ParseDistinguishedName(this string value)
{
var previousState = DistinguishedNameParserState.Component;
var currentState = DistinguishedNameParserState.Component;
var currentComponent = new StringBuilder();
var previousChar = char.MinValue;
var position = 0;
Func<StringBuilder, Tuple<string, string>> parseComponent = sb =>
{
var s = sb.ToString();
sb.Clear();
var index = s.IndexOf('=');
if (index == -1)
{
return null;
}
var item1 = s.Substring(0, index).Trim().ToUpper();
var item2 = s.Substring(index + 1).Trim();
return Tuple.Create(item1, item2);
};
while (position < value.Length)
{
var currentChar = value[position];
switch (currentState)
{
case DistinguishedNameParserState.Component:
switch (currentChar)
{
case ',':
case ';':
// Separator found, yield parsed component
var component = parseComponent(currentComponent);
if (component != null)
{
yield return component;
}
break;
case '\\':
// Escape character found
previousState = currentState;
currentState = DistinguishedNameParserState.EscapedCharacter;
break;
case '"':
// Quotation mark found
if (previousChar == currentChar)
{
// Double quotes inside quoted string produce single quote
currentComponent.Append(currentChar);
}
currentState = DistinguishedNameParserState.QuotedString;
break;
default:
currentComponent.Append(currentChar);
break;
}
break;
case DistinguishedNameParserState.QuotedString:
switch (currentChar)
{
case '\\':
// Escape character found
previousState = currentState;
currentState = DistinguishedNameParserState.EscapedCharacter;
break;
case '"':
// Quotation mark found
currentState = DistinguishedNameParserState.Component;
break;
default:
currentComponent.Append(currentChar);
break;
}
break;
case DistinguishedNameParserState.EscapedCharacter:
currentComponent.Append(currentChar);
currentState = previousState;
currentChar = char.MinValue;
break;
}
previousChar = currentChar;
position++;
}
// Yield last parsed component, if any
if (currentComponent.Length > 0)
{
var component = parseComponent(currentComponent);
if (component != null)
{
yield return component;
}
}
}

Sorry for being a bit late to the party, but I was able to call the Name attribute directly from c#
UserPrincipal p
and then I was able to call
p.Name
and that gave me the full name (Common Name)
Sample code:
string name;
foreach(UserPrincipal p in PSR)
{
//PSR refers to PrincipalSearchResult
name = p.Name;
Console.WriteLine(name);
}
Obviously, you will have to fill in the blanks. But this should be easier than parsing regex.

Could you not just retrieve the CN attribute values?
As you correctly note, use someone else's class as there are lots of fun edge cases (escaped commas, escaped other characters) that make parsing a DN look easy, but actually reasonably tricky.
I usually use a Java class that comes with the Novell (Now NetID) Identity Manager. So that is not helpful.

using System.Linq;
var dn = "CN=Jeff Smith,OU=Sales,DC=Fabrikam,DC=COM";
var cn = dn.Split(',').Where(i => i.Contains("CN=")).Select(i => i.Replace("CN=", "")).FirstOrDefault();

Well, Here I am another person late to the party. Here is my Solution:
var dn = new X500DistinguishedName("CN=TestGroup,OU=Groups,OU=UT-SLC,OU=US,DC=\"Company, inc\",DC=com");
foreach(var part in dn.Format(true).Split("\r\n"))
{
if(part == "") continue;
var parts = part.Split('=', 2);
var key = parts[0];
var value = parts[1];
// use your key and value as you see fit here.
}
Basically its leveraging the X500DistinguishedName.Format method to put things on lines. Then split by lines, then split each line into key value.

What's the simplest way to encoding List<String> into plain String and decode it back?

I think I've come across this requirement for a dozen times. But I could never find a satisfying solution. For instance, there are a collection of string which I want to serialize (to disk or through network) through a channel where only plain string is allowed.
I almost always end up using "split" and "join" with ridiculous separator like
":::==--==:::".
like this:
public static string encode(System.Collections.Generic.List<string> data)
{
return string.Join(" :::==--==::: ", data.ToArray());
}
public static string[] decode(string encoded)
{
return encoded.Split(new string[] { " :::==--==::: " }, StringSplitOptions.None);
}
But this simple solution apparently has some flaws. The string cannot contains the separator string. And consequently, the encoded string can no longer re-encoded again.
AFAIK, the comprehensive solution should involve escaping the separator on encoding and unescaping on decoding. While the problem sound simple, I believe the complete solution can take significant amount of code. I wonder if there is any trick allowed me to build encoder & decoder in very few lines of code ?

Add a reference and using to System.Web, and then:
public static string Encode(IEnumerable<string> strings)
{
return string.Join("&", strings.Select(s => HttpUtility.UrlEncode(s)).ToArray());
}
public static IEnumerable<string> Decode(string list)
{
return list.Split('&').Select(s => HttpUtility.UrlDecode(s));
}
Most languages have a pair of utility functions that do Url "percent" encoding, and this is ideal for reuse in this kind of situation.

You could use the .ToArray property on the List<> and then serialize the Array - that could then be dumped to disk or network, and reconstituted with a deserialization on the other end.
Not too much code, and you get to use the serialization techniques already tested and coded in the .net framework.

You might like to look at the way CSV files are formatted.
escape all instances of a deliminater, e.g. " in the string
wrap each item in the list in "item"
join using a simple seperator like ,
I don't believe there is a silver bullet solution to this problem.

Here's an old-school technique that might be suitable -
Serialise by storing the width of each string[] as a fixed-width prefix in each line.
So
string[0]="abc"
string[1]="defg"
string[2]=" :::==--==::: "
becomes
0003abc0004defg0014 :::==--==:::
...where the size of the prefix is large enough to cater for the string maximum length

You could use an XmlDocument to handle the serialization. That will handle the encoding for you.
public static string encode(System.Collections.Generic.List<string> data)
{
var xml = new XmlDocument();
xml.AppendChild(xml.CreateElement("data"));
foreach (var item in data)
{
var xmlItem = (XmlElement)xml.DocumentElement.AppendChild(xml.CreateElement("item"));
xmlItem.InnerText = item;
}
return xml.OuterXml;
}
public static string[] decode(string encoded)
{
var items = new System.Collections.Generic.List<string>();
var xml = new XmlDocument();
xml.LoadXml(encoded);
foreach (XmlElement xmlItem in xml.SelectNodes("/data/item"))
items.Add(xmlItem.InnerText);
return items.ToArray();
}

I would just prefix every string with its length and an terminator indicating the end of the length.
abc
defg
hijk
xyz
546
4.X
becomes
3: abc 4: defg 4: hijk 3: xyz 3: 546 3: 4.X
No restriction or limitations at all and quite simple.

Json.NET is a very easy way to serialize about any object you can imagine. JSON keeps things compact and can be faster than XML.
List<string> foo = new List<string>() { "1", "2" };
string output = JsonConvert.SerializeObject(foo);
List<string> fooToo = (List<string>)JsonConvert.DeserializeObject(output, typeof(List<string>));

It can be done much simpler if you are willing to use a separator of 2 characters long:
In java code:
StringBuilder builder = new StringBuilder();
for(String s : list) {
if(builder.length() != 0) {
builder.append("||");
}
builder.append(s.replace("|", "|p"));
}
And back:
for(String item : encodedList.split("||")) {
list.add(item.replace("|p", "|"));
}

You shouldn't need to do this manually. As the other answers have pointed out, there are plenty of ways, built-in or otherwise, to serialize/deserialize.
However, if you did decide to do the work yourself, it doesn't require that much code:
public static string CreateDelimitedString(IEnumerable<string> items)
{
StringBuilder sb = new StringBuilder();
foreach (string item in items)
{
sb.Append(item.Replace("\\", "\\\\").Replace(",", "\\,"));
sb.Append(",");
}
return (sb.Length > 0) ? sb.ToString(0, sb.Length - 1) : string.Empty;
}
This will delimit the items with a comma (,). Any existing commas will be escaped with a backslash (\) and any existing backslashes will also be escaped.
public static IEnumerable<string> GetItemsFromDelimitedString(string s)
{
bool escaped = false;
StringBuilder sb = new StringBuilder();
foreach (char c in s)
{
if ((c == '\\') && !escaped)
{
escaped = true;
}
else if ((c == ',') && !escaped)
{
yield return sb.ToString();
sb.Length = 0;
}
else
{
sb.Append(c);
escaped = false;
}
}
yield return sb.ToString();
}

Why not use Xstream to serialise it, rather than reinventing your own serialisation format?
Its pretty simple:
new XStream().toXML(yourobject)

Include the System.Linq library in your file and change your functions to this:
public static string encode(System.Collections.Generic.List<string> data, out string delimiter)
{
delimiter = ":";
while(data.Contains(delimiter)) delimiter += ":";
return string.Join(delimiter, data.ToArray());
}
public static string[] decode(string encoded, string delimiter)
{
return encoded.Split(new string[] { delimiter }, StringSplitOptions.None);
}

There are loads of textual markup languages out there, any would function
Many would function trivially given the simplicity of your input it all depends on how:
human readable you want the encoding
resilient to api changes it should be
how easy to parse it is
how easy it is to write or get a parser for it.
If the last one is the most important then just use the existing xml libraries MS supply for you:
class TrivialStringEncoder
{
private readonly XmlSerializer ser = new XmlSerializer(typeof(string[]));
public string Encode(IEnumerable<string> input)
{
using (var s = new StringWriter())
{
ser.Serialize(s, input.ToArray());
return s.ToString();
}
}
public IEnumerable<string> Decode(string input)
{
using (var s = new StringReader(input))
{
return (string[])ser.Deserialize(s);
}
}
public static void Main(string[] args)
{
var encoded = Encode(args);
Console.WriteLine(encoded);
var decoded = Decode(encoded);
foreach(var x in decoded)
Console.WriteLine(x);
}
}
running on the inputs "A", "<", ">" you get (edited for formatting):
<?xml version="1.0" encoding="utf-16"?>
<ArrayOfString
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<string>A</string>
<string><</string>
<string>></string>
</ArrayOfString>
A
<
>
Verbose, slow but extremely simple and requires no additional libraries

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Converting HTML entities to Unicode Characters in C# - c#

Use HttpUtility.HtmlDecode() .Read on msdn here decodedString = HttpUtility.HtmlDecode(myEncodedString)

Related

How to fill object from contents of a string and populate a List?

Converting Arabic Words to Unicode format in C#

Ignore special characters in Examine

Extract Common Name from Distinguished Name

What's the simplest way to encoding List<String> into plain String and decode it back?

Categories

Resources