Ignore special characters in Examine - c#

In Umbraco, I use Examine to search in the website but the content is in french. Everything works fine except when I search for "Français" it's not the same result as "Francais". Is there a way to ignore those french characters? I try to find a FrenchAnalyser for Leucene/Examine but did not found anything. I use Fuzzy so it return results even if the words is not the same.
Here's the code of my search :
public static ISearchResults Search(string searchTerm)
{
var provider = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var criteria = provider.CreateSearchCriteria(BooleanOperation.Or);
var crawl = criteria.GroupedOr(BoostedSearchableFields, searchTerm.Boost(15))
.Or().GroupedOr(BoostedSearchableFields, searchTerm.Fuzzy(Fuzziness))
.Or().GroupedOr(SearchableFields, searchTerm.Fuzzy(Fuzziness))
.Not().Field("umbracoNavHide", "1");
return provider.Search(crawl.Compile());
}

We ended up using a custom analyer based on the SnowballAnalyzer
public class CustomAnalyzer : SnowballAnalyzer
{
public CustomAnalyzer() : base("French") { }
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
TokenStream result = base.TokenStream(fieldName, reader);
result = new ISOLatin1AccentFilter(result);
return result;
}
}

Try using Regex like this below:
var strInput ="Français";
var strToReplace = string.Empty;
var sNewString = Regex.Replace(strInput, "[^A-Za-z0-9]", strToReplace);
I've used this pattern "[^A-Za-z0-9]" to replace all non-alphanumeric string with a blank.
Hope it helps.

You can actually convert the unicode characters with diacritics to english equivalents using the following method. That will enable you to search for "Français" with the search term "Francais".
public static string RemoveDiacritics(this string text)
{
if (string.IsNullOrWhiteSpace(text))
return text;
text = text.Normalize(NormalizationForm.FormD);
var chars = text.Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark).ToArray();
return new string(chars).Normalize(NormalizationForm.FormC);
}
Use it on any string like this:
var converted = unicodeString.RemoveDiacritics();

Related

Sprache how to handle multiple line continuation values

I am using Sprache to parse a legacy file.
The file has the following structure very similar to a key and value dictionary:
Entity
{
propertyA simple
propertyB 10-1
propertyC "first"
propertyD "I am a line that spawns
to another line"
propertyE "second"
propertyF 1,2,3,4,5,6,\
7,8,9,10,11,\
12,13,14
propertyG "one","two","three",\
"four","five","six","seven",\
"eight","nine"
}
I am able to process the file correctly but not when it has the "\" line continuation.
The only dirty hack I did is to replace the string sent as input to the parser and replace the characters so there is no line continuation:
public static Document ParseLegsacyFile(string input)
{
// HACK
return Document.Parse(input.Replace("\\\r\n", string.Empty));
}
I don't want to carry out this technical debt...
Is there anyway to instruct the parser to ignore the pattern "\" and "\r\n" and replace to a string empty?
I already tried the Except (with Or), Return and Then without much success.
Here is part of the parsers i am using. The following ones are just for the "value" part:
public static readonly Parser<GenericObject> Value =
from value in Parse.AnyChar.Until(Parse.LineEnd).Text()
select new GenericObject(value);
private static readonly Parser<GenericString> SingleString =
from result in (from open in Parse.Char(Quote)
from content in Parse.CharExcept(Quote).Many().Text()
from close in Parse.Char(Quote)
select content).Token()
select new GenericString(result);
public static readonly Parser<GenericString> StringValue =
from value in SingleString .DelimitedBy(Parse.Char(Char.Parse(Comma)))
select new StringLiteral(string.Join(Comma, value));
Old Question, but answer may help someone:
You can remove the continuation char "\" and combine their lines using Sprache as below:
var text = #"...text here...";
var result= RemoveSlash(text).ToList();
foreach (var l in result)
Console.WriteLine(l);
IEnumerable<string> RemoveSlash(string text)
{
// return;
Parser<string> Eol = Parse.String("\\" + Environment.NewLine).Text();
var oneLine = Parse.AnyChar.Until(Parse.LineEnd).Text();
var multiLine =
from l in Parse.AnyChar.Until(Eol).Text().Many()
from c in oneLine.Once()
let m = string.Join("", l.Concat(c))
select m;
var lines = multiLine.Or(oneLine);
var result = lines.Many().Parse(text);
return result;
}
Try it

Extract multiple values from string using C#

I'am creating my own forum. I've got problem with quoting messages. I know how to add quoting message into text box, but i cannot figure out how to extract values from string after post. In text box i've got something like this:
[quote IdPost=8] Some quoting text [/quote]
[quote IdPost=15] Second quoting text [/quote]
Could You tell what is the easiest way to extract all "IdPost" numbers from string after posting form ?.
by using a regex
#"\[quote IdPost=(\d+)\]"
something like
Regex reg = new Regex(#"\[quote IdPost=(\d+)\]");
foreach (Match match in reg.Matches(text))
{
...
}
var originalstring = "[quote IdPost=8] Some quoting text [/quote]";
//"[quote IdPost=" and "8] Some quoting text [/quote]"
var splits = originalstring.Split('=');
if(splits.Count() == 2)
{
//"8" and "] Some quoting text [/quote]"
var splits2 = splits[1].Split(']');
int id;
if(int.TryParse(splits2[0], out id))
{
return id;
}
}
I do not know exactly what is your string, but here is a regex-free solution with Substring :
using System;
public class Program
{
public static void Main()
{
string source = "[quote IdPost=8] Some quoting text [/quote]";
Console.WriteLine(ExtractNum(source, "=", "]"));
Console.WriteLine(ExtractNum2(source, "[quote IdPost="));
}
public static string ExtractNum(string source, string start, string end)
{
int index = source.IndexOf(start) + start.Length;
return source.Substring(index, source.IndexOf(end) - index);
}
// just another solution for fun
public static string ExtractNum2(string source, string junk)
{
source = source.Substring(junk.Length, source.Length - junk.Length); // erase start
return source.Remove(source.IndexOf(']')); // erase end
}
}
Demo on DotNetFiddle

Converting HTML entities to Unicode Characters in C#

I found similar questions and answers for Python and Javascript, but not for C# or any other WinRT compatible language.
The reason I think I need it, is because I'm displaying text I get from websites in a Windows 8 store app. E.g. é should become é.
Or is there a better way? I'm not displaying websites or rss feeds, but just a list of websites and their titles.
I recommend using System.Net.WebUtility.HtmlDecode and NOT HttpUtility.HtmlDecode.
This is due to the fact that the System.Web reference does not exist in Winforms/WPF/Console applications and you can get the exact same result using this class (which is already added as a reference in all those projects).
Usage:
string s = System.Net.WebUtility.HtmlDecode("é"); // Returns é
Use HttpUtility.HtmlDecode() .Read on msdn here
decodedString = HttpUtility.HtmlDecode(myEncodedString)
This might be useful, replaces all (for as far as my requirements go) entities with their unicode equivalent.
public string EntityToUnicode(string html) {
var replacements = new Dictionary<string, string>();
var regex = new Regex("(&[a-z]{2,5};)");
foreach (Match match in regex.Matches(html)) {
if (!replacements.ContainsKey(match.Value)) {
var unicode = HttpUtility.HtmlDecode(match.Value);
if (unicode.Length == 1) {
replacements.Add(match.Value, string.Concat("&#", Convert.ToInt32(unicode[0]), ";"));
}
}
}
foreach (var replacement in replacements) {
html = html.Replace(replacement.Key, replacement.Value);
}
return html;
}
Different coding/encoding of HTML entities and HTML numbers in Metro App and WP8 App.
With Windows Runtime Metro App
{
string inStr = "ó";
string auxStr = System.Net.WebUtility.HtmlEncode(inStr);
// auxStr == ó
string outStr = System.Net.WebUtility.HtmlDecode(auxStr);
// outStr == ó
string outStr2 = System.Net.WebUtility.HtmlDecode("ó");
// outStr2 == ó
}
With Windows Phone 8.0
{
string inStr = "ó";
string auxStr = System.Net.WebUtility.HtmlEncode(inStr);
// auxStr == ó
string outStr = System.Net.WebUtility.HtmlDecode(auxStr);
// outStr == ó
string outStr2 = System.Net.WebUtility.HtmlDecode("ó");
// outStr2 == ó
}
To solve this, in WP8, I have implemented the table in HTML ISO-8859-1 Reference before calling System.Net.WebUtility.HtmlDecode().
This worked for me, replaces both common and unicode entities.
private static readonly Regex HtmlEntityRegex = new Regex("&(#)?([a-zA-Z0-9]*);");
public static string HtmlDecode(this string html)
{
if (html.IsNullOrEmpty()) return html;
return HtmlEntityRegex.Replace(html, x => x.Groups[1].Value == "#"
? ((char)int.Parse(x.Groups[2].Value)).ToString()
: HttpUtility.HtmlDecode(x.Groups[0].Value));
}
[Test]
[TestCase(null, null)]
[TestCase("", "")]
[TestCase("'fark'", "'fark'")]
[TestCase(""fark"", "\"fark\"")]
public void should_remove_html_entities(string html, string expected)
{
html.HtmlDecode().ShouldEqual(expected);
}
Improved Zumey method (I can`t comment there). Max char size is in the entity: &exclamation; (11). Upper case in the entities are also possible, ex. À (Source from wiki)
public string EntityToUnicode(string html) {
var replacements = new Dictionary<string, string>();
var regex = new Regex("(&[a-zA-Z]{2,11};)");
foreach (Match match in regex.Matches(html)) {
if (!replacements.ContainsKey(match.Value)) {
var unicode = HttpUtility.HtmlDecode(match.Value);
if (unicode.Length == 1) {
replacements.Add(match.Value, string.Concat("&#", Convert.ToInt32(unicode[0]), ";"));
}
}
}
foreach (var replacement in replacements) {
html = html.Replace(replacement.Key, replacement.Value);
}
return html;
}

String escape into XML

Is there any C# function which could be used to escape and un-escape a string, which could be used to fill in the content of an XML element?
I am using VSTS 2008 + C# + .Net 3.0.
EDIT 1: I am concatenating simple and short XML file and I do not use serialization, so I need to explicitly escape XML character by hand, for example, I need to put a<b into <foo></foo>, so I need escape string a<b and put it into element foo.
SecurityElement.Escape(string s)
public static string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}
public static string XmlUnescape(string escaped)
{
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("root");
node.InnerXml = escaped;
return node.InnerText;
}
EDIT: You say "I am concatenating simple and short XML file and I do not use serialization, so I need to explicitly escape XML character by hand".
I would strongly advise you not to do it by hand. Use the XML APIs to do it all for you - read in the original files, merge the two into a single document however you need to (you probably want to use XmlDocument.ImportNode), and then write it out again. You don't want to write your own XML parsers/formatters. Serialization is somewhat irrelevant here.
If you can give us a short but complete example of exactly what you're trying to do, we can probably help you to avoid having to worry about escaping in the first place.
Original answer
It's not entirely clear what you mean, but normally XML APIs do this for you. You set the text in a node, and it will automatically escape anything it needs to. For example:
LINQ to XML example:
using System;
using System.Xml.Linq;
class Test
{
static void Main()
{
XElement element = new XElement("tag",
"Brackets & stuff <>");
Console.WriteLine(element);
}
}
DOM example:
using System;
using System.Xml;
class Test
{
static void Main()
{
XmlDocument doc = new XmlDocument();
XmlElement element = doc.CreateElement("tag");
element.InnerText = "Brackets & stuff <>";
Console.WriteLine(element.OuterXml);
}
}
Output from both examples:
<tag>Brackets & stuff <></tag>
That's assuming you want XML escaping, of course. If you're not, please post more details.
Thanks to #sehe for the one-line escape:
var escaped = new System.Xml.Linq.XText(unescaped).ToString();
I add to it the one-line un-escape:
var unescapedAgain = System.Xml.XmlReader.Create(new StringReader("<r>" + escaped + "</r>")).ReadElementString();
George, it's simple. Always use the XML APIs to handle XML. They do all the escaping and unescaping for you.
Never create XML by appending strings.
And if you want, like me when I found this question, to escape XML node names, like for example when reading from an XML serialization, use the easiest way:
XmlConvert.EncodeName(string nameToEscape)
It will also escape spaces and any non-valid characters for XML elements.
http://msdn.microsoft.com/en-us/library/system.security.securityelement.escape%28VS.80%29.aspx
Another take based on John Skeet's answer that doesn't return the tags:
void Main()
{
XmlString("Brackets & stuff <> and \"quotes\"").Dump();
}
public string XmlString(string text)
{
return new XElement("t", text).LastNode.ToString();
}
This returns just the value passed in, in XML encoded format:
Brackets & stuff <> and "quotes"
WARNING: Necromancing
Still Darin Dimitrov's answer + System.Security.SecurityElement.Escape(string s) isn't complete.
In XML 1.1, the simplest and safest way is to just encode EVERYTHING.
Like for \t.
It isn't supported at all in XML 1.0.
For XML 1.0, one possible workaround is to base-64 encode the text containing the character(s).
//string EncodedXml = SpecialXmlEscape("привет мир");
//Console.WriteLine(EncodedXml);
//string DecodedXml = XmlUnescape(EncodedXml);
//Console.WriteLine(DecodedXml);
public static string SpecialXmlEscape(string input)
{
//string content = System.Xml.XmlConvert.EncodeName("\t");
//string content = System.Security.SecurityElement.Escape("\t");
//string strDelimiter = System.Web.HttpUtility.HtmlEncode("\t"); // XmlEscape("\t"); //XmlDecode(" ");
//strDelimiter = XmlUnescape(";");
//Console.WriteLine(strDelimiter);
//Console.WriteLine(string.Format("&#{0};", (int)';'));
//Console.WriteLine(System.Text.Encoding.ASCII.HeaderName);
//Console.WriteLine(System.Text.Encoding.UTF8.HeaderName);
string strXmlText = "";
if (string.IsNullOrEmpty(input))
return input;
System.Text.StringBuilder sb = new StringBuilder();
for (int i = 0; i < input.Length; ++i)
{
sb.AppendFormat("&#{0};", (int)input[i]);
}
strXmlText = sb.ToString();
sb.Clear();
sb = null;
return strXmlText;
} // End Function SpecialXmlEscape
XML 1.0:
public static string Base64Encode(string plainText)
{
var plainTextBytes = System.Text.Encoding.UTF8.GetBytes(plainText);
return System.Convert.ToBase64String(plainTextBytes);
}
public static string Base64Decode(string base64EncodedData)
{
var base64EncodedBytes = System.Convert.FromBase64String(base64EncodedData);
return System.Text.Encoding.UTF8.GetString(base64EncodedBytes);
}
Following functions will do the work. Didn't test against XmlDocument, but I guess this is much faster.
public static string XmlEncode(string value)
{
System.Xml.XmlWriterSettings settings = new System.Xml.XmlWriterSettings
{
ConformanceLevel = System.Xml.ConformanceLevel.Fragment
};
StringBuilder builder = new StringBuilder();
using (var writer = System.Xml.XmlWriter.Create(builder, settings))
{
writer.WriteString(value);
}
return builder.ToString();
}
public static string XmlDecode(string xmlEncodedValue)
{
System.Xml.XmlReaderSettings settings = new System.Xml.XmlReaderSettings
{
ConformanceLevel = System.Xml.ConformanceLevel.Fragment
};
using (var stringReader = new System.IO.StringReader(xmlEncodedValue))
{
using (var xmlReader = System.Xml.XmlReader.Create(stringReader, settings))
{
xmlReader.Read();
return xmlReader.Value;
}
}
}
Using a third-party library (Newtonsoft.Json) as alternative:
public static string XmlEscape(string unescaped)
{
if (unescaped == null) return null;
return JsonConvert.SerializeObject(unescaped); ;
}
public static string XmlUnescape(string escaped)
{
if (escaped == null) return null;
return JsonConvert.DeserializeObject(escaped, typeof(string)).ToString();
}
Examples of escaped string:
a<b ==> "a<b"
<foo></foo> ==> "foo></foo>"
NOTE:
In newer versions, the code written above may not work with escaping, so you need to specify how the strings will be escaped:
public static string XmlEscape(string unescaped)
{
if (unescaped == null) return null;
return JsonConvert.SerializeObject(unescaped, new JsonSerializerSettings()
{
StringEscapeHandling = StringEscapeHandling.EscapeHtml
});
}
Examples of escaped string:
a<b ==> "a\u003cb"
<foo></foo> ==> "\u003cfoo\u003e\u003c/foo\u003e"
SecurityElementEscape does this job for you
Use this method to replace invalid characters in a string before using the string in a SecurityElement. If invalid characters are used in a SecurityElement without being escaped, an ArgumentException is thrown.
The following table shows the invalid XML characters and their escaped equivalents.
https://learn.microsoft.com/en-us/dotnet/api/system.security.securityelement.escape?view=net-5.0

What's the simplest way to encoding List<String> into plain String and decode it back?

I think I've come across this requirement for a dozen times. But I could never find a satisfying solution. For instance, there are a collection of string which I want to serialize (to disk or through network) through a channel where only plain string is allowed.
I almost always end up using "split" and "join" with ridiculous separator like
":::==--==:::".
like this:
public static string encode(System.Collections.Generic.List<string> data)
{
return string.Join(" :::==--==::: ", data.ToArray());
}
public static string[] decode(string encoded)
{
return encoded.Split(new string[] { " :::==--==::: " }, StringSplitOptions.None);
}
But this simple solution apparently has some flaws. The string cannot contains the separator string. And consequently, the encoded string can no longer re-encoded again.
AFAIK, the comprehensive solution should involve escaping the separator on encoding and unescaping on decoding. While the problem sound simple, I believe the complete solution can take significant amount of code. I wonder if there is any trick allowed me to build encoder & decoder in very few lines of code ?
Add a reference and using to System.Web, and then:
public static string Encode(IEnumerable<string> strings)
{
return string.Join("&", strings.Select(s => HttpUtility.UrlEncode(s)).ToArray());
}
public static IEnumerable<string> Decode(string list)
{
return list.Split('&').Select(s => HttpUtility.UrlDecode(s));
}
Most languages have a pair of utility functions that do Url "percent" encoding, and this is ideal for reuse in this kind of situation.
You could use the .ToArray property on the List<> and then serialize the Array - that could then be dumped to disk or network, and reconstituted with a deserialization on the other end.
Not too much code, and you get to use the serialization techniques already tested and coded in the .net framework.
You might like to look at the way CSV files are formatted.
escape all instances of a deliminater, e.g. " in the string
wrap each item in the list in "item"
join using a simple seperator like ,
I don't believe there is a silver bullet solution to this problem.
Here's an old-school technique that might be suitable -
Serialise by storing the width of each string[] as a fixed-width prefix in each line.
So
string[0]="abc"
string[1]="defg"
string[2]=" :::==--==::: "
becomes
0003abc0004defg0014 :::==--==:::
...where the size of the prefix is large enough to cater for the string maximum length
You could use an XmlDocument to handle the serialization. That will handle the encoding for you.
public static string encode(System.Collections.Generic.List<string> data)
{
var xml = new XmlDocument();
xml.AppendChild(xml.CreateElement("data"));
foreach (var item in data)
{
var xmlItem = (XmlElement)xml.DocumentElement.AppendChild(xml.CreateElement("item"));
xmlItem.InnerText = item;
}
return xml.OuterXml;
}
public static string[] decode(string encoded)
{
var items = new System.Collections.Generic.List<string>();
var xml = new XmlDocument();
xml.LoadXml(encoded);
foreach (XmlElement xmlItem in xml.SelectNodes("/data/item"))
items.Add(xmlItem.InnerText);
return items.ToArray();
}
I would just prefix every string with its length and an terminator indicating the end of the length.
abc
defg
hijk
xyz
546
4.X
becomes
3: abc 4: defg 4: hijk 3: xyz 3: 546 3: 4.X
No restriction or limitations at all and quite simple.
Json.NET is a very easy way to serialize about any object you can imagine. JSON keeps things compact and can be faster than XML.
List<string> foo = new List<string>() { "1", "2" };
string output = JsonConvert.SerializeObject(foo);
List<string> fooToo = (List<string>)JsonConvert.DeserializeObject(output, typeof(List<string>));
It can be done much simpler if you are willing to use a separator of 2 characters long:
In java code:
StringBuilder builder = new StringBuilder();
for(String s : list) {
if(builder.length() != 0) {
builder.append("||");
}
builder.append(s.replace("|", "|p"));
}
And back:
for(String item : encodedList.split("||")) {
list.add(item.replace("|p", "|"));
}
You shouldn't need to do this manually. As the other answers have pointed out, there are plenty of ways, built-in or otherwise, to serialize/deserialize.
However, if you did decide to do the work yourself, it doesn't require that much code:
public static string CreateDelimitedString(IEnumerable<string> items)
{
StringBuilder sb = new StringBuilder();
foreach (string item in items)
{
sb.Append(item.Replace("\\", "\\\\").Replace(",", "\\,"));
sb.Append(",");
}
return (sb.Length > 0) ? sb.ToString(0, sb.Length - 1) : string.Empty;
}
This will delimit the items with a comma (,). Any existing commas will be escaped with a backslash (\) and any existing backslashes will also be escaped.
public static IEnumerable<string> GetItemsFromDelimitedString(string s)
{
bool escaped = false;
StringBuilder sb = new StringBuilder();
foreach (char c in s)
{
if ((c == '\\') && !escaped)
{
escaped = true;
}
else if ((c == ',') && !escaped)
{
yield return sb.ToString();
sb.Length = 0;
}
else
{
sb.Append(c);
escaped = false;
}
}
yield return sb.ToString();
}
Why not use Xstream to serialise it, rather than reinventing your own serialisation format?
Its pretty simple:
new XStream().toXML(yourobject)
Include the System.Linq library in your file and change your functions to this:
public static string encode(System.Collections.Generic.List<string> data, out string delimiter)
{
delimiter = ":";
while(data.Contains(delimiter)) delimiter += ":";
return string.Join(delimiter, data.ToArray());
}
public static string[] decode(string encoded, string delimiter)
{
return encoded.Split(new string[] { delimiter }, StringSplitOptions.None);
}
There are loads of textual markup languages out there, any would function
Many would function trivially given the simplicity of your input it all depends on how:
human readable you want the encoding
resilient to api changes it should be
how easy to parse it is
how easy it is to write or get a parser for it.
If the last one is the most important then just use the existing xml libraries MS supply for you:
class TrivialStringEncoder
{
private readonly XmlSerializer ser = new XmlSerializer(typeof(string[]));
public string Encode(IEnumerable<string> input)
{
using (var s = new StringWriter())
{
ser.Serialize(s, input.ToArray());
return s.ToString();
}
}
public IEnumerable<string> Decode(string input)
{
using (var s = new StringReader(input))
{
return (string[])ser.Deserialize(s);
}
}
public static void Main(string[] args)
{
var encoded = Encode(args);
Console.WriteLine(encoded);
var decoded = Decode(encoded);
foreach(var x in decoded)
Console.WriteLine(x);
}
}
running on the inputs "A", "<", ">" you get (edited for formatting):
<?xml version="1.0" encoding="utf-16"?>
<ArrayOfString
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<string>A</string>
<string><</string>
<string>></string>
</ArrayOfString>
A
<
>
Verbose, slow but extremely simple and requires no additional libraries

Categories

Resources