How to get the unescaped length of XElement inner text?

How to get the unescaped length of XElement inner text? - c#

I try to parse the following Java resources file - which is an XML.
I am parsing using C# and XDocument tools, so not a Java question here.
<?xml version="1.0" encoding="utf-8"?>
<resources>
<string name="problem"> test </string>
<string name="no_problem"> test </string>
</resources>
The problem is that XDocument.Load(string path) method load this as an XDocument with 2 identical XElements.
I load the file.
string filePath = #"c:\res.xml"; // whatever
var xDocument = XDocument.Load(filePath);
When I parse the XDocument object, here is the problem.
foreach (var node in xDocument.Root.Nodes())
{
if (node.NodeType == XmlNodeType.Element)
{
var xElement = node as XElement;
if (xElement != null) // just to be sure
{
var elementText = xElement.Value;
Console.WriteLine("Text = '{0}', Length = {1}",
elementText, elementText.Length);
}
}
}
This produces the following 2 lines :
"Text = ' test ', Length = 6"
"Text = ' test ', Length = 6"
I want to get the following 2 lines :
"Text = ' test ', Length = 6"
"Text = ' test ', Length = 16"
Document encoding is UTF8, if this is relevant somehow.

string filePath = #"c:\res.xml"; // whatever
var xDocument = XDocument.Load(filePath);
String one = (xDocument.Root.Nodes().ElementAt(0) as XElement).Value;//< test >
String two = (xDocument.Root.Nodes().ElementAt(1) as XElement).Value;//< test >
Console.WriteLine(one == two); //false
Console.WriteLine(String.Format("{0} {1}", (int)one[0], (int)two[0]));//160 32
You have two different strings, and   is there, but in unicode format.
One possible way to get things back is manually replace non-breaking space to " "
String result = one.Replace(((char) 160).ToString(), " ");

Thanks to Dmitry, following his suggestion, I have made a function to make stuff work for a list of unicode codes.
private static readonly List<int> UnicodeCharCodesReplace =
new List<int>() { 160 }; // put integers here
public static string UnicodeUnescape(this string input)
{
var chars = input.ToCharArray();
var sb = new StringBuilder();
foreach (var c in chars)
{
if (UnicodeCharCodesReplace.Contains(c))
{
// Append &#code; instead of character
sb.Append("&#");
sb.Append(((int) c).ToString());
sb.Append(";");
}
else
{
// Append character itself
sb.Append(c);
}
}
return sb.ToString();
}

Related

c# comparing XML format

I am trying to compare XML user input to a valid XML string. What I do is remove the values from the user input and compare it to the valid XML. At the bottom you can see my code. But as you can see in the XML examples the user input the goodslines has two goodsline children and fewer children. How can i alter my code so that it can this case would return true when compared? Thanks in advance
Valid XML
<?xml version="1.0" encoding="Windows-1252"?>
<goodslines>
<goodsline>
<unitamount></unitamount>
<unit_id matchmode="1"></unit_id>
<product_id matchmode="1"></product_id>
<weight></weight>
<loadingmeter></loadingmeter>
<volume></volume>
<length></length>
<width></width>
<height></height>
</goodsline>
</goodslines>
User input
<?xml version="1.0" encoding="Windows-1252"?>
<goodslines>
<goodsline>
<unitamount>5</unitamount>
<unit_id matchmode="1">colli</unit_id>
<product_id matchmode="1">1109</product_id>
<weight>50</weight>
<loadingmeter>0.2</loadingmeter>
<volume>0.036</volume>
<length>20</length>
<width>20</width>
<height>90</height>
</goodsline>
<goodsline>
<unitamount>12</unitamount>
<unit_id matchmode="1">drums</unit_id>
<product_id matchmode="1">1109</product_id>
<weight>345</weight>
</goodsline>
</goodslines>
Code
public static string Format(string xml)
{
try
{
var stringBuilder = new StringBuilder();
var element = XDocument.Parse(xml);
var settings = new XmlWriterSettings
{
OmitXmlDeclaration = true,
Indent = true,
IndentChars = new string(' ', 3),
NewLineChars = Environment.NewLine,
NewLineOnAttributes = false,
NewLineHandling = NewLineHandling.Replace
};
using (var xmlWriter = XmlWriter.Create(stringBuilder, settings))
element.Save(xmlWriter);
return stringBuilder.ToString();
}
catch(Exception ex)
{
return "Unable to format XML" + ex;
}
}
public static bool Compare(string xmlA, string xmlB)
{
if(xmlA == null || xmlB == null)
return false;
var xmlFormattedA = Format(xmlA);
var xmlFormattedB = Format(xmlB);
return xmlFormattedA.Equals(xmlFormattedB, StringComparison.InvariantCultureIgnoreCase);
}
public static string NoText(string request)
{
string pattern = #"<.*?>";
Regex rg = new Regex(pattern);
var noTextArr = rg.Matches(request)
.Cast<Match>()
.Select(m => m.Value)
.ToArray();
string noText = string.Join("", noTextArr);
return noText;
}

C# XDocument right way to escape symbols

Hello I'm struggling with escaping in xml, problem is my output is escaped 2 times and I dont understand why its happening.
Code below:
private static string FixSingleEncoding(string data)
{
//data?.Replace("&", "&").Replace("<", "<").Replace(">", ">").Replace(""", """).Replace("'", "&apos;");
return System.Net.WebUtility.HtmlEncode(data); //SecurityElement.Escape(data);//
}
private static XDocument FixEncoding(XDocument instance)
{
XNamespace naming = instance.Root.Name.Namespace;
var result = instance.Descendants(naming + "dataset").ToList();
var count = result.Count;
for (int i = 0; i < count; i++)
{
result[i].Value = FixSingleEncoding(result[i].Value);
}
return instance;
}
public static bool CreateNewDataset(string path, string data)
{
Debug.WriteLine("CALL");
XDocument xdoc = XDocument.Load(Path.Combine(MasterLocation, path));
xdoc = FixEncoding(xdoc);
XNamespace df = xdoc.Root.Name.Namespace;
XElement root = new XElement(df+"changeSet");
root.Add(new XAttribute("id", "My Name"));
root.Add(new XAttribute("author", "Test"));
string final = data;
XElement innerelement = new XElement(df + "data", final);
innerelement.Add(new XAttribute("endDelimiter", "GO"));
root.Add(innerelement);
xdoc.Root.Add(root);
xdoc.Save(Path.Combine(MasterLocation, path));
return true;
}
Problem is when I first time load xml file and use method CreateNewDataset it retrieves all data from xml file and unescape old data, so I put FixEncoding method, but then another problem showed up, now it escapes two times, how do I know that exactly two times, well using VS Code and converting XML Entity to string, it needs to converted 2 times to readable string, CreateNewDataset method is called only once, but data escaped two times, what do I miss here?
entered data
IF EXISTS ( SELECT *
FROM sysobjects
WHERE id = object_id(N'[dbo].[table1]')
and OBJECTPROPERTY(id, N'IsProcedure') = 0)
orginal code before CreateNewDataset:
<changeSet id="Test" author="My Name">
<data endDelimiter="GO">
IF EXISTS ( SELECT *
FROM sysobjects
WHERE id = object_id(N&apos;[dbo].[table1]&apos;)
and OBJECTPROPERTY(id, N&apos;IsProcedure&apos;) = 0)
</data>
</changeSet>
AFTER createnewdataset(without FixEncoding)
<changeSet id="Test" author="My Name">
<data endDelimiter="GO">
IF EXISTS ( SELECT *
FROM sysobjects
WHERE id = object_id(N'[dbo].[table1]')
and OBJECTPROPERTY(id, N'IsProcedure') = 0)
</data>
</changeSet>

C# programming for trimming first three lines and last four lines

Below is my XML file. I want to get the node "name" from the XML using C#
'EventObjectsRead' ('73')
message attributes:
SATRCFG_OBJECT [xml] =
<ConfData>
<CfgAgentGroup>
<CfgGroup>
<DBID value="225"/>
<tenantDBID value="101"/>
<name value="CBD"/>
<routeDNDBIDs>
<DBID value="825"/>
</routeDNDBIDs>
<capacityTableDBID value="0"/>
<quotaTableDBID value="0"/>
<state value="1"/>
<capacityRuleDBID value="0"/>
<siteDBID value="0"/>
<contractDBID value="0"/>
</CfgGroup>
<agentDBIDs>
<DBID value="128"/>
<DBID value="133"/>
<DBID value="135"/>
<DBID value="385"/>
<DBID value="433"/>
</agentDBIDs>
</CfgAgentGroup>
</ConfData>
IATRCFG_TOTALCOUNT [int] = 1
IATRCFG_OBJECTCOUNT [int] = 1
IATRCFG_OBJECTTYPE [int] = 5
IATRCFG_REQUESTID [int] = 3
Is there a way to get node "name" directly from above XML or if i need to trim first three lines and last four lines. how can i do it.

You could extract the node you are looking for using Regex on the original string (where str is your string data):
// Use Regex to match the exact string and parse that to XElement.
string nameXML = Regex.Match(str, #"<name +value="".*"" */>").Groups[0].Value;
XElement name = XElement.Parse(nameXML);
Or here is an example where you can strip the invalid lines, parse the XML and then access the data from an XML object:
// Split the string into groups using newline as a delimiter.
string[] groups = str.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
// Use skip and take to trim the first 3 and last 4 elements.
// Rejoin the remainder back together with empty strings and parse the XElement.
string xmlString = string.Join(string.Empty, groups.Take(groups.Length - 4).Skip(3));
XElement xml = XElement.Parse(xmlString);
// Use Descendants and First to get the first node called 'name' in the XML.
XElement name = xml.Descendants("name").First();

Here are two ways to achieve this. Either with string operation or with RegEx:
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Name: {0}", GetNameFromFileString(File.ReadAllText("file.txt")));
Console.WriteLine("Name: {0}", GetNameFromFile("file.txt"));
}
private static string GetNameFromFileString(string filecontent)
{
Regex r = new Regex("(?<Xml><ConfData>.*</ConfData>)", RegexOptions.Singleline);
var match = r.Match(filecontent);
var xmlString = match.Groups["Xml"].ToString();
return GetNameFromXmlString(xmlString);
}
private static string GetNameFromFile(string filename)
{
var lines = File.ReadAllLines(filename);
var xml = new StringBuilder();
var isXml = false;
foreach (var line in lines)
{
if (line.Contains("<ConfData>"))
isXml = true;
if (isXml)
xml.Append(line.Trim());
if (line.Contains("</ConfData>"))
isXml = false;
}
var text = xml.ToString();
return GetNameFromXmlString(text);
}
private static string GetNameFromXmlString(string text)
{
var xDocument = XDocument.Parse(text);
var cfgAgentGroupt = xDocument.Root.Element("CfgAgentGroup");
var cfgGroup = cfgAgentGroupt.Element("CfgGroup");
var name = cfgGroup.Element("name");
var nameValue = name.Attribute("value");
var value = nameValue.Value;
return value;
}
}

From the string what you provided us and the describtion of what you want to do i assume taht you want to extract the XMl from the file. I would do this in the following way:
string text = System.IO.File.ReadAllText(#"C:\docs\myfile.txt");
Regex r = new Regex("<ConfData>(.|\r\n)*?</ConfData>");
var v = r.Match(text);
string myResult = "<ConfData>" + v.Groups[0].ToString() + "</ConfData>";

Regex without escaping Characters - Problems

I found some solutions for my problem, which is quite simple:
I have a string, which is looking like this:
"\r\nContent-Disposition: form-data; name=\"ctl00$cphMainContent$grid$ctl03$ucPicture$ctl00\""
My goal is to break it down, so I have a Dictionary of values, like:
Key = "name", value ? "ctl..."
My approach was: Split it by "\r\n" and then by the equal or the colon sign.
This worked fine, but then some funny Tester uploaded a file with all allowed charactes, which made the String looking like this:
"\r\nContent-Disposition: form-data; name=\"ctl00_cphMainContent_grid_ctl03_ucPicture_btnUpload$fileUpload\"; filename=\"C:\\Users\\matthias.mueller\\Desktop\\- ie+![]{}_-´;,.$¨##ç %&()=~^`'.jpg\"\r\nContent-Type: image/jpeg"
Of course, the simple splitting doesn't work anymore, since it splits now the filename.
I corrected this by reading out "filename=" and escaping the signs I'm looking to split, and then creating a regex.
Now comes my problem: I found two Regex-samples, which could do the work for the equal sign, the semicolon and the colon. one is:
[^\\]=
The other one I found was:
(?<!\\\\)=
The problem is, the first one doesn't only split, but it splits the equal sign and one character before this sign, which means my key in the Dictionary is "nam" instead of "name"
The second one works fine on this matter, but it still splits the escaped equal sign in the filename.
Is my approach for this problem even working? Would there be a better solution for this? And why is the first Regex cutting a character?
Edit: To avoid confusion, my escaped String looks like this:
"Content-Disposition: form-data; name=\"ctl00_cphMainContent_grid_ctl03_ucPicture_btnUpload$fileUpload\"; filename=\"C\:\Users\matthias.mueller\Desktop\- ie+![]{}_-´\;,.$¨##ç %&()\=~^`'.jpg\""
So I want basically: Split by equal Sign EXCEPT the escaped ones. By the way: The string here shows only one \, but there are 2.
Edit 2: OK seems like I have a working solution, but it's so ugly:
Dictionary<string, string> ParseHeader(byte[] bytes, int pos)
{
Dictionary<string, string> items;
string header;
string[] headerLines;
int start;
int end;
string input = _encoding.GetString(bytes, pos, bytes.Length - pos);
start = input.IndexOf("\r\n", 0);
if (start < 0) return null;
end = input.IndexOf("\r\n\r\n", start);
if (end < 0) return null;
WriteBytes(false, bytes, pos, end + 4 - 0); // Write the header to the form content
header = input.Substring(start, end - start);
items = new Dictionary<string, string>();
headerLines = Regex.Split(header, "\r\n");
Regex regLineParts = new Regex(#"(?<!\\\\);");
Regex regColon = new Regex(#"(?<!\\\\):");
Regex regEqualSign = new Regex(#"(?<!\\\\)=");
foreach (string hl in headerLines)
{
string workString = hl;
//Escape the Semicolon in filename
if (hl.Contains("filename"))
{
String orig = hl.Substring(hl.IndexOf("filename=\"") + 10);
orig = orig.Substring(0, orig.IndexOf('"'));
string toReplace = orig;
toReplace = toReplace.Replace(toReplace, toReplace.Replace(";", #"\\;"));
toReplace = toReplace.Replace(toReplace, toReplace.Replace(":", #"\\:"));
toReplace = toReplace.Replace(toReplace, toReplace.Replace("=", #"\\="));
workString = hl.Replace(orig, toReplace);
}
string[] lineParts = regLineParts.Split(workString);
for (int i = 0; i < lineParts.Length; i++)
{
string[] p;
if (i == 0)
p = regColon.Split(lineParts[i]);
else
p = regEqualSign.Split(lineParts[i]);
if (p.Length == 2)
{
string orig = p[0];
orig = orig.Replace(#"\\;", ";");
orig = orig.Replace(#"\\:", ":");
orig = orig.Replace(#"\\=", "=");
p[0] = orig;
orig = p[1];
orig = orig.Replace(#"\\;", ";");
orig = orig.Replace(#"\\:", ":");
orig = orig.Replace(#"\\=", "=");
p[1] = orig;
items.Add(p[0].Trim(), p[1].Trim());
}
}
}
return items;
}
Needs some further testing.

I had a go at writing a parser for you. It handles literal strings, like "here is a string", as the values in name-value pairs. I've also written a few tests, and the last shows an '=' character inside a literal string. It also handles escaping quotes (") inside literal strings by escaping as \" -- I'm not sure if this is right, but you could change it.
A quick explanation. I first find anything that looks like a literal string and replace it with a value like PLACEHOLDER8230498234098230498. This means the whole thing is now literal name-value pairs; eg
key="value"
becomes
key=PLACEHOLDER8230498234098230498
The original string value is stored off in the literalStrings dictionary for later.
So now we split on semicolons (to get key=value strings) and then on equals, to get the proper key/value pairs.
Then I substitute the placeholder values back in before returning the result.
public class HttpHeaderParser
{
public NameValueCollection Parse(string header)
{
var result = new NameValueCollection();
// 'register' any string values;
var stringLiteralRx = new Regex(#"""(?<content>(\\""|[^\""])+?)""", RegexOptions.IgnorePatternWhitespace);
var equalsRx = new Regex("=", RegexOptions.IgnorePatternWhitespace);
var semiRx = new Regex(";", RegexOptions.IgnorePatternWhitespace);
Dictionary<string, string> literalStrings = new Dictionary<string, string>();
var cleanedHeader = stringLiteralRx.Replace(header, m =>
{
var replacement = "PLACEHOLDER" + Guid.NewGuid().ToString("N");
var stringLiteral = m.Groups["content"].Value.Replace("\\\"", "\"");
literalStrings.Add(replacement, stringLiteral);
return replacement;
});
// now it's safe to split on semicolons to get name-value pairs
var nameValuePairs = semiRx.Split(cleanedHeader);
foreach(var nameValuePair in nameValuePairs)
{
var nameAndValuePieces = equalsRx.Split(nameValuePair);
var name = nameAndValuePieces[0].Trim();
var value = nameAndValuePieces[1];
string replacementValue;
if (literalStrings.TryGetValue(value, out replacementValue))
{
value = replacementValue;
}
result.Add(name, value);
}
return result;
}
}
There's every chance there are some proper bugs in it.
Here's some unit tests you should incorporate, too;
[TestMethod]
public void TestMethod1()
{
var tests = new[] {
new { input=#"foo=bar; baz=quux", expected = #"foo|bar^baz|quux"},
new { input=#"foo=bar;baz=""quux""", expected = #"foo|bar^baz|quux"},
new { input=#"foo=""bar"";baz=""quux""", expected = #"foo|bar^baz|quux"},
new { input=#"foo=""b,a,r"";baz=""quux""", expected = #"foo|b,a,r^baz|quux"},
new { input=#"foo=""b;r"";baz=""quux""", expected = #"foo|b;r^baz|quux"},
new { input=#"foo=""b\""r"";baz=""quux""", expected = #"foo|b""r^baz|quux"},
new { input=#"foo=""b=r"";baz=""quux""", expected = #"foo|b=r^baz|quux"},
};
var parser = new HttpHeaderParser();
foreach(var test in tests)
{
var actual = parser.Parse(test.input);
var actualAsString = String.Join("^", actual.Keys.Cast<string>().Select(k => string.Format("{0}|{1}", k, actual[k])));
Assert.AreEqual(test.expected, actualAsString);
}
}

Looks to me like you'll need a bit more of a solid parser for this than a regex split. According to this page the name/value pairs can either be 'raw';
x=1
or quoted;
x="foo bar baz"
So you'll need to look for a solution that not only splits on the equals, but ignores any equals inside;
x="y=z"
It might be that there is a better or more managed way for you to access this info. If you are using a classic ASP.NET WebForms FileUpload control, you can access the filename using the properties of the control, like
FileUpload1.HasFile
FileUpload1.FileName
If you're using MVC, you can use the HttpPostedFileBase class as a parameter to the action method. See this answer
[HttpPost]
public ActionResult Index(HttpPostedFileBase file)
{
// Verify that the user selected a file
if (file != null && file.ContentLength > 0)
{
// extract only the fielname
var fileName = Path.GetFileName(file.FileName);
// store the file inside ~/App_Data/uploads folder
var path = Path.Combine(Server.MapPath("~/App_Data/uploads"), fileName);
file.SaveAs(path);
}
// redirect back to the index action to show the form once again
return RedirectToAction("Index");
}

This:
(?<!\\\\)=
matches = not preceded by \\.
It should be:
(?<!\\)=
(Make sure you use # (verbatim) strings for the regex, to avoid confusion)

how to parse this text in c#

abc = tamaz feeo maa roo key gaera porla
Xyz = gippaza eka jaguar ammaz te sanna.
i want to make a struct
public struct word
{
public string Word;
public string Definition;
}
how i can parse them and make a list of <word> in c#.
how i can parse it in c#
thanks for help but it is a text and it is not sure that a line or more so what i do for newline

Read the input line by line and split by the equal sign.
class Entry
{
private string term;
private string definition;
Entry(string term, string definition)
{
this.term = term;
this.definition = definition;
}
}
// ...
string[] data = line.Split('=');
string word = data[0].Trim();
string definition = data[1].Trim();
Entry entry = new Entry(word, definition);

This can also be done using a very simple LINQ query:
var definitions =
from line in File.ReadAllLines(file)
let parts = line.Split('=')
select new word
{
Word = parts[0].Trim(),
Definition = parts[1].Trim()
}

Using RegExp you can proceed in two ways, depending on your source input
Exemple 1
Assuming you have read your source and saved any single line in a vector or list :
string[] input = { "abc = tamaz feeo maa roo key gaera porla", "Xyz = gippaza eka jaguar ammaz te sanna." };
Regex mySplit = new Regex("(\\w+)\\s*=\\s*((\\w+).*)");
List<word> mylist = new List<word>();
foreach (string wordDef in input)
{
Match myMatch = mySplit.Match(wordDef);
word myWord;
myWord.Word = myMatch.Groups[1].Captures[0].Value;
myWord.Definition = myMatch.Groups[2].Captures[0].Value;
mylist.Add(myWord);
}
Exemple 2
Assuming you have read your source in a single variable (and any line is terminated with the line break character '\n') you can use the same regexp "(\w+)\s*=\s*((\w+).*)" but in this way
string inputs = "abc = tamaz feeo maa roo, key gaera porla\r\nXyz = gippaza eka jaguar; ammaz: te sanna.";
MatchCollection myMatches = mySplit.Matches(inputs);
foreach (Match singleMatch in myMatches)
{
word myWord;
myWord.Word = singleMatch.Groups[1].Captures[0].Value;
myWord.Definition = singleMatch.Groups[2].Captures[0].Value;
mylist.Add(myWord);
}
Lines that matches or does not match the regexp "(\w+)\s=\s*((\w+).)":
"abc = tamaz feeo maa roo key gaera porla,qsdsdsqdqsd\n" --> Match!
"Xyz= gippaza eka jaguar ammaz te sanna. sdq=sqds \n" --> Match! you can insert description that includes spaces too.
"qsdqsd=\nsdsdsd\n" --> Match a multiline pair too!
"sdqsd=\n" --> DO NOT Match! (lacking descr)
"= sdq sqdqsd.\n" --> DO NOT Match! (lacking word)

// Split at an = sign. Take at most two parts (word and definition);
// ignore any = signs in the definition
string[] parts = line.Split(new[] { '=' }, 2);
word w = new word();
w.Word = parts[0].Trim();
// If the definition is missing then parts.Length == 1
if (parts.Length == 1)
w.Definition = string.Empty;
else
w.Definition = parts[1].Trim();
words.Add(w);

Use Regular Expressions

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to get the unescaped length of XElement inner text? - c#

Related

c# comparing XML format

C# XDocument right way to escape symbols

C# programming for trimming first three lines and last four lines

Regex without escaping Characters - Problems

how to parse this text in c#

Categories

Resources