I want to convert this:
<translation>
1 Sənədlər
</translation>
to
<translation>1 Sənədlər</translation> in XML using C#.
Please help me. Only translation tags.
I tried this:
XDocument xdoc = XDocument.Load(path);
xdoc.Save("path, SaveOptions.DisableFormatting);
But it does not remove the new lines between <translation> tags.
what you have should work. you can validate by dumping the XDocument to a string variable to confirm if the SaveOptions is removing the formatting.
for eg: i tried the below and content does not have any formatting including newlines and whitespaces.
XDocument xmlDoc = new XDocument(new XElement("Team", new XElement("Developer", "Sam")));
var content = xmlDoc.ToString(SaveOptions.DisableFormatting);
A new line is determined in the code by "\n" and possibly also "\r". You can simply remove these:
string xmlString = "<translation>\r\n1 Sənədlər\r\n</translation>"; // With the 'new lines'
xmlString = xmlString.Replace("\r", "").Replace("\n", "");
This will result in:
<translation>
1 Sənədlər
</translation>
Becomming:
<translation>1 Sənədlər</translation>
I hope this helps.
You can strip out newlines manually in an environment-sensitive way by using
var content = xmlString.Replace(Environment.NewLine, string.Empty)
XML defines two types of whitespace: significant and insignificant:
Insignificant whitespace is the whitespace between elements where text content doesn't occur, whereas significant whitespace is the whitespace within elements that contain text content. You might find the graphic in this article useful to show the difference.
What you have in your translation element is significant whitespace; the element contains text so it is assumed to be part of the element contents. Without a schema or DTD that says it can be collapsed, no amount of changing the whitespace handling on read or write is going to remove this. These options only relate to the insignificant whitespace.
What you can do is apply your own processing: using LINQ to XML, you can trim the whitespace of all elements that contain only text using something like this:
var textElements = doc.Descendants()
.Where(element => element.Nodes().All(node => node is XText));
foreach (var element in textElements)
{
element.Value = element.Value.Trim();
}
See this fiddle for a demo.
Related
I have an XML document, which I want to serialize with normalized indentations (using two spaces), but preserving any additional line-breaks between Elements. I am using C#. Preferably I would normalize the line-break characters (so they are all \r\n), but crucially I would like to keep the presence of multiple consecutive line-breaks.
For example, given the input document:
<root>
<elementOne>Hello</elementOne>
<elementTwo>I am misaligned</elementTwo>
<elementThree>I am indented with a Tab character</elementThree>
<!-- Here is a comment preceeding another element -->
<elementFour />
</root>
I would like to produce the output document:
<root>
<elementOne>Hello</elementOne>
<elementTwo>I am slightly misaligned</elementTwo>
<elementThree>I am indented with a Tab character</elementThree>
<!-- Here is a comment preceeding another element -->
<elementFour />
</root>
If I parse the input document to an XElement and then serialise it, I get the output with normalized spacing, but the extra line-break removed:
<root>
<elementOne>Hello</elementOne>
<elementTwo>I am slightly misaligned</elementTwo>
<elementThree>I am indented with a Tab character</elementThree>
<!-- Here is a comment preceeding another element -->
<elementFour />
</root>
I have tried using XDocument.Load with LoadOptions.PreserveWhitespace, but then I cannot find a way to get indentation normalization. I have also tried using XmlWriterSettings as follows:
XmlWriterSettings settings = new XmlWriterSettings {
Indent = true,
IndentChars = " ",
NewLineChars = "\r\n",
NewLineHandling = NewLineHandling.None
};
But tweaking these settings seems to either normalize both line-breaks and indentation, or neither.
The reason I want this behavior is that I want to "pretty-print" a large user-editable XML document so that the indentation is correct, but I don't want to remove the line-breaks added by the user for readability.
It's not possible to preserve only parts of the whitespace between elements: either the whitespace is considered significant or it isn't.
A different solution is to replace all the blank lines with a placeholder comment, format the document in the usual way, and then remove the comments (but leaving the empty lines).
For example:
public static class XmlFormatting {
static readonly string sNewLineComment = new XComment($"x-newline-placeholder-{Guid.NewGuid()}").ToString();
static readonly Regex sNewLineCommentRegex = new Regex($#"^\s*{sNewLineComment}\s*$", RegexOptions.Compiled | RegexOptions.Multiline);
static readonly Regex sEmptyLineRegex = new Regex(#"^\s*$", RegexOptions.Compiled | RegexOptions.Multiline);
public static string PrettyPrintXml(string inputXml) {
string newlinesReplacedWithComments = sEmptyLineRegex.Replace(inputXml, sNewLineComment);
string formattedDocument = XDocument.Parse(newlinesReplacedWithComments, LoadOptions.None).ToString();
return sNewLineCommentRegex.Replace(formattedDocument, string.Empty);
}
}
I am trying to convert a file to XML format that contains some special characters but it's not getting converted because of that special characters in the data.
I have already this regex code still it's not working for me please help.
The code what I have tried:
string filedata = #"D:\readwrite\test11.txt";
string input = ReadForFile(filedata);
string re1 = #"[^\u0000-\u007F]+";
string re5 = #"\p{Cs}";
data = Regex.Replace(input, re1, "");
data = Regex.Replace(input, re5, "");
XmlDocument xmlDocument = new XmlDocument();
try
{
xmlDocument = (XmlDocument)JsonConvert.DeserializeXmlNode(data);
var Xdoc = XDocument.Parse(xmlDocument.OuterXml);
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
0x04 is a transmission control character and cannot appear in a text string. XmlDocument is right to reject it if it really does appear in your data. This does suggest that the regex you have doesn't do what you think it does, if I'm right that regex will find the first instance of one or more of those invalid characters at the beginning of a line and replace it, but not all of them. The real question for me is why this non-text 'character' appears in data intended as XML in the first place.
I have other questions. I've never seen JsonConvert.DeserializeXmlNode before - I had to look up what it does. Why are you using a JSON function against the root of a document which presumably therefore contains no JSON? Why are you then taking that document, converting it back to a string, and then creating an XDocument from it? Why not just create an XDocument to start with?
I have a requirement where I need to read an XML file that may contain special characters. But I need to keep those special characters "as-is". However, after calling XDocument.Load(), ' is turned to ' and & to &.
Here is what the XML file may look like:
<root>
<child>This is a text with special character such as ' and &</child>
</root>
XDocument xDoc = null;
xDocument = XDocument.Load("myFile.xml", LoadOptions.SetBaseUri | LoadOptions.SetLineInfo | LoadOptions.PreserveWhitespace);
I've tried with encoding, but with no success. For example:
using (StreamReader oReader = new StreamReader("myFile.xml", Encoding.GetEncoding("utf-8")))
{
xDocument = XDocument.Load(oReader);
}
or
xDocument = XDocument.Parse(File.ReadAllText("myFile.xml", Encoding.UTF8));
Is there anything else that I can try?
Thanks.
I have a Html code and I want to Convert it to plain text but keep only colored text tags.
for example:
when I have below Html:
<body>
This is a <b>sample</b> html text.
<p align="center" style="color:#ff9999">this is only a sample<p>
....
and some other tags...
</body>
</html>
I want the output:
this is a sample html text.
<#ff9999>this is only a sample<>
....
and some other tags...
I'd use parser to parse HTML like HtmlAgilityPack, and use regular expressions to find the color value in attributes.
First, find all the nodes that contain style attribute with color defined in it by using xpath:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
.SelectNodes("//*[contains(#style, 'color')]")
.ToArray();
Then the simplest regex to match a color value: (?<=color:\s*)#?\w+.
var colorRegex = new Regex(#"(?<=color:\s*)#?\w+", RegexOptions.IgnoreCase);
Then iterate through these nodes and if there is a regex match, replace the inner html of the node with html encoded tags (you'll understand why a little bit later):
foreach (var node in nodes)
{
var style = node.Attributes["style"].Value;
if (colorRegex.IsMatch(style))
{
var color = colorRegex.Match(style).Value;
node.InnerHtml =
HttpUtility.HtmlEncode("<" + color + ">") +
node.InnerHtml +
HttpUtility.HtmlEncode("</" + color + ">");
}
}
And finally get the inner text of the document and perform html decoding on it (this is because inner text strips all the tags):
var txt = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
This should return something like this:
This is a sample html text.
<#ff9999>this is only a sample</#ff9999>
....
and some other tags...
Of course you could improve it for your needs.
It is possible to do it using regular expressions but... You should not parse (X)HTML with regex.
The first regexp I came with to solve the problem is:
<p(\w|\s|[="])+color:(#([0-9a-f]{6}|[0-9a-f]{3}))">(\w|\s)+</p>
Group 5th will be the hex (3 or 6 hexadecimals) colour and group 6th will be the text inside the tag.
Obviously, it's not the best solution as I'm not a regexp master and obviously it needs some testing and probably generalisation... But still it's a good point to start with.
I have a part of code mentioned like below.
//Reading from a file and assign to the variable named "s"
string s = "<item><name> Foo </name></item>";
XmlDocument doc = new XmlDocument();
doc.LoadXml(s);
But, it stops working if the contents has characters something like "<", ">"..etc.
string s = "<item><name> Foo > Bar </name></item>";
I know, I have to escape those characters before loading but, if I do like
doc.LoadXml(System.Security.SecurityElement.Escape(s));
, the tags (< , >) are also escaped and as a result, the error occurs.
How can I solve this problem?
a tricky solution:
string s = "<item><name> Foo > Bar </name></item>";
s = Regex.Replace(s, #"<[^>]+?>", m => HttpUtility.HtmlEncode(m.Value)).Replace("<","ojlovecd").Replace(">","cdloveoj");
s = HttpUtility.HtmlDecode(s).Replace("ojlovecd", ">").Replace("cdloveoj", "<");
XmlDocument doc = new XmlDocument();
doc.LoadXml(s);
Assuming your content will never contain the characters "]]>", you can use CDATA.
string s = "<item><name><![CDATA[ Foo > Bar ]]></name></item>";
Otherwise, you'll need to html encode your special characters, and decode them before you use/display them (unless it's in a browser).
string s = "<item><name> Foo > Bar </name></item>";
Assign the content of string to the InnerXml property of node.
var node = doc.CreateElement("root");
node.InnerXml = s;
Take a look at - Different ways how to escape an XML string in C#
It looks like the strings that you have generated are strings, and not valid XML. You can either get the strings generated as valid XML OR if you know that the strings are always going to be the name, then don't include the XML <item> and <name> tags in the data.
Then when you create the XMLDocument. do a CreateElement and assign your string before resaving the results.
XmlDocument doc = new XmlDocument();
XmlElement root = doc.CreateElement("item");
doc.AppendChild(root);
XmlElement name = doc.CreateElement("name");
name.InnerText = "the contents from your file";
root.AppendChild(name);