I have an XML file which contain images, The images are of the form [[ID]] where ID is the unique identifier for an image stored in the database. I want to parse this XML file for [[ ]] tags and replace the tags with the images which will be fetched from the database. Upto this point, this is what I have done
string xmlUrl = Server.MapPath("/Model Report.xsl");
XmlDocument doc = new XmlDocument();
doc.Load(xmlUrl);
now want to parse doc for [[ ]] tags. The image in the XML file looks like this(this is only a part of XML file.)
<table-graphics.as-jpeg datatype="string" display-name="As JPEG">[[ID]]</table-graphics.as-jpeg>
Your help is truly appreciated.
It's still not clear to me what you are asking. Sounds like you are trying to treat the XML document like a plain text document; in which case:
string xmlUrl = Server.MapPath("/Model Report.xsl");
string raw = new WebClient().DownloadString(xmlUrl);
var output = new List<string>();
int left = 0;
int right = 0;
while ((left = raw.IndexOf("[[", right)) >= 0) {
int right = raw.IndexOf("]]", left);
if (right > left) output.Add(raw.Substring(left+2, right-left-2));
}
should give all the [[something]] patterns in the document.
If you know more about the document, like all the IDs will be in table-graphics.as-jpeg tags, you should use the XPath support as Mark said:
string xmlUrl = Server.MapPath("/Model Report.xsl");
XmlDocument doc = new XmlDocument();
doc.Load(xmlUrl);
foreach (XmlNode node in doc.SelectNodes("//table-graphics.as-jpeg"))
{
node.InnerText . . .
}
Take a look at the XPath support in C# to help with the parsing.
[...edit...]
You'll want to look for the table-graphics.as-jpeg elements, then grab their text content. (Don't have the exact syntax in my head -- I always have to look it up myself.)
Related
I use "application 1" to create and edit xhtml files.
It has an option to enter annotations into the content of non-empty elements like p, h1, h2, td etc ... which results in mixed xml code sections like this:
<p>Hello <NS1:annotation [...SomeAttributes...]>everybody</NS1:annotation> out there!</p>
For translational purposes I have to export these xhtml files into "application 2" which can't deal with these internal elements. As the annotations are not part of the desired content in the translations removing them before exporting them to application 2 would be a perfect workaround:
<p>Hello everybody out there!</p>
Removing nodes from an XmlDocument reliably finds and removes the internal xml elements but it also deletes the content of the annotation element - loosing the word "everybody" in the example above:
<p>Hello out there!</p>
What I need is rather "unbinding" the content of these internal elemts into the content of the parent element. But so far I haven't found a method using the c# xml tools doing the job.
So far I first save the xhtml file, re-open it as text file and use regedits to remove the annotation. I can even use c# methods for it:
TextFile txt = new TextFile();
string s = txt.ReadFile(filename);
string pattern = #"<NS1:annotation.+>(.+)</NS1:annotation>";
string input = s;
string replacement = "$1";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(input, replacement);
TextFile.Write((filename,result););
This is doubtlessly a better solution as it doesn't loose the content of the annotation but I wonder if there is really not a solution based on the c# Xml-tools that does the job.
Anybody out there who knows it?
I think I found an answer using XmlDocument.
The key is that in mixed xml nodes the text surrounding the node can be adressed as xml nodes too. I wasn't aware of this ...
The following function unbinds the content of the mixed node and releases it into the content of the parent node. I haven't tested it for nodes containing multiple annotations, but that's enough for me at the moment ...
private void removeAnnotations(XmlDocument doc)
{
XmlNamespaceManager manager = new XmlNamespaceManager(new NameTable());
manager.AddNamespace("NS1","http://www.someurl.net");
XmlNodeList annotations = doc.SelectNodes("//NS1:annotation", manager);
int i = 0;
while (i < annotations.Count)
{
//in mixed xml the Siblings are xml text nodes. Therefore we write them into buffers:
string s0 = "";
if(annotations[i].PreviousSibling != null) s0 = annotations[i].PreviousSibling.InnerText;
string s2 = "";
if(annotations[i].NextSibling != null) s2 = annotations[i].NextSibling.InnerText;
//buffer the content of the annotation itself
string s1 = annotations[i].InnerText;
//buffer the link to the parent node before we remove the annotation,
XmlNode parent = annotations[i].ParentNode;
//now remove the annotation
parent.RemoveChild(annotations[i]);
//and apply the new Text to the parent element
parent.InnerText = s0 + s1 + s2;
i++;
}
}
<TestCase Name="DEBUG">
<ActionEnvironment Name="Carved records indication">
<Define Name="_TestedVersionPath" Value="{CustomParam {paramName=PA tested version installer folder path}, {appName=PA installer}, {hint=\\ptnas1\builds\Temp Builds\Forensic\Physical Analyzer\PA.Test\UFED_Analyzer_17.02.05_03-00_6.0.0.128\EncryptedSetup}}"/>
<Define Name="_PathOfdata" Value="SharedData\myfolder\mydata.xml"/>
<ActionSet Name="DEBUG">
<Actions>
<SpecialAction ActionName="myactionname">
<CaseName>123</CaseName>
<UaeSendQueryValues>
<URL>192.168.75.133</URL>
<RestURL></RestURL>
<UserName>user1</UserName>
<Password>aaa</Password>
<PathOfQuery>_PathOfdata</PathOfQuery>
<Method>GET</Method>
<ParamsFromFile></ParamsFromFile>
</UaeSendQueryValues>
</SpecialAction>
</Actions>
</ActionSet>
</ActionEnvironment>
I have the above xml. i need to find every PathOfQuery tag, get the text of it (in the example _PathOfdata) and then go up in the xml tree and find the first Define tag who's name = to text of PathofQuery tag and get its value (in the example "SharedData\myfolder\mydata.xml")
then i would like to replace that value with another string.
i need to do this for each PathofQuery tag that appears in the xml (it could be more then one) and i want to find always the first apparition of the Define tag (could be more than one) when i travel the tree up from the point where the PathofQuery tag was found.
I want to do this on C Sharp
any help will be appreciated.
Let's assume string s holds the above Xml. Then the following code will work for you:
XmlDocument xDoc = new XmlDocument();
xDoc.LoadXml(s);
XmlNode pathOfQuery = xDoc.SelectSingleNode("//PathOfQuery");
string pathOfQueryValue = pathOfQuery.InnerText;
Console.WriteLine(pathOfQueryValue);
XmlNode define = xDoc.SelectSingleNode("//Define[#Name='" + pathOfQueryValue + "']");
if(define!=null)
{
string defineTagValue = define.Attributes["Value"].Value;
Console.WriteLine(defineTagValue);
pathOfQuery.InnerText = defineTagValue;
Console.WriteLine(pathOfQuery.InnerText);
}
I have a variable in my program that contains HTML data as a string. The variable, htmlText, contains something like the following:
<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>
I'd like to iterate through this HTML, using the HtmlAgilityPack, but every example I see tries to load the HTML as a document. I already have the HTML that I want to parse within the variable htmlText. Can someone show me how to parse this, without loading it as a document?
The example I'm looking at right now looks like this:
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
I want to convert this to use my htmlText and find all underline elements within. I just don't want to load this as a document since I already have the HTML that I want to parse stored in a variable.
You can use the LoadHtml method of HtmlDocument class
Document is simply a name, it's not really a document (or doesn't have to be).
var doc = New HtmlAgilityPack.HtmlDocument;
string myHTML = "<ul><li><u>Mode selector </u></li><li><u>LAND ALT</u></li>";
doc.LoadHtml(myHTML);
foreach (var node in doc.DocumentNode.SelectNodes("//a[#href]")) {
Console.WriteLine(node.InnerHtml);
}
I've used this exact same thing to parse html chunks in variables.
So I'm working on a speech recognition program in C# and while trying to implement the YAHOO News API into the program I am getting no response.
I won't copy/paste my whole code as it would be very long so here are the main bits.
private void GetNews()
{
string query = String.Format("http://news.yahoo.com/rss/");
XmlDocument wData = new XmlDocument();
wData.Load(query);
XmlNamespaceManager manager = new XmlNamespaceManager(wData.NameTable);
manager.AddNamespace("media", "http://search.yahoo.com/mrss/");
XmlNode channel = wData.SelectSingleNode("rss").SelectSingleNode("channel");
XmlNodeList nodes = wData.SelectNodes("rss/channel/item/description", manager);
FirstStory = channel.SelectSingleNode("item").SelectSingleNode("title", manager).Attributes["alt"].Value;
}
I believe I have done something wrong here:
XmlNode channel = wData.SelectSingleNode("rss").SelectSingleNode("channel");
XmlNodeList nodes = wData.SelectNodes("rss/channel/item/description", manager);
FirstStory = channel.SelectSingleNode("item").SelectSingleNode("title", manager).Attributes["alt"].Value;
Here is the full XML Document: http://news.yahoo.com/rss/
If any more info is required let me know.
Hmm I have implemented my own code to get news from Yahoo, I read all the news Title ( which is located at rss/channel/item/title ) and Short story ( which is located rss/channel/item/description ).
The short story is the problem for news, and that is the point when we need to get all the inner text of description node in a string and then parse it like XML. The text code is in this format and the Short story is right behind </p>
<p><a><img /></a></p>"Short Story"<br clear="all"/>
We need to modify it since we have many xml roots (p and br) and we add an extra root <me>
string ShStory=null;
string Title = null;
//Creating a XML Document
XmlDocument doc = new XmlDocument();
//Loading rss on it
doc.Load("http://news.yahoo.com/rss/");
//Looping every item in the XML
foreach (XmlNode node in doc.SelectNodes("rss/channel/item"))
{
//Reading Title which is simple
Title = node.SelectSingleNode("title").InnerText;
//Putting all description text in string ndd
string ndd = node.SelectSingleNode("description").InnerText;
XmlDocument xm = new XmlDocument();
//Loading modified string as XML in xm with the root <me>
xm.LoadXml("<me>"+ndd+"</me>");
//Selecting node <p> which has the text
XmlNode nodds = xm.SelectSingleNode("/me/p");
//Putting inner text in the string ShStory
ShStory= nodds.InnerText;
//Showing the message box with the loaded data
MessageBox.Show(Title+ " "+ShStory);
}
Choose the me as the right answer or vote me up if the code works for you. If there are any issues you can ask me. Cheers
It's likely that you are passing that namespace manager to those attributes, but I'm not 100% certain. Those are definitely not in that .../mrss/ namespace, so I would guess that is your problem.
I would try it without passing the namespace (if possible) or using the GetElementsByTagName method to avoid namespace issues.
Tag contains the text rather than Xml.
Here is an example to display text news:
foreach (XmlElement node in nodes)
{
Console.WriteLine(Regex.Match(node.InnerXml,
"(?<=(/a>)).+(?=(</p))"));
Console.WriteLine();
}
I want to parse a html page to get some data.
First, I convert it to XML document using SgmlReader.
Then, I load the result to XMLDocument and then navigate through XPath:
//contains html document
var loadedFile = LoadWebPage();
...
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = new StringReader(loadedFile);
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
This code works fine for most cases, except on this site - www.arrow.com (try to search something like OP295GS). I can get a table with result using the following XPath:
var node = doc.SelectSingleNode(".//*[#id='results-table']");
This gives me a node with several child nodes:
[0] {Element, Name="thead"}
[1] {Element, Name="tbody"}
[2] {Element, Name="tbody"}
FirstChild {Element, Name="thead"}
Ok, let's try to get some child nodes using XPath. But this doesn't work:
var childNodes = node.SelectNodes("tbody");
//childnodes.Count = 0
This also:
var childNode = node.SelectSingleNode("thead");
// childNode = null
And even this:
var childNode = doc.SelectSingleNode(".//*[#id='results-table']/thead")
What can be wrong in Xpath queries?
I've just tried to parse that HTML page with Html Agility Pack and my XPath queries work good. But my application use XmlDocument inside, Html Agility Pack doesn't suit me.
I even tried the following trick with Html Agility Pack, but Xpath queries doesn't work also:
//let's parse and convert HTML document using HTML Agility Pack and then load
//the result to XmlDocument
HtmlDocument xmlDocument = new HtmlDocument();
xmlDocument.OptionOutputAsXml = true;
xmlDocument.Load(new StringReader(webPage));
XmlDocument document = new XmlDocument();
document.LoadXml(xmlDocument.DocumentNode.InnerHtml);
Perhaps, web page contains errors (not all tags are closed and so on), but in spite of this I can see child nodes (through Quick Watch in Visual Studio), but cannot access them through XPath.
My XPath queries works correctly in Firefox + FirePath + XPather plugins, but don't work in .net XmlDocument :(
I have not used SqmlReader, but every time I have seen this problem it has been due to namespaces. A quick look at the HTML on www.arrow.com shows that this node has a namespace (note the xmlns:javaurlencoder):
<form name="CatSearchForm" method="post" action="http://components.arrow.com/part/search/OP295GS" xmlns:javaurlencoder="java.net.URLEncoder">
This code is how I loop through all nodes in a document to see which ones have namespaces and which don't. If the node you are looking for or any of its parents have namespaces, you must create a XmlNamespaceManager and pass it along with your call to SelectNodes().
This is kind of annoying, so another idea might be to strip all the xmlns: attributes out of the XML before loading it into a XmlDocument. Then, you won't need to fool with XmlNamespaceManager!
XmlDocument doc = new XmlDocument();
doc.Load(#"C:\temp\X.loadtest.xml");
Dictionary<string, string> namespaces = new Dictionary<string, string>();
XmlNodeList nlAllNodes = doc.SelectNodes("//*");
foreach (XmlNode n in nlAllNodes)
{
if (n.NodeType != XmlNodeType.Element) continue;
if (!String.IsNullOrEmpty(n.NamespaceURI) && !namespaces.ContainsKey(n.Name))
{
namespaces.Add(n.Name, n.NamespaceURI);
}
}
// Inspect the namespaces dictionary to write the code below
XmlNamespaceManager nMgr = new XmlNamespaceManager(doc.NameTable);
// Sometimes this works
nMgr.AddNamespace("ns1", doc.DocumentElement.NamespaceURI);
// You can make the first param whatever you want, it just must match in XPath queries
nMgr.AddNamespace("javaurlencoder", "java.net.URLEncoder");
XmlNodeList iter = doc.SelectNodes("//ns1:TestProfile", nMgr);
foreach (XmlNode n in iter)
{
// Do stuff
}
To be honest when I am trying to get information from a website I use regex.
Ok Kore Nordmann (in his php blog) thinks, this is not good. But some of the comments tell differently.
http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html
http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
But it is in php, so sorry for this =) Hope it helps anyway.