I need to parse through the aspx file (from disk, and not the one rendered on the browser) and make a list of all the server side asp.net controls present on the page, and then create an xml file from it. which would be the best way to do it? Also, are there any available libraries for this?
For eg, if my aspx file contains
<asp:label ID="lbl1" runat="server" Text="Hi"></asp:label>
my xml file would be
<controls>
<ID>lbl1</ID>
<runat>server</runat>
<Text>Hi</Text>
</controls>
Xml parsers wouldn't understand the ASP directives: <%# <%= etc.
You'll probably best to use regular expressions to do this, likely in 3 stages.
Match any tag elements from the entire page.
For Each tag, match the tag and control type.
For Each tag that matches (2), match any attributes.
So, starting from the top, we can use the following regex:
(?<tag><[^%/](?:.*?)>)
This will match any tags that don't have <% and < / and does so lazily (we don't want greedy expressions, as we won't read the content correctly). The following could be matched:
<asp:Content ID="ph_PageContent" ContentPlaceHolderID="ph_MainContent" runat="server">
<asp:Image runat="server" />
<img src="/test.png" />
For each of those captured tags, we want to then extract the tag and type:
<(?<tag>[a-z][a-z1-9]*):(?<type>[a-z][a-z1-9]*)
Creating named capture groups makes this easier, this will allow us to easily extract the tag and type. This will only match server tags, so standard html tags will be dropped at this point.
<asp:Content ID="ph_PageContent" ContentPlaceHolderID="ph_MainContent" runat="server">
Will yield:
{ tag = "asp", type = "Content" }
With that same tag, we can then match any attributes:
(?<name>\S+)=["']?(?<value>(?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Which yields:
{ name = "ID", value = "ph_PageContent" },
{ name = "ContentPlaceHolderID", value = "ph_MainContent" },
{ name = "runat", value = "server" }
So putting that all together, we can create a quick function that can create an XmlDocument for us:
public XmlDocument CreateDocumentFromMarkup(string content)
{
if (string.IsNullOrEmpty(content))
throw new ArgumentException("'content' must have a value.", "content");
RegexOptions options = RegexOptions.CultureInvariant | RegexOptions.Compiled | RegexOptions.IgnoreCase;
Regex tagExpr = new Regex("(?<tag><[^%/](?:.*?)>)", options);
Regex serverTagExpr = new Regex("<(?<tag>[a-z][a-z1-9]*):(?<type>[a-z][a-z1-9]*)", options);
Regex attributeExpr = new Regex("(?<name>\\S+)=[\"']?(?<value>(?:.(?![\"']?\\s+(?:\\S+)=|[>\"']))+.)[\"']?", options);
XmlDocument document = new XmlDocument();
XmlElement root = document.CreateElement("controls");
Func<XmlDocument, string, string, XmlElement> creator = (document, name, value) => {
XmlElement element = document.CreateElement(name);
element.InnerText = value;
return element;
};
foreach (Match tagMatch in tagExpr.Matches(content)) {
Match serverTagMatch = serverTagExpr.Match(tagMatch.Value);
if (serverTagMatch.Success) {
XmlElement controlElement = document.CreateElement("control");
controlElement.AppendChild(
creator(document, "tag", serverTagMatch.Groups["tag"].Value));
controlElement.AppendChild(
creator(document, "type", serverTagMatch.Groups["type"].Value));
XmlElement attributeElement = document.CreateElement("attributes");
foreach (Match attributeMatch in attributeExpr.Matches(tagMatch.Value)) {
if (attributeMatch.Success) {
attributeElement.AppendChild(
creator(document, attributeMatch.Groups["name"].Value, attributeMatch.Groups["value"].Value));
}
}
controlElement.AppendChild(attributeElement);
root.AppendChild(controlElement);
}
}
return document;
}
The resultant document could look like this:
<controls>
<control>
<tag>asp</tag>
<type>Content</type>
<attributes>
<ID>ph_PageContent</ID>
<ContentPlaceHolderID>ph_MainContent</ContentPlaceHolderID>
<runat>server</runat>
</attributes>
</control>
</controls>
Hope that helps!
I used the below three regular expressions with the above code and it gives me html tags as well. Also I can obtain the value in between opening and closing tags too.
Regex tagExpr = new Regex("(?<tag><[^%/](?:.*?)>[^/<]*)", options);
Regex serverTagExpr = new Regex("<(?<type>[a-z][a-z1-9:]*)[^>/]*(?:/>|[>/])(?<value>[^</]*)", options);
Regex attributeExpr = new Regex("(?<name>\\S+)=[\"']?(?<value>(?:.(?![\"']?\\s+(?:\\S+)=|[>\"']))+.)[\"']?", options);
Func<XmlDocument, string, string, XmlElement> creator = (document, name, value) => {
XmlElement element = document.CreateElement(name);
element.InnerText = value;
the above generic template will work version 3.5 and above.. so if any one using version below that , create function like :
public XmlElement creator(XmlDocument document, string name, string value)
{
XmlElement element = document.CreateElement(name);
element.InnerText = value;
return element;
}
this will work
ASPX files should be valid XML, so maybe XSLT would be a good solution. The W3 Schools site has a good introduction and reference. You could then call this XSLT from a simple program to pick the required file(s).
Alternatively, you could use Linq to XML to load the ASPX file(s) and iterate over the controls in a Linq-style.
if the code for the tag is written in more than one line, we may have an issue in extracting the tag data. to avoid that I have removed the newline characters as below from the source string that we are passing to the above function (content)
string contentRemovedNewLines = Regex.Replace(content, #"\t|\n|\r", "");
then we can use contentRemovedNewLines instead of content.
Above code works as i wanted. one more thing can be added. you can call the above method as shown below and then save as an xml file so, we can check that the expected result is there or not.
XmlDocument xmlDocWithWebContent = CreateDocumentFromMarkup(sourceToRead);
string xmlfileLocation = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location) + "tempXmlOutputFileOfWebSource.xml";
xmlDocWithWebContent.Save(xmlfileLocation);
to do that, we have to have a root element for the xml file
XmlDocument document = new XmlDocument();
XmlNode xmlNode = document.CreateNode(XmlNodeType.XmlDeclaration, "", "");
XmlElement root = document.CreateElement("controls");
document.AppendChild(root);
i used the above fix for that
Related
What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file. Without the use of any 3rd party ( HTMLAgilityPack, etc)
What I am having trouble with: What will be my approach to get the following(from,to,subject,cc) from the html tags?
Steps I tried: I tried to get the index of <p class=MsoNormal> and the last index of the email #sampleemail.com but I think that is a bad approach since in some html files there will be a lot of
"<p class=MsNormal>" , regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf) function and it worked
Sample tag containing information of from:
<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p>
HTML FILE output:
HTMLAgilityPack is your friend. Simply using XPath like //p[#class ='MsoNormal'] to get tags content in HTML
public static void Main()
{
var html =
#"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//p[#class ='MsoNormal']");
foreach(var node in nodes)
Console.WriteLine(node.InnerText);
}
Result:
From:1234#sampleemail.com
Update
We may use Regex to write this simple parser. But remember that it cannot clear all cases for complicated html document.
public static void MainFunc()
{
string str = #"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
Console.WriteLine(result);
}
I use "application 1" to create and edit xhtml files.
It has an option to enter annotations into the content of non-empty elements like p, h1, h2, td etc ... which results in mixed xml code sections like this:
<p>Hello <NS1:annotation [...SomeAttributes...]>everybody</NS1:annotation> out there!</p>
For translational purposes I have to export these xhtml files into "application 2" which can't deal with these internal elements. As the annotations are not part of the desired content in the translations removing them before exporting them to application 2 would be a perfect workaround:
<p>Hello everybody out there!</p>
Removing nodes from an XmlDocument reliably finds and removes the internal xml elements but it also deletes the content of the annotation element - loosing the word "everybody" in the example above:
<p>Hello out there!</p>
What I need is rather "unbinding" the content of these internal elemts into the content of the parent element. But so far I haven't found a method using the c# xml tools doing the job.
So far I first save the xhtml file, re-open it as text file and use regedits to remove the annotation. I can even use c# methods for it:
TextFile txt = new TextFile();
string s = txt.ReadFile(filename);
string pattern = #"<NS1:annotation.+>(.+)</NS1:annotation>";
string input = s;
string replacement = "$1";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(input, replacement);
TextFile.Write((filename,result););
This is doubtlessly a better solution as it doesn't loose the content of the annotation but I wonder if there is really not a solution based on the c# Xml-tools that does the job.
Anybody out there who knows it?
I think I found an answer using XmlDocument.
The key is that in mixed xml nodes the text surrounding the node can be adressed as xml nodes too. I wasn't aware of this ...
The following function unbinds the content of the mixed node and releases it into the content of the parent node. I haven't tested it for nodes containing multiple annotations, but that's enough for me at the moment ...
private void removeAnnotations(XmlDocument doc)
{
XmlNamespaceManager manager = new XmlNamespaceManager(new NameTable());
manager.AddNamespace("NS1","http://www.someurl.net");
XmlNodeList annotations = doc.SelectNodes("//NS1:annotation", manager);
int i = 0;
while (i < annotations.Count)
{
//in mixed xml the Siblings are xml text nodes. Therefore we write them into buffers:
string s0 = "";
if(annotations[i].PreviousSibling != null) s0 = annotations[i].PreviousSibling.InnerText;
string s2 = "";
if(annotations[i].NextSibling != null) s2 = annotations[i].NextSibling.InnerText;
//buffer the content of the annotation itself
string s1 = annotations[i].InnerText;
//buffer the link to the parent node before we remove the annotation,
XmlNode parent = annotations[i].ParentNode;
//now remove the annotation
parent.RemoveChild(annotations[i]);
//and apply the new Text to the parent element
parent.InnerText = s0 + s1 + s2;
i++;
}
}
<TestCase Name="DEBUG">
<ActionEnvironment Name="Carved records indication">
<Define Name="_TestedVersionPath" Value="{CustomParam {paramName=PA tested version installer folder path}, {appName=PA installer}, {hint=\\ptnas1\builds\Temp Builds\Forensic\Physical Analyzer\PA.Test\UFED_Analyzer_17.02.05_03-00_6.0.0.128\EncryptedSetup}}"/>
<Define Name="_PathOfdata" Value="SharedData\myfolder\mydata.xml"/>
<ActionSet Name="DEBUG">
<Actions>
<SpecialAction ActionName="myactionname">
<CaseName>123</CaseName>
<UaeSendQueryValues>
<URL>192.168.75.133</URL>
<RestURL></RestURL>
<UserName>user1</UserName>
<Password>aaa</Password>
<PathOfQuery>_PathOfdata</PathOfQuery>
<Method>GET</Method>
<ParamsFromFile></ParamsFromFile>
</UaeSendQueryValues>
</SpecialAction>
</Actions>
</ActionSet>
</ActionEnvironment>
I have the above xml. i need to find every PathOfQuery tag, get the text of it (in the example _PathOfdata) and then go up in the xml tree and find the first Define tag who's name = to text of PathofQuery tag and get its value (in the example "SharedData\myfolder\mydata.xml")
then i would like to replace that value with another string.
i need to do this for each PathofQuery tag that appears in the xml (it could be more then one) and i want to find always the first apparition of the Define tag (could be more than one) when i travel the tree up from the point where the PathofQuery tag was found.
I want to do this on C Sharp
any help will be appreciated.
Let's assume string s holds the above Xml. Then the following code will work for you:
XmlDocument xDoc = new XmlDocument();
xDoc.LoadXml(s);
XmlNode pathOfQuery = xDoc.SelectSingleNode("//PathOfQuery");
string pathOfQueryValue = pathOfQuery.InnerText;
Console.WriteLine(pathOfQueryValue);
XmlNode define = xDoc.SelectSingleNode("//Define[#Name='" + pathOfQueryValue + "']");
if(define!=null)
{
string defineTagValue = define.Attributes["Value"].Value;
Console.WriteLine(defineTagValue);
pathOfQuery.InnerText = defineTagValue;
Console.WriteLine(pathOfQuery.InnerText);
}
So I have an HTML snippet that I want to modify using C#.
<div>
This is a specialSearchWord that I want to link to
<img src="anImage.jpg" />
A hyperlink
Some more text and that specialSearchWord again.
</div>
and I want to transform it to this:
<div>
This is a <a class="special" href="http://mysite.com/search/specialSearchWord">specialSearchWord</a> that I want to link to
<img src="anImage.jpg" />
A hyperlink
Some more text and that <a class="special" href="http://mysite.com/search/specialSearchWord">specialSearchWord</a> again.
</div>
I'm going to use HTML Agility Pack based on the many recommendations here, but I don't know where I'm going. In particular,
How do I load a partial snippet as a string, instead of a full HTML document?
How do edit?
How do I then return the text string of the edited object?
The same as a full HTML document. It doesn't matter.
The are 2 options: you may edit InnerHtml property directly (or Text on text nodes) or modifying the dom tree by using e.g. AppendChild, PrependChild etc.
You may use HtmlDocument.DocumentNode.OuterHtml property or use HtmlDocument.Save method (personally I prefer the second option).
As to parsing, I select the text nodes which contain the search term inside your div, and then just use string.Replace method to replace it:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var textNodes = doc.DocumentNode.SelectNodes("/div/text()[contains(.,'specialSearchWord')]");
if (textNodes != null)
foreach (HtmlTextNode node in textNodes)
node.Text = node.Text.Replace("specialSearchWord", "<a class='special' href='http://mysite.com/search/specialSearchWord'>specialSearchWord</a>");
And saving the result to a string:
string result = null;
using (StringWriter writer = new StringWriter())
{
doc.Save(writer);
result = writer.ToString();
}
Answers:
There may be a way to do this but I don't know how. I suggest
loading the entire document.
Use a combination of XPath and regular
expressions
See the code below for a contrived example. You may have
other constraints not mentioned but this code sample should get you
started.
Note that your Xpath expression may need to be more complex to find the div that you want.
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtmlFile);
HtmlNode divNode = doc.DocumentNode.SelectSingleNode("//div[2]");
string newDiv = Regex.Replace(divNode.InnerHtml, #"specialSearchWord",
"<a class='special' href='http://etc'>specialSearchWord</a>");
divNode.InnerHtml = newDiv;
Console.WriteLine(doc.DocumentNode.OuterHtml);
How can I select every paragraph in a div tag for example.
<div id="body_text">
<p>Hi</p>
<p>Help Me Please</P>
<p>Thankyou</P>
I have got Html Agility downloaded and referenced in my program, All I need is the paragraphs. There may be a variable number of paragraphs and there are loads of different div tags but I only need the content within the body_text. Then I assume this can be stored as a string which I then want to write to a .txt file for later reference. Thankyou.
The valid XPATH for your case is //div[#id='body_text']/p
foreach(HtmlNode node in yourHTMLAgilityPackDocument.DocumentNode.SelectNodes("//div[#id='body_text']/p")
{
string text = node.InnerText; //that's the text you are looking for
}
Here's a solution that grabs the paragraphs as an enumeration of HtmlNodes:
HtmlDocument doc = new HtmlDocument();
doc.Load("your.html");
var div = doc.GetElementbyId("body_text");
var paragraphs = div.ChildNodes.Where(item => item.Name == "p");
Without explicit Linq:
var paragraphs = doc.GetElementbyId("body_text").Elements("p");