I'm having trouble to make some loops.
I'm using agilitypack. I have a TXT file with several links (1 per line), and for each link that txt want to navigate to the page and then later extract to be in xpath and write in a memo.
The problem I'm having and that the code is only carrying out the procedure for the last line of txt. Where am I wrong?
var Webget = new HtmlWeb();
foreach (string line in File.ReadLines("c:\\test.txt"))
{
var doc = Webget.Load(line);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*[#id='title-article']"))
{
memoEdit1.Text = node.ChildNodes[0].InnerHtml + "\r\n";
break;
}
}
try to change
memoEdit1.Text = node.ChildNodes[0].InnerHtml + "\r\n";
to
memoEdit1.Text += node.ChildNodes[0].InnerHtml + "\r\n";
You're overwriting memoEdit1.Text every time. Try
memoEdit1.Text += node.ChildNodes[0].InnerHtml + "\r\n";
instead - note the += instead of =, which adds the new text every time.
Incidentally, constantly appending strings together isn't really the best way. Something like this might be better:
var Webget = new HtmlWeb();
var builder = new StringBuilder();
foreach (string line in File.ReadLines("c:\\test.txt"))
{
var doc = Webget.Load(line);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*[#id='title-article']"))
{
builder.AppendFormat("{0}\r\n", node.ChildNodes[0].InnerHtml);
break;
}
}
memoEdit1.Text = builder.ToString();
Or, using LINQ:
var Webget = new HtmlWeb();
memoEdit1.Text = string.Join(
"\r\n",
File.ReadAllLines("c:\\test.txt")
.Select (line => Webget.Load(line).DocumentNode.SelectNodes("//*[#id='title-article']").First().ChildNodes[0].InnerHtml));
If you are only selecting 1 node in the inner loop then use SelectSingleNode Instead. Also the better practice when concatenating strings in a loop is to use StringBuilder:
StringBuilder builder = new StringBuilder();
var Webget = new HtmlWeb();
foreach (string line in File.ReadLines("c:\\test.txt"))
{
var doc = Webget.Load(line);
builder.AppendLine(doc.DocumentNode.SelectSingleNode("//*[#id='title-article']").InnerHtml);
}
memoEdit1.Text = builder.ToString();
Using linq it will look like this:
var Webget = new HtmlWeb();
var result = File.ReadLines("c:\\test.txt")
.Select(line => Webget.Load(line).DocumentNode.SelectSingleNode("//*[#id='title-article']").InnerHtml));
memoEdit1.Text = string.Join(Environment.NewLine, result);
Related
I have an xmlString that I am parsing to an XDocument:
xmlString =
"<TestXml>" +
"<Data>" +
"<leadData>" +
"<Email>testEmail#yahoo.ca</Email>" +
"<FirstName>John</FirstName>" +
"<LastName>Doe</LastName>" +
"<Phone>555-555-5555</Phone>" +
"<AddressLine1>123 Fake St</AddressLine1>" +
"<AddressLine2></AddressLine2>" +
"<City>Metropolis</City>" +
"<State>DC</State>" +
"<Zip>20016</Zip>" +
"</leadData>" +
"</Data>" +
"</TestXml>"
I parse the string to an XDocument, and then try and iterate through the nodes:
XDocument xDoc = XDocument.Parse(xmlString);
Dictionary<string, string> xDict = new Dictionary<string, string>();
//Convert xDocument to Dictionary
foreach (var child in xDoc.Root.Elements())
{
//xDict.Add();
}
This will only iterate once, and the one iteration seems to have all of the data in it. I realize I am doing something wrong, but after googling around I have no idea what.
Try xDoc.Root.Descendants() instead of xDoc.Root.Elements() in your foreach loop.
Your root has only one child Data, therefore it iterates only once
var xDict = XDocument.Parse(xmlString)
.Descendants("leadData")
.Elements()
.ToDictionary(e => e.Name.LocalName, e => (string)e);
I'm able to read and download list of .jpg files on a page using this regular expression
MatchCollection match = Regex.Matches(htmlText,#"http://.*?\b.jpg\b", RegexOptions.RightToLeft);
Output example: http://somefiles.jpg from this line
<img src="http://somefiles.jpg"/> in html
Question:How could I read files in this kind of format?
I just want to extract files with .exe on the page. So on the example above ^ I just want to get the datavoila-setup.exe file. Sorry I'm a little noob and confuse how to do it T_T. Thanks in advance for anyone who could help me. :)
this is my updated codes but I'm getting error on the HtmlDocument doc = new HtmlDocument(); part "No Source Available" and I'm getting an null value for list :(
protected void Button2_Click(object sender, EventArgs e)
{
//Get the url given by the user
string urls;
urls = txtSiteAddress.Text;
StringBuilder result = new StringBuilder();
//Give request to the url given
HttpWebRequest requesters = (HttpWebRequest)HttpWebRequest.Create(urls);
requesters.UserAgent = "";
//Check for the web response
WebResponse response = requesters.GetResponse();
Stream streams = response.GetResponseStream();
//reads the url as html codes
StreamReader readers = new StreamReader(streams);
string htmlTexts = readers.ReadToEnd();
HtmlDocument doc = new HtmlDocument();
doc.Load(streams);
var list = doc.DocumentNode.SelectNodes("//a[#href]")
.Select(p => p.Attributes["href"].Value)
.Where(x => x.EndsWith("exe"))
.ToList();
doc.Save("list");
}
this is Flipbed answer it works but not I'm not getting a clean catch :( I think there is something to edit on splitting the html to text
protected void Button2_Click(object sender, EventArgs e)
{
//Get the url given by the user
string urls;
urls = txtSiteAddress.Text;
StringBuilder result = new StringBuilder();
//Give request to the url given
HttpWebRequest requesters = (HttpWebRequest)HttpWebRequest.Create(urls);
requesters.UserAgent = "";
//Check for the web response
WebResponse response = requesters.GetResponse();
Stream streams = response.GetResponseStream();
//reads the url as html codes
StreamReader readers = new StreamReader(streams);
string htmlTexts = readers.ReadToEnd();
WebClient webclient = new WebClient();
string checkurl = webclient.DownloadString(urls);
List<string> list = new List<string>();//!3
//Splits the html into with \ into texts
string[] parts = htmlTexts.Split(new string[] { "\"" },//!3
StringSplitOptions.RemoveEmptyEntries);//!3
//Compares the split text with valid file extension
foreach (string part in parts)//!3
{
if (part.EndsWith(".exe"))//!3
{
list.Add(part);//!3
//Download the data into a Byte array
byte[] fileData = webclient.DownloadData(this.txtSiteAddress.Text + '/' + part);//!6
//Create FileStream that will write the byte array to
FileStream file =//!6
File.Create(this.txtDownloadPath.Text + "\\" + list);//!6
//Write the full byte array to the file
file.Write(fileData, 0, fileData.Length);//!6
//Download message complete
lblMessage.Text = "Download Complete!";
//Clears the textfields content
txtSiteAddress.Text = "";
txtDownloadPath.Text = "";
//Close the file so other processes can access it
file.Close();
break;
}
}
This is not an answer but too long for a comment. (I'll delete it later)
To resolve the issue it works, it doesn't work etc; a complete code, for those who may want to check
string html = #"";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
//Anirudh's Solution
var itemList = doc.DocumentNode.SelectNodes("//a//#href")//get all hrefs
.Select(p => p.InnerText)
.Where(x => x.EndsWith("exe"))
.ToList();
//returns empty list
//correct one
var itemList2 = doc.DocumentNode.SelectNodes("//a[#href]")
.Select(p => p.Attributes["href"].Value)
.Where(x => x.EndsWith("exe"))
.ToList();
//returns download/datavoila-setup.exe
Regex is not a good choice for parsing HTML files..
HTML is not strict nor is it regular with its format..
Use htmlagilitypack
You can use this code to retrieve all exe's using HtmlAgilityPack
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://yourWebSite.com");
var itemList = doc.DocumentNode.SelectNodes("//a[#href]")//get all hrefs
.Select(p => p.Attributes["href"].Value)
.Where(x=>x.EndsWith("exe"))
.ToList();
itemList now contain all exe's
I would use FizzlerEx, it adds jQuery like syntax to HTMLAgilityPack. Use the ends-with selector to test the href attribute:
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
var web = new HtmlWeb();
var document = web.Load("http://example.com/page.html")
var page = document.DocumentNode;
foreach(var item in page.QuerySelectorAll("a[href$='exe']"))
{
var file = item.Attributes["href"].Value;
}
And an explanation of why it is bad to parse HTML with RegEx: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
Instead of using regular expressions you could just use normal code.
List<string> files = new List<string>();
string[] parts = htmlText.Split(new string[]{"\""},
StringSplitOptions.RemoveEmptyEntries);
foreach (string part in parts)
{
if (part.EndsWith(".exe"))
files.Add(part);
}
In this case you would have all the found files in the files list.
EDIT:
You could do:
List<string> files = new List<string>();
string[] hrefs = htmlText.Split(new string[]{"href=\""},
StringSplitOptions.RemoveEmptyEntries);
foreach (string href in hrefs)
{
string[] possibleFile = href.Split(new string[]{"\""},
StringSplitOptions.RemoveEmptyEntries);
if (possibleFile.Length() > 0 && possibleFile[0].EndsWith(".exe"))
files.Add(possibleFile[0]);
}
This would also check that the exe file is within a href.
XDocument coordinates = XDocument.Load("http://feeds.feedburner.com/TechCrunch");
System.IO.StreamWriter StreamWriter1 = new System.IO.StreamWriter(DestFile);
XNamespace nsContent = "http://purl.org/rss/1.0/modules/content/";
string pchild = null;
foreach (var item in coordinates.Descendants("item"))
{
string link = item.Element("guid").Value;
//string content = item.Element(nsContent + "encoded").Value;
foreach (var child in item.Descendants(nsContent + "encoded"))
{
pchild = pchild + child.Element("p").Value;
}
StreamWriter1.WriteLine(link + Environment.NewLine + Environment.NewLine + pchild + Environment.NewLine);
}
StreamWriter1.Close();
If i use Commented line code (string content = item.Element(nsContent + "encoded").Value;) instead of inner for loop than it will fetch the value of <conten:encoded> element but it contains all links, images etc etc. And I want only text.
For that I have tried to use this filter (inner for loop) but its showing error :
Object reference not set to an instance of an object.
Please suggest me code so that I can store only text and remove all other links, <img> tags etc.
The content of item.Element(nsContent + "encoded").Value is html not xml. You should parse it accordingly, such as using Html Agility Pack
See the example below
string content = item.Element(nsContent + "encoded").Value;
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(new StringReader(content));
var text = String.Join(Environment.NewLine + Environment.NewLine,
doc.DocumentNode
.Descendants("p")
.Select(n => "\t" + System.Web.HttpUtility.HtmlDecode(n.InnerText))
);
Firstly, I would start by using a StringBuilder:
StringBuilder sb = new StringBuilder();
Then, I suspect that sometimes, the "child" doesn't have a "p" element, so you can check before using it:
foreach (var child in item.Descendants(nsContent + "encoded"))
{
if (child.Element("p") != null)
{
sb.Append(child.Element("p").Value);
}
}
StreamWriter1.WriteLine(link + Environment.NewLine + Environment.NewLine + sb.ToString() + Environment.NewLine);
Does that work for you?
I want to go to multiple pages using ASP.NET 4.0, copy all HTML and then finally paste it in a text box. From there I would like to run my parsing function, what is the best way to handle this?
protected void goButton_Click(object sender, EventArgs e)
{
if (datacenterCombo.Text == "BL2")
{
fwURL = "http://website1.com/index.html";
l2URL = "http://website2.com/index.html";
lbURL = "http://website3.com/index.html";
l3URL = "http://website4.com/index.html";
coreURL = "http://website5.com/index.html";
WebRequest objRequest = HttpWebRequest.Create(fwURL);
WebRequest layer2 = HttpWebRequest.Create(l2URL);
objRequest.Credentials = CredentialCache.DefaultCredentials;
using (StreamReader layer2 = new StreamReader(layer2.GetResponse().GetResponseStream()))
using (StreamReader objReader = new StreamReader(objRequest.GetResponse().GetResponseStream()))
{
originalBox.Text = objReader.ReadToEnd();
}
objRequest = HttpWebRequest.Create(l2URL);
//Read all lines of file
String[] crString = { "<BR> " };
String[] aLines = originalBox.Text.Split(crString, StringSplitOptions.RemoveEmptyEntries);
String noHtml = String.Empty;
for (int x = 0; x < aLines.Length; x++)
{
if (aLines[x].Contains(ipaddressBox.Text))
{
noHtml += (RemoveHTML(aLines[x]) + "\r\n");
}
}
//Print results to textbox
resultsBox.Text = String.Join(Environment.NewLine, noHtml);
}
}
public static string RemoveHTML(string text)
{
text = text.Replace(" ", " ").Replace("<br>", "\n");
var oRegEx = new System.Text.RegularExpressions.Regex("<[^>]+>");
return oRegEx.Replace(text, string.Empty);
}
Instead of doing all this manually you should probably use HtmlAgilityPack instead then you could do something like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://google.com");
var targetNodes = doc.DocumentNode
.Descendants()
.Where(x=> x.ChildNodes.Count == 0
&& x.InnerText.Contains(someIpAddress));
foreach (var node in targetNodes)
{
//do something
}
If HtmlAgilityPack is not an option for you, simplify at least the download portion of your code and use a WebClient:
using (WebClient wc = new WebClient())
{
string html = wc.DownloadString("http://google.com");
}
I have matchCollection.
And I need group index 1.
Now I take out the data from a large number of casts, I would like to avoid it.
example: startTag = <a>, endTag = </a>
Html = <a>texttexttext</a>.
I need get "texttexttext" with out <a> and </a>
var regex = new Regex(startTag + "(.*?)" + endTag, RegexOptions.IgnoreCase);
var matchCollection = regex.Matches(html);
foreach (var item in matchCollection)
{
string temp = ((Match)(((Group)(item)).Captures.SyncRoot)).Groups[1].Value;
}
I would recommend you using Html Agility Pack to parse HTML instead of regex for various reasons.
So to apply it to your example with finding all anchor text inside an HTML document:
using System;
using System.Net;
using HtmlAgilityPack;
class Program
{
static void Main()
{
string html = "";
using (var client = new WebClient())
{
html = client.DownloadString("http://stackoverflow.com");
}
var doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a"))
{
// Will print all text contained inside all anchors
// on http://stackoverflow.com
Console.WriteLine(link.InnerText);
}
}
}
You could use a capture group. You might also want to use a named group. Notice the parentheses I added to regex.
var html = "<a>xx yyy</a> <a>bbb cccc</a>";
var startTag = "<a>";
var endTag = "</a>";
var regex = new Regex(startTag + "((.*?))" + endTag, RegexOptions.IgnoreCase);
var matchCollection = regex.Matches(html);
foreach (Match item in matchCollection)
{
var data = item.Groups[1];
Console.WriteLine(data);
}
This is even a little nicer, because a named group is a little easier to grab.
var html = "<a>xx yyy</a> <a>bbb cccc</a>";
var startTag = "<a>";
var endTag = "</a>";
var regex = new Regex(startTag + "(?<txt>(.*?))" + endTag, RegexOptions.IgnoreCase);
var matchCollection = regex.Matches(html);
foreach (Match item in matchCollection)
{
var data = item.Groups["txt"];
Console.WriteLine(data);
}