Scrape website with html agility pack,find class - c#

i am tring to get som data, from a html string using HTML Agility pack.
The row string[] i am trying to get the data from returns innerhtml like this:
<td class="street">Riksdagen</td>
<td class="number"> </td>
<td class="number"> </td>
<td class="postalcode">100 12</td>
<td class="locality">Stockholm</td>
<td class="region_code">018001</td>
<td class="county">Stockholm</td>
<td class="namnkommun">Stockholm</td>
How can i assign each class to the right addressDataModel propery?
var row = doc.DocumentNode.SelectNodes("//*[#id='thetable']/tr");
foreach (var rowItem in row)
{
var addressDataModel = new AddressDataModel
{
street = rowItem.FirstChild.InnerText,
zipCodeFrom = // Next item,
zipCodeTo = // Next item,
zipCode = // Next item,
locality = // Next item,
regionCode = // Next item,
state = // Next item,
county = // Next item
};
}

You can write something like this (make sure the node exists before use InnerText prop):
var addressDataModel = new AddressDataModel
{
street = rowItem.SelectSingleNode("./td[#class='street']").InnerText,
zipCodeFrom = // Next item,
zipCodeTo = // Next item,
zipCode = // Next item,
locality = // Next item,
regionCode = // Next item,
state = // Next item,
county = rowItem.SelectSingleNode("./td[#class='county']").InnerText
};
Reference: http://www.w3schools.com/xpath/xpath_syntax.asp

You can also refer to this if you don't want to use Xpath :
HtmlAgilityPack.HtmlDocument htmlContent = new HtmlAgilityPack.HtmlDocument();
htmlContent.LoadHtml(htmlCode);
if (htmlContent.DocumentNode != null)
{
foreach (HtmlNode n in htmlContent.DocumentNode.Descendants("div"))
{
if (n.HasAttributes && n.Attributes["class"] != null)
{
if (n.Attributes["class"].Value == "className")
{
// Do something
}
}
}
}

Related

PartialView(); The name does not exist in the current context

So im a bit stuck. I'm still learning all this stuff., but I had to add a csv parser to my application, which should display the results on the my alerts page. If I do
return PartialView(model, pin, tDate, stat);
It will tell me pin, tDate and stat does not exist in the current context. If I take them out, the app will run, but doesn't display the intended result.
I declared pin, tDate and stat in the UserADInfoModel.cs
here is the controller:
public ActionResult _Alerts(UserADInfoModel model, List<string> groups)
{
DataRowCollection drColl = Core.GetAlerts(model);
ViewBag.Alerts = drColl;
var path = #"Exchange Migration data 10-25-17.csv";
using (TextFieldParser csvParser = new TextFieldParser(path))
{
csvParser.CommentTokens = new string[] { "#" };
csvParser.SetDelimiters(new string[] { "," });
csvParser.HasFieldsEnclosedInQuotes = true;
// Skip the row with the column names
csvParser.ReadLine();
// Read the lines
while (!csvParser.EndOfData)
{
string[] fields = csvParser.ReadFields();
string pin = fields[0];
string tDate = fields[2];
string stat = fields[6];
}
}
return PartialView(model, pin, tDate, stat);
}
and here is the view
#if (#ViewBag.pin == Model.SAM)
{
<tr style="background-color : #ff3333; color: #ffffff">
<td style="padding-left :10px; padding-right: 10px;padding-top:2px; padding-bottom: 2px">
<p>Critical</p>
</td>
<td style="padding-left :10px; padding-right: 10px;padding-top:2px; padding-bottom: 2px">
<p>Exchange Migration</p>
</td>
<td style="padding-left :10px; padding-right: 10px;padding-top:2px; padding-bottom: 2px">
<p>Caller was set to migrate on (#ViewBag.tDate). The status of the migration is (#ViewBag.stat). Please contact Hypercare</p>
</td>
</tr>
}
#foreach (var x in ViewBag.Alerts)
{
var uClass = (x["Weight"].Contains("Warning")) ? "#ff8c1a, " : (x["Weight"].Contains("Critical")) ? "#ff3333" : "";
<tr #if (x["Weight"].Contains("Warning")) {
#MvcHtmlString.Create("style=\"background-color: #ff8c1a\"")
}
else if(x["Weight"].Contains("Critical")){
#MvcHtmlString.Create("style=\"background-color: #ff3333; color: #ffffff\"")
}>
what am I doing wrong? TIA
You declared pin, tDate and stat inside the while scope.
Decide if you want to use model (recommended) or ViewBag to pass data:
public ActionResult _Alerts(UserADInfoModel model, List<string> groups)
{
DataRowCollection drColl = Core.GetAlerts(model);
ViewBag.Alerts = drColl;
var path = #"Exchange Migration data 10-25-17.csv";
using (TextFieldParser csvParser = new TextFieldParser(path))
{
// ...
while (!csvParser.EndOfData)
{
string[] fields = csvParser.ReadFields();
model.pin = fields[0];
model.tDate = fields[2];
model.stat = fields[6];
}
}
return PartialView(model);
}
_Alerts.cshtml:
#model UserADInfoModel
#if (#Model.pin == Model.SAM)
// ...
// ... #Model.tDate ... #Model.stat ...

How to display audit trail on a ballon

I have created Audit trail and it works fine. you can see all the changes and even the name of the table edited. Now i want to display changes on the popup. Once the user edit some fields, i want to display fieldname,oldvalue,newvalue,username and date.Only once you click on the balloon.
I have no idea on how i can do that. I have implemented the balloon which you can click and view the changes but it doesn't work.
How can i display all the changes on the balloon
Ballon i have created: i have customize it the way i want it if there's an edit
<div class="small-chat-box fadeInRight animated">
<div class="left">
<div class="author-name">
Claim Edit <small class="chat-date">
<br>09-06-2016 12:00 pm
</small>
</div>
<div class="chat-message active">
Username: JJ
<br>fieldname: Surname
<br>oldvalue: small
<br>newvalue big
<br>date: #DateTime.Now
</div>
</div>
</div>
my audit method which works fine
public static List<ClaimAudit> GetClaimLog(SidDbContext db, int claimId)
{
--------------------
foreach (var entry in changeTrack)
{
if (entry.Entity != null)
{
string entityName = string.Empty;
string state = string.Empty;
switch (entry.State)
{
case EntityState.Modified:
entityName = ObjectContext.GetObjectType(entry.Entity.GetType()).Name;
state = entry.State.ToString();
foreach (string prop in entry.OriginalValues.PropertyNames)
{
object currentValue = entry.CurrentValues[prop];
object originalValue = entry.OriginalValues[prop];
if (!currentValue.Equals(originalValue))
{
ClaimLogs.Add(new ClaimAudit
{
tableName = entityName,
state = state,
fieldName = prop,
oldValue = Convert.ToString(originalValue),
newValue = Convert.ToString(currentValue),
userId = "JJ",
UpdateDate = DateTime.Now,
CommentsId = 1,
ClaimId = claimId
});
}
}
break;
}
}
}
calling my update function on save
public async Task<bool> UpdateClaimVehicle(ClaimVehicle vehicleModel)
{
List<ClaimAudit> ClaimLogs = new List<ClaimAudit>();
using (IBaseDataRepository<ClaimVehicle> bs = new BaseRepository<ClaimVehicle>())
{
using (SidDbContext db = new SidDbContext())
{
if (vehicleModel != null)
{
var oldClaim = db.ClaimVehicles.FirstOrDefault(x => x.ClaimId == vehicleModel.ClaimId);
oldClaim.VehicleMakeModelId = vehicleModel.VehicleMakeModelId;
oldClaim.VehicleRegistrationNumber = vehicleModel.VehicleRegistrationNumber;
oldClaim.VehicleYearModel = vehicleModel.VehicleYearModel;
oldClaim.VinNumber = vehicleModel.VinNumber;
ClaimLogs = GetClaimLog(db, oldClaim.ClaimId);
db.claimAuditLog.AddRange(ClaimLogs);
db.SaveChanges();
}
return true;
}
}
}
Here i was trying to display it on the view first,but it says model does not exist
#model Business.Models.ClaimDetailsModel
#foreach (var item in Model) //error here
{
<tr>
<td>
#Html.DisplayFor(modelItem => item.ClaimId)
</td>
</tr>
}

c# Find out variating column in html table

How can I find out the sixth column in this html table (using for example HTML Agility Pack or Regex)?
<tr><td>So, 22.05.16</td><td>1</td><td>D</td><td>E</td><td>190</td><td>DifferentThings</td></tr>
In the last column could stand anything and this is only one row of many, so I want the full last column with every entry.
Edit:
If there is an blank
<td></td>
in the 6th row I always get an
System.NullReferenceException
What shoud I do now?
innerTextOfLastCell = lastTdCell.InnerText.Trim();
is causing the error
Edit:
Solved it!
Just typed:
if (lastTdCell != null) //Not lastTdCell.InnerText.Trim()!
{
innerTextOfLastCell = lastTdCell.InnerText.Trim();
s = s + innerTextOfLastCell + "\n";
run.Text = s;
}
else
{
s = s + "\n\n";
run.Text = s;
}
Using HtmlAgilityPack, this should work regardless of the number of columns the table has.
var html = new HtmlDocument();
html.LoadHtml("<table><tr><td>So, 22.05.16</td><td>1</td><td>D</td><td>E</td><td>190</td><td>DifferentThings</td></tr></table>");
var root = html.DocumentNode;
var tableNodes = root.Descendants("table");
var innerTextOfLastCell = string.Empty;
foreach (var tbs in tableNodes.Select((tbNodes, i) => new { tbNodes = tbNodes, i = i }))
{
var trs = tbs.tbNodes.Descendants("tr");
foreach (var tr in trs.Select((trNodes, j) => new { trNodes = trNodes, j = j }))
{
var tds = tr.trNodes.Descendants("td");
var lastTdCell = tds.LastOrDefault();
innerTextOfLastCell = lastTdCell.InnerText.Trim();
}
}
[edit]
If you did want to use the other option from How to get the value from a specific cell C# Html-Agility-Pack, then you could try the following code:
HtmlNode lastTdnode = root.SelectSingleNode("//table[1]/tr[last()]/td[last()]");
This will give you the last <td> from the last <tr> from the first <table>
If you wanted the sixth cell you can use something like this, but will give you the same result as above:
HtmlNode sixthTdNode = root.SelectSingleNode("//table[1]/tr[last()]/td[6]");
If you wanted to mix it up even more you can try this:
HtmlNode nthTdNode = root.SelectSingleNode("//table[1]/tr[last()]/td[" + 6 + "]");

Like statement or removal of trailing blanks in html agility pack?

I m trying to download data from a website into a datatable. The problem is I cannot access the right node because there seem to be blanck spaces. Here is my code so far:
public static DataTable downloadtable()
{
DataTable dt = new DataTable();
string htmlCode = "";
using (WebClient client = new WebClient())
{
client.Headers.Add(HttpRequestHeader.UserAgent, "AvoidError");
htmlCode = client.DownloadString("https://www.eex.com/en/Market%20Data/Trading%20Data/Power/Hour%20Contracts%20%7C%20Spot%20Hourly%20Auction/Area%20Prices/spot-hours-area-table/2013-08-22");
}
//this is just to check the file structure from text file
System.IO.StreamWriter file = new System.IO.StreamWriter("c:\\temp\\test.txt");
file.WriteLine(htmlCode);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
dt = new DataTable();
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table[#class='list electricity']/tr/th[#class='title'][.='Market Area']"))
{
//This is the problem name where I get the error
foreach (HtmlNode row in table.SelectNodes("//td[#class='title'][.=' 00-01 ']"))
{
foreach (var cell in row.SelectNodes("//td"))
{
//this is to check for correct result, final result would be to dump it into datatable
Console.WriteLine(cell.InnerText);
}
}
}
return dt;
}
I m trying to download the Hours prices from the link in the code but it seems to fail because of trailing blanks (I think).
Is there a like statement for the name of a node? Or can you remove trailing blanks?
I believe your problem is that you are trying to retrieve td's from inside a td node which obviously doesn't have more td's.
<tr>
<td class="title"> 00-01 </td>
<td class="spacer"></td>
<td class="r">€/MWh</td>
<td class="spacer"></td>
<td>35.34</td>
<td class="spacer"></td>
<td>34.02</td>
<td class="spacer"></td>
<td>34.02</td>
</tr>
So if you try to iterate with your result table.SelectNodes("//td[#class='title'][.=' 00-01 ']") it will contain no td's inside of it.
If you want all the rows starting from 00-01 you can use this one:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
foreach (HtmlNode row in doc2.DocumentNode.SelectNodes("//td[#class='title'][(normalize-space(.)='00-01')]/ancestor::table"))
{
foreach (var cell in row.SelectNodes("./tr/td"))
{
if (string.IsNullOrEmpty(cell.InnerText.Trim()))
continue;
Console.WriteLine(cell.InnerText.Trim());
}
}
If you want only the 00-01 row you can use this one:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//td[#class='title']"))
{
if (row.InnerText.Trim() == "00-01")
{
foreach (var cell in row.ParentNode.ChildNodes)
{
if (string.IsNullOrEmpty(cell.InnerText.Trim()))
continue;
Console.WriteLine(cell.InnerText.Trim());
}
}
}
Or you can use it as:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
foreach (HtmlNode row in doc2.DocumentNode.SelectNodes("//td[#class='title'][(normalize-space(.)='00-01')]"))
{
foreach (var cell in row.ParentNode.ChildNodes)
{
if (string.IsNullOrEmpty(cell.InnerText.Trim()))
continue;
Console.WriteLine(cell.InnerText.Trim());
}
}

Html Agility Pack loop through table rows and columns

I have a table like this
<table border="0" cellpadding="0" cellspacing="0" id="table2">
<tr>
<th>Name
</th>
<th>Age
</th>
</tr>
<tr>
<td>Mario
</td>
<th>Age: 78
</td>
</tr>
<tr>
<td>Jane
</td>
<td>Age: 67
</td>
</tr>
<tr>
<td>James
</td>
<th>Age: 92
</td>
</tr>
</table>
And want to use HTML Agility Pack to parse it. I have tried this code to no avail:
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//table[#id='table2']//tr"))
{
foreach (HtmlNode col in row.SelectNodes("//td"))
{
Response.Write(col.InnerText);
}
}
What am I doing wrong?
Why don't you just select the tds directly?
foreach (HtmlNode col in doc.DocumentNode.SelectNodes("//table[#id='table2']//tr//td"))
Response.Write(col.InnerText);
Alternately, if you really need the trs separately for some other processing, drop the // and do:
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//table[#id='table2']//tr"))
foreach (HtmlNode col in row.SelectNodes("td"))
Response.Write(col.InnerText);
Of course that will only work if the tds are direct children of the trs but they should be, right?
EDIT:
var cols = doc.DocumentNode.SelectNodes("//table[#id='table2']//tr//td");
for (int ii = 0; ii < cols.Count; ii=ii+2)
{
string name = cols[ii].InnerText.Trim();
int age = int.Parse(cols[ii+1].InnerText.Split(' ')[1]);
}
There's probably a more impressive way to do this with LINQ.
I've run the code and it displays only the Names, which is correct, because the Ages are defined using invalid HTML: <th></td> (probably a typo).
By the way, the code can be simplified to only one loop:
foreach (var cell in doc.DocumentNode.SelectNodes("//table[#id='table2']/tr/td"))
{
Response.Write(cell.InnerText);
}
Here's the code I used to test: http://pastebin.com/euzhUAAh
I had to provide the full xpath. I got the full xpath by using Firebug from a suggestion by #Coda (https://stackoverflow.com/a/3104048/1238850) and I ended up with this code:
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("/html/body/table/tbody/tr/td/table[#id='table2']/tbody/tr"))
{
HtmlNodeCollection cells = row.SelectNodes("td");
for (int i = 0; i < cells.Count; ++i)
{
if (i == 0)
{ Response.Write("Person Name : " + cells[i].InnerText + "<br>"); }
else {
Response.Write("Other attributes are: " + cells[i].InnerText + "<br>");
}
}
}
I am sure it can be written way better than this but it is working for me now.
I did the same project with this:
private List<PhrasalVerb> ExtractVerbsFromMainPage(string content)
{
var verbs =new List<PhrasalVerb>(); ;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(content);
var rows = doc.DocumentNode.SelectNodes("//table[#class='idioms-table']//tr");
rows.RemoveAt(0); //remove header
foreach (var row in rows)
{
var cols = row.SelectNodes("td");
verbs.Add(new PhrasalVerb {
Uid = Guid.NewGuid(),
Name = cols[0].InnerHtml,
Definition = cols[1].InnerText,
Count =int.TryParse(cols[2].InnerText,out _) == true ? Convert.ToInt32(cols[2].InnerText) : 0
});
}
return verbs;
}
private List<Table1> getTable1Data(string result)
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(result);
var table1 = htmlDoc.DocumentNode.SelectNodes("//table").First();
var tbody = table1.ChildNodes["tbody"];
var lst = new List<Table1>();
foreach (var row in tbody.ChildNodes.Where(r => r.Name == "tr"))
{
var tbl1 = new Table1();
var columnsArray = row.ChildNodes.Where(c => c.Name == "td").ToArray();
for (int i = 0; i < columnsArray.Length; i++)
{
if (i == 0)
tbl1.Course = columnsArray[i].InnerText.Trim();
if (i == 1)
tbl1.Count = columnsArray[i].InnerText.Trim();
if (i == 2)
tbl1.Correct = columnsArray[i].InnerText.Trim();
}
lst.Add(tbl1);
}
return lst;
}
public class Table1
{
public string Course { get; set; }
public string Count { get; set; }
public string Correct { get; set; }
}

Categories

Resources