How can you pull strings from a webpage and display them?

How can you pull strings from a webpage and display them? - c#

I want to make a desktop weather application in C#. I want it to pull the weather from weather.com. I am very new to this subject. I am using the HtmlAgilityPack.dll. I have tried the following code to pull today's weather (degrees):
string webUrl = "http://www.weather.com/weather/today/l/90025:4:US";
HtmlWeb HTMLweb = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = HTMLweb.Load(webUrl);
string degrees = doc.DocumentNode.SelectNodes("//*[#id=\"wx-local-wrap\"]/div[2]/div[2]/div/div/div/div/section/div/div/div[1]/div/section/section[1]/div[2]/span[1]/span")[0].InnerText;
MessageBox.Show("{0}°F", degrees);
However, when I run this code it throws the NullReferenceException. What am I doing wrong and how can I fix it?
Thank you.

Handling webpages like this is an exhaustive task and any change to the webpage by its developers will render your application useless.
Therefore, use XML or an API to retrieve weather data instead. This can be a good place to start:
http://openweathermap.org/current
It supports XML and JSON where you provide parameters such as cityID, cityName or a geographic coordinates and it returns results in clear structured XML easy to parse using XmlReader
Hope that helped :)

Related

C# Best Buy Web Scraping - Can't get add to cart element

I'm writing a simple web scraping application to retrieve information on certain PC components.
I'm using Best Buy as my test website and I'm using the HTMLAgilityPack as my scraper.
I'm able to retrieve the title and the price; however, I can't seem to get the availability.
So, I'm trying to read the Add to Cart button element's text. If it's available, it'll read "Add to Cart", otherwise, it'll read "Unavailable".
But, when I get the XPath and try to save it to a variable, it returns null. Can someone please help me out?
Here's my code.
var url = "https://www.bestbuy.com/site/pny-nvidia-geforce-gt-710-verto-2gb-ddr3-pci-express-2-0-graphics-card-black/5092306.p?skuId=5092306";
HtmlWeb web = new HtmlWeb();
HtmlDocument pageDocument = web.Load(url);
string titleXPath = "/html/body/div[3]/main/div[2]/div[3]/div[1]/div[1]/div/div/div[1]/h1";
string priceXPath = "/html/body/div[3]/main/div[2]/div[3]/div[2]/div/div/div[1]/div/div/div/div/div[2]/div/div/div/span[1]";
string availabilityXPath = "/html/body/div[3]/main/div[2]/div[3]/div[2]/div/div/div[7]/div[1]/div/div/div[1]/button";
var title = pageDocument.DocumentNode.SelectSingleNode(titleXPath);
var price = pageDocument.DocumentNode.SelectSingleNode(priceXPath);
bool availability = pageDocument.DocumentNode.SelectSingleNode(availabilityXPath) != null ? true : false;
Console.WriteLine(title.InnerText);
Console.WriteLine(price.InnerText);
Console.WriteLine(availability);
It correctly outputs the title and price, but availability is always null.

Try string availabilityXPath = "//button[. = 'Add to Cart']"
In web scraping, while a long generated xpath will always work on the same static page, when you're dealing with multiple pages across the same store, the location of certain elements can drift and break your xpaths. Yours is breaking at /html/body/div[3]/main/div[2]/div[3]/div[2]/div/div/div[7]/div[1]/div and I suspect that's what's happening here.
Learning to write one from scratch will be invaluable (and much easier to debug!).

How can I Add and Invoice in QBFC using an XML Text File?

I have a Google Sheets form that has been set up to take form data, format it into a QBXML Invoice Add Request, then save it as a text document (.gdoc). My issue is that the QBFC C# sample code that I have found is all based around building the QBXML request and then sending that; I haven't been able to figure out how to send the ready-made QBXML document to Quickbooks Desktop as a request.
For example, this code doesn't work because DoRequests() needs to be passed an IMsgSetRequest and won't accept a string:
String xmlDoc = File.ReadAllText("J:\\My Drive\\XML Test Doc.gdoc");
IMsgSetResponse responseMsgSet = sessionManager.DoRequests(xmlDoc);
And this won't work either, because you can't convert form a string to an IMsgSetRequest:
String xmlDoc = File.ReadAllText("J:\\My Drive\\XML Test Doc.gdoc");
IMsgSetRequest requestMsgSet = xmlDoc;
IMsgSetResponse responseMsgSet = sessionManager.DoRequests(requestMsgSet);
I'm assuming (and hoping) that there's a simple solution that I'm just overlooking. But if there is, it's eluded me for long enough that I've decided that it's worth reaching out to you folks for assistance. Thanks in advance.

The QBSessionManager has a .DoRequestsFromXMLString that you could pass the full XML string to (which you read from your file).
Its definition is:
IMsgSetResponse DoRequestsFromXMLString(string qbXMLRequest);

You need to use QBXML instead of QBFC. QBFC is a wrapper that generates the QBXML. Since you already have the QBXML generated you can bypass QBFC. Include a reference to QBXMLRP2Lib and the following code should allow you to send the data to QuickBooks.
String xmlDoc = File.ReadAllText("J:\\My Drive\\XML Test Doc.gdoc");
QBXMLRP2Lib.IRequestProcessor5 rp = new QBXMLRP2Lib.RequestProcessor3();
rp.OpenConnection2("AppID", "AppName", QBXMLRP2Lib.QBXMLRPConnectionType.localQBD);
string ticket = rp.BeginSession("", QBXMLRP2Lib.QBFileMode.qbFileOpenDoNotCare);
string response = rp.ProcessRequest(ticket, xmlDoc);

You may find that this tool helps: SDKTestPlus3. The tool handles the connection to Quickbooks Desktop and you can then pass your XML file over.

scraping data from website with a C# console application

I'm trying to learn Spanish and making some flash cards (for my personal use) to help me learn the verbs.
Here is an example, page example. So near the top of the page you will see the past participle: bloqueado & gerund: bloqueando. It is these two values that I wish to obtain in my code and use for my flash cards.
If this is possible I will use a C# console application. I am aware that scraping data from a website is not ideal however this is a once off.
Any guidance on how to start something like this and pitfalls to avoid would be very helpful!

I know this isn't an exact answer, but here is the process I would suggest.
https://www.gnu.org/software/wget/ and mirror the website to a
folder. Wget is a web spider and will follow the links on the site until it has downloaded everything. You'll have to run it with a few different parameters until you figure out the correct settings you want.
Use C# to run through each file in the folder and extract the
words from <section class="verb-mood-section"> in each file. It's your choosing of whether you want to output them to the console or store them in a database or flat file.
Should be that easy, in theory.

Use SGMLReader. SGMLReader is a versatile and robust component that will stream HTML to an XMLReader:
XmlDocument FromHtml(TextReader reader) {
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}
You can see that you need to create a TextReader first. TThis would in reality be a StreamReader as a TextReader is an abstract class.
Then you create the XMLDocument over that. Once you've got it into the XMLDocument you can use the various methods supported by XMLDocument to isolate and extract the nodes you need. I'll leave you to explore that aspect of it.
You might try using the XDocument class as it's a lot easier to handle than the XMLDocument, especially if you're a newbie. It also supports LINQ.

Retrive the Url from an Html Img Tag

BackGround Info
Currently working on a C# web api that will be returning selected Img url's as base64. I currently have the functionality that would preform the base64 conversion however, I am getting a large amount of text which also include Img Url's which I will need to crop out of the string and give it to my function to convert the img to base 64. I read up on an lib.("HtmlAgilityPack;") that should make this task easy but when I am use it I get "HtmlDocument.cs" not found. However, I am not submitting a document, but sending it a string which is HTML. I read the doc and it is suppose to work with a string as well, but it is not working for me. This is the code using "HtmlAgilityPack".
NON WORKING CODE
foreach(var item in returnList)
{
if (item.Content.Contains("~~/picture~~"))
{
HtmlDocument doc = new HtmlDocument();
doc.Load(item.Content);
Error Message From HtmlAgilityPack
Question
I am receiving a string which is Html from SharePoint. This Html string may be tokenized with heading tokens and/or picture tokens. I am trying to isolate the retrieve the html from the img src Hmtl tag. I understand that regex may be impractical, but I would consider working with a regex expressions is it available to retrieve the url from img src.
Sample String
Bullet~~Increased Cash Flow</li><li>~~/Document Text Bullet~~Tax Efficient Organizational Structures</li><li>~~/Document Text Bullet~~Tax Strategies that Closely Align with Business Strategies</li><li>~~/Document Text Bullet~~Complete Knowledge of State and Local Tax Obligations</li></ul><p>~~/Document Heading 2~~is the firm of choice</p><p>~~/Document Text~~When it comes to accounting and advisory services is the unique firm of choice. As a trusted advisor to our clients, we bring an integrated client service approach with dedicated industry experience. Dixon Hughes Goodman respects the value of every client relationship and provides clients throughout the U.S. with an unwavering commitment to hands-on, personal attention from our partners and senior-level professionals.</p><p>~~/Document Text~~of choice for clients in search of a trusted advisor to deal with their state and local tax needs. Through our leading best practices and experience, our SALT professionals offer quality and ease to the client engagement. We are proud to provide highly comprehensive services.</p>
<p>~~/picture~~<br></p><p>
<img src="/sites/ContentCenter/Graphics/map-al.jpg" alt="map al" style="width:611px;height:262px;" /> 
<br></p><p><br></p><p>
~~/picture~~<br></p><p>
<img src="/sites/ContentCenter/Graphics/Firm_Telescope_Illustration.jpg" alt="Firm_Telescope_Illustration.jpg" style="margin:5px;width:155px;height:155px;" /> </p><p></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div>
Important
I am working with an HTML string, not a file.

The issue you are having is that C# is looking for a file and since it is not finding it, it tells you. This is not an error that will brake your app, it is just telling you that the file is not found and the Lib will than read the string given. This documentation can be found here https://htmlagilitypack.codeplex.com/SourceControl/latest#Trunk/HtmlAgilityPackDocumentation.shfbproj. The code below is a cookie cutter model that anyone can use.
Important
C# is looking for a file which can not be displayed, because it a string that is supplied. That is the message that you are getting, however your still will work as well with accordance to the doc provided and will not effect your code.
Exmample Code
HtmlAgilityPack.HtmlDocument htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml("YourContent"); // can be a string or can be a path.
HtmlAttribute att = url.Attributes["src"];
Uri imgUrl = new System.Uri("Url"+ att.Value); // build your url

string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].+?>", RegexOptions.IgnoreCase).Groups[1].Value;
It has been asked multiple times here.
also here

Trouble Scraping .HTM File

I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL's game summary pages. I think this is kind of an interesting problem so I would post it here.
The page I am testing is:
http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM
Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can't right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:
/html/body/table[#id='MainTable']/tbody/tr[1]/td/table[#id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[#id='Home']/tbody/tr[3]/td
When I try to grab that node / inner text, htmlagilitypack won't find it. Does anyone see anything strange in the page's source code that might be stopping me?
I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!
p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.

Ok so it appears that my xpaths have tbody's in them. When I remove these tbodys manually from the xpath, HTMLAgilityPack can handle it fine.
I'd still like to know why I am getting invalid xpaths, but for now I have answered my question.

I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.
When I do
string test = string.Empty;
StreamReader sr = new StreamReader(#"C:\gs.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
sr.Close();
sr = null;
string xpath = #"//table[#id='Home']/tr[3]/td";
test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;
That works fine.. returns a
"COLUMBUS BLUE JACKETSGame 5 Home Game 3"
which I hope is the string you wanted.
Examining the html I couldn't find a /tbody.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can you pull strings from a webpage and display them? - c#

Related

C# Best Buy Web Scraping - Can't get add to cart element

How can I Add and Invoice in QBFC using an XML Text File?

scraping data from website with a C# console application

Retrive the Url from an Html Img Tag

Trouble Scraping .HTM File

Categories

Resources