scraping data from website with a C# console application - c#

I'm trying to learn Spanish and making some flash cards (for my personal use) to help me learn the verbs.
Here is an example, page example. So near the top of the page you will see the past participle: bloqueado & gerund: bloqueando. It is these two values that I wish to obtain in my code and use for my flash cards.
If this is possible I will use a C# console application. I am aware that scraping data from a website is not ideal however this is a once off.
Any guidance on how to start something like this and pitfalls to avoid would be very helpful!

I know this isn't an exact answer, but here is the process I would suggest.
https://www.gnu.org/software/wget/ and mirror the website to a
folder. Wget is a web spider and will follow the links on the site until it has downloaded everything. You'll have to run it with a few different parameters until you figure out the correct settings you want.
Use C# to run through each file in the folder and extract the
words from <section class="verb-mood-section"> in each file. It's your choosing of whether you want to output them to the console or store them in a database or flat file.
Should be that easy, in theory.

Use SGMLReader. SGMLReader is a versatile and robust component that will stream HTML to an XMLReader:
XmlDocument FromHtml(TextReader reader) {
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}
You can see that you need to create a TextReader first. TThis would in reality be a StreamReader as a TextReader is an abstract class.
Then you create the XMLDocument over that. Once you've got it into the XMLDocument you can use the various methods supported by XMLDocument to isolate and extract the nodes you need. I'll leave you to explore that aspect of it.
You might try using the XDocument class as it's a lot easier to handle than the XMLDocument, especially if you're a newbie. It also supports LINQ.

Related

Reading an xml file 50 lines at a time

Currently trying to make a method to read in XML files at the moment 50 lines at a time this will be increased to allow larger files to be used in the program.
At the moment i am trying to accomplish this with the following code.
List<dataclass.DataRecord> list = new List<dataclass.DataRecord>();
string filename = "FileLocation"
XmlDocument testing = new XmlDocument();
//using (StreamReader streamreader = new StreamReader(filename))
using (XmlTextReader reader = new XmlTextReader(new StringReader(filename)))
{
while (reader.Read() != null)
{
for (int i = 0; i < 50; i++)
{
testing.Load(reader);
//list.add(line);
Console.WriteLine(testing);
//testing.Load(reader);
}
}
}
commented lines are just from previous ideas i used to accomplish my goal and the filename has been taken out as i just prefer not to place that online.
Basically at the moment i keep getting the following error:
Data at the root level is invalid. Line 1, position 1.
So i dunno if I am:
A. Going about this the right way.
B. Is the only way to fix this error is by surrounding the "testing.load" by "root + /root" tags
hope someone can help thank.
As I explained in my comment XML consists of nodes whereas you are looking at it as though it were a flat-file with lines.
There are a couple of Stackoverflow questions with answers that match what you are trying to do. The real question is "How can you load a large XML file". The answer is to use a stream rather than loading in one big chunk, following on from there you can find lots of resources about using XmlReader.
Couple of pointers to other SO articles:
C# and Reading Large XML Files
Reading large XML documents in .net
Hope that helps!
If you are only trying to load xml into XmlDocument - why not just
XmlDocument testing = new XmlDocument();
testing.Load(filename);
If your XML file is really big, you're better off using some sort of pull parser (parses tag-by-tag, attribute-by-attribute, etc) rather than DOM parser (loads whole document during parsing, keeps it in memory).

Read an XML file from http address

I need to read an xml file using c#/.net from a source like so: https://10.1.12.15/xmldata?item=all
That is basically just an xml file.
StreamReader does not like that.
What's the best way to read the contents of that link?
The file looks like so:
- <RIMP>
- <HSI>
<SBSN>CZ325000123</SBSN>
<SPN>ProLiant DL380p Gen8</SPN>
<UUID>BBBBBBGGGGHHHJJJJ</UUID>
<SP>1</SP>
<cUUID>0000-000-222-22222-333333333333</cUUID>
- <VIRTUAL>...
You'll want to use LINQ to XML to process the XML file. The XDocument.Load Method supports loading an XML document from an URI:
var document = XDocument.Load("https://10.1.12.15/xmldata?item=all");
Another way to do this is using the XmlDocument class. A lot of servers around the world are still running .Net Framework < 3.0 so it's good to know that this class still exists alongside XDocumentin case you're developing an application that will be run on a server.
string url = #"https://10.1.12.15/xmldata?item=all";
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(url);
Maybe the correct answer must starting by reading the initial question about how to "Read an XML file from a URL (or in this case from a Http address)".
I think that can be the best for you see the next easy demos:
(In this case XmlTextReader but today you can use XmlReader instead of XmlTextReader)
http://support.microsoft.com/en-us/kb/307643
(Parallel you could read this documentation too).
https://msdn.microsoft.com/en-us/library/system.xml.xmlreader(v=vs.110).aspx
regards

how to load a xml file in c# windows phone?

Im making a game, where i generate the map from a xml file.
Does not WP7 support regular xml? also tried to wrap it inside a XnaContent xml file, but all my nodes a invalid.
How do I go about to load a regular xml file into my c# WP7 project?
I think Jon Skeet nailed it on the head, but I had an example where I am doing a read of an xml file from isolated memory I thought I would share.
private const string filePath = "TimeKeeperData.xml";
private static XDocument ReadDataFromIsolatedStorageXmlDoc()
{
using (IsolatedStorageFile storage = IsolatedStorageFile.GetUserStoreForApplication())
{
if (!storage.FileExists(filePath))
{
return new XDocument();
}
using (var isoFileStream = new IsolatedStorageFileStream(filePath, FileMode.OpenOrCreate, storage))
{
using (XmlReader reader = XmlReader.Create(isoFileStream))
{
return XDocument.Load(reader);
}
}
}
}
Yes, Windows Phone 7 definitely supports regular XML, and it works fine using LINQ to XML to load XML data either from isolated storage, or fetched from the web, or fetched from a resource.
It's unclear exactly what you're trying to do or what's going wrong (partly because you've shown no code) but you can certainly use XML in Windows Phone 7. Not being an XNA developer, I don't know about XnaContent, but you should potentially try loading it as an XDocument first just to check whether that works, and go from there.
When you say all your nodes are invalid, it makes me think you might not be handling xml namespaces correctly. Are you taking them into account, if required?

Display xml data into html

I have a xml structure stored in XDocument.
I want to present as html document (or something similar) , main idea that a web browser will be able to present it .
Does XSLT will right technology here ?
Is there some examples for how to do so ?
Thansk for help.
Yes, XSLT is good for this. I recently had to do this using the following code:
var xslt = new XslCompiledTransform(true);
xslt.Load(styleSheetFile, XsltSettings.TrustedXslt, new XmlUrlResolver());
xslt.Transform(xmlFile, outputFile);
You can use XSLT or LinqToXML. Many examples out there but you can start # http://msdn.microsoft.com/en-us/library/bb387098.aspx

How to trigger an executable upon update of an RSS feed

I have an RSS feed URL, that I can view in any Feed Reader.
This RSS feed is not controlled by me, it is only consumed by me.
This RSS Feed (Office of Inspector General's Excluded Provider List) links to a page with download-able files.
These files are updated approximately once a month, and the RSS feed displays new "unread" items.
What I want to do is write something (in C#) that checks this RSS Feed once a week, and when a new item (i.e. a new download-able file) is available, triggers off an executable.
This is essentially like a very scaled-down RSS Reader, with the sole purpose of triggering an executable when a new item appears.
Any guidance, advice would be greatly appreciated.
Edit:
I need help in determining when a new
item becomes available for
download.
The running of an
executable I can do.
The
executable that will run, will process
the downloaded file.
As a commenter already noted, this question is quite broad, but here's an attempt to answer:
You can either write a Windows Service (use a template that comes with VS/MonoDevelop) or you can write a simple console app that would be called by Windows Scheduler or Cron.
The main code will use one of the many RSS feed parsers available:
There are plenty of examples here on SO. IMO, the simplest LINQ-based is here
I personally like this approach, also using LINQ.
Once you parse the feed, you need to look for the value of the Link element, found by doing this from the SO example above:
....
var feeds = from feed in feedXML.Descendants("item")
select new
{
Title = feed.Element("title").Value,
**Link** = feed.Element("link").Value,
Description = feed.Element("description").Value
};
....
So, now that you have the executable, you'll need to download it to your machine. I suggest you look into this example from MSDN:
Now, that you have the file downloaded, simple use Process.Start("Path to EXE"); to execute it.
Watch out for viruses in the exes!!!
If you are using .Net 3.5 or above you can you the various classes within the System.ServiceModel.Syndication namespace, specifically the SyndicationFeed class which exposes a LastUpdatedTime property that you can use to compare dates to know when to call your executable using the Process.Start method in the System.Diagnostics namespace.
using (XmlReader reader = XmlReader.Create(path))
{
SyndicationFeed feed = SyndicationFeed.Load(reader);
if ((feed != null) && (feed.LastUpdateTime > feedLastUpdated))
{
// Launch Process
}
}
So you have to read the RSS feed from the URL, and then parse the data to determine whether a new item is available.
To read the feed, you'll want to use a WebClient. The simplest way:
var MyClient = new WebClient();
string rssData = MyClient.DownloadString("http://whatever");
You can then create an XML document from the returned string.
var feedXML = new XMlDocument();
feedXML.Load(rssData);
#dawebber shows how to parse the XML with LINQ. You'll want to check the date on each item to see if it's newer than the last date checked. Or perhaps you have a database of items that you've already seen and you want to check to see if the items you received are in the database.
Whenever you find a new item, you can fire off your executable using Process.Start.
You could write a System Tray application. I've done several that screen scrape/monitor sites on a scheduled basis. Here is a VERY simple start. I think you could do what you're looking for in a few hours.
http://alperguc.blogspot.com/2008/11/c-system-tray-minimize-to-tray-with.html

Categories

Resources