I've got a few web pages that have static data in HTML mark-up tables. By this, I mean, manually maintained text:
<table border="1" >
<tr><th>Number</th><th>Date</th><th>BW</th><th>WW</th><th>%</th><th>Type</th><th>CED</th><th>BW</th><th>WW</th><th>YW</th><th>Mlk</th><th>Me</th></tr>
<tr><td>313</td><td>9/16/2013</td><td>74</td><td>512</td><td>100</td><td>861U</td><td>3</td><td>-1.1</td><td>54</td><td>85</td><td>16</td><td></td></tr>
<tr><td>315</td><td>10/6/2013</td><td>-</td><td>-</td><td>-</td><td>W179</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr>
<tr><td>316</td><td>10/102013</td><td>72</td><td>595</td><td>94.2</td><td>W179</td><td>7</td><td>-2.3</td><td>53</td><td>80</td><td>21</td><td>-3</td></tr>
<tr><td>350</td><td>10/11/2013</td><td>71</td><td>703</td><td>100</td><td>W179</td><td>7</td><td>-2.3</td><td>46</td><td>72</td><td>20</td><td>-5</td></tr>
<tr><td>392</td><td>3/8/2013</td><td>61</td><td>651</td><td>100</td><td>RANGER</td><td>7</td><td>-2.3</td><td>52</td><td>82</td><td>20</td><td>-2</td></tr>
<tr><td>303</td><td>7/3/2013</td><td>63</td><td>-</td><td>97.1</td><td>W179</td><td>8</td><td>-3.2</td><td>N/A</td><td>82</td><td>21</td><td>-8</td></tr>
<tr><td>304</td><td>7/8/2013</td><td>62</td><td>-</td><td>97.1</td><td>W179</td><td>7</td><td>-3.9</td><td>N/A</td><td>69</td><td>20</td><td>-4</td></tr>
<tr><td>397</td><td>3/18/2013</td><td>78</td><td>621</td><td>100</td><td>STATEMENT</td><td>6</td><td>-2.7</td><td>55</td><td>84</td><td>19</td><td>5</td></tr>
<tr><td>395</td><td>3/17/2013</td><td>63</td><td>716</td><td>94.2</td><td>STATEMENT</td><td>5</td><td>-2.7</td><td>54</td><td>85</td><td>19</td><td>5</td></tr>
<tr><td>390</td><td>3/6/2013</td><td>66</td><td>583</td><td>94.2</td><td>ENVY</td><td>2</td><td>-0.6</td><td>55</td><td>80</td><td>23</td><td>2</td></tr>
<tr><td>388</td><td>3/4/2013</td><td>53</td><td>621</td><td>100</td><td>STATEMENT</td><td>10</td><td>-5.1</td><td>49</td><td>82</td><td>20</td><td>2</td></tr>
<tr><td>300</td><td>3/22/2013</td><td>61</td><td>633</td><td>100</td><td>RANGER</td><td>8</td><td>-2.8</td><td>49</td><td>81</td><td>19</td><td>-2</td></tr>
<tr><td>379</td><td>2/1/2013</td><td>55</td><td>518</td><td>100</td><td>STATEMENT</td><td>8</td><td>-4.1</td><td>61</td><td>98</td><td>18</td><td>1</td></tr>
<tr><td>398</td><td>3/20/2013</td><td>62</td><td>664</td><td>100</td><td>RANGER</td><td>6</td><td>-2.3</td><td>53</td><td>83</td><td>20</td><td>0</td></tr>
<tr><td>384</td><td>2/10/2013</td><td>61</td><td>650</td><td>100</td><td>ENVY</td><td>3</td><td>-1</td><td>50</td><td>70</td><td>19</td><td>4</td></tr>
<tr><td>369</td><td>1/30/2013</td><td>76</td><td>651</td><td>100</td><td>STATEMENT</td><td>5</td><td>-2.4</td><td>60</td><td>99</td><td>20</td><td>8</td></tr>
<tr><td>373</td><td>1/21/2013</td><td>71</td><td>433</td><td>100</td><td>STATEMENT</td><td>4</td><td>-1.6</td><td>55</td><td>89</td><td>17</td><td>3</td></tr>
<tr><td>393</td><td>3/10/2013</td><td>63</td><td>717</td><td>100</td><td>STATEMENT</td><td>3</td><td>-4.6</td><td>51</td><td>91</td><td>20</td><td>5</td></tr>
<tr><td>389</td><td>3/8/2013</td><td>72</td><td>723</td><td>88.3</td><td>ENVY</td><td>4</td><td>-0.6</td><td>54</td><td>76</td><td>24</td><td>2</td></tr>
<tr><td>364</td><td>10/1/2012</td><td>60</td><td>574</td><td>100</td><td>RANGER</td><td>1</td><td>0.4</td><td>56</td><td>84</td><td>21</td><td>2</td></tr>
</table>
Currently, I am contemplating using a WebClient.DownloadString to pull all of the text in, and try to create an XML file out of it by parsing each row <tr>.
That sounds tedious, and I would rather not reinvent the wheel. Besides, a few good solutions would give me something to look at for ideas on how to best approach writing my version.
Has anyone come across some code that can do this?
I've started, to give you an idea of what I'm working on:
private const string XML_DATA = "App_Data/page_data.xml";
private const string TABLE_START = "<table>";
private const string TABLE_STOP = "</table>";
private string[] TABLE_ROW = { "<tr>", "</tr>" };
private string[] TABLE_HEAD = { "<th>", "</th>" };
private string[] TABLE_DET = { "<td>", "</td>" };
private void load_data() {
if (!File.Exists(XML_DATA)) {
string HtmlText;
using (var client = new WebClient()) {
HtmlText = client.DownloadString(Server.MapPath("/Sales.aspx"));
}
if (!String.IsNullOrEmpty(HtmlText)) {
var lcTxt = HtmlText.ToLower();
int len0 = TABLE_START.Length;
int tStart = lcTxt.IndexOf(TABLE_START) + len0;
int tStop = lcTxt.IndexOf(TABLE_STOP);
if ((len0 < tStart) && (tStart < tStop)) {
var tableString = HtmlText.Substring(tStart, tStop - tStart);
var tableRows = tableString.Split(TABLE_ROW, StringSplitOptions.RemoveEmptyEntries);
foreach (var row in tableRows) {
if (-1 < row.IndexOf(TABLE_HEAD[0])) {
//
} else {
//
}
}
}
}
}
}
Of course, you can see that is already going to fail, because the Markup using <table border="1">.
Yes, easy to fix, but I'd rather have a working guide that has already been through a lot of debugging steps.
UPDATE: I tried using XmlDocument's LoadXml method, but it can't seem to read basic HTML:
You definitely shouldn't be trying to parse that manually. Other people have already solved that problem.
If your markup is valid XML (and from what you've shown us, it looks like it is), then you can just parse it as XML:
XmlDocument doc = new XmlDocument();
doc.LoadXml(HtmlString);
doc.Save("myfile.xml");
But for that matter, if it's already valid XML markup, and all you need to do is save it as a file, then you don't need to parse it. Just save it:
File.WriteAllText("myfile.xml", HtmlString);
Related
I've tried to check other answers on this site, but none of them worked for me. I have following HTML code:
<h3 class="x-large lheight20 margintop5">
<strong>some textstring</strong>
</h3>
I am trying to get # from this document with following code:
string adUrl = Doc.DocumentNode.SelectSingleNode("//*[#id=\"offers_table\"]/tbody/tr["+i+ "]/td/table/tbody/tr[1]/td[2]/div/h3/a/#href").InnerText;
I've also tried to do that without #href. Also tried with a[contains(#href, 'searchString')]. But all of these lines gave me just the name of the link - some textstring
Attributes doesn't have InnerText.You have to use the Attributes collection instead.
string adUrl = Doc.DocumentNode.SelectSingleNode("//*[#id=\"offers_table\"]/tbody/tr["+i+ "]/td/table/tbody/tr[1]/td[2]/div/h3/a")
.Attributes["href"].Value;
Why not just use the XDocument class?
private string GetUrl(string filename)
{
var doc = XDocument.Load(filename)
foreach (var h3Element in doc.Elements("h3").Where(e => e.Attribute("class"))
{
var classAtt = h3Element.Attribute("class");
if (classAtt == "x-large lheight20 margintop5")
{
h3Element.Element("a").Attribute("href").value;
}
}
}
The code is not tested so use with caution.
Hi I am trying to parse some xml from a weird xml document developed by icalander. I have been having a lot of trouble just parsing the data, but thanks to the help of people from stackoverflow I have been able to parse the data. Now I need some help parsing between the nodes. Here is a link to the xml file I am parsing from (http://datastore.unm.edu/events/events.xml)
I am using the pivotapp model from Visual Studio 2010 to create this app. In the MainViewModel.cs section I am modifying the following code in hopes that the tag will print out in place of "LineOne" (code listed below). For example, from the xml file linked above, I would like LineOne = Lobo's Got Talent.
I need help figuring out the best method to achieve this, I will need LineTwo to contain the date and time, and LineThree to contain the description.
Thank you for your time and help, it has been greatly appreciated!
public void LoadData()
{
var webClient = new WebClient();
webClient.OpenReadAsync(new Uri("http://datastore.unm.edu/events/events.xml"));
webClient.OpenReadCompleted += new OpenReadCompletedEventHandler(webClient_OpenReadCompleted);
}
public void webClient_OpenReadCompleted(object sender,
OpenReadCompletedEventArgs e)
{
XDocument unmXdoc = XDocument.Load(e.Result, LoadOptions.None);
this.Items.Add(new ItemViewModel() { LineOne = unmXdoc.ToString(),
LineTwo = "", LineThree = "" });
}
Thank you for looking and helping!
The xml is fine, I think you are running into a namespace issue here, you have two options, strip the namespace of the xml file if you are sure you do not need it. The preferred option is to work with the namespace and specify it for the fully qualified element names. see Here
private readonly XNamespace dataNamspace = "urn:ietf:params:xml:ns:icalendar-2.0";
public void webClient_OpenReadCompleted(object sender,
OpenReadCompletedEventArgs e)
{
XDocument unmXdoc = XDocument.Load(e.Result, LoadOptions.None);
this.Items = from p in unmXdoc.Descendants(dataNamspace + "vevent").Elements(dataNamspace + "properties")
select new ItemViewModel
{
LineOne = this.GetElementValue(p, "summary"),
LineTwo = this.GetElementValue(p, "description"),
LineThree = this.GetElementValue(p, "categories"),
};
lstData.ItemsSource = this.Items;
}
private string GetElementValue(XElement element, string fieldName)
{
var childElement = element.Element(dataNamspace + fieldName);
return childElement != null ? childElement.Value : String.Empty;
}
net C#. I am trying to parse Json from a webservice. I have done it with text but having a problem with parsing image. Here is the Url from where I m getting Json
http://collectionking.com/rest/view/items_in_collection.json?args=122
And this is My code to Parse it
using (var wc = new WebClient()) {
JavaScriptSerializer js = new JavaScriptSerializer();
var result = js.Deserialize<ck[]>(wc.DownloadString("http://collectionking.com/rest/view/items_in_collection.json args=122"));
foreach (var i in result) {
lblTitle.Text = i.node_title;
imgCk.ImageUrl = i.["main image"];
lblNid.Text = i.nid;
Any help would be great.
Thanks in advance.
PS: It returns the Title and Nid but not the Image.
My class is as follows:
public class ck
{
public string node_title;
public string main_image;
public string nid; }
Your problem is that you are setting ImageUrl to something like this <img typeof="foaf:Image" src="http://... and not an actual url. You will need to further parse main image and extract the url to show it correctly.
Edit
This was a though nut to crack because of the whitespace. The only solution I could find was to remove the whitespace before parsing the string. It's not a very nice solution but I couldn't find any other way using the built in classes. You might be able to solve it properly using JSON.Net or some other library though.
I also added a regular expression to extract the url for you, though there is no error checking what so ever here so you'll need to add that yourself.
using (var wc = new WebClient()) {
JavaScriptSerializer js = new JavaScriptSerializer();
var result = js.Deserialize<ck[]>(wc.DownloadString("http://collectionking.com/rest/view/items_in_collection.json?args=122").Replace("\"main image\":", "\"main_image\":")); // Replace the name "main image" with "main_image" to deserialize it properly, also fixed missing ? in url
foreach (var i in result) {
lblTitle.Text = i.node_title;
string realImageUrl = Regex.Match(i.main_image, #"src=""(.*?)""").Groups[1].Value; // Extract the value of the src-attribute to get the actual url, will throw an exception if there isn't a src-attribute
imgCk.ImageUrl = realImageUrl;
lblNid.Text = i.nid;
}
}
Try This
private static string ExtractImageFromTag(string tag)
{
int start = tag.IndexOf("src=\""),
end = tag.IndexOf("\"", start + 6);
return tag.Substring(start + 5, end - start - 5);
}
private static string ExtractTitleFromTag(string tag)
{
int start = tag.IndexOf(">"),
end = tag.IndexOf("<", start + 1);
return tag.Substring(start + 1, end - start - 1);
}
It may help
I have created a simple web crawler but I want to add the recursion function so that every page that is opened I can get the URLs in this page, but I have no idea how I can do that and I want also to include threads to make it faster.
Here is my code
namespace Crawler
{
public partial class Form1 : Form
{
String Rstring;
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
WebRequest myWebRequest;
WebResponse myWebResponse;
String URL = textBox1.Text;
myWebRequest = WebRequest.Create(URL);
myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource
Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
//and save it in the stream
StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
Rstring = sreader.ReadToEnd();//reads it to the end
String Links = GetContent(Rstring);//gets the links only
textBox2.Text = Rstring;
textBox3.Text = Links;
streamResponse.Close();
sreader.Close();
myWebResponse.Close();
}
private String GetContent(String Rstring)
{
String sString="";
HTMLDocument d = new HTMLDocument();
IHTMLDocument2 doc = (IHTMLDocument2)d;
doc.write(Rstring);
IHTMLElementCollection L = doc.links;
foreach (IHTMLElement links in L)
{
sString += links.getAttribute("href", 0);
sString += "/n";
}
return sString;
}
I fixed your GetContent method as follow to get new links from crawled page:
public ISet<string> GetNewLinks(string content)
{
Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");
ISet<string> newLinks = new HashSet<string>();
foreach (var match in regexLink.Matches(content))
{
if (!newLinks.Contains(match.ToString()))
newLinks.Add(match.ToString());
}
return newLinks;
}
Updated
Fixed: regex should be regexLink. Thanks #shashlearner for pointing this out (my mistype).
i have created something similar using Reactive Extension.
https://github.com/Misterhex/WebCrawler
i hope it can help you.
Crawler crawler = new Crawler();
IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));
observable.Subscribe(onNext: Console.WriteLine,
onCompleted: () => Console.WriteLine("Crawling completed"));
The following includes an answer/recommendation.
I believe you should use a dataGridView instead of a textBox as when you look at it in GUI it is easier to see the links (URLs) found.
You could change:
textBox3.Text = Links;
to
dataGridView.DataSource = Links;
Now for the question, you haven't included:
using System. "'s"
which ones were used, as it would be appreciated if I could get them as can't figure it out.
From a design standpoint, I've written a few webcrawlers. Basically you want to implement a Depth First Search using a Stack data structure. You can use Breadth First Search also, but you'll likely come into stack memory issues. Good luck.
I have looked all over for this. It could be me just typing the wrong thing in search I'm not sure. So, if you know a good tutorial or example of this please share. I'm trying to learn.
I have a C# Windows Form app I'm working on. I have information (movies in this case) saved in an XML file. I saved the xml file like this.
//Now we add new movie.
XmlElement nodRoot = doc.DocumentElement;
string allMyChildren = nodRoot.InnerText;
string capitalized = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(movieEditNameTextbox.Text);
int indexLookForNewMake = allMyChildren.IndexOf(capitalized);
if (indexLookForNewMake >= 0)
{
MessageBox.Show("Movie is already saved.", "Error");
}
else
{
XmlElement el = doc.CreateElement("Name");
el.InnerText = capitalized;
doc.DocumentElement.AppendChild(el);
//Check if Year is really a Number.
if (movieEditYearTextbox.Text.All(Char.IsDigit))
{
//Remove ' cause it gives errors.
string capitalizedFixed = capitalized.Replace("'", "");
string capitalizedFinalFixed = capitalizedFixed.Replace("\"", "");
//Assign Attribute to each New one.
el.SetAttribute("Name", capitalizedFinalFixed);
el.SetAttribute("Type", movieEditTypeDropdown.Text);
el.SetAttribute("Year", movieEditYearTextbox.Text);
//Reset all fields, they don't need data now.
movieEditNameTextbox.Text = "";
movieEditYearTextbox.Text = "";
movieEditTypeDropdown.SelectedIndex = -1;
removeMovieTextbox.Text = "";
doc.Save("movie.xml");
label4.Text = "Movie Has been Edited";
loadXml();
}
else
{
//Error out. Year not a Number
MessageBox.Show("Check movie year. Seems it isn't a number.", "Error");
}
}
That all works fine. Now what I'm trying to do is make it where you can choose a directory, and it search the directory and sub directories and get file names and save them into the XML file.
I used this to try to accomplish this. It does pull the list. But it doesn't save it. It don't save the new information.
I can't use LINQ as it cause a confliction for some reason with other code.
DirectoryInfo dirCustom = new DirectoryInfo(#"D:\Video");
FileInfo[] filCustom;
filCustom = dirCustom.GetFiles("*",SearchOption.AllDirectories);
//Open XML File.
XmlDocument doc = new XmlDocument();
doc.Load("movie.xml");
XmlElement el = doc.CreateElement("Name");
string fulCustoms = filCustom.ToString();
foreach (FileInfo filFile in filCustom)
{
string capitalized = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(filFile.Name);
string capitalizedFixed = capitalized.Replace("\"", "");
el.SetAttribute("Name", capitalizedFixed);
el.SetAttribute("Type", "EDIT TYPE");
el.SetAttribute("Year", "EDIT YEAR");
richTextBox1.AppendText(capitalizedFixed + "\r\n");
}
doc.Save("movie.xml");
label4.Text = "Movie Has been Edited";
loadXml();
Now, the richTextBox does display the information correctly but it don't save it.
The loadXml() is just my noobish way to refresh the datagridview.
I'm completely lost and don't know where to turn to. I know my coding is probarely horrible, lol. I'm new to this. This is my first more complex application I have worked on.
I can't think of anymore information that would help you understand what I mean. I hope you do.
Thank you so much for your help.
Not sure exactly what your LoadXML() method does but my only piece of advise with your issue is to change the way you are implementing this functionality.
Create an object called Movie
public class Movie
{
public Movie() {}
public String Title { get; set; }
blah... blah...
}
Then create a MovieList
public class MovieList : List<Movie> { }
Then implement the following 2 methods inside the MovieList.
public static void Serialize(String path, MovieList movieList)
{
XmlSerializer serializer = new XmlSerializer(typeof(MovieList));
using (StreamWriter streamWriter = new StreamWriter(path))
{
serializer.Serialize(streamWriter, movieList);
}
}
public static MovieList Deserialize(String path)
{
XmlSerializer serializer = new XmlSerializer(typeof(MovieList));
using (StreamReader streamReader = new StreamReader(path))
{
return (MovieList) serializer.Deserialize(streamReader);
}
}
Thats it... You now have your object serialized and you can retrieve the data to populate through binding or whatever other methods you choose.