How can i slice my data to seperate lists? - c#

I need take some data from url but inside of div have 3 a element. And i need slice it and take it to seperate list. In Url html data is like :
<div class="nesne row nobetciDiv ">
<div class="col-md-4 tablo yuksek">
<div class="hucre hucre-ortala">
ARAT ECZANESİ // I need take it to first list.
<br>
0(242) 237-67-22 // I need take it to second list.
</div>
</div>
<div class="col-md-8 tablo yuksek">
<div class="hucre hucre-ortala">
<a href="https://maps.google.com/maps?q=36.8905816274,30.6800764847" class="nadres" target="_blank">
<img src="/Resim/Upload/mapi.png" class="mapi">
K.KARABEKIR CD.EGITIM ARASTIRMA HASTANESI ACIL KARSISI </a> // I need take it to third list.
</div>
</div>
</div>
I tried do like :
List<string> pharmacyName = new List<string>();
List<string> pharmacyAdress = new List<string>();
List<string> pharmacyNumber = new List<string>();
Uri url = new Uri("https://www.antalyaeo.org.tr/tr/nobetci-eczaneler");
WebClient client = new WebClient();
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
string html = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
var xpath = "//text()[not(normalize-space())]";
var emptyNodes = document.DocumentNode.SelectNodes(xpath);
foreach (HtmlNode emptyNode in emptyNodes)
{
emptyNode.ParentNode
.ReplaceChild(HtmlTextNode.CreateNode(""), emptyNode);
}
HtmlNodeCollection title = document.DocumentNode.SelectNodes("//div[contains(#class,'nobetciDiv')]");
foreach (var item in title)
{
HtmlNode x = item.SelectSingleNode("//div[contains(#class,'col-md-4')]");
HtmlNode a = x.SelectSingleNode("//a");
pharmacyName.Add(a.InnerText);
// Its giving me like : ARAT ECZANESİ0(242) 237-67-22 Cant seperate it or cant take third one.
}
Sorry for my bad english i tried describe with codes mostly. So thanks for all helps!

Firstly,<div class="col-md-4 tablo yuksek"></div> has two <a></a>,so you cannot use SelectSingleNode.Then the third <a></a> is in <div class="col-md-8 tablo yuksek"></div> rather than <div class="col-md-4 tablo yuksek"></div>.Try to change
foreach (var item in title)
{
HtmlNode x = item.SelectSingleNode("//div[contains(#class,'col-md-4')]");
HtmlNode a = x.SelectSingleNode("//a");
pharmacyName.Add(a.InnerText);
// Its giving me like : ARAT ECZANESİ0(242) 237-67-22 Cant seperate it or cant take third one.
}
to
foreach (var item in title)
{
HtmlNode x = item.SelectSingleNode("//div[contains(#class,'col-md-4')]").FirstChild;
pharmacyName.Add(x.SelectNodes("//a[#href]")[0].InnerText);
pharmacyAdress.Add(x.SelectNodes("a[#href]")[1].InnerText);
string s3= item.SelectSingleNode("//div[contains(#class,'col-md-8')]").SelectSingleNode("//a[contains(#class,'nadres')]").InnerText.Replace("\r\n","").Trim();
pharmacyNumber.Add(s3);
}
result:

Related

Remove duplicate rel attributes in HTML using c#

I have an HTML document with a number of links. Some of these links contain two identical rel attributes, and I would like to iterate over all of the a-tags, check if there are more than one rel attribute and remove one if there is, but I can't figure out how to do it.
Example:
<a rel="nofollow" rel="nofollow" href="blah"> link<a>
Should be:
<a rel="nofollow" href="blah"> link<a>
Thanks for any help.
If are you attributes are identical, this is a possible solution using nuget HtmlAgilityPack
using System.IO;
using System.Text;
using System.Linq;
using HtmlAgilityPack;
public class Program
{
public static void Main()
{
var html =
#"
<div>
<a rel='nofollow' rel='nofollow' href='lah'> link1</a>
<a rel='qwerty' rel='qwerty' href='lah'> link2</a>
<a rel='asdf' rel='asdf' href='lah'> link3</a>
</div>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//a");
foreach (var node in htmlNodes)
{
var attr = node.Attributes["rel"];
while(node.Attributes.Contains("rel")) node.Attributes.Remove("rel");
node.Attributes.Add("rel", attr.Value);
}
using var ms = new MemoryStream();
htmlDoc.Save(ms, Encoding.UTF8);
var result = Encoding.UTF8.GetString(ms.ToArray());
System.Console.WriteLine(result);
}
}

C# Get iFrame source within the current webrequest

So I was wondering how I would get the source of an iFrame within the Page Source of the webrequest that has been made.
Example of what I mean:
string text = streamReader.ReadToEnd(); // Sets the string Text to the source of the page.
Now the string text holds the source of the page.
And within that page of the source is
Authenticator
</h3>
</div>
<div class="RaggedBoxContainer"><div class="RaggedBoxBg"><div class="RaggedBoxTop"></div><div class="RaggedBoxContent">
<iframe src="https://secure.runescape.com/m=totp-authenticator/a=13/c=zBsBJTw2E0M/accountinfo" allowtransparency="true" frameborder="0"></iframe>
</div><div class="RaggedBoxBottom"></div></div></div>
</div>
And I need it to read the source of the iFrame which is:
<h2 class="accountSettingsTitle">RuneScape Authenticator is enabled</h2>
<p>Your account is protected from hijackers. You will need your code generator each time you log in to RuneScape.</p>
<p>It's also really important to keep your email account secure. <a target="_top" href="https://support.runescape.com/hc/en-gb/articles/207258145">Find out how to do this.</a>
<p>You can <a target="_top" href="cape.com/m=totp-authenticator/a=13/c=zBsBJTw2E0M/disableTOTPRequest">disable</a> Authenticator - but remember this will make your account much less secure.</p>
How would I do that?
I know that in the webbrowser it will be:
foreach (HtmlElement elm in webBrowser1.Document.GetElementsByTagName("iframe"))
{
string src = elm.GetAttribute("src");
if (src != null && src != "")
{
string content = new System.Net.WebClient().DownloadString(src); //or using HttpWebRequest
MessageBox.Show(content);
}
}
Please help me out im confused.

Can't get innertext html c#

Ok. So I found this code online everything is working on it but it shows me the div class I am searching for but removes all the text. Any idea why? Heres a example of what its outputting...
<div class="marquee"><img src="logo.png" /></div>
<div id="joke">
<div id="setup" class="exit-left"></div>
<div id="punchline">
<div class="question"></div>
<div id="zing" class="exit-right"></div>
</div>
</div>
<div id="buttons">
<input id="tell-me" class="exit-bottom no-select" type="button" value="Tell Me!" />
<!--<input id="another" class="exit-bottom" type="button" value="Another!" />-->
<table class="another exit-bottom no-select">
<tr>
<td class="another" colspan="3">Another</td>
<td class="share"><div class="share-img"></div>Share</td>
</tr>
</table>
</div>
And the innertext is not shown at all...
And here is my code is VS.
var doc = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNode.ElementsFlags["br"] = HtmlAgilityPack.HtmlElementFlag.Empty;
doc.OptionWriteEmptyNodes = true;
try
{
var webRequest = HttpWebRequest.Create("http://dadjokegenerator.com/");
Stream stream = webRequest.GetResponse().GetResponseStream();
doc.Load(stream);
stream.Close();
}
catch (System.UriFormatException uex)
{
throw;
}
catch (System.Net.WebException wex)
{
throw;
}
//get the div by id and then get the inner text
doc.GetElementbyId("content").InnerHtml;
await e.Channel.SendMessage("test " + divString); `
Although your code correctly downloads content of page http://dadjokegenerator.com/, InnerHtml is empty, because this page actually doesn't contain joke you are looking for (you can see that, if you display source code of page in you web browser - e.g. in Firefox press CTRL+U). Joke is added to this page later by javascript. If you look at source code of this Javascript at http://dadjokegenerator.com/js/main.js, you can see that individual jokes are downloaded from URL http://dadjokegenerator.com/api/api.php?a=j&lt=r&vj=0
Here is minimal sample to download joke from this URL. I ommited all error-checks for simplicity and I used free Json.NET library for JSON deserialization:
public class Joke
{
public int Id;
public string Setup;
public string Punchline;
public override string ToString()
{
return Setup + " " + Punchline;
}
}
public static Joke GetJoke()
{
var request = HttpWebRequest.Create("http://dadjokegenerator.com/api/api.php?a=j&lt=r&vj=0");
using (var response = request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
using (var reader = new StreamReader(stream))
{
var jokeString = reader.ReadToEnd();
Joke[] jokes = JsonConvert.DeserializeObject<Joke[]>(jokeString);
return jokes.FirstOrDefault();
}
}
}
}
Usage is e.g.
GetJoke().ToString();
These links show how to read a web page.
Html Agility Pack. Load and scrape webpage
Get HTML code from website in C#

How to use HTMLAgilityPack to extract HTML data

I am learning to write web crawler and found some great examples to get me started but since I am new to this, I have a few questions in regards to the coding method.
The search result for example can be found here: Search Result
When I look at the HTML source for the result I can see the following:
<HR><CENTER><H3>License Information *</H3></CENTER><HR>
<P>
<CENTER> 06/03/2014 </CENTER> <BR>
<B>Name : </B> WILLIAMS AJAYA L <BR>
<B>Address : </B> NEW YORK NY <BR>
<B>Profession : </B> ATHLETIC TRAINER <BR>
<B>License No: </B> 001475 <BR>
<B>Date of Licensure : </B> 01/12/07 <BR>
<B>Additional Qualification : </B> Not applicable in this profession <BR>
<B> Status :</B> REGISTERED <BR>
<B>Registered through last day of : </B> 08/15 <BR>
How can I use the HTMLAgilityPack to scrap those data from the site?
I was trying to implement an example as shown below, but not sure where to make the edit to get it working to crawl the page:
private void btnCrawl_Click(object sender, EventArgs e)
{
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
filename = Path.GetFileNameWithoutExtension( ie.FullName ).ToLower();
if ( filename.Equals( "iexplore" ) )
txtURL.Text = "Now Crawling: " + ie.LocationURL.ToString();
}
string url = ie.LocationURL.ToString();
string xmlns = "{http://www.w3.org/1999/xhtml}";
Crawler cl = new Crawler(url);
XDocument xdoc = cl.GetXDocument();
var res = from item in xdoc.Descendants(xmlns + "div")
where item.Attribute("class") != null && item.Attribute("class").Value == "folder-news"
&& item.Element(xmlns + "a") != null
//select item;
select new
{
Link = item.Element(xmlns + "a").Attribute("href").Value,
Image = item.Element(xmlns + "a").Element(xmlns + "img").Attribute("src").Value,
Title = item.Elements(xmlns + "p").ElementAt(0).Element(xmlns + "a").Value,
Desc = item.Elements(xmlns + "p").ElementAt(1).Value
};
foreach (var node in res)
{
MessageBox.Show(node.ToString());
tb.Text = node + "\n";
}
//Console.ReadKey();
}
The Crawler helper class:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;
namespace CrawlerWeb
{
public class Crawler
{
public string Url
{
get;
set;
}
public Crawler() { }
public Crawler(string Url)
{
this.Url = Url;
}
public XDocument GetXDocument()
{
HtmlAgilityPack.HtmlWeb doc1 = new HtmlAgilityPack.HtmlWeb();
doc1.UserAgent = "Mozilla/4.0 (conpatible; MSIE 7.0; Windows NT 5.1)";
HtmlAgilityPack.HtmlDocument doc2 = doc1.Load(Url);
doc2.OptionOutputAsXml = true;
doc2.OptionAutoCloseOnEnd = true;
doc2.OptionDefaultStreamEncoding = System.Text.Encoding.UTF8;
XDocument xdoc = XDocument.Parse(doc2.DocumentNode.SelectSingleNode("html").OuterHtml);
return xdoc;
}
}
}
tb is a multiline textbox... So I would like it to display the following:
Name WILLIAMS AJAYA L
Address NEW YORK NY
Profession ATHLETIC TRAINER
License No 001475
Date of Licensure 1/12/07
Additional Qualification Not applicable in this profession
Status REGISTERED
Registered through last day of 08/15
I would like the second argument to be added to an array because next step would be to write to a SQL database...
I am able to get the URL from the IE which has the search result but how can I code it in my script?
This little snippet should get you started:
HtmlDocument doc = new HtmlDocument();
WebClient client = new WebClient();
string html = client.DownloadString("http://www.nysed.gov/coms/op001/opsc2a?profcd=67&plicno=001475&namechk=WIL");
doc.LoadHtml(html);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div");
You basically use the WebClient class to download the HTML file and then you load that HTML into the HtmlDocument object. Then you need to use XPath to query the DOM tree and search for nodes. In the above example "nodes" will include all the div elements in the document.
Here's a quick reference about the XPath syntax: http://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx

Selenium - Get elements html rather Text Value

Via that code i have extracted all desired text out of a html document
private void RunThroughSearch(string url)
{
private IWebDriver driver;
driver = new FirefoxDriver();
INavigation nav = driver.Navigate();
nav.GoToUrl(url);
var div = driver.FindElement(By.Id("results"));
var element = driver.FindElements(By.ClassName("sa_wr"));
}
though as i need to refine results of extracted document
Container
HEADER -> Title of a given block
Url -> Link to the relevant block
text -> body of a given block
/Container
as u can see in my code i am able to get the value of the text part
as a text value , that was fine, but what if i want to have
the value of the container as HTML and not the extracted text ?
<div class="container">
<div class="Header"> Title...</div>
<div class="Url"> www.example.co.il</div>
<div class="ResConent"> bla.. </div>
</div>
so the container is about 10 times in a page
i need to extract it's innerHtml .
any ideas ? (using Selenium)
This seemed to work for me, and is less code:
var element = driver.FindElement(By.ClassName("sa_wr"));
var innerHtml = element.GetAttribute("innerHTML");
Find the element first, then use IJavaScriptExecutor to get the inner HTML.
var element = driver.FindElements(By.ClassName("sa_wr"));
IJavaScriptExecutor js = driver as IJavaScriptExecutor;
if (js != null) {
string innerHtml = (string)js.ExecuteScript("return arguments[0].innerHTML;", element);
}
I found the solution from SQA-SO
IWebDriver driver;
IJavaScriptExecutor js = driver as IJavaScriptExecutor;
js.ExecuteScript("document.getElementById("title").innerHTML = "New text!";");

Categories

Resources