C# - Get variable from webbrowser generated by javascript - c#

have downloaded page by webbrowser and need to get mail address. But it is generated by javastript. In code i can find this script:
<script type="text/javascript" charset="utf-8">var i='ma'+'il'+'to';var a='impexta#impexta.sk';document.write(''+a+'');</script>
I read everywhere how to Invoke script, by i don't know his name. So what i want is to get "a" variable value.
EDIT: Code before:
...
WebBrowser wb = new WebBrowser();
wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wb_DocumentCompleted);
wb.Navigate(url);
for (; wb.ReadyState != WebBrowserReadyState.Complete; )
{
System.Windows.Forms.Application.DoEvents();
}
...
void wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = sender as WebBrowser;
if (wb != null)
{
if (wb.ReadyState == WebBrowserReadyState.Complete)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(wb.DocumentStream);
}
}
}

I found easy solution. Just finding the right part of string in HTML code:
foreach (HtmlNode link in root.SelectNodes("//script"))
{
if (link.InnerText.Contains("+a+"))
{
string[] strs = new string[] { "var a='", "';document.write" };
strs = link.InnerText.Split(strs, StringSplitOptions.None);
outMail = System.Net.WebUtility.HtmlDecode(strs[1]);
if (outMail != "")
{
break;
}
}
}

Related

C# Wait for Web Page to Load Before Scraping

I am trying to make a Windows Forms app that logs in another web application, navigates for a few steps (clicks) until it reaches a specific page and then scrape some info (names and addresses).
The problem is that I am using the DocumentCompletedEventHandler in order to have a page loaded before I execute the code for navigating to the next page (in order to reach the final web page).
When it fires, DocumentCompletedEventHandler fires multiple times.
When I reach the loggin page, it enters the credentials and then the message "Page loaded!" appears multiple times.
I press enter, it appears again.
Then it navigates to the next page and with that new page I have the same problem.
how can I make DocumentCompletedEventHandler to fire only once and not multiple times?
private void loadEvent(object sender, WebBrowserDocumentCompletedEventArgs e)
{
MessageBox.Show("Page loaded!");
}
private void loadLogin(object sender, WebBrowserDocumentCompletedEventArgs e)
{
var inputElements = webBrowser1.Document.GetElementsByTagName("input");
foreach (HtmlElement i in inputElements)
{
if (i.GetAttribute("name").Equals("utilizator"))
{
i.InnerText = textBox1.Text;
}
if (i.GetAttribute("name").Equals("parola"))
{
i.Focus();
i.InnerText = textBox2.Text;
}
}
var buttonElements = webBrowser1.Document.GetElementsByTagName("input");
foreach (HtmlElement b in buttonElements)
{
if (b.GetAttribute("name").Equals("Intra"))
{
b.InvokeMember("Click");
}
}
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(loadEvent);
var inputElements1 = webBrowser1.Document.GetElementsByTagName("input");
foreach (HtmlElement i1 in inputElements1)
{
if (i1.GetAttribute("id").Equals("headerqstext"))
{
i1.Focus();
i1.InnerText = textBox3.Text;
}
}
var buttonElements1 = webBrowser1.Document.GetElementsByTagName("button");
foreach (HtmlElement b1 in buttonElements1)
{
if (b1.GetAttribute("title").Equals("Caută"))
{
b1.InvokeMember("Click");
}
}
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(loadEvent);
}
private void Button1_Click(object sender, EventArgs e)
{
webBrowser1.Navigate("http://10.1.104.23/ecris_cdms/");
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(loadLogin);
}
}
}
try this :)
Uri last = null;
private void CompleteResponse(object sender, WebBrowserDocumentCompletedEventArgs e)
{
if (!(last != null && last != e.Url))
return;
//your code here
}

Getting <img src=""> attribute from AliExpress error

Today I was trying to load images from aliexpress products.
I was using this code : string NowImage = HJ.GetElementsByTagName("img")[0].GetAttribute("src");
it worked for the first 8 images and didn't load the rest of images.
it was returning empty string.
And I checked the html of the aliexpress and found out that it should work.
Can someone help me ? Thanks for reading.
public bool Search()
{
WB.DocumentCompleted += WB_SearchCompleted;
WB.Navigate(URL);
while (WB.ReadyState != WebBrowserReadyState.Complete)
Application.DoEvents();
return true;
}
private void WB_SearchCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection HEC = WB.Document.GetElementsByTagName("li");
foreach(HtmlElement HJ in HEC)
{
if(HJ.GetAttribute("qrdata") == "")
continue;
NowImage = HJ.GetElementsByTagName("img")[0].GetAttribute("src");
//for the first 8 images it was loading perfect after that it was
//returning empty string
}
}

Navigating to a next page after documentCompleted in webbrowser

I'm new on web scraping and I would like to get an "id" and i want webbrowser to navigate next page using this "id" and do what i want to do. For example;
Go to https://www.example.com/
then get "id"
navigate https://www.example.com/id
then get title name
Go to https://www.example.com/
then get "second id"
navigate https://www.example.com/id
then get title name
...
How can i achieve this on c#?
note: this web site has "Secure Hypertext Transfer Protocol(https)"
EDIT: DocumentCompleted firing twice when url navigating to www.example.com/id
[STAThread]
static void Main(string[] args)
{
WB = new WebBrowser();
WB.AllowNavigation = true;
WB.ScriptErrorsSuppressed = true;
WB.Navigate("https://www.example.com/page");
WB.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(WB_DocumentCompleted);
while (completed)
{
Application.DoEvents();
}
}
static async void WB_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
System.Windows.Forms.HtmlDocument doc = WB.Document;
var docy = new HtmlAgilityPack.HtmlDocument();
docy.Load(new StringReader(WB.Document.Body.InnerHtml));
HtmlElementCollection divs = doc.GetElementsByTagName("td");
if (WB.Url.ToString().IndexOf("page") > -1)
{
HtmlNodeCollection prens = docy.DocumentNode.SelectNodes(".//*[#id='mid']//dıv//dıv//table//tbody//tr//td//a");
for (int i = 0; i < prens.Count; i++)
{
HtmlNode nodes= docy.DocumentNode.SelectSingleNode(".//*[#id='mid']//dıv[6]//dıv//table[2]//tbody//tr[" + satir + "]//td[2]//a");
HtmlNode links = nodes;
if (links != null)
{
hrefValue = links.GetAttributeValue("href", string.Empty);
string[] gelenevent = hrefValue .Split('.');
eventid = gelenevent[0].Remove(0, 1);
satir++;
while (WB.IsBusy)
Application.DoEvents();
for (int y = 0; y < 500; y++)
if (WB.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
await Task.Delay(5000);
Thread.Sleep(10);
}
else
break;
Application.DoEvents();
WB.Navigate(new Uri("https://www.example.com/" + eventid + ".html"));
break;
}
}
}
if (WB.Url.ToString().IndexOf(eventid) > -1)
{
var node = docy.DocumentNode.SelectSingleNode(".//*[#id='mid']//dıv[6]//dıv//table[4]//tbody//tr[2]//td[1]");
while (WB.IsBusy)
Application.DoEvents();
for (int i = 0; i < 500; i++)
if (WB.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
await Task.Delay(5000);
Thread.Sleep(10);
}
else
break;
Application.DoEvents();
WB.Navigate(new Uri("https://www.example.com/page"));
}
}

Navigate URLs using WebBrowser DocumentCompleted

This is the scenario
1-Navigate to admin page.
2-Enter username and password
3-Navigate to new page
4-Fill some text in textareas etc and post .
5-Repeat Step 3 and 4 until loop ends
The Code Below successfully does step 1 and 2. But it reaches step 3 before new page is loaded and generates the error "Object reference not set to an instance of an object" on this line doc.GetElementById("title").SetAttribute("value", "check1");
I am trying to achieve this from last 3 days but can't reached step 3 until now. Any help will be appreciated
bool AdminPagework =false;
bool postnavigationdone =false;
public Form1()
{
InitializeComponent();
webBrowser1.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(AdminPageCredentials);
webBrowser1.Navigate("www.website.com/admin");
}
private void AdminPageCredentials(object sender, WebBrowserDocumentCompletedEventArgs e)
{
if (AdminPagework == false && (webBrowser1.ReadyState == WebBrowserReadyState.Complete))
{
HtmlDocument doc = webBrowser1.Document;
doc.GetElementById("login").SetAttribute("value", "ADMIN");
doc.GetElementById("pass").SetAttribute("value", "123");
doc.GetElementById("submit").InvokeMember("click");
AdminPagework = true;
webBrowser1.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(RedirectToPostPage);
webBrowser1.Navigate("http://www.website.com/admin/post.php");
}
}
public void RedirectToPostPage(object sender, WebBrowserDocumentCompletedEventArgs e)
{
if ((postnavigationdone == false) && (webBrowser1.ReadyState == WebBrowserReadyState.Complete))
{
HtmlDocument doc = webBrowser1.Document;
doc.GetElementById("title").SetAttribute("value", "check1");
doc.GetElementById("content").SetAttribute("value", textBox2.Text);
doc.GetElementById("post-format-video").InvokeMember("click");
doc.GetElementById("in-category-64").InvokeMember("click");
webBrowser1.Document.GetElementById("mm").SetAttribute("value", "01");
webBrowser1.Document.GetElementById("jj").SetAttribute("value", "01");
webBrowser1.Document.GetElementById("aa").SetAttribute("value", "2013");
webBrowser1.Document.GetElementById("hh").SetAttribute("value", "01");
webBrowser1.Document.GetElementById("mm").SetAttribute("value", "01");
doc.GetElementById("publish").InvokeMember("click");
postnavigationdone = true;
}
}
var titleElement = doc.GetElementById("title");
titleElement.SetAttribute("value","check1");
Try that and see if the title element is found after all, since the most likely reason it fails is: There is no element with the name "title".
I like using ScrapySharp framework (you'll find it on NuGet) for web automation.
var titleNodes = doc.DocumentNode.CssSelect("div#title").ToList();
foreach(var titleNode in titleNodes)
{
titleNode.SetAttribute("value","check1");
}
btw. why would you do that anyway, changing this attribute? Just curious...

C# stopping an infinite foreach loop

This foreach loop checks a webpage and sees if there are any images then downloads them. How do i stop it? When i press the button it continues the loop forever.
private void button1_Click(object sender, EventArgs e)
{
WebBrowser browser = new WebBrowser();
browser.DocumentCompleted +=browser_DocumentCompleted;
browser.Navigate(textBox1.Text);
}
void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser browser = sender as WebBrowser;
HtmlElementCollection imgCollection = browser.Document.GetElementsByTagName("img");
WebClient webClient = new WebClient();
int count = 0; //if available
int maximumCount = imgCollection.Count;
try
{
foreach (HtmlElement img in imgCollection)
{
string url = img.GetAttribute("src");
webClient.DownloadFile(url, url.Substring(url.LastIndexOf('/')));
count++;
if(count >= maximumCount)
break;
}
}
catch { MessageBox.Show("errr"); }
}
use the break; keyword to break out of a loop
You do not have an infinite loop, you have an exception that is being thrown based on how you are writing the file to disk
private void button1_Click(object sender, EventArgs e)
{
WebBrowser browser = new WebBrowser();
browser.DocumentCompleted += browser_DocumentCompleted;
browser.Navigate("www.google.ca");
}
void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser browser = sender as WebBrowser;
HtmlElementCollection imgCollection = browser.Document.GetElementsByTagName("img");
WebClient webClient = new WebClient();
foreach (HtmlElement img in imgCollection)
{
string url = img.GetAttribute("src");
string name = System.IO.Path.GetFileName(url);
string path = System.IO.Path.Combine(Environment.CurrentDirectory, name);
webClient.DownloadFile(url, path);
}
}
That code works fine on my environment. The issue you seemed to be having was when you were setting the DownloadFile filepath, you were setting it to a value like `\myimage.png', and the webclient could not find the path so it threw and exception.
The above code drops it into the current directory with the extension name.
Maybe the Event browser.DocumentCompleted cause the error, if the page refreshes the event gets fired again. You could try to deregister the event.
void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser browser = sender as WebBrowser;
browser.DocumentCompleted -= browser_DocumentCompleted;
HtmlElementCollection imgCollection = browser.Document.GetElementsByTagName("img");
WebClient webClient = new WebClient();
foreach (HtmlElement img in imgCollection)
{
string url = img.GetAttribute("src");
string name = System.IO.Path.GetFileName(url);
string path = System.IO.Path.Combine(Environment.CurrentDirectory, name);
webClient.DownloadFile(url, path);
}
}

Categories

Resources