Cell Number Scraping from web - c#

I'm pretty new in c#.net and I want to create a mobile number extractor from website
I want to do that if I have a website like olx.com.pk and I have extracted all the link from it and want to extract number from those link.
How can I do this, I've done link extraction very far want to know about mobile
here is a link extractor code of mine:
private void button1_Click(object sender, EventArgs e)
{
WebBrowser wb = new WebBrowser();
wb.ScriptErrorsSuppressed = true;
wb.Url = new Uri(textBox1.Text);
wb.DocumentCompleted += wb_DocumentCompleted;
}
void wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlDocument code = ((WebBrowser)sender).Document;
extract(code);
}
private void extract(HtmlDocument code)
{
HtmlElementCollection anchorList = code.GetElementsByTagName("a");
foreach (var item in anchorList)
{
listBox1.Items.Add(((HtmlElement)item).GetAttribute("href"));
}
}
here is regular expression for Pakistani mobile number.
^((\+92)|(0092))-{0,1}\d{3}-{0,1}\d{7}$|^\d{11}$|^\d{4}-\d{7}$

You've got the link-collection part right, all you have to do now is open every single link and match text using Matches():
Regex re = new Regex(#"(\+92|0092)-?\d{3}-?\d{7}|\d{11}|\d{4}-\d{7}");
foreach(string link in listBox1.Items){
// Load data to `HtmlDocument code`
string text = ((mshtml.IHTMLDocument3)code.DomDocument).documentElement.innerHTML;
foreach( Match match in re.Matches(text)){
// do what you need
}
}
And to prevent longer numbers from matching too you can use negative lookahead/lookbehind expressions:
Regex re = new Regex(#"(?<!\d)((\+92|0092)-?\d{3}-?\d{7}|\d{11}|\d{4}-\d{7})(?!\d)");

Related

Open multiple pages in WebBrowser and send a command to all of them

I have a winform app with the following functionality:
Has a multiline textbox that contain one URL on each line - about 30 URLs (each URL is different but the webpage is the same (just the domain is different);
I have another textbox in which I can write a command and a button that sends that command to an input field from the webpage.
I have a WebBrowser controller ( I would like to do all the things in one controller )
The webpage consist of a textbox and a button which I want to be clicked after I insert a command in that textbox.
My code so far:
//get path for the text file to import the URLs to my textbox to see them
private void button1_Click(object sender, EventArgs e)
{
OpenFileDialog fbd1 = new OpenFileDialog();
fbd1.Title = "Open Dictionary(only .txt)";
fbd1.Filter = "TXT files|*.txt";
fbd1.InitialDirectory = #"M:\";
if (fbd1.ShowDialog(this) == DialogResult.OK)
path = fbd1.FileName;
}
//import the content of .txt to my textbox
private void button2_Click(object sender, EventArgs e)
{
textBox1.Lines = File.ReadAllLines(path);
}
//click the button from webpage
private void button3_Click(object sender, EventArgs e)
{
this.webBrowser1.Document.GetElementById("_act").InvokeMember("click");
}
//parse the value of the textbox and press the button from the webpage
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
newValue = textBox2.Text;
HtmlDocument doc = this.webBrowser1.Document;
doc.GetElementById("_cmd").SetAttribute("Value", newValue);
}
Now, how can I add all those 30 URLs from my textbox in the same webcontroller so that I can send the same command to all of the textboxes from all the webpages and then press the button for all of them ?
//EDIT 1
So, I have adapted #Setsu method and I've created the following:
public IEnumerable<string> GetUrlList()
{
string f = File.ReadAllText(path); ;
List<string> lines = new List<string>();
using (StreamReader r = new StreamReader(f))
{
string line;
while ((line = r.ReadLine()) != null)
lines.Add(line);
}
return lines;
}
Now, is this returning what it should return, in order to parse each URL ?
If you want to keep using just 1 WebBrowser control, you'd have to sequentially navigate to each URL. Note, however, that the Navigate method of the WebBrowser class is asynchronous, so you can't just naively call it in a loop. Your best bet is to implement an async/await pattern detailed in this answer here.
Alternatively, you CAN have 30 WebBrowser controls and have each one navigate on its own; this is roughly equivalent to having 30 tabs open in modern browsers. Since each WebBrowser is doing identical work, you can just have 1 DocumentCompleted event written to handle a single WebBrowser, and then hook up the others to the same event. Do note that the WebBrowser control has a bug that will cause it to gradually leak memory, and the only way to solve this is to restart the application. Thus, I would recommend going with the async/await solution.
UPDATE:
Here's a brief code sample of how to do the 30 WebBrowsers way (untested as I don't have access to VS right now):
List<WebBrowser> myBrowsers = new List<WebBrowser>();
public void btnDoWork(object sender, EventArgs e)
{
//This method starts navigation.
//It will call a helper function that gives us a list
//of URLs to work with, and naively create as many
//WebBrowsers as necessary to navigate all of them
IEnumerable<string> urlList = GetUrlList();
//note: be sure to sanitize the URLs in this method call
foreach (string url in urlList)
{
WebBrowser browser = new WebBrowser();
browser.DocumentCompleted += webBrowserDocumentCompleted;
browser.Navigate(url);
myBrowsers.Add(browser);
}
}
private void webBrowserDocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
//check that the full document is finished
if (e.Url.AbsolutePath != (sender as WebBrowser).Url.AbsolutePath)
return;
//get our browser reference
WebBrowser browser = sender as WebBrowser;
//get the string command from form TextBox
string command = textBox2.Text;
//enter the command string
browser.Document.GetElementById("_cmd").SetAttribute("Value", command);
//invoke click
browser.Document.GetElementById("_act").InvokeMember("click");
//detach the event handler from the browser
//note: necessary to stop endlessly setting strings and clicking buttons
browser.DocumentCompleted -= webBrowserDocumentCompleted;
//attach second DocumentCompleted event handler to destroy browser
browser.DocumentCompleted += webBrowserDestroyOnCompletion;
}
private void webBrowserDestroyOnCompletion(object sender, WebBrowserDocumentCompletedEventArgs e)
{
//check that the full document is finished
if (e.Url.AbsolutePath != (sender as WebBrowser).Url.AbsolutePath)
return;
//I just destroy the WebBrowser, but you might want to do something
//with the newly navigated page
WebBrowser browser = sender as WebBrowser;
browser.Dispose();
myBrowsers.Remove(browser);
}

C# winforms webbrowser not going to url's asked for

I was asked by a friend to develop a winform app to be able to extract data. I figured it would be easy enough - how wrong I was!
In my winform, I have included a webbrowser control and some buttons. The URL for the webbrowser is http://www.racingpost.com/greyhounds/card.sd and as you can imagine, it is the place to get data for greyhounds. When on the page above, there are a number of links within this area which are specific to a race time. If you click on any of these, it takes you to that race, and its this data that I need to extract. So, my initial thoughts were to get ALL links off the link above, then store them in a list, then just have a button available to take in whatever link it is, and then take the webbrowser to that location. Once there, I can then look to extract the data and store it as needed.
So, in the first instance, I use
//url = link above
wb1.Url = new Uri(url);
grab the data (which are links for each race on that day)
once I have this, use a further button to go to the specific race
wb1.Url = new Uri("http://www.racingpost.com/greyhounds/card.sd#resultday=2015-01-17&raceid=1344640");
then, once there, click another button to capture the data, after which, return to the original link above.
The problem is, it will not go to the location present in the link. BUT, if I click the link manually within the webbrowser, it goes there no problem.
I have looked at the properties of the webbrowser, and these all look fine - although I can't qualify that tbh!
I know if I try to go to the links manually, I can, but if I try to do it through code, it just wont budge. I can only assume I have done something wrong in the code.
Hope some of that makes sense - first posting, so apologies if I made a mess of it. I will provide all code no problem, but cant seem to figure out how to post the code in 'code format'?
//here is the code
public partial class Form1 : Form
{
Uri _url;
public Form1()
{
InitializeComponent();
wb1.Url = new Uri("http://www.racingpost.com/greyhounds/card.sd");
wb1.Navigated +=new WebBrowserNavigatedEventHandler(wb1_Navigated);
}
classmodules.trackUrl tu;
private void btnGrabData_Click(object sender, EventArgs e)
{
classmodules.utility u = new classmodules.utility();
rtb1.Text = u.GetWebData("http://www.racingpost.com/greyhounds/card.sd");
HtmlDocument doc = wb1.Document;
string innerText = (((mshtml.HTMLDocument)(doc.DomDocument)).documentElement).outerHTML;
innerText = Regex.Replace(innerText, #"\r\n?|\n", "");
rtb1.Text = innerText;
tu = new classmodules.trackUrl();
u.splitOLs(ref tu, innerText);
classmodules.StaticUtils su = new classmodules.StaticUtils();
su.SerializeObject(tu, typeof(classmodules.trackUrl)).Save(#"d:\dogsUTL.xml");
classmodules.ExcelProcessor xl = new classmodules.ExcelProcessor();
xl.createExcel(tu);
}
private void wb1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb1 = sender as WebBrowser;
this.Text = wb1.Url.ToString();
}
void wb1_Navigated(object sender, WebBrowserNavigatedEventArgs e)
{
_url = e.Url;
}
private void btnGoBack_Click(object sender, EventArgs e)
{
goBack();
}
private void goBack()
{
wb1.Url = new Uri("http://www.racingpost.com/greyhounds/card.sd");
}
private void btnGetRaceData_Click(object sender, EventArgs e)
{
HtmlDocument doc = wb1.Document;
string innerText = (((mshtml.HTMLDocument)(doc.DomDocument)).documentElement).outerHTML;
rtb2.Text = innerText;
}
//###############################
//OK, here is the point where I want to take in the URL and click a button //to instruct the webbrowser to go to that location. I add an initial //counter to 0, and then get the first url from the list, increment the //counter, then when I click the button again, urlNo wil be 1, so then it //tries the second url
int urlNo = 0;
private void btnUseData_Click(object sender, EventArgs e)
{
if (tu.race.Count > urlNo)
{
string url = tu.race[urlNo].url;
wb1.Url = new Uri(url);
lblUrl.Text = url;
urlNo++;
}
else
{
lblUrl.Text = "No More";
}
}
Did you try the Navigate(...) method? In theory, the behavior of Navigate and Url is the same, but I can infer that they behave a bit different.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.navigate(v=vs.110).aspx

How to disable every navigation in WebBrowser?

I have a WebBrowser control which I dinamically refresh/change url based on user input. I don't want to let the user to navigate, so I set AllowNavigation to false. This seems to be OK, however the below link is still "active":
Close Page
The issue here is: If the user clicks it, and confirms closure in the pop-up window I can't manage WebBrowser anymore. Looks like it is closed though the last page is still visible. Also I can't remove this link as the site is not managed by me.
Disable the control? Nope, I have to allow the user to highlight and copy text from the webpage.
Do I have any other option to disable literally ALL links?
#TaW: here is my code based on yours. So I have to set the url from my code and call a custom one:
button_click()
{
webBrowser1_load_URL("http://website/somecheck.php?compname=" + textBoxHost.Text);
}
Here it is the function:
private void webBrowser1_load_URL(string url)
{
string s = GetDocumentText(url.ToString());
s = s.Replace(#"javascript:window.close()", "");
webBrowser1.AllowNavigation = true;
webBrowser1.DocumentText = s;
}
The rest is exaclty what's in your answer:
private void webBrowser1_DocumentCompleted(object sender,
WebBrowserDocumentCompletedEventArgs e)
{
webBrowser1.AllowNavigation = false;
}
public string GetDocumentText(string s)
{
WebBrowser dummy = new WebBrowser(); //(*)
dummy.Url = new Uri(s);
return dummy.DocumentText;
}
Still it's not working. Please help me to spot the issue with my code.
If you have control over the loading of the pages you could grab the pages' text and change the code to disable rogue scripts. The one you showed can simply be deleted. Of course you might have to forsee more than the one..
Obviously this could be eased if you could do without javascript alltogether, but if that is not an option go for those that do real or pseudo-navigation..
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
webBrowser1.AllowNavigation = false;
}
private void loadURL_Click(object sender, EventArgs e)
{
webBrowser1.AllowNavigation = true;
string s = File.ReadAllText(textBox_URL.Text);
s = s.Replace("javascript:window.close()", "");
webBrowser1.DocumentText = s;
}
If the pages are not in the file system, the same trick should work, for instance by loading the URL into a dummy WebBrowser like this:
private void cb_loadURL_Click(object sender, EventArgs e)
{
string s = GetDocumentText(tb_URL.Text);
s = s.Replace("javascript:window.close()", "");
webBrowser1.AllowNavigation = true;
webBrowser1.DocumentText = s;
}
public string GetDocumentText(string s)
{
WebBrowser dummy = new WebBrowser(); //(*)
dummy.Url = new Uri(s);
return dummy.DocumentText;
}
Note: According to this post you can't set the DocumentText quite as freely as one would think; probably a bug.. Instead of creating the dummy each time you can also move the (*) line to class level. Then, no matter how many changes you had to make, you would always have an unchanged version, th user could e.g. save somewhere..

.NET C# - webBrowser.Navigate to next URL doesnt work always

I have a List of URLs in a textfile which i want to visit using the C# webBrowser class and save the content of every website to somewhere. The problem is, that the program doesn't always visit the new URL.
Link 1 and 2 is visited correctly, then the browser window doesn't refresh on link 3. Link 4 works again, while 5, 6 and 7 fails. Link 8 works, 9 to 15 fails. 16 Works and so on...
Here is an example list of URLs:
http://www.example.com/somefile_7.html*SomeOtherText1*SomeAdditionalText1
http://www.example.com/somefile_12.html*SomeOtherText1*SomeAdditionalText2
static int counter_getURL = 0;
private void Form1_Load(object sender, EventArgs e)
{
nextTurn();
}
void startBrowser(string url)
{
webBrowser1.Navigate(new Uri(url), "_self");
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(get_browser_string);
}
void get_browser_string(object sender, WebBrowserDocumentCompletedEventArgs e)
{
// Display the content of the website in textBox1
textBox1.Text = webBrowser1.Document.Body.InnerText;
MessageBox.Show("Next");
nextTurn();
}
public void nextTurn()
{
startBrowser(getURL());
}
public string getURL()
{
string url = "";
string[] input = System.IO.File.ReadAllLines(#"C:\Users\WORKSTATION01\Desktop\url_list.txt", Encoding.Default);
// Get the URL only
string[] splitted = input[counter_getURL].Split(new char[] { '*' });
url = splitted[0];
counter_getURL++;
return url;
}
DocumentCompleted also fires for FRAMEs inside a webpage. My guess is that some webpages of your URLs have FRAMEs and that interferes with your code.

C#: How do I get the document title from a WebBrowser element?

I'm having issues trying to get the document title from a WebBrowser in C#. It works fine in VB.NET, but it won't give me any properties in C#.
When I type in MyBrowser.Document., the only options I get are 4 methods: Equals, GetHashCode, GetType, and ToString - no properties.
I think it's because I have to assign the document to a new instance first, but I can't find the HTMLDocument class that exists in VB.NET.
Basically what I'm wanting to do is return the Document.Title each time the WebBrowser loads/reloads a page.
Can someone help please? It will be much appreciated!
Here is the code I have at the moment...
private void Link_Click(object sender, RoutedEventArgs e)
{
WebBrowser tempBrowser = new WebBrowser();
tempBrowser.HorizontalAlignment = HorizontalAlignment.Left;
tempBrowser.Margin = new Thickness(-4, -4, -4, -4);
tempBrowser.Name = "MyBrowser";
tempBrowser.VerticalAlignment = VerticalAlignment.Top;
tempBrowser.LoadCompleted += new System.Windows.Navigation.LoadCompletedEventHandler(tempBrowser_LoadCompleted);
tempTab.Content = tempBrowser; // this is just a TabControl that contains the WebBrowser
Uri tempURI = new Uri("http://www.google.com");
tempBrowser.Navigate(tempURI);
}
private void tempBrowser_LoadCompleted(object sender, EventArgs e)
{
if (sender is WebBrowser)
{
MessageBox.Show("Test");
currentBrowser = (WebBrowser)sender;
System.Windows.Forms.HtmlDocument tempDoc = (System.Windows.Forms.HtmlDocument)currentBrowser.Document;
MessageBox.Show(tempDoc.Title);
}
}
This code doesn't give me any errors, but I never see the second MessageBox. I do see the first one though (the "Test" message), so the program is getting to that code block.
Add reference to Microsoft.mshtml
Add event receiver for LoadCompleted
webbrowser.LoadCompleted += new LoadCompletedEventHandler(webbrowser_LoadCompleted);
Then you will have no problems with document not being loaded in order to read values back out
void webbrowser_LoadCompleted(object sender, NavigationEventArgs e)
{
// Get the document title and display it
if (webbrowser.Document != null)
{
mshtml.IHTMLDocument2 doc = webbrowser.Document as mshtml.IHTMLDocument2;
Informative.Text = doc.title;
}
}
You are not using the Windows Forms WebBrowser control. I think you got the COM wrapper for ieframe.dll, its name is AxWebBrowser. Verify that by opening the References node in the Solution Explorer window. If you see AxSHDocVw then you got the wrong control. It is pretty unfriendly, it just gives you an opaque interface pointer for the Document property. You'll indeed only get the default object class members.
Look in the toolbox. Pick WebBrowser instead of "Microsoft Web Browser".
string title = ((HTMLDocument)MyBrowser.Document).Title
Or
HTMLDocument Doc = (HTMLDocument)MyBrowser.Document.Title ;
string title = doc.Title;
LoadCompleted doesn't fire. You should use Navigated event handler instead of it.
webBrowser.Navigated += new NavigatedEventHandler(WebBrowser_Navigated);
(...)
private void WebBrowser_Navigated(object sender, NavigationEventArgs e)
{
HTMLDocument doc = ((WebBrowser)sender).Document as HTMLDocument;
foreach (IHTMLElement elem in doc.all)
{
(...)
}
// you may have to dispose WebBrowser object on exit
}
Finally works well with:
using System.Windows.Forms;
...
WebBrowser CtrlWebBrowser = new WebBrowser();
...
CtrlWebBrowser.Document.Title = "Hello World";
MessageBox.Show( CtrlWebBrowser.Document.Title );

Categories

Resources