I'm trying to develop a desktop app to be used as a website scraping tool. My requirement is the user should be able to specify a url in the desktop app.The desktop app should be able to invoke the asp.net script to scrape data from the website and return the records to the desktop app.
Should I use a web service or a ASP.NET runtime for this...???
Any help is appreciated :)
Additional details
The scraping activity is already done.I used HTMLAgility pkg. This is my scraping code to extract a list of company names from a web page.
public static String getPageHTML(String URL)
{
String totalCompanies = null;
HttpWebRequest httpWebRequest = (HttpWebRequest)HttpWebRequest.Create(URL);
IWebProxy myProxy = httpWebRequest.Proxy;
if (myProxy != null)
{
myProxy.Credentials = CredentialCache.DefaultCredentials;
}
httpWebRequest.Method = "GET";
HttpWebResponse res;
res = (HttpWebResponse)httpWebRequest.GetResponse();
HtmlDocument doc1 = new HtmlDocument();
doc1.Load(res.GetResponseStream());
HtmlNode node = doc1.DocumentNode.SelectSingleNode("//td[#class='mainbody']/table/tr[last()]/td");
try
{
totalCompanies = node.InnerText;
return totalCompanies;
}
catch (NullReferenceException e)
{
totalCompanies = "No records found";
return totalCompanies;
}
}
You can use HttpWebRequest within your desktop app, i've done this before (winforms). For example: -
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("url");
var response = new StreamReader(req.GetResponse().GetResponseStream()).ReadToEnd();
You can then use HtmlAgilityPack to parse the data from the response:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(response);
//Sample query
var node = doc.DocumentNode.Descendants("div")
.Where(d => d.Attributes.Contains("id")).ToList();
(it would be helpful to include more details/be more specific)
If your ASP.NET page already does all the scraping, and all you need to do is access that ASP.NET page, you can simply use HttpWebRequest
http://msdn.microsoft.com/en-us/library/456dfw4f.aspx - short description & tutorial
If that URL is the website TO BE SCRAPED, and you need to include that ASP.NET script in your project, then you need to add it as a web service.
You can do it with both but also you can do it by adding a webbrowser to your desktop application. I don't know why but result is much more faster.
Related
I wanna get some data from the Instagram users.
So I've used Instagram Basic Display Api and the profile data I could receive was these:
username
media count
account type
but I want these data:
username
name
media count
Profile Image
followers count
following count
I don't know how can I have these data without Instagram Graph API(in any way) in c#?
Or is there any way to get these data with the WebClient class or anything like that?
Update for #Eehab answer: I use RestClient and WebClient in this example and both of them give the same result.
Now see WebClient example:
WebClient client = new WebClient();
string page = client.DownloadString("https://www.instagram.com/instagram/?__a=1");
Console.WriteLine(page);
Console.ReadKey();
and see an image of this code here.
now see the result of the code above here
I've also got, that this link is the only access for login users and I've been login into my Instagram account in chrome already, but I think WebClient needs to log in too.
Edit Through #Eehab answer:
In this case for using this Url(https://www.instagram.com/{username}/?__a=1), we can't do it without Instagram logged-in browser profile. So we should log in to Instagram with selenium and use the logged-in cookies to use it for Url requests. So first Install the selenium web driver and then write the following codes(untested):
var driver = new ChromeDriver();
//go to Instagram
driver.Url = "https://www.instagram.com/";
//Log in
var userNameElement = _driver.FindElement(By.Name("username"));
userNameElement.SendKeys("Username");
var passwordElement = _driver.FindElement(By.Name("password"));
passwordElement.SendKeys(Cars[0].auth.pass);
var loginButton = _driver.FindElement(By.Id("login"));
loginButton.Click();
//Get cookies
var cookies = driver.Manage().Cookies.AllCookies.ToList();
//Send request with given cookies :)
var url = "https://www.instagram.com/{username}/?__a=1";
var httpRequest = (HttpWebRequest)WebRequest.Create(url);
foreach(var cookie in cookies){
httpRequest.Headers["Cookie"] += $"{cookie.Name}={cookie.Value}; ";
}
var httpResponse = (HttpWebResponse)httpRequest.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
var result = streamReader.ReadToEnd();
}
//...
If anyone can improve this question for more uses can edit and I really appreciate it :)
You could do that using the open API , example :
https://www.instagram.com/instagram/?__a=1
example code from postman code :
var client = new RestClient("https://www.instagram.com/instagram/?__a=1");
client.Timeout = -1;
var request = new RestRequest(Method.GET);
IRestResponse response = client.Execute(request);
Console.WriteLine(response.Content);
you could use HttpClient class also, if you want to use WebClient you could do it with
WebClient.DownloadString Method while I don't recommend using WebClient for this scraping, keep in mind Instagram may block you if blocked you , you need residential proxies to bypass the block.
the response will be json data , use Json.Net or similar library to deserialize it.
just replace instagram with any username you want in the given url.
I am trying to get a table from the web page https://www.belastingdienst.nl/rekenhulpen/wisselkoersen/ using HtmlAgilityPack.
My code so far is
WebClient webClient = new WebClient();
string page = webClient.DownloadString("https://www.belastingdienst.nl/rekenhulpen/wisselkoersen/");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
List<List<string>> table = doc.DocumentNode.SelectSingleNode("//table[#class='list_result Result']")
.Descendants("tr")
.Skip(1)
.Where(tr => tr.Elements("td").Count() > 1)
.Select(tr => tr.Elements("td").Select(td => td.InnerText.Trim()).ToList())
.ToList();
My problem is that the webpage creates the table by using JavaScript and when I try to read it it throws a null exception because the web page is showing that I must enable JavaScript.
I also tried to use "GET" method
string Url = "https://www.belastingdienst.nl/rekenhulpen/wisselkoersen/";
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(Url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string result = sr.ReadToEnd();
sr.Close();
myResponse.Close();
with the same results.
I already enable JavaScript in Internet Explorer and change registry as well
if (Environment.Is64BitOperatingSystem)
Regkey = Microsoft.Win32.Registry.LocalMachine.OpenSubKey(#"SOFTWARE\\Wow6432Node\\Microsoft\\Internet Explorer\\MAIN\\FeatureControl\\FEATURE_BROWSER_EMULATION", true);
else //For 32 bit machine
Regkey = Microsoft.Win32.Registry.LocalMachine.OpenSubKey(#"SOFTWARE\\Microsoft\\Internet Explorer\\Main\\FeatureControl\\FEATURE_BROWSER_EMULATION", true);
If I use a WebBrowser component I can see the web page without problem but I still can't get the table to list.
F12 is your friend in any browser.
Select the Network tab and you'll notice that all of the info is in this file :
https://www.belastingdienst.nl/data/douane_wisselkoersen/wks.douane.wisselkoersen.dd201806.xml
(I suppose that the data for july 2018 will be held in a url named *.dd201807.xml)
Using C# you will need to do a GET for that URL and parse it as XML, no need to use HtmlAgilityPack. You will need to construct the current year concatenated with the current month to pick the right URL.
Leuker kan ik het niet maken!
WebClient is an http client, not a web browser, so it won't execute JavaScript. What is need is a headless web browser. See this page for a list of headless web browsers. I have not tried any of them though, so I cannot give you a recommendation here:
Headless browser for C# (.NET)?
I'm trying to complete a PUT request to the IIS media services API - to try and set a publishing point to "stopped" state.
I've read the following link, which hasn't helped me very much!
https://msdn.microsoft.com/en-us/library/hh206014%28VS.90%29.aspx
My current code throws an exception on the the httpWebRequest1.GetResponse(), it indicates the web server is returning a 401 unauthorized error code:
string url = "http://localhost/LiveStream.isml/State";
var httpWebRequest1 = (HttpWebRequest)WebRequest.Create(url);
httpWebRequest1.ContentType = "application/atom+xml";
httpWebRequest1.Method = "PUT";
httpWebRequest1.Headers.Add("Authorization", "USERNAME:PASSWORD");
using (var streamWriter = new StreamWriter(httpWebRequest1.GetRequestStream()))
{
XmlDocument document = new XmlDocument();
document.Load("Resources/XMLFile1.xml");
string test = GetXMLAsString(document);
streamWriter.Write(test);
}
var httpResponse = (HttpWebResponse)httpWebRequest1.GetResponse();
using (var streamReader = new StreamReader(httpResponse.GetResponseStream()))
{
var responseText = streamReader.ReadToEnd();
}
My Username/password was commented out, but they work fine when visiting the page in a browser, and inputting them in the username/password form that opens.
My Script essentially "PUT"s an XML document that is a copy of the XML document returned when visiting the state page in a browser.
Any help would be appreciated.
I need your help!.
Im trying to insert a new photo into a Picasa Album using Oauth 2.0 and a simple HttpRequest process. The result is that I cant insert a new photo into my Picasa web album after following the instructions listed on: https://developers.google.com/picasa-web/docs/2.0/developers_guide_protocol#Auth
I also have to say that I tried using the .Net library that they provide with the same results.
The implementation that I'm using now is the following:
public static string PostImage(
string streamImageConvertedToString)
{
string url = string.Format("https://picasaweb.google.com/data/feed/api/user/{0}/albumid/{1}", "username#gmail.com", "idAlbum");
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
request.ContentType = "image/jpeg";
request.ContentLength = Encoding.UTF8.GetByteCount(data);
request.Method = "POST";
request.Headers.Add("GData-Version", "2");
request.Headers.Add("Slug", "cute_baby_kitten.jpg");
request.Headers.Add("Authorization", "Bearer " + GetToken());
if (data != null)
{
using (StreamWriter writer = new StreamWriter(request.GetRequestStream()))
{
writer.Write(data);
}
}
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
string result = string.Empty;
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
result = reader.ReadToEnd();
}
return result;
}
private static string GetToken() {
const string ServiceAccountEmail = "someid#developer.gserviceaccount.com";
var servicio = new PicasaService(null);
var certificate = new X509Certificate2(HttpContext.Current.Server.MapPath("/key2.p12"), "notasecret", X509KeyStorageFlags.Exportable);
var serviceAccountCredentialInitializer =
new ServiceAccountCredential.Initializer(ServiceAccountEmail)
{
Scopes = new[] { "https://picasaweb.google.com/data/" }
}.FromCertificate(certificate);
var credential = new ServiceAccountCredential(serviceAccountCredentialInitializer);
if (!credential.RequestAccessTokenAsync(System.Threading.CancellationToken.None).Result)
throw new InvalidOperationException("Access token request failed.");
return credential.Token.AccessToken;
}
Any help is welcome!!
(403) Forbidden
Means that you are trying to use a method insert which requires authorization to do.
you are connecting to service account someid#developer.gserviceaccount.com which should give you access to someid#developer.gserviceaccount.com pictures then.
you appear to be trying to access username#gmail.com unless you have given someid#developer.gserviceaccount.com access to insert pictures on behalf of username#gmail.com (Which I am not even sure is possible) you are not going to have permission to do this.
Remember a service account is a sudo user it has its own drive account, calendar account ... it does not have access to a random users data unless that user has given them access like they would any other user.
Note: Google .net client library does not support gdata APIs. Picasa is a gdata library I like how are trying to merge the two I am have to test this.
You're best (imho) approach would be to forget libraries and forget service accounts. Get a refresh token for the google user account you're trying to insert to, and use the raw HTTP REST API to invoke Picasa.
I need to write a simple C# app that should receive entire contents of a web page currently opened in Firefox. Is there any way to do it directly from C#? If not, is it possible to develop some kind of plug-in that would transfer page contents? As I am a total newbie in Firefox plug-ins programming, I'd really appreciate any info on getting me started quickly. Maybe there are some sources I can use as a reference? Doc links? Recommendations?
UPD: I actually need to communicate with a Firefox instance, not get contents of a web page from a given URL
It would help if you elaborate What you are trying to achieve. May be plugins already out there such as firebug can help.
Anways, if you really want to develop both plugin and C# application:
Check out this tutorial on firefox extension:
http://robertnyman.com/2009/01/24/how-to-develop-a-firefox-extension/
Otherwise, You can use WebRequest or HttpWebRequest class in .NET request to get the HTML source of any URL.
I think you'd almost certainly need to write a Firefox plugin for that. However there are certainly ways to request a webpage, and receive its HTML response within C#. It depends on what your requirements are?
If you're requirements are simply receive the source from any website, leave a comment and I'll point you towards the code.
Uri uri = new Uri(url);
System.Net.HttpWebRequest req = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(uri.AbsoluteUri);
req.AllowAutoRedirect = true;
req.MaximumAutomaticRedirections = 3;
//req.UserAgent = _UserAgent; //"Mozilla/6.0 (MSIE 6.0; Windows NT 5.1; Searcharoo.NET)";
req.KeepAlive = true;
req.Timeout = _RequestTimeout * 1000; //prefRequestTimeout
// SIMONJONES http://codeproject.com/aspnet/spideroo.asp?msg=1421158#xx1421158xx
req.CookieContainer = new System.Net.CookieContainer();
req.CookieContainer.Add(_CookieContainer.GetCookies(uri));
System.Net.HttpWebResponse webresponse = null;
try
{
webresponse = (System.Net.HttpWebResponse)req.GetResponse();
}
catch (Exception ex)
{
webresponse = null;
Console.Write("request for url failed: {0} {1}", url, ex.Message);
}
if (webresponse != null)
{
webresponse.Cookies = req.CookieContainer.GetCookies(req.RequestUri);
// handle cookies (need to do this incase we have any session cookies)
foreach (System.Net.Cookie retCookie in webresponse.Cookies)
{
bool cookieFound = false;
foreach (System.Net.Cookie oldCookie in _CookieContainer.GetCookies(uri))
{
if (retCookie.Name.Equals(oldCookie.Name))
{
oldCookie.Value = retCookie.Value;
cookieFound = true;
}
}
if (!cookieFound)
{
_CookieContainer.Add(retCookie);
}
}
string enc = "utf-8"; // default
if (webresponse.ContentEncoding != String.Empty)
{
// Use the HttpHeader Content-Type in preference to the one set in META
doc.Encoding = webresponse.ContentEncoding;
}
else if (doc.Encoding == String.Empty)
{
doc.Encoding = enc; // default
}
//http://www.c-sharpcorner.com/Code/2003/Dec/ReadingWebPageSources.asp
System.IO.StreamReader stream = new System.IO.StreamReader
(webresponse.GetResponseStream(), System.Text.Encoding.GetEncoding(doc.Encoding));
webresponse.Close();
This does what you want.
using System.Net;
var cli = new WebClient();
string data = cli.DownloadString("http://www.heise.de");
Console.WriteLine(data);
Native messaging enables an extension to exchange messages with a native application installed on the user's computer.