The scenario is that a mail is sent to an inbox. Attached to the mail is a html file which the user clicks to open the page in a browser. They then click a link on the webpage which opens a PDF file online.
Now, what I want to achieve programmatically with c# is to save the attached html file on disk, open the file, find the link, click it and save the file that opens to disk.
I have gotten as far as programmatically open the email and save the attached html file to disk. But now I'm sort of stuck at opening the file programmatically.
I've gotten as far as creating a FileWebRequest to open the file but I don't know how to find the link ("a" tag, only on in the whole page) and programmatically click it (in c#) so the PDF opens so I can download it and save to disk.
What needs to be done after the filewebrequest?
FileWebRequest req = (FileWebRequest)WebRequest.Create(pathToHtmlFile);
FileWebResponse res = (FileWebResponse)req.GetResponse();
// What now..?
At first you should extract the PDF URL using RegEx from html content and then download it using WebClient :
private static string FindPdfFileDownloadLink(string htmlContent)
{
return Regex.Match(htmlContent, #"^(https?:\/\/)?www\.([\da-z\.-]+)\.([a-z\.]{2,6})\/[\w \.-]+?\.pdf$").Value;
}
public static int Main(string[] args)
{
string htmlContent = File.ReadAllText("1.html");
string pdfUrl = FindPdfFileDownloadLink(htmlContent);
using (WebClient wClient = new WebClient())
{
wClient.DownloadFile(pdfUrl, #"1.pdf");
}
Console.Read();
return 0;
}
if you want to really click on link for any reason you can load the html in a hidden web browser and find the element that you want and click on it.
To load the content into WebBrowser control :
webBrowser1.Navigate(#"1.html");
and to find and click on element :
HtmlElement link = webBrowser.Document.GetElementByID("link_id_58547")
link.InvokeMember("Click")
Related
When I open a link that downloads a file in a WebBrowser, I'm asked to select the path to save the file. How can I do this automatically, i.e. save it in the path specified by me without being asked about it?
WebBrowser wb = new WebBrowser();
Uri uri = new Uri(url);
wb.Url = uri;
You can use Navigating & DownloadFileCompleted events of WebBrowser control. In the Navigating event, you can apply If condition to check file-type.
Visit Open/Save WebBrowser Control Dialog Box for example of downloading zip file.
I'm looking for a method that replicates a Web Browsers Save Page As function (Save as Type = Text Files) in C#.
Dilemma: I've attempted to use WebClient and HttpWebRequest to download all Text from a Web Page. Both methods only return the HTML of the web page which does not include dynamic content.
Sample code:
string url = #"https://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=" + package.Item2 + "&LOCALE=en";
try
{
System.Net.ServicePointManager.SecurityProtocol = System.Net.SecurityProtocolType.Tls11 | System.Net.SecurityProtocolType.Tls12;
using (WebClient client = new WebClient())
{
string content = client.DownloadString(url);
}
}
The above example returns the HTML without the tracking events from the page.
When I display the page in Firefox, right click on the page and select Save Page As and save as Text File all of the raw text is saved in the file. I would like to mimic this feature.
If you are scraping a web page that shows dynamic content then you basically have 2 options:
Use something to render the page first. The simplest in C# would be to have a WebBrowser control, and listen for the DocumentCompleted event. Note that there is some nuance to this when it fires for multiple documents on one page
Figure out what service the page is calling to get the extra data, and see if you can access that directly. It may well be the case that the Canadapost website is accessing an API that you can also call directly.
on my PC this is works fine, but on some PCs the file qould not open, the WebBrowser displays an error, and the file opens in the default PDF program instead of WebBrowser.
My code:
Uri GuideURI = new Uri(String.Format("file:///{0}/../PDFs/" + link + ".pdf", Directory.GetCurrentDirectory()));
PDF_Web_Browser.Navigate(GuideURI);
One way to resolve this issue might be to not rely on the PC's PDF reader software.
You can use MuPDF as a library to extract the text from PDF and maybe write the content of it in XML format, then navigate to the file.
If you don't want to go this far, you can show an error message when trying to display a PDF file on a PC that doesn't have the required features to open it in the WebBrowser (source).
private void webBrowser1_Navigating(object sender, WebBrowserNavigatingEventArgs e)
{
string url = e.Url.ToString();
if (url.StartsWith("res://ieframe.dll/navcancl.htm") && url.EndsWith("pdf"))
{
e.Cancel = true;
MessageBox.Show("Cannot open PDF!");
}
}
Or you can even make a mix of those. Just in case the WebBroswer can't open the PDF file, you can write a message like "PDF addon not detected" and then display the XML file generated with the help of MuPDF library.
Maybe its because WebBrowser uses engine of Interneet Explorer. If that person doesn't have installed extension for that, or have older version of IE, he dont be able to open PDF in WebBrowser.
I'm trying to retrieve data from a webpage but I cannot do it by making a web request and parsing the resulting html file because the actual text that I'm trying to retrieve is not in the html file! I imagine that this text is pulled using some script and for that reason it's not in the html file. For all I know I'm looking at the wrong data, but assuming that my theory is correct, is there a straightforward way to retrieve whatever text is displayed by the browser (Firefox or IE) rather than attempt to fetch the text from the html file?
Assuming you are referring to text that has been generated using Javascript in the browser.
You can use PhantomJS to achieve this: http://phantomjs.org/
It is essentially a headless browser that will process Javascript.
You may need to run this as ane xternal program but Im sure you can do that through C#
Your other option would be to open the web page in a WebBrowser object which should execute the scripts, and then you can get the HtmlDocument object and go from there.
Take a look at this example...
private void test()
{
WebBrowser wBrowser1 = new WebBrowser();
wBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wBrowser1_DocumentCompleted);
wBrowser1.Url = new Uri("Web Page URL");
}
void wBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlDocument document = (sender as WebBrowser).Document;
// get elements and values accordingly.
}
I am attempting to create links in my view so the end user can download the files in my model. In internet explorer I can right click and download from the link but I cannot left click (it does not open the file). Firefox gives me a message when I click the file that it does'nt know how to open this address, because the protocol (d) isn't associated with any program.
Here is how I am creating the link.
#{
foreach (var EpubFile in item.files)
{
if(File.Exists(System.Configuration.ConfigurationManager.AppSettings["UploadFileDirectory"] + EpubFile.FileReference))
{
string link = System.Configuration.ConfigurationManager.AppSettings["UploadFileDirectory"] + EpubFile.FileReference;
#EpubFile.OriginalFileName
}
}
}
Make sure the link is prefixed with http:// and is a full or partial path in URL form, not in filename form. E.g., c:\inetpub\wwwroot\foo\files\myfile.txt should be /files/myfile.txt. You can use Server.MapPath to obtain the relative path of a file under your web application root.