Get Document OuterHTML of MVC Application in C# - c#

We need to export the entire page of MVC Application to PDF for that purpose need to get all the HTML contents (i.e. including dynamic content too)
To get the contents of page we used following code
string contents = File.ReadAllText(path);
but it will give only static content of page(i.e. it gives page source code) not new nodes added in DOM.
Then tried following code but this also gives static content
// WebClient object
WebClient client = new WebClient();
// Retrieve resource as a stream
Stream data = client.OpenRead(new Uri("xxxx.html"));
// Retrieve the text
StreamReader reader = new StreamReader(data);
string htmlContent = reader.ReadToEnd();
So i want to get enitre outerHTML of document in C# with out using any third party DLL . i googled so many links and everyone updated like use webbrowser control and get the content.
i don't how this will be useful for our application. Our Application is MVC4. we need to export the enitre page to PDF so we need enitre content OF HTML (dynamic content too)
How can i use this below code in ourt MVC Application to get document outerHTML
mshtml.HTMLDocument doc = webBrowser1.Document.DomDocument as mshtml.HTMLDocument;
string html = doc.documentElement.outerHTML;
or
var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)webBrowser.Document.DomDocument;
StringReader sr = new StringReader(documentAsIHtmlDocument3.documentElement.outerHTML);
htmlDoc.Load(sr)
Any help on this.

You haven't mentioned what the PDF is intended for. Most likely it is for the visitor of the page to download. If that is true, maybe you could use jsPDF. That way you get around the problem with not having access to the entire page serverside.

Related

Extracting string from Html page using C#

I have a source html page and I want to do the following:
extracting a specific string from the whole html page and save the new choosing string in a new html page.
creating a database on MySQL with 4 columns.
importing the data from the html page to the table on MySql.
I would be pretty thankful and grateful if someone could help me in that cause I have no that perfect knowledge of using C#.
You could use this code :
HttpClient http = new HttpClient();
//I have put Ebay.com. you could use any.
var response = await http.GetByteArrayAsync("ebay.com");
String source = Encoding.GetEncoding("utf-8").GetString(response, 0, response.Length - 1);
source = WebUtility.HtmlDecode(source);
HtmlDocument Nodes = new HtmlDocument();
Nodes.LoadHtml(source);
In the Nodes object, you will have all the DOM elements in the HTML page.
You could use linq to filter out whatever you need.
Example :
List<HtmlNode> RequiredNodes = Nodes.DocumentNode.Descendants()
.Where(x => x.Attributes["Class"].Contains("List-Item")).ToList();
You will probably need to install Html Agility Pack NuGet or download it from the link.
hope this helps.

Getting contents of Media Library Item in Sitecore

Using Sitecore 7.5, I am trying to store several html files inside of the Media Library. Then in my sublayout codebehind I am attempting to grab the inner content of those html files.
I had this working when I was storing the html file on the server. I would upload the file into the Media Library using 'upload as file', and then use the following code to read the content:
string filename = htmlMediaItem.Fields["File Path"].ToString();
string path = Server.MapPath(filename);
string content = System.IO.File.ReadAllText(path);
However I now would like to do this without storing the files on the server and instead only have them inside the media library. Is there anyway I can do this?
So far I have had a hard time trying to find information on the subject.
Thank you.
From what I understand you want to read content of an html file stored in Media Library.
Sitecore.Data.Items.Item sampleItem = Sitecore.Context.Database.GetItem("/sitecore/media library/Files/yourhtmlfile");
Sitecore.Data.Items.Item sampleMedia = new Sitecore.Data.Items.MediaItem(sampleItem);
using(var reader = new StreamReader(MediaManager.GetMedia(sampleMedia).GetStream().Stream))
{
string text = reader.ReadToEnd();
}

Get video(s) location from raw HTML text

I'm loading a web page into my WebView, and I can access it's raw HTML as text. The page has several video elements embedded within it, and I want to get their locations as a list of strings so I can download them separately.
How would I go about doing this ?
You can use HTTP agility pack for parsing
HtmlDocument document = new HtmlDocument();
document.LoadHtml(rawText);
var videoSourceNodes = document.DocumentNode.SelectNodes("//video/source");
foreach(var node in videoSourceNodes)
{
var path = node.Attributes["src"].Value;
}
It's your concern to convert relative path to absolute.

What's the most efficient way to visit a .html page?

I have a .html page that just has 5 characters on it (4 numbers and a period).
The only way I know of is to make a webbrowser that navigates to a URL, then use
browser.GetElementByID();
However that uses IE so I'm sure it's slow. Is there any better way (without using an API, something built into C#) to simply visit a webpage in a fashion that you can read off of it?
Try these 2 lines:
var wc = new System.Net.WebClient();
string html = wc.DownloadString("http://google.com"); // Your page will be in that html variable
It appears that you want to download a url, parse it as html then to find an element and read its inner text, right? Use nuget to grab a reference to HtmlAgilityPack, then:
using(var wc = new System.Net.WebClient()){
string html = wc.DownloadString("http://foo.com");
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var el = doc.GetElementbyId("foo");
if(el != null)
{
var text = el.InnerText;
Console.WriteLine(text);
}
}
Without using any APIs? You're in the .NET framework, so you're already using an abstraction layer to some extent. But if you want pure C# without any addons, you could just open a TCP socket to the site and download the contents (it's just a formatted string, after all) and read the data.
Here's a similar question: How to get page via TcpClient?

How do I recover the full html of a page, including what is generated by javascript

How do I recover the full html of a page, including what is generated by javascript. The problem is that I want to access the contents of the select tag, but the page but it is coming empty, this probably being generated dynamically. Please I'm about to give up!
I just posted a piece of code because this very large, if I find it necessary to put the whole code.
res = (HttpWebResponse)req.GetResponse();
res.Cookies = req.CookieContainer.GetCookies(req.RequestUri);
cookieContainer.Add(res.Cookies);
sr = new StreamReader(res.GetResponseStream());
getHtml = sr.ReadToEnd();
viewstate = rxViewstate.Match(getHtml).Groups[1].Value;
EventValdidation = rxEventValidation.Match(getHtml).Groups[1].Value;
viewstate = HttpUtility.UrlEncode(viewstate);
EventValdidation = HttpUtility.UrlEncode(EventValdidation);
//Here I should take the contents of the select tag.
getHtml = rxDropDownMenu.Match(getHtml).Groups[2].Value;
You can't just do this with HttpWebRequest, all that does is download the raw HTML and non of the linked JavaScript files.
It also wouldn't run the JavaScript or give you any kind of DOM to inspect.
You'd really need to use WebBrowser or perhaps something like Awesomium.

Categories

Resources