C# extracting html only

C# extracting html only - c#

Basically i have a webpage with embedded css and JavaScript, so what i want to do is extract only the HTML itself, from texts to tables , images and what not.
So far i have the whole web page stored into a string called "html" the contents of this page is just the facebook hompepage for example,but as you will see there's all scripts and other embedded stuff which i don't want to have.
HTMLEdit = //webpage I chose to store in here//
string html = HTMLEdit.DocumentText;
String result = "this i want to only contain the <head>,<body>,<foot>."
I am only interested in displaying the result witch only contains html, i don't want the JavaScript or css or any other stuff
I have looked at the agility pack but there's no documentation on there website to do this and this is my first ever c# project i have decided to make, so excuse my ignorance if i don't make sense.

See this question
HTML Agility Pack strip tags NOT IN whitelist
Maybe adapt that answer, and drop link and script tags.

Related

Selenium Chrome driver find text on the page

I want to load a web page in the Selenium Chrome driver and search a few texts on the loaded HTML page. I want to programmatically check if the search text is found or not. I found a way to search in the source code but couldn't find a way to search on the HTML page. XPath search would not work as I want to use the solution with different HTML pages. Could someone suggest how this can be done? I am using the Csharp language.

I think you can do this with Xpath. The following will return an array of any elements on the page with text "text you want to search"
driver.FindElements((By.XPATH, '//*[text() = "text you want to search"]'));
Or, the following will grab any elements containing "my text":
driver.FindElements((By.XPATH, '//*[text()[contains(.,'my text')]]'));
You can then find the length of that array, or interact with the elements specifically. As long as you are just searching for that text, I think this should work on any web page.

how to remove tags in text while displaying on a web page

I am using TFS API and I try am trying to access a property "Description" of an object "WorkItem"
I want to display this property on a web page.
When I display this on the web page , this is what I see :
<p>This task is created for our SSRS team Sesame project.</p>
Firstly , I wanted to know if this is a html tag or does it mean something else in TFS .
And secondly , is there a way I can display this in plain text ?
This task is created for our SSRS team Sesame project.
Please let me know.

If you just want to remove tags, see this simple solution (it uses Regular Expressions):
Remove HTML tags in String
It could be even better if you create it as a extension method.

This is HTML as the System.Description field is an HTML one.
If you just write the field out to a tag it should render as text without the tags. The browser hosting the webpage will render them like any other HTML tags.

How to copy all data from a HTML doc and save it to a string using C#

I need to create a data index of HTML pages provided to a service by essentially grabbing all text on them and putting them in a string to go into a storage system.
If this were GUI based, I would simply Ctrl+A on the HTML page, copy it, then go to Notepad and Ctrl+V. Simples. If I can do it via good old point n' click, then surely there must be a way to do it programmatically, but I'm struggling to find anything useful.
The HTML docs in question are being loaded for rendering currently using the System.Windows.Controls.WebBrowser class, so I wonder if its somehow possible to grab the data from there?
I'm going to keep hunting, but any pointers would be very appreciated.
Note: We don't want the HTML source code, and would also really rather not have to parse all the source code to get the text unless we absolutely have to.

If I understand your problem correctly, you will have to do a bit of work to get the data.
WebBrowser browser=new WebBrowser(); // This is what you have
HtmlDocument doc = browser.Document; // This gives you the browser contents
String content =
(((mshtml.HTMLDocumentClass)(doc.DomDocument)).documentElement).innerText;
That last line is the browser's view of the rendered content.

This looks like it might be quite helpful.

How to programmatically load a HTML document in order to add to the document's <head>?

We are supplied with HTML 'wrapper' files from the client, which we need to insert out content into, and then render the HTML.
Before we render the HTML with our content inserted, I need to add a few tags to the <head> section of the client's wrapper, such as references to our script files, css and some meta tags.
So what I'm doing is
string html = File.ReadAllText(wrapperLocation, Encoding.GetEncoding("iso-8859-1"));
and now I have the complete HTML. I then search for a pre-defined content well in that string and insert our content into that, and render it.
How can I create an instance of a HTML document and modify the <head> section as required?
edit: I don't want to reference System.Windows.Forms so WebBrowser is not an option.

I haven't tried this library myself, but this would probably fit the bill: http://htmlagilitypack.codeplex.com/

You can use https://github.com/jamietre/CsQuery to edit an html dom.
var dom = CQ.Create(html);
var dom = CQ.CreateFromUrl("http://www.jquery.com");
dom.Select("div > span")
.Eq(1)
.Text("Change the text content of the 2nd span child of each div");
Just select the head and add to it.

I use the WebBrowser control as host, and navigate/alter the document through its Document property.
Nice documentation and samples at the link above.

Are you using MasterPages?
This seems like the most obvious use of them.
The MasterPage has <asp:ContentPlaceHolder>'s for all the points where you want the content to go.
In our app we have a base controller that overrides all the View() overloads so that it reads in the name of the MasterPage from the web.config. That way customising the app is as simple as a new MasterPage and from a Controllers point of view there is no code change since our base class handles the MasterPage/web.config stuff.

I couldn't get an automated solution to this, so it came down to a hack:
public virtual void PopulateCssTag(string tags)
{
// tags is a pre-compsed string containing all the tags I need.
this.Wrapper = this.Wrapper.Replace("</head>", tags + "</head>");
}

HttpWebRequest reades only homepage

Hi I tried to read a page using HttpWebRequest like this
string lcUrl = "http://www.greatandhra.com";
HttpWebRequest loHttp = (HttpWebRequest)WebRequest.Create(lcUrl);
loHttp.Timeout = 10000; // 10 secs
loHttp.UserAgent = "Code Sample Web Client";
HttpWebResponse loWebResponse = (HttpWebResponse)loHttp.GetResponse();
Encoding enc = Encoding.GetEncoding(1252); // Windows default Code Page
StreamReader loResponseStream =
new StreamReader(loWebResponse.GetResponseStream(), enc);
string lcHtml = loResponseStream.ReadToEnd();
mydiv.InnerHtml = lcHtml;
// Response.Write(lcHtml);
loWebResponse.Close();
loResponseStream.Close();
i can able to read that page and bind it to mydiv. But when i click on any one of links in that div it is not displaying any result. Because my application doesnt contain entire site. So what we will do now.
Can somebody copy my code and test it plz
Nagu

I'm fairly sure you can't insert a full page in a DIV without breaking something. In fact the whole head tag may be getting skipped altogether (and any javascript code there may not be run). Considering what you seem to want to do, I suggest you use an IFRAME with a dynamic src, which will also hopefully lift some pressure off your server (which wouldn't be in charge of fetching the html to be mirrored anymore).

If you really want a whole page of HTML embedded in another, then the IFRAME tag is probably the one to use, rather than the DIV.
Rather than having to create a web request and have all that code to retrieve the remote page, you can just set the src attribute of the IFRAME to point ot the page you want it to display.
For example, something like this in markup:
<iframe src="<%=LcUrl %>" frameborder="0"></iframe>
where LcUrl is a property on your code-behind page, that exposes your string lcUrl from your sample.
Alternatively, you could make the IFRAME runat="server" and set its src property programatically (or even inject the innerHTML in a way sismilar to your code sample if you really wanted to).

The code you are putting inside .InnerHtml of the div contains the entire page (including < html >, < body >, < /html > and < /body> ) which can cause a miriad of problems with any number of browsers.
I would either move to an iframe, or consider some sort of parsing the HTML for the remote site and displaying a transformed version (ie. strip the HTML ,BODY, META tags, replace some link URLs, etc).

But when i click on any one of links in that div it is not displaying any result
Probably because the links in the download page are relative... If you just copy the HTML into a DIV in your page, the browser considers the links relative to the current URL : it doesn't know about the origin of this content. I think the solution is to parse the downloaded HTML, and convert relative URLs in href attributes to absolute URLs

If you want to embed it, you need to strip everything but the body part. That means that you have to parse your string lcHTML for <body....> and remove everything before and includeing the body tag. You must also strip away everything from </body>. Then you need to parse the string for all occurences of <a href="....."> that do not start with http:// and include h t t p://www.greatandhra.com or set <base target="h t t p://www.greatandhra.com"> in your head section.
If you don't want to embed, simply clear the response buffer and stream the lcHTML string back to the browser.
PS: I had to write all h t t p with spaces to be able to post this.

Sounds like what you are trying to do is display a different site embedded in your site. For this to work by dropping it into a div you would have to extract the code between the body tags as it wouldn't be valid with html and head in the middle of another page.
The links won't work because you've now taken that page out of context in your site so you'd also have to rewrite any links on the page that are relative (i.e. don't start with http) to point to a page on your site which will then fetch the other sites page and display them back in your site, or you could add the url of the site you're grabbing to the beginning of all the relative links so they link back to that site.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# extracting html only - c#

See this question HTML Agility Pack strip tags NOT IN whitelist Maybe adapt that answer, and drop link and script tags.

Related

Selenium Chrome driver find text on the page

how to remove tags in text while displaying on a web page

How to copy all data from a HTML doc and save it to a string using C#

How to programmatically load a HTML document in order to add to the document's <head>?

HttpWebRequest reades only homepage

Categories

Resources