Is it possible to scrape all the text from a site that was navigated to by WebBrowser control without looking at the source?
David Walker's method is great when one don't need any info from the header nor non main part of the webpage. if one need something outside inner text, there is only two options, one is to parse with "getElement".
the other one is issue commands (Document.ExecCommand) to webbrowser to select all and copy to clipboard:
wb.Document.ExecCommand("SelectAll", false, null);
wb.Document.ExecCommand("Copy", false, null);
then finally string content=clipboard.getText();
Please note the spelling and syntax may not be correct, I'm recalling from my memory
string browserContents = webBrowser.Document.Body.InnerText;
You use the DocumentText property or the WebBrowser control.
This property is what holds the HTML of the site you have navigated to.
Update: (following comments)
If you want to parse the HTML and get the text parts of it, I suggest you use the HTML Agility Pack.
Related
I am using TFS API and I try am trying to access a property "Description" of an object "WorkItem"
I want to display this property on a web page.
When I display this on the web page , this is what I see :
<p>This task is created for our SSRS team Sesame project.</p>
Firstly , I wanted to know if this is a html tag or does it mean something else in TFS .
And secondly , is there a way I can display this in plain text ?
This task is created for our SSRS team Sesame project.
Please let me know.
If you just want to remove tags, see this simple solution (it uses Regular Expressions):
Remove HTML tags in String
It could be even better if you create it as a extension method.
This is HTML as the System.Description field is an HTML one.
If you just write the field out to a tag it should render as text without the tags. The browser hosting the webpage will render them like any other HTML tags.
I want to assign the html content to iframe control from asp.net code behind page;
this is working fine
myIframe.Attributes.Add("src", "pathtofilewith.html");
but, i don't want to give the path of html file to display i just want to assign some html content which comes from database to iframe control.
i want some thing like this(Ashok) to be displayed in iframe control
i tried the bellow ways but nothing is succesful
myIframe.Attributes["innerHTML"] = "<h1>Thank You..</h1>";
myIframe.Attributes.Add("innerHTML", "<h1>Ashok</h1>");
A way to communicate between two different pages where one is in an IFrame on the other, is to post data using JQuery. An example is given in this StackOverflow question
It is also discussed in this other StackOverflow question
On this page, you will also find a short and simple example of how you can put content in an IFrame without using a separate web-page for it (note the lacking src attribute!).
Hope some of this helps!
You can't. That's not how an IFRAME works - it is for use with the src attribute as you've already discovered.
Perhaps you want to create a DIV instead
There is no way to insert HTML content into iframe tag directly, but you can create a page which gets the content form the database and then you can view it using iframe,
or
You can create a page for example called getContent.aspx which request value from the URL e.g. getContent.aspx?content=<h1>Thank You..</h1> and display it wherever you like, and then call it from iframe.
http://booking.travel24.com/index.php?KID=610000&&id=lmpergebnis&showresult=1&detail=zielgebiet®ion=-1&ziel=-1&termin=20.02.2011&ruecktermin=17.03.2011&dauer=-1&abflughafen=46&personen=25;25&kategorie=-1&verpflegung=-1&zimmer=-1
I am trying to parse some HTML parts of this page, but when I check the source code I can not find this: "Tunesien, Marokko".
If I check with xdeveloper I can see this as html:
<a class="reglreg" href="javascript:s_hliste(20009);">Tunesien, Marokko</a>
but if i check source code of the page I can't find this. Why?
If you view the source and search for "Marokko" you will see there are several places where it occurs (loaded as data in several JavaScript arrays).
It appears as if some of the content is produced dynamically through the JavaScript loaded onto the page. That JavaScript builds HTML and changes the page to include the content you are looking for.
To answer your first real question
Why?
Because when you check the source code inside a browser, you'll get the original html code. Then javascript comes along and modify the DOM which you can follow in any modern browser's console.
can i get somehow whole source code
then? if i can not see it in browser
how can i see it?
To make it simple, it depends how you're trying to parse it. With what language?
maybe the data is coming via AJAX call, so it's not there on the html at the start, but dynamically added to it.
if you need to parse this, you can try to "emulate" the ajax call yourself.
I need to create a data index of HTML pages provided to a service by essentially grabbing all text on them and putting them in a string to go into a storage system.
If this were GUI based, I would simply Ctrl+A on the HTML page, copy it, then go to Notepad and Ctrl+V. Simples. If I can do it via good old point n' click, then surely there must be a way to do it programmatically, but I'm struggling to find anything useful.
The HTML docs in question are being loaded for rendering currently using the System.Windows.Controls.WebBrowser class, so I wonder if its somehow possible to grab the data from there?
I'm going to keep hunting, but any pointers would be very appreciated.
Note: We don't want the HTML source code, and would also really rather not have to parse all the source code to get the text unless we absolutely have to.
If I understand your problem correctly, you will have to do a bit of work to get the data.
WebBrowser browser=new WebBrowser(); // This is what you have
HtmlDocument doc = browser.Document; // This gives you the browser contents
String content =
(((mshtml.HTMLDocumentClass)(doc.DomDocument)).documentElement).innerText;
That last line is the browser's view of the rendered content.
This looks like it might be quite helpful.
We are supplied with HTML 'wrapper' files from the client, which we need to insert out content into, and then render the HTML.
Before we render the HTML with our content inserted, I need to add a few tags to the <head> section of the client's wrapper, such as references to our script files, css and some meta tags.
So what I'm doing is
string html = File.ReadAllText(wrapperLocation, Encoding.GetEncoding("iso-8859-1"));
and now I have the complete HTML. I then search for a pre-defined content well in that string and insert our content into that, and render it.
How can I create an instance of a HTML document and modify the <head> section as required?
edit: I don't want to reference System.Windows.Forms so WebBrowser is not an option.
I haven't tried this library myself, but this would probably fit the bill: http://htmlagilitypack.codeplex.com/
You can use https://github.com/jamietre/CsQuery to edit an html dom.
var dom = CQ.Create(html);
var dom = CQ.CreateFromUrl("http://www.jquery.com");
dom.Select("div > span")
.Eq(1)
.Text("Change the text content of the 2nd span child of each div");
Just select the head and add to it.
I use the WebBrowser control as host, and navigate/alter the document through its Document property.
Nice documentation and samples at the link above.
Are you using MasterPages?
This seems like the most obvious use of them.
The MasterPage has <asp:ContentPlaceHolder>'s for all the points where you want the content to go.
In our app we have a base controller that overrides all the View() overloads so that it reads in the name of the MasterPage from the web.config. That way customising the app is as simple as a new MasterPage and from a Controllers point of view there is no code change since our base class handles the MasterPage/web.config stuff.
I couldn't get an automated solution to this, so it came down to a hack:
public virtual void PopulateCssTag(string tags)
{
// tags is a pre-compsed string containing all the tags I need.
this.Wrapper = this.Wrapper.Replace("</head>", tags + "</head>");
}