How to get all Display text from a webpage in C#

How to get all Display text from a webpage in C# - c#

Hi I am working on data scraping application in C#.
Actually I want to get all the Display text but not the html tags.
Here's My code
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.
Load(#"http://dawateislami.net/books/bookslibrary.do#!section:bookDetail_521.tr");
string str = doc.DocumentNode.InnerText;
This inner html is returning some tags and scripts as well but I want to only get the Display text that's visible to user.
Please help me.
Thanks

[I believe this will solve ur problem][1]
Method 1 – In Memory Cut and Paste
Use WebBrowser control object to process the web page, and then copy the text from the control…
Use the following code to download the web page:
Collapse | Copy Code
//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed
wb.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;
Use the following event code to process the downloaded web page text:
Collapse | Copy Code
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand(“SelectAll”, false, null);
wb.Document.ExecCommand(“Copy”, false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}
Method 2 – In Memory Selection Object
This is a second method of processing the downloaded web page text. It seems to take just a bit longer (very minimal difference). However, it avoids using the clipboard and the limitations associated with that.
Collapse | Copy Code
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{ //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand(“SelectAll”, false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}
Method 3 – The Elegant, Simple, Slower XmlDocument Approach
A good friend shared this example with me. I am a huge fan of simple, and this example wins the simplicity contest hands down. It was unfortunately very slow compared to the other two approaches.
The XmlDocument object will load / process HTML files with only 3 simple lines of code:
Collapse | Copy Code
XmlDocument document = new XmlDocument();
document.Load(“www.yourwebsite.com”);
string allText = document.InnerText;
There you have it! Three simple ways to scrape only displayed text from web pages with no external “packages” involved.
Packages

To remove javascript and css:
foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
style.Remove();
To remove comments (untested):
foreach(var comment in doc.DocumentNode.Descendants("//comment()").ToArray())
comment.Remove()

For removing all html tags from a string you can use:
String output = inputString.replaceAll("<[^>]*>", "");
For removing a specific tag:
String output = inputString.replaceAll("(?i)<td[^>]*>", "");
Hope it helps :)

Related

how to get the input element by class using mshtml.IHTMLDocument2

I'm new in using mshtml package library from .NET and, as such, I'm struggling to read the input element rendered in WPF web browser control.
I have researched through the internet and stumbled on this link. How to get the value of an input box of the webbrowser control in WPF?
But in my case, I want to read/get the input element by class, not by name attribute. Unfortunately, I know that GetElementByClass is not an option in WebBrowser.Document
This is my code - When I tried to run the debugger, the code doesn't continue inside the foreach loop ):
private void browser_LoadCompleted(object sender, NavigationEventArgs e)
{
IHTMLElementCollection elementCollection;
IHTMLDocument3 dom = (IHTMLDocument3)browser.Document;
elementCollection = dom.getElementsByTagName("input");
foreach (HtmlElement item in elementCollection)
{
if (item.OuterHtml.Contains("input_search"))
{
//Do something else here...
}
}
}
A screen capture of the DOM of the page
Any help is greatly appreciated.

Ghost cell in IE9 with Repeater Control

IE9 Generate blank cell or you can say Ghost Cell, with ASP.Net Repeater control.
I try javascript regural expression. Render function to run reg. exp. but the page holds few update controls and generate error.
Error: sys.webforms.pagerequestmanagerservererrorexception the message
received from the server could not be parsed. ScriptResource.axd
I try all the well known links for this error.
Please suggest me if you really have...
Thank You
protected override void Render(HtmlTextWriter writer)
{
using (HtmlTextWriter htmlwriter = new HtmlTextWriter(new System.IO.StringWriter()))
{
base.Render(htmlwriter);
string html = htmlwriter.InnerWriter.ToString();
if ((ConfigurationManager.AppSettings.Get("RemoveWhitespace") + string.Empty).Equals("true", StringComparison.OrdinalIgnoreCase))
{
//html = Regex.Replace(html, #"(?<=[^])\t{2,}|(?<=[>])\s{2,}(?=[<])|(?<=[>])\s{2,11}(?=[<])|(?=[\n])\s{2,}", string.Empty);
html = Regex.Replace(html, #"(?<=<td[^>]*>)(?>\s+)(?!<table)|(?<!</table>\s*)\s+(?=</td>)", string.Empty);
html = html.Replace(";\n", ";");
}
writer.Write(html.Trim());
}
another Solution is, but fail for Repeater
var expr = new RegExp('>[ \t\r\n\v\f]*<', 'g');
document.body.innerHTML = document.body.innerHTML.replace(expr, '><');

You can access the Repeater control directly (before it's written to the page and rendered by IE) and remove the cells based on their index.

Need to remove spaces between "< /td >" and "< td >".

Found a very useful script to prevent unwanted cells in your html table while rendering in IE9 browser.
function removeWhiteSpaces()
{
$('#myTable').html(function(i, el) {
return el.replace(/>\s*</g, '><');
});
}
This javascript function you should call when the page loads (i.e. onload event)

Windows Phone Emulator Functions Properly, Debugging or Deploying to Device Does Not

I am developing a very simple application that parses an XML feed, does some formatting and then displays it in a TextBlock. I've added a hyperLink (called "More..) to the bottom of the page (ideally this would be added to the end of the TextBlock after the XML has been parsed) to add more content by changing the URL of the XML feed to the next page.
The issue I'm experiencing is an odd one as the program works perfectly when in the Windows Phone 7 Emulator, but when I deploy it to the device or debug on the device, it works for the first click of the "More..." button, but the ones after the first click just seem to add empty space into the application when deployed or debugged from the device.
I'm using a Samsung Focus (NoDo) and originally thought this may have had to do with the fact that I may not have had the latest developer tools. I've made sure that I am running the latest version of Visual Studio and am still running into the issue.
Here are some snippets of my code to help out.
I've declared the clickCount variable here:
public partial class MainPage : PhoneApplicationPage
//set clickCount to 2 for second page
int clickCount = 2;
Here is the snippet of code I use to parse the XML file:
void client_DownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
{
if (e.Error == null)
{
ListBoxItem areaItem = null;
StringReader stream = new StringReader(e.Result);
XmlReader reader = XmlReader.Create(stream);
string areaName = String.Empty;
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name == "example")
{
areaName = reader.ReadElementContentAsString();
areaItem = new ListBoxItem();
areaItem.Content = areaName;
textBlock1.Inlines.Add(areaName);
textBlock1.Inlines.Add(new LineBreak());
}
}
}
}
}
and the code for when the hyperLink button is clicked:
private void hyperlinkButton1_Click(object sender, RoutedEventArgs e)
{
int stringNum = clickCount;
//URL is being incremented each time hyperlink is clicked
string baseURL = "http://startofURL" + stringNum + ".xml";
Uri url = new Uri(baseURL, UriKind.Absolute);
WebClient client = new WebClient();
client.DownloadStringCompleted += new DownloadStringCompletedEventHandler(client_DownloadStringCompleted);
client.DownloadStringAsync(url);
//increment page number
clickCount = clickCount + 1;
}

It feels like there's a little more debugging to do here.
Can you test where exactly this is going wrong?
is it the click that is not working on subsequent attempts?
is it the HTTP load which is failing?
is it the adding of inline text which is failing?
Looking at it, I suspect it's the last thing. Can you check that your TextBlock is expecting Multiline text? Also, given what you've written (where you don't really seem to be making use of the Inline's from the code snippet I've seen), it might be easier to append add the new content to a ListBox or a StackPanel rather than to the inside of the TextBlock - ListBox's especially have some benefit in terms of Virtualizing the display of their content.

How to display the string html contents into webbrowser control?

I have a C# win app program. I save the text with html format in my database but I want to show it in a webbrowser to my user.How to display the string html contents into webbrowser control?

Try this:
webBrowser1.DocumentText =
"<html><body>Please enter your name:<br/>" +
"<input type='text' name='userName'/><br/>" +
"<a href='http://www.microsoft.com'>continue</a>" +
"</body></html>";

Instead of navigating to blank, you can do
webBrowser1.DocumentText="0";
webBrowser1.Document.OpenNew(true);
webBrowser1.Document.Write(theHTML);
webBrowser1.Refresh();
No need to wait for events or anything else. You can check the MSDN for OpenNew, while I have tested the initial DocumentText assignment in one of my projects and it works.

As commented by Thomas W. - I almost missed this comment but I had the same issues so it's worth rewriting as an answer I think.
The main issue being that after the first assignment of webBrowser1.DocumentText to some html, subsequent assignments had no effect.
The solution as linked by Thomas can be found in detail at http://weblogs.asp.net/gunnarpeipman/archive/2009/08/15/displaying-custom-html-in-webbrowser-control.aspx however I will summarize below in case this page becomes unavailable in the future.
In short, due to the way the webBrowser control works, you must navigate to a new page each time you wish to change the content. Therefore the author proposes a method to update the control as:
private void DisplayHtml(string html)
{
webBrowser1.Navigate("about:blank");
if (webBrowser1.Document != null)
{
webBrowser1.Document.Write(string.Empty);
}
webBrowser1.DocumentText = html;
}
I have however found that in my current application I get a CastException from the line if(webBrowser1.Document != null). I'm not sure why this is, but I've found that if I wrap the whole if block in a try catch the desired effect still works. See:
private void DisplayHtml(string html)
{
webBrowser1.Navigate("about:blank");
try
{
if (webBrowser1.Document != null)
{
webBrowser1.Document.Write(string.Empty);
}
}
catch (CastException e)
{ } // do nothing with this
webBrowser1.DocumentText = html;
}
So every time the function to DisplayHtml is executed I receive a CastException from the if statement, so the contents of the if statement are never reached. However if I comment out the if statement so as not to receive the CastException, then the browser control doesn't get updated. I suspect there is another side effect of the code behind the Document property which causes this effect despite the fact that it also throws an exception.
Anyway I hope this helps people.

For some reason the code supplied by m3z (with the DisplayHtml(string) method) is not working in my case (except first time). I'm always displaying html from string. Here is my version after the battle with the WebBrowser control:
webBrowser1.Navigate("about:blank");
while (webBrowser1.Document == null || webBrowser1.Document.Body == null)
Application.DoEvents();
webBrowser1.Document.OpenNew(true).Write(html);
Working every time for me. I hope it helps someone.

Simple solution, I've tested is
webBrowser1.Refresh();
var str = "<html><head></head><body>" + sender.ToString() + "</body></html>";
webBrowser1.DocumentText = str;

webBrowser.NavigateToString(yourString);

Here is a little code. It works (for me) at any subsequent html code change of the WebBrowser control. You may adapt it to your specific needs.
static public void SetWebBrowserHtml(WebBrowser Browser, string HtmlText)
{
if (Browser != null)
{
if (string.IsNullOrWhiteSpace(HtmlText))
{
// Putting a div inside body forces control to use div instead of P (paragraph)
// when the user presses the enter button
HtmlText =
#"<html>
<head>
<meta http-equiv=""Content-Type"" content=""text/html; charset=UTF-8"" />
</head>
<div></div>
<body>
</body>
</html>";
}
if (Browser.Document == null)
{
Browser.Navigate("about:blank");
//Wait for document to finish loading
while (Browser.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
System.Threading.Thread.Sleep(5);
}
}
// Write html code
dynamic Doc = Browser.Document.DomDocument;
Doc.open();
Doc.write(HtmlText);
Doc.close();
// Add scripts here
/*
dynamic Doc = Document.DomDocument;
dynamic Script = Doc.getElementById("MyScriptFunctions");
if (Script == null)
{
Script = Doc.createElement("script");
Script.id = "MyScriptFunctions";
Script.text = JavascriptFunctionsSourcecode;
Doc.appendChild(Script);
}
*/
// Enable contentEditable
/*
if (Browser.Document.Body != null)
{
if (Browser.Version.Major >= 9)
Browser.Document.Body.SetAttribute("contentEditable", "true");
}
*/
// Attach event handlers
// Browser.Document.AttachEventHandler("onkeyup", BrowserKeyUp);
// Browser.Document.AttachEventHandler("onkeypress", BrowserKeyPress);
// etc...
}
}

Old question, but here's my go-to for this operation.
If browser.Document IsNot Nothing Then
browser.Document.OpenNew(True)
browser.Document.Write(My.Resources.htmlTemplate)
Else
browser.DocumentText = My.Resources.htmlTemplate
End If
And be sure that any browser.Navigating event DOES NOT cancel "about:blank" URLs. Example event below for full control of WebBrowser navigating.
Private Sub browser_Navigating(sender As Object, e As WebBrowserNavigatingEventArgs) Handles browser.Navigating
Try
Me.Cursor = Cursors.WaitCursor
Select Case e.Url.Scheme
Case Constants.App_Url_Scheme
Dim query As Specialized.NameValueCollection = System.Web.HttpUtility.ParseQueryString(e.Url.Query)
Select Case e.Url.Host
Case Constants.Navigation.URLs.ToggleExpander.Host
Dim nodeID As String = query.Item(Constants.Navigation.URLs.ToggleExpander.Parameters.NodeID)
:
:
<other operations here>
:
:
End Select
Case Else
e.Cancel = (e.Url.ToString() <> "about:blank")
End Select
Catch ex As Exception
ExceptionBox.Show(ex, "Operation failed.")
Finally
Me.Cursor = Cursors.Default
End Try
End Sub

The DisplayHtml(string html) recommended by m3z worked for me.
In case it helps somebody, I would also like to mention that initially there were some spaces in my HTML that invalidated the HTML and so the text appeared as a string. The spaces were introduced (around the angular brackets) when I pasted the HTML into Visual Studio. So if your text is still appearing as text after you try the solutions mentioned in this post, then it may be worth checking that the HTML syntax is correct.

C# Navigate to Anchors in WebBrowser control

We have a web browser in our Winforms app to nicely display history of a selected item rendered by xslt.
The xslt is writing out <a> tags in the outputted html to allow the webBrowser control to navigate to the selected history entry.
As we are not 'navigating' to the html in the strict web sense, rather setting the html by the DocumentText, I can't 'navigate' to desired anchors with a #AnchorName, as the webBrowser's Url is null (edit: actually on completion it is about:blank).
How can I dynamically navigate to Anchor tags in the html of the Web Browser control in this case?
EDIT:
Thanks sdolphion for the tip, this is the eventual code I used
void _history_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
_completed = true;
if (!string.IsNullOrEmpty(_requestedAnchor))
{
JumpToRequestedAnchor();
return;
}
}
private void JumpToRequestedAnchor()
{
HtmlElementCollection elements = _history.Document.GetElementsByTagName("A");
foreach (HtmlElement element in elements)
{
if (element.GetAttribute("Name") == _requestedAnchor)
{
element.ScrollIntoView(true);
return;
}
}
}

I am sure someone has a better way of doing this but here is what I used to accomplish this task.
HtmlElementCollection elements = this.webBrowser.Document.Body.All;
foreach(HtmlElement element in elements){
string nameAttribute = element.GetAttribute("Name");
if(!string.IsNullOrEmpty(nameAttribute) && nameAttribute == section){
element.ScrollIntoView(true);
break;
}
}

I know this question is old and has a great answer, but this hasn't been suggested yet, so it might be useful for others that come here looking for an answer.
Another way to do it is use the element id in the HTML.
<p id="section1">This is a test section</p>
Then you can use
HtmlElement sectionAnchor = webBrowserPreview.Document.GetElementById("section1");
if (sectionAnchor != null)
{
sectionAnchor.ScrollIntoView(true);
}
where webBrowserPreview is your WebBrowser control.
Alternatively, sectionAnchor.ScrollIntoView(false) will only bring the element on screen instead of aligning it with the top of the page

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to get all Display text from a webpage in C# - c#

For removing all html tags from a string you can use: String output = inputString.replaceAll("<[^>]>", ""); For removing a specific tag: String output = inputString.replaceAll("(?i)<td[^>]>", ""); Hope it helps :)

Related

how to get the input element by class using mshtml.IHTMLDocument2

Ghost cell in IE9 with Repeater Control

Windows Phone Emulator Functions Properly, Debugging or Deploying to Device Does Not

How to display the string html contents into webbrowser control?

C# Navigate to Anchors in WebBrowser control

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to get all Display text from a webpage in C# - c#

For removing all html tags from a string you can use: String output = inputString.replaceAll("<[^>]*>", ""); For removing a specific tag: String output = inputString.replaceAll("(?i)<td[^>]*>", ""); Hope it helps :)

Related

how to get the input element by class using mshtml.IHTMLDocument2

Ghost cell in IE9 with Repeater Control

Windows Phone Emulator Functions Properly, Debugging or Deploying to Device Does Not

How to display the string html contents into webbrowser control?

C# Navigate to Anchors in WebBrowser control

Categories

Resources

For removing all html tags from a string you can use: String output = inputString.replaceAll("<[^>]>", ""); For removing a specific tag: String output = inputString.replaceAll("(?i)<td[^>]>", ""); Hope it helps :)