Get simple text in body - c#

I want to get plain text in the body tag.
Markup:
**simple text 1**
<div>------</div>
<font>-------</font>
**simple text 2**
Code:
foreach (HtmlElement elm in webBrowser1.Document.Body.All)
{
//get simple text
}

Simply:
string plainText = webBrowser1.Document.Body.InnerText;

I find this easy way :
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(webBrowser1.Document.Body.InnerHtml);
foreach (var elm in htmlDoc.DocumentNode.Descendants())
{
if (elm.NodeType == HtmlNodeType.Text)
{
//simple text is #text
var innerText=elm.InnerText;
}
}
have a good time.

Try following way: you can get all text which are shown in browser preview by using following technique.
string plainText= StripHTML(webBrowser1);// call this way-----
public string StripHTML(WebBrowser webp)
{
try
{
Clipboard.Clear();
webp.Document.ExecCommand("SelectAll", true, null);
webp.Document.ExecCommand("Copy", false, null);
}
catch (Exception ep)
{
MessageBox.Show(ep.Message);
}
return Clipboard.GetText();
}

Related

Openxml inserts italics at bookmarks.(several characters in a string are italics)

Here is the code for my work.
public void InsertValue(WordprocessingDocument doc, string bookMark, string txt)
{
try
{
RemoveBookMarkContent(doc, bookMark);
var bmStart = FindBookMarkStart(doc, bookMark);
if (bmStart == null)
return;
var run = new Run();
run.Append(GetRunProperties());
run.Append(new Text(txt));
bmStart.Parent.InsertAfter(run, bmStart);
}
catch (Exception c)
{
//not Exception
}
}
private void RemoveBookMarkContent(WordprocessingDocument doc, string bmName)
{
BookmarkStart bmStart = FindBookMarkStart(doc, bmName);
if (bmStart == null)
return;
BookmarkEnd bmEnd = FindBookMarkEnd(doc, bmStart.Id);
while (true)
{
var run = bmStart.NextSibling();
if (run == null)
{
break;
}
if (run is BookmarkEnd && (BookmarkEnd)run == bmEnd)
{
break;
}
run.Remove();
}
}
There are still several auxiliary classes not written.Work process, first find the bookmark location, delete the content of the bookmark location, and then add it.I've also tried to add one Paragraph to the bookmark location.But that doesn't work.
Document to insert in bookmark eg:露点:U=0.15℃(k=2);相对湿度:U=1.0%RH(k=2).Both u and K must be italics.Any help will be appreciated.Thanks.
I tried a new component.[Spire.Office.][1]
At the beginning, I didn't think of a solution, but I used the global search and replacement to determine whether the search location has bookmarks, which perfectly solved the problem.
Here is the code for my work.
var selection = document.FindAllString("U", false, true);
foreach (var sec in selection)
{
var t = sec.GetAsOneRange();
if (sec.GetAsOneRange()?.Owner?.LastChild?.DocumentObjectType == DocumentObjectType.BookmarkEnd)
{
sec.GetAsOneRange().CharacterFormat.Italic = true;
}
}
I didn't try to do this with openxml, but I think the principle should be consistent.
[1]: https://www.e-iceblue.cn/Buy/Spire-PDF-NET.html

Exception loading the XML

I'm using this code to save and restore the XML values ​​but I'm in trouble . Rescue usually works the problem and when I try to load the XML . I get this exception that in the image.
line 105 : string text = el.Attribute("Text").Value;
void SaveData() {
XDocument xmlDocument = new XDocument(new XElement("Pages"));
List<XElement> xmlPages = new List<XElement>();
foreach(KeyValuePair<string, string> doc in documents)
xmlDocument.Root.Add(
new XElement("Page",
new XAttribute("nodeName", GetNodeName(doc.Key)),
new XAttribute("pageGuid", doc.Key),
new XAttribute("Rtf", doc.Value)));
xmlDocument.Root.Add(
new XElement("TextEdit",
new XAttribute("Text", textBox1.Text)));
xmlDocument.Save(GetPathToFile());
}
void LoadData() {
try {
XDocument xmlDocument = XDocument.Load(GetPathToFile());
rootNode.Nodes.Clear();
documents.Clear();
foreach(XElement el in xmlDocument.Root.Elements()) {
string nodeName = el.Attribute("nodeName").Value;
string pageGuid = el.Attribute("pageGuid").Value;
string rtf = el.Attribute("Rtf").Value;
string text = el.Attribute("Text").Value;
rootNode.Nodes.Add(new DataNode(nodeName, pageGuid));
documents.Add(pageGuid, rtf);
textBox1.Text = text;
}
} catch(Exception ex) {
MessageBox.Show("No data loaded. Check XML file" + ex.ToString());
}
treeList1.RefreshDataSource();
}
The exception is clear: There is not such attribute el.Attribute("Text"), so you can't try to get it's value. Check for attribute existence before getting it's value.
After research could solve the case.
Solution:
void LoadData() {
try {
XDocument xmlDocument = XDocument.Load(GetPathToFile());
rootNode.Nodes.Clear();
documents.Clear();
foreach(XElement el in xmlDocument.Root.Elements()) {
switch(el.Name.LocalName) {
case "Page":
string nodeName = el.Attribute("nodeName").Value;
string pageGuid = el.Attribute("pageGuid").Value;
string rtf = el.Attribute("Rtf").Value;
rootNode.Nodes.Add(new DataNode(nodeName, pageGuid));
documents.Add(pageGuid, rtf);
break;
case "Text":
textEdit1.Text = el.Attribute("text").Value;
break;
}
}
} catch(Exception ex) {
MessageBox.Show("No data loaded. Check XML file");
}
treeList1.RefreshDataSource();
}

Scraping HTML DOM elements using HtmlAgilityPack in ASP.NET

I am Scraping HTML DOM elements using HtmlAgilityPack in ASP.NET. currently my code is loading all the href links which means that sublinks of sublinks also . But I need only the depending URL of my domain URL. I don't know how to write code for it. Can any one help me to do this?
Here is my code:
public void GetURL(string strGetURL)
{
var getHtmlSource = new HtmlWeb();
var document = new HtmlDocument();
try
{
document = getHtmlSource.Load(strGetURL);
var aTags = document.DocumentNode.SelectNodes("//a");
if (aTags != null)
{
outputurl.Text = string.Empty;
int _count = 0;
foreach (var aTag in aTags)
{
string strURLTmp;
strURLTmp = aTag.Attributes["href"].Value;
if (_count != 0)
{
if (!CheckDuplicate(strURLTmp))
{
lstResults.Add(strURLTmp);
outputurl.Text += strURLTmp + "\n";
counter++;
GetURL(strURLTmp);
}
}
_count++;
}
}
}
If you meant to get URL that contains specific domain, you can change the XPath to be :
//a[contains(#href, 'your domain here')]
Or if you prefer LINQ than XPath :
var aTags = document.DocumentNode.SelectNodes("//a");
if (aTags != null)
{
....
var relevantLinks = aTags.Where(o => o.GetAttributeValue("href", "")
.Contains("your domain here")
);
....
}
GetAttributeValue() is a better way to get value of an attribute using HAP. Instead of returning null which may cause exception, this method returns the 2nd parameter when the attribute is not found in the context node.

HTML Agility Pack - Filter Href Value Results

I'm working on a web scraper. The following text shows the results of the code given at the end of this question, which gets the values of all hrefs from a page.
I only want to get values that contain docid=
index.php?pageid=a45475a11ec72b843d74959b60fd7bd64556e8988583f
#
summary_of_documents.php
index.php?pageid=a45475a11ec72b843d74959b60fd7bd64579b861c1d7b
#
index.php?pageid=a45475a11ec72b843d74959b60fd7bd64579e0509c7f0&apform=judiciary
decisions.php?doctype=Decisions / Signed
Resolutions&docid=1263778435388003271#sam
decisions.php?doctype=Decisions / Signed
Resolutions&docid=12637789021669321156#sam
?doctype=Decisions / Signed Resolutions&year=1986&month=January#head
?doctype=Decisions / Signed Resolutions&year=1986&month=February#head
Here's the code:
string url = urlTextBox.Text;
string sourceCode = Extractor.getSourceCode(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceCode);
List<string> links = new List<string>();
if (links != null)
{
foreach (HtmlAgilityPack.HtmlNode nd in doc.DocumentNode.SelectNodes("//a[#href]"))
{
links.Add(nd.Attributes["href"].Value);
}
}
else
{
MessageBox.Show("No Links Found");
}
if (links != null)
{
foreach (string str in links)
{
richTextBox9.Text += str + "\n";
}
}
else
{
MessageBox.Show("No Link Values Found");
}
How can I do this?
Why not just replace this:
links.Add(nd.Attributes["href"].Value);
with this:
if (nd.Attributes["href"].Value.Contains("docid="))
links.Add(nd.Attributes["href"].Value);

Does WebBrowser.DocumentText as well contains all frame documents text?

I am not sure if WebBrowser.DocumentText contains only top document source or frames document text also included. Could not find that from MSDN page.
No it does not. I have tried next:
DocumentText:
File.WriteAllText(#"C:\doc.txt", webBrowser1.DocumentText, Encoding.UTF8);
GetElementsByTagName("HTML")
HtmlElement elem;
if (webBrowser1.Document != null)
{
HtmlElementCollection elems = webBrowser1.Document.GetElementsByTagName("HTML");
if (elems.Count == 1)
{
elem = elems[0];
string pageSource = elem.OuterHtml;
File.WriteAllText(#"C:\doc.txt", pageSource, Encoding.UTF8);
}
}
IOleCommandTarget
public void ShowSource()
{
IOleCommandTarget cmdt = null;
object o = null;
object oIE = null;
try {
cmdt = (IOleCommandTarget)this.Document.DomDocument;
cmdt.Exec(cmdGUID, oCommands.ViewSource, 1, o, o);
} catch (Exception ex) {
throw new Exception(ex.Message.ToString(), ex.InnerException);
} finally {
cmdt = null;
}
}
The only way is to go through all frame documents.
Updated If iframe has different url you will get UnauthorizedAccessException when trying to retrieve iframe document

Categories

Resources