CsQuery remove comments - c#

I'm getting a URL and get all of it's content by calling:
CQ dom = ...;
string content = dom.Text();
I'm noticed that the "Text()" method also extracting HTML comments like:
<html>
<body>
<!-- This is comment - Ignore me -->
</body>
</html>
I looking for some option to remove all those comments from code. Something like this:
dom["comment"].remove();
Is this possible?
Thanks

Found the solution.
The creation of the dom should be done like this:
CQ.Create(stream, Encoding.UTF8, HtmlParsingMode.Auto, HtmlParsingOptions.IgnoreComments);
HtmlParsingOptions.IgnoreComments was what I'm looking for.

Related

can you find html element by attribute with csquery

Can I use csquery to find a html with a certain attribute with a certain value.
So if I got a page where there is something like this
<html>
<body>
<div align="left">something</div>
</body>
</html>
Can I then get the hole line out by search for a div with the attribute align with the value left? or even just the html element, and then get the value from within the attribute?
As always, thanks for the help and time.
I haven't used csquery myself but when looking at the docs, and you can use css queries this should work
div[align='left']
EDIT:
After being assured that this is in response to a client side operation, in the script it should look like this:
var rows = query["div[align='left']"];
This how you can look up elements by tag and attribute selectors, is to have the attributes you want in brackets. and then the value interpolated like so.

Cannot add text to <p>

I'm trying to insert some text after the first tag:
<body id="tinymce" spellcheck="false">
<p>
// I want to insert text here
<br>
</p>
</body>
My attempt so far hasn't worked:
IElement tinymice;
string testText = "some text here"
string xPath = string.Format("//body[#id='{0}']/p", "tinymce");
tinymice = GetElementByXPath(xPath);
tinymce.SendKeys(string.Format("{0}", testText ));
Put the text into a span. Paragraph tags are intended to be seperators.
From the docs here: http://jwebunit.sourceforge.net/apidocs/net/sourceforge/jwebunit/api/IElement.html
I infer that you can call:
tinymice.setTextContent("text to insert");
I googled "getelementbyxpath selenium" I hope these are the right docs, it does list selenium at the bottom of the page: http://jwebunit.sourceforge.net/apidocs/net/sourceforge/jwebunit/api/class-use/IElement.html
I realize this is probably a javascript package, but this may be the same methods for the C# dll as well.
I don't think the above person was aware that tinyMCE is tinyMCE and not Tiny Mice..

Displaying html links in a different order on page load

I have a bunch of static links currently within a div, however Im after changing the order of the links on page load.
I've considered using a literal and a loop through the links in code behind but im stumped. Maybe a repeater... I need a push in the right direction please!
Im fairly new to this so any help would be much appreciated. Thanks
(c# or vb.Net)
ok so lets say that you would store the links in a Links.txt file because there could be hundreds of links for example..
1. Create the .txt file
2. Make note of the file location.
3. Save the file with the links in them
4. use this code in your project to load the links
List<string> strUrlLinks = new List<string>(File.ReadAllLines(FilePath + FileName.txt));
in code you can then hover over strUrlLinks and you will see the Links
to access the individual links do it based on it's [Index] position
I know you asked for an answer in c# or vb.net and may have some technical restraints on this, but to create this functionality in jQuery is almost trivial considering the links are already hardcoded on the page.
See the example below which sorts in descending order on the link text.
<html>
<head>
<title></title>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
</head>
<body>
<div id="linksContainer">
Line 1
Line 3
Line 2
</div>
<script type='text/javascript'>
$(window).load(function(){
var orderDivLinks = function(desc) {
$('div#linksContainer').children().detach().sort(function(a,b) {
var compA = $(a).text();
var compB = $(b).text();
return (compA < compB) ? -1 : (compA > compB) ? 1 : 0;
}).appendTo(document.body);
}
orderDivLinks(0);
});
</script>
</body>
</html>

Extract content with XPath?

I have html content that I am storing as an XML document (using HTML Agility Pack). I know some XPath, but I am not able to zero into the exact content I need.
In my example below, I am trying to extract the "src" and "alt" text from the large image. This is my example:
<html>
<body>
....
<div id="large_image_display">
<img class="photo" src="images/KC0763_l.jpg" alt="Circles t-shirt - Navy" />
</div>
....
<div id="small_image_display">
<img class="photo" src="images/KC0763_s.jpg" alt="Circles t-shirt - Navy" />
</div>
</body>
</html>
What is the XPath to get "images/KC0763_l.jpg" and "Circles t-shirt - Navy"? This is how far I got but it is wrong. Mostly pseudo code at this point:
\\div[#class='large_image_display']\img[1][#class='photo']#src
\\div[#class='large_image_display']\img[1][#class='photo']#alt
Any help in getting this right would be greatly appreciated.
The following xpath will get you to the src attributes for the img tags:
'//html/body/div/img[#class="photo"]/#src'
And similarly this will get you to the alt attributes:
'//html/body/div/img[#class="photo"]/#alt'
From there you can get to the attribute text. If you want to only find the ones that match 'large_image_display' then you would filter it further like this:
'//html/body/div[#id="large_image_display"]/img[#class="photo"]/#src'
Use the following XPath expressions:
/html/body/div[#id='large_image_display']/img/#src
and
/html/body/div[#id='large_image_display']/img/#alt
Always try to avoid using the // abbreviation, because it may result in very inefficient evaluation (causes the whole (sub)tree to be scanned).
In this particular case we know that the html element is the top element of the document and we can simply select it by /html -- not //html.
Your major problem was that in your expressions you were using \ and \\ and there are no such operators in XPath. The correct XPath operators you were trying to use are / and the // abbreviation.

Iteration through the HtmlDocument.All collection stops at the referenced stylesheet?

Since "bug in .NET" is often not the real cause of a problem, I wonder if I'm missing something here.
What I'm doing feels pretty simple. I'm iterating through the elements in a HtmlDocument called doc like this:
System.Diagnostics.Debug.WriteLine("*** " + doc.Url + " ***");
foreach (HtmlElement field in doc.All)
System.Diagnostics.Debug.WriteLine(string.Format("Tag = {0}, ID = {1} ", field.TagName, field.Id));
I then discovered the debug window output was this:
Tag = !, ID =
Tag = HTML, ID =
Tag = HEAD, ID =
Tag = TITLE, ID =
Tag = LINK, ID =
... when the actual HTML document looks like this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Protocol</title>
<link rel="Stylesheet" type="text/css" media="all" href="ProtocolStyle.css">
</head>
<body onselectstart="return false">
<table>
<!-- Misc. table elements and cell values -->
</table>
</body>
</html>
Commenting out the LINK tag solves the issue for me, and the document is completely parsed. The ProtocolStyle.css file exist on disk and is loaded properly, if that would matter. Is this a bug in .NET 3.5 SP1, or what? For being such a web-oriented framework, I find it hard to believe there would be such a major bug in it.
Update: By the way, this iteration was done in the WebBrowser control's Navigated event.
After a few years, I returned to this code and finally discovered that the problem was that I walked through the HtmlDocument.All collection in the WebBrowser.Navigated event handler. The proper way to do this is to walk through the elements in WebBrowser.DocumentCompleted.
This mistake also caused embedded script code to seemingly "halt" parsing, exactly like the aforementioned LINK tags. In reality, it wasn't halting -- it just hadn't finished rendering the entire document yet.

Categories

Resources