can you find html element by attribute with csquery - c#

Can I use csquery to find a html with a certain attribute with a certain value.
So if I got a page where there is something like this
<html>
<body>
<div align="left">something</div>
</body>
</html>
Can I then get the hole line out by search for a div with the attribute align with the value left? or even just the html element, and then get the value from within the attribute?
As always, thanks for the help and time.

I haven't used csquery myself but when looking at the docs, and you can use css queries this should work
div[align='left']
EDIT:
After being assured that this is in response to a client side operation, in the script it should look like this:
var rows = query["div[align='left']"];
This how you can look up elements by tag and attribute selectors, is to have the attributes you want in brackets. and then the value interpolated like so.

Related

CsQuery remove comments

I'm getting a URL and get all of it's content by calling:
CQ dom = ...;
string content = dom.Text();
I'm noticed that the "Text()" method also extracting HTML comments like:
<html>
<body>
<!-- This is comment - Ignore me -->
</body>
</html>
I looking for some option to remove all those comments from code. Something like this:
dom["comment"].remove();
Is this possible?
Thanks
Found the solution.
The creation of the dom should be done like this:
CQ.Create(stream, Encoding.UTF8, HtmlParsingMode.Auto, HtmlParsingOptions.IgnoreComments);
HtmlParsingOptions.IgnoreComments was what I'm looking for.

Any DOM parsers that do not modify the DOM?

I need to write a page, can use PHP or .NET, that will display the unmodified html for an element of another page.
The other page may not have valid HTML, but we want it to be returned unmodified. We will not be selecting based on the invalid elements, but will select their parent element and need them returned unmodified.
An example HTML page that my page will be fetching:
<body>
<div>
<p>test1</p>
<br>
<p>test2
<p>test3</p>
</div>
</body>
So far everything I have tried attempts to fix the HTML, it makes the br in the example self closing and the second paragraph tags gets closed.
Is there anything out there that can do this?
Thanks!

How should I apply CSS selectors on a parsed HTML like this?

I'm trying to apply CSS selectors on HTML sources where each DOM element is parsed like this:
Destination URL, Text, (DOM Path in format \TagName, ClassName, Id, ChildRank)
For example:
\TagName=A&ClassName=&Id=&ChildRank=1
\TagName=LI&ClassName=&Id=&ChildRank=1
\TagName=UL&ClassName=&Id=&ChildRank=1
\TagName=DIV&ClassName=header-wrap&Id=&ChildRank=1
\TagName=DIV&ClassName=header-classified&Id=&ChildRank=1
\TagName=DIV&ClassName=header-container&Id=&ChildRank=3
\TagName=DIV&ClassName=main-container&Id=&ChildRank=1
\TagName=DIV&ClassName=ody-custom%20cobrand&Id=&ChildRank=2
\TagName=DIV&ClassName=ody-skin&Id=&ChildRank=10
\TagName=CENTER&ClassName=&Id=&ChildRank=2
\TagName=BODY&ClassName=&Id=&ChildRank=3
\TagName=HTML&ClassName=&Id=&ChildRank=0
I'd like to know is there an easy way to do this in C#, say if I have a list of CSS selectors and I want to see whether there is a CSS selector that can actually select a DOM element in the DOM elements?
Sorry the question might seem unclear.

Extract content with XPath?

I have html content that I am storing as an XML document (using HTML Agility Pack). I know some XPath, but I am not able to zero into the exact content I need.
In my example below, I am trying to extract the "src" and "alt" text from the large image. This is my example:
<html>
<body>
....
<div id="large_image_display">
<img class="photo" src="images/KC0763_l.jpg" alt="Circles t-shirt - Navy" />
</div>
....
<div id="small_image_display">
<img class="photo" src="images/KC0763_s.jpg" alt="Circles t-shirt - Navy" />
</div>
</body>
</html>
What is the XPath to get "images/KC0763_l.jpg" and "Circles t-shirt - Navy"? This is how far I got but it is wrong. Mostly pseudo code at this point:
\\div[#class='large_image_display']\img[1][#class='photo']#src
\\div[#class='large_image_display']\img[1][#class='photo']#alt
Any help in getting this right would be greatly appreciated.
The following xpath will get you to the src attributes for the img tags:
'//html/body/div/img[#class="photo"]/#src'
And similarly this will get you to the alt attributes:
'//html/body/div/img[#class="photo"]/#alt'
From there you can get to the attribute text. If you want to only find the ones that match 'large_image_display' then you would filter it further like this:
'//html/body/div[#id="large_image_display"]/img[#class="photo"]/#src'
Use the following XPath expressions:
/html/body/div[#id='large_image_display']/img/#src
and
/html/body/div[#id='large_image_display']/img/#alt
Always try to avoid using the // abbreviation, because it may result in very inefficient evaluation (causes the whole (sub)tree to be scanned).
In this particular case we know that the html element is the top element of the document and we can simply select it by /html -- not //html.
Your major problem was that in your expressions you were using \ and \\ and there are no such operators in XPath. The correct XPath operators you were trying to use are / and the // abbreviation.

How determine css text of each node in html

How can I iterate over HTML nodes of a web page and get the CSS Text of each node in it? I need something like what Firebug is doing, if you click on a Node, it gives you complete list of all CSS Texts associated with that Node (even inherited styles).
My main problem is not actually iterating over HTML nodes. I am doing it with Html Agility Pack library. I just need to get complete CSS for each node.
p.s. I am sorry, I should have explained that I want to do this in C# (not javascript)
I found the following code snippet useful for all element in the page and 'CurrentStyle' property of them shows their computed style:
HTMLDocument doc = (HTMLDocument)axWebBrowser1.Document;
var body = (HTMLBody)doc.body;//current style
var childs = (IHTMLDOMChildrenCollection)body.childNodes;
var currentelementType = (HTMLBody)childs.item(0);
var width = currentelementType.currentStyle.width;
Note that according to my prev post axWebBrowser1 is a WebBrowser control.
If you want the current styles for an element, look into getComputedStyle(), but if you want the inheritance too then you may have to implement the style cascade. Firebug does quite a lot of work behind the scenes to generate what you see!
You can get the CSS text from the style attribute like this:
node.getAttribute('style')
Or if you want style you can iterate through the keys and values in
node.style
If you want to grab the entire computed style of the element and not just the CSS applied in the style attribute, read this article on computed and cascaded styles.
You can use WebBrowser control in C# to access the htm document object and cast its body tag as following:
HTMLDocument doc = (HTMLDocument)axWebBrowser1.Document;
var body = (HTMLBody)doc.body;
But before that you should add com refrence: MSHTML to you project.
here you could access body.currentStyle that show you all its styles that might be css or inline styles.
You can try for (property in objName) operator as seen here.
I'm not sure if you can simply get "all" CSS properties using JavaScript to be honest, you could look into the [DOMNode].currentStyle, [DOMNode].style and document.defaultView.getComputedStyle thingamajiggy's. They should contain the 'current' style they had. What you could then do is have an array of all CSS properties you want to test and simply loop them through a function of your own that gets the CSS property for everything using forementioned methods (depending on which browser). I usually attempt the DOMNode.style[property] first as this is "inline" javascript and always rules over everything, then I sniff if the browser uses the .currentStyle method or .getComputedStyle and use the correct one.
It's not perfect and you might need to clean up some things (height: auto; to the actual current height, some browsers might return RGB colours instead of HEX) etc.
So, yes, I don't know of anything prefab that you can use in Javascript.

Categories

Resources