Get links of a html document in order

Get links of a html document in order - c#

I want to get all the links of a HTML document. This isn't a problem, but apparently it puts all the links in an alphabetic order before storing them in an array one by one. I want to have the links in original order (not in alphabetic).
So is there any possibility to get the first found link, store it, then the second one,...? I already tried using HtmlAgilityPack and the Webbrowser-Control methods, but both order them alphabetically. The original order is important for later purposes.
I heard that it might be possible with Regex, but I've found enough answers, where they say that you shouldn't use it for HTML parsing. So how can I do it?
Here's the Webbrowser-Control code, I tried to use to get the links and store them into an array:
private void btnGet_Click(object sender, EventArgs e)
{
HtmlWindow mainFrame = webFl.Document.Window.Frames["mainFrame"];
HtmlElementCollection links = mainFrame.Document.Links;
foreach (HtmlElement link in links)
{
string linkText = link.OuterHtml;
if (linkText.Contains("puzzle"))
{
arr[i] = linkText;
i++;
}
}
}
Thank you in advance,
Opak

You can get the correct order by walking the DOM tree using HTML DOM API. The following code does this. Note, I use dynamic to access DOM API. That's because WebBrowser's HtmlElement.FirstChild/HtmlElement.NextSibling don't work for this purpose, as they return null for DOM text nodes.
private void btnGet_Click(object sender, EventArgs e)
{
Action<object> walkTheDom = null;
var links = new List<object>();
// element.FirstChild / NextSibling don't work as they stop at DOM text nodes
walkTheDom = (element) =>
{
dynamic domElement = element;
if (domElement.tagName == "A")
links.Add(domElement);
for (dynamic child = domElement.firstChild; child != null; child = child.nextSibling)
{
if (child.nodeType == 1) // Element node?
walkTheDom(child);
}
};
walkTheDom(this.webBrowser.Document.Body.DomElement);
string html = links.Aggregate(String.Empty, (a, b) => a + ((dynamic)b).outerHtml + Environment.NewLine);
MessageBox.Show(html);
}
[UPDATE] If you really need to get a list of HtmlElement objects for <A> tags, instead of dynamic native elements, that's still possible with a little trick using GetElementById:
private void btnGet_Click(object sender, EventArgs e)
{
// element.FirstChild / NextSibling don't work because they stop on DOM text nodes
var links = new List<HtmlElement>();
var document = this.webBrowser.Document;
dynamic domDocument = document.DomDocument;
Action<dynamic> walkTheDom = null;
walkTheDom = (domElement) =>
{
if (domElement.tagName == "A")
{
// get HtmlElement for the found <A> tag
string savedId = domElement.id;
string uniqueId = domDocument.uniqueID;
domElement.id = uniqueId;
links.Add(document.GetElementById(uniqueId));
if (savedId != null)
domElement.id = savedId;
else
domElement.removeAttribute("id");
}
for (var child = domElement.firstChild; child != null; child = child.nextSibling)
{
if (child.nodeType == 1) // is an Element node?
walkTheDom(child);
}
};
// walk the DOM for <A> tags
walkTheDom(domDocument.body);
// show the found tags
string combinedHtml = links.Aggregate(String.Empty, (html, element) => html + element.OuterHtml + Environment.NewLine);
MessageBox.Show(combinedHtml);
}

Related

Cant get Innertexts from webpage using html agility pack xpath in c#

So this is my code guys.
Im trying to get the text inside a span and storage it locally. Im using html agility pack and trying to retrieve the text using xpath but the nodes dont retrieve anything and appear as null.
This is the page im trying to get the text from: https://siat.sat.gob.mx/app/qr/faces/pages/mobile/validadorqr.jsf?D1=10&D2=1&D3=15030267855_SDS150309FC7
Specifically the "Denominación o razón social" text.
namespace ObtencionDatosSatBeta
{
public partial class Form1 : Form
{
DataTable table;
public Form1()
{
InitializeComponent();
}
private void InitTable()
{
table = new DataTable("tabladedatosTable");
table.Columns.Add("Variable", typeof(string));
table.Columns.Add("Contenido", typeof(string));
//table.Rows.Add("Super Mario 64", "84%");
tabladedatos.DataSource = table;
}
private async void Form1_Load(object sender, EventArgs e)
{
InitTable();
HtmlWeb web = new HtmlWeb();
var doc = await Task.Factory.StartNew(() => web.Load("https://siat.sat.gob.mx/app/qr/faces/pages/mobile/validadorqr.jsf?D1=10&D2=1&D3=15030267855_SDS150309FC7"));
var nodes = doc.DocumentNode.SelectNodes("//*[#id=\"ubicacionForm: j_idt12:0:j_idt13: j_idt17_data\"]//tr//td//span");
var innerTexts = nodes.Select(node => node.InnerText);
}
private void tabladedatos_CellContentClick(object sender, DataGridViewCellEventArgs e)
{
}
}
}
Any idea?
var nodes = doc.DocumentNode.SelectNodes("//*[#id=\"ubicacionForm: j_idt12:0:j_idt13: j_idt17_data\"]//tr//td//span");
The line of code above is the one that appears as null.

Use this Xpath which gets the first span under the element with the following ID: ubicacionForm:j_idt10:0:j_idt11:j_idt14_data
(//*[#id='ubicacionForm:j_idt10:0:j_idt11:j_idt14_data']//span)[1]
You can select the element using multiple different ways by copying the HTML in chrome (Ctrl + Option + J)
And then paste the HTML in Xpather where you can play around with your Xpath. Xpather.com

Can't set TreeView.SelectedNode Property

I'm trying to set the selected node after cleaning and refilling my treeview. Here's the code I tried:
private TreeNode selectednode;
private void ElementTextChanged(object sender, EventArgs e)//saves changes to the XElements displayed in the textboxes
{
BusinessLayer.ElementName = (sender as TextBox).Tag.ToString();
string Type = (sender as TextBox).Name;
string Value = (sender as TextBox).Text;
if (TView_.SelectedNode!=null)
{
selectednode = TView_.SelectedNode;
}
string NodePath = TView_.SelectedNode.FullPath.Replace("\\", "/");
Telementchange.Stop();
Telementchange.Interval = 2000;
Telementchange.Tick += (object _sender, EventArgs _e) => {
if (Type=="Value")
{
BusinessLayer.ChangeElementValue(NodePath,Value);//nembiztos hogy így kéne ezt meghívni
}
else
{
BusinessLayer.ChangeElementName(NodePath, Value);
BusinessLayer.ElementName = Value;
}
FillTree(BusinessLayer.Doc);
TView_.SelectedNode = selectednode; //nemműködikezaszar!!!!!
TView_.Select();
Telementchange.Stop();
};
Telementchange.Start();
}
For some season after I set the TView_.SelectedNode property it is null.
Thank you for helping!

Looking at the code you show you seem to do this:
store the currently selected Node in a variable
clean and refill the TreeView
select the stored Node
This is bound to fail as at the moment after the filling, the stored Node is no longer part of the TreeView's node collection unless you have added it again in the fill routine..
I don't think you do that.
If you want to re-select some node you will need to identify it in the new collection of nodes. If the Text is good enough for that do a recursive TreeView search like the one in L.B's answer here in this post (Not the accepted answer, though!)

I couldn't solve my problem by setting the SelectedNode property so i made a workaround.
private void RefreshTreeView()
{
FillTree(BusinessLayer.Doc);
TView_.SelectedNode = _selectednode;
ExpandToPath(TView_.TopNode, _selectedPath);
}
void ExpandToPath(TreeNode relativeRoot, string path)
{
char delimiter = '\\';
List<string> elements = path.Split(delimiter).ToList();
elements.RemoveAt(0);
relativeRoot.Expand();
if (elements.Count == 0)
{
TView_.SelectedNode = relativeRoot;
return;
}
foreach (TreeNode node in relativeRoot.Nodes)
{
if (node.Text == elements[0])
{
ExpandToPath(node, string.Join(delimiter.ToString(),elements));
}
}
}

How to invoke element in a WebBrowser by the class name(s)?

I'm trying to make a simple Facebook client. One of the features should allow the user to post content on the homepage/his profile.
It logs the user in (works fine, all of the elements have got ids on Facebook) and then inserts the data in the corresponding field (works fine as well), but then it needs to click the "Post" button. However, this button doesn't have any id. It only has got a class.
<li><button value="1" class="_42ft _4jy0 _11b _4jy3 _4jy1 selected _51sy" data-ft="{"tn":"+{"}" type="submit">Posten</button></li>
('Posten' is 'Post' on German.)
I've been looking around the internet for a few hours now and tried different solutions. My most current solution is to search the item by it's inner content ("Posten") and then invoke it. Doesn't work. It inserts the text but doesn't invoke the button. Here's the code:
private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
if (postHomepage)
{
webBrowser1.Document.GetElementById("u_0_z").SetAttribute("value", metroTextBox1.Text);
GetButtonByInnerText("Posten").InvokeMember("click");
postHomepage = false;
}
}
HtmlElement GetButtonByInnerText(string SearchString)
{
String data = webBrowser1.DocumentText;
//Is the string contained in the website
int indexOfText = data.IndexOf(SearchString);
if (indexOfText < 0)
{
return null;
}
data = data.Remove(indexOfText); //Remove all text after the found text
//These strings are a list of website elements
//NOTE: These need to be updated with ALL elements from list such as:
// http://www.w3.org/TR/REC-html40/index/elements.html
string[] strings = { "<button" };
//Split the string with these elements.
//subtract 2 because -1 for index -1 for elements being 1 bigger than wanted
int index = (data.Split(strings, StringSplitOptions.None).Length - 2);
HtmlElement item = webBrowser1.Document.All[index];
//If the element is a div (which contains the search string
//we actually need the next item.
if (item.OuterHtml.Trim().ToLower().StartsWith("<li"))
item = webBrowser1.Document.All[index + 1];
//txtDebug.Text = index.ToString();
return item;
}
(This is a quick solution which I edited for my use, not very clean).
What's wrong here?

It does not look like your GetButtonByInnerText() method is searching for the button element correctly.
Here is simple replacement for you to try:
HtmlElement GetButtonByInnerText(string SearchString)
{
foreach (HtmlElement el in webBrowser1.Document.All)
if (el.InnerText==SearchString)
return el;
}

How to compare previous selected node with current selected node on asp.net treeview

I want to compare last selected node and current selected node on the treeview by using java script.
Please suggest me with some code samples to compare last selection and current selection node on the treeview.
If both the node selections are same , we need to deselect the same node.
Thanks. Please help on this.
I have resolved by server side code:
protected void TreeView1_PreRender(object sender, EventArgs e)
{
if (TreeView1.SelectedNode != null)
{
if (!string.IsNullOrEmpty(ADUtility.treenodevalue))
{
if (ADUtility.treenodevalue == TreeView1.SelectedNode.ValuePath)
{
TreeView1.SelectedNode.Selected = false;
}
else
{
ADUtility.treenodevalue = TreeView1.SelectedNode.ValuePath;
}
}
else
{
ADUtility.treenodevalue = TreeView1.SelectedNode.ValuePath;
}
}
}

I am just giving you the Pseudo code for this after that you can implement it by own.
Make 2 Global variables CurrentselectedNode and PreviousselectedNode
And make a ArrayList of Nodes
Arraylist<Object> nodeCollection;
var PreviousselectedNode;
var CurrentselectedNode;
if(nodeCollection.Current != null)
{
PreviousselectedNode= nodeCollection.Current;
var tempselectedItem = Products_Data.selectedNodeID.value;
var CurrentselectedNode = Document.getElementById(tempselectedItem);
// Here Do what you want to do with current Node and Previous Node
nodeCollection.Add(tempselectedNode);
}
else
{
var tempselectedItem = Products_Data.selectedNodeID.value;
var tempselectedNode = Document.getElementById(tempselectedItem);
nodeCollection.Add(tempselectedNode);
}

C# - XmlNodeList - Getting inner xml/text between description tags without HTML

Right now I've got a list box that shows RSS article titles/urls of an RSS feed. The title and URL extraction were no problem, but now I'm trying to have the description appear in a rich text box whenever the article title is selected in the list box. I can successfully get the description to show up in the text box, but it's always followed by a bunch of extra html. Example:
There's a silly rumor exploding on the Internet this weekend, alleging that Facebook is shutting down on March 15 because CEO Mark Zuckerberg "wants his old life back," and desires to "put an end to all the madness."<div class="feedflare">
<img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"></img> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"></img> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=at7OdUE16Y0:jsXll_RkIzI:V_sGLiPBpWU" border="0"></img> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"></img> <img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=at7OdUE16Y0:jsXll_RkIzI:gIN9vFwOqvQ" border="0"></img>
Code:
private void button1_Click(object sender, EventArgs e)
{
{
XmlTextReader rssReader = new XmlTextReader(txtUrl.Text);
XmlDocument rssDoc = new XmlDocument();
rssDoc.Load(rssReader);
XmlNodeList titleList = rssDoc.GetElementsByTagName("title");
XmlNodeList urlList = rssDoc.GetElementsByTagName("link");
descList = rssDoc.GetElementsByTagName("description");
for (int i = 0; i < titleList.Count; i++)
{
lvi = rowNews.Items.Add(titleList[i].InnerXml);
lvi.SubItems.Add(urlList[i].InnerXml);
}
}
}
private void rowNews_SelectedIndexChanged(object sender, EventArgs e)
{
if (rowNews.SelectedIndices.Count <= 0)
{
return;
}
int intselectedindex = rowNews.SelectedIndices[0]; // Get index of article title
txtDesc.Text=(descList[intselectedindex].InnerText);
// Get description array index that matched list index
}

You can strip html using approach from Using C# regular expressions to remove HTML tags

You can use InnerText instead of InnerHtml. This will only get the content of your child nodes without any markup.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get links of a html document in order - c#

Related

Cant get Innertexts from webpage using html agility pack xpath in c#

Can't set TreeView.SelectedNode Property

How to invoke element in a WebBrowser by the class name(s)?

How to compare previous selected node with current selected node on asp.net treeview

C# - XmlNodeList - Getting inner xml/text between description tags without HTML

Categories

Resources