Can I parse the html tables by giving only column name ?
Like only those data should be extracted from the table which matches those column names I give.
Like for example I have table of column names like serial no., name, address, phone no,total Rs..
And I want to extract the information about only name, phone no and total Rs.. Then how can I do it?
Take a look at Html Agility Pack It provides an LINQ api for searching html content.
Yes you can. You can use XPATH to scan your html document (google for screen scraping).
Another technique is UI testing frameworks like Watin which let you use CSS selectors and more to find elements on a HTML page and get the contents.
You can use Data Extracting SDK which has HtmlProcessor class with Tables property which handles HTML tables as DataTable objects.
Related
Here is my html document of belarc advisor, I want to read some specific data from it like Operating System, Drives, Memory in Data Table.
Note: I have done such task like read excel files in data grid view. But I am not aware about how can I read specific data from html document.
So is it possible to read only specific data from html document?
You might want to take a look at: https://html-agility-pack.net/?z=codeplex
Also the default XmlDocument class should be able to parse the HTML.
Or regular expressions are an option: http://mahmoud-alam.blogspot.com/2008/02/some-times-we-need-to-extract.html
What that means is you have to load the page and then look for the datatable you would like to extract.
How to search through html entities in lucene.net?
All my index in numeric html entities, so if I search for example "34" it comes &#<b>34</b>;
Also very interesting, how to make search through different fields with different words like in SQL. for example search phrase "word1 word2"
SELECT * FROM table WHERE
title LIKE 'word1%' OR title LIKE 'word2%' OR
description LIKE'word1%' OR description LIKE 'word2%'
It comes down to how you store it. When you store your document, it appears you're storing your HTML and searching on it.
I recommend that you have two separate fields:
One stores the raw HTML, but it is not analysed (there's no need to search on the markup, is there?)
One contains the HTML that is processed for searching. This field is not stored but it is analyzed.
In order to populate the second field, you should run the HTML through something like HTML Agility Pack to get the inner text of the HTML nodes you're storing/processing, and then run that text through the HttpUtility.HtmlDecode method to get the text that the HTML entities represent which you can actually analyze and search on.
Then, you can search on the analyzed field for whatever you wish without doing anything special, and then retrieve the content from the field that stores the raw HTML.
In regards to wildcard searches, they are supported, you just have to build your query appropriately (assuming you are using a QueryParser). Note that wildcard prefixes are not enabled by default.
I am really new in c# programming. I would like some help from you guys (if possible). I have a website (it is a shopping website ) with data : products, price, description...etc. What I would like to do is: Since the website has a search capability so I would like to get the data from it by querying the search link and get only the important data (product id, name, price and description). When I perform the search I get many pages, and every time I press next I get new page with extra list of products. How can I simply make automation of these tasks?
I searched a lot over internet I found that I need to use webclient() with regular expression, and I thought that maybe a loop over the page content and over the search result pages would be necessary.
what do you think guys?
Website Example.
I´ll appreciate any effort from your side.
What you're describing is called scraping.
What you'll want is to use something like HtmlAgilityPack to get the website. Then you find the nodes you're interested in by using the DOM, and reading their inner text.
The whole process is rather complicated, but at least I've sent you off in the right direction. For the most part, search urls tend to have the same format.
In your link for instance
http://cdon.se/hemelektronik/advanced-search?manufacturer-id=&title=.&title-matchtype=1&genre-id=&page-size=15&sort-order=142&page=2
You can change 'page' to be smething else and you can go through all the pages that way.
Added:
Also don't TRY to use regex to parse html. It drove one particular person mad...
RegEx match open tags except XHTML self-contained tags
I want to take all values from this link
http://economictimes.indiatimes.com/indices/nifty_50_companies.cms
and want to put those NIFTY 50 companies companyname and LTP into sql table.
please help me give me some threads
I want use c# ,asp.net :)
Use HTML Agility Pack to get the values from the HTML page.
Find tutorials from here on how to use HTML Agility Pack.
Searcharoo has a class called the HtmlDocument. You can split the downloaded html content(use System.Net.WebClient to download the document from the url) into metatitle, metadescription, metakeywords and content(words).
You can find the searcharoo download here(it is open source)
http://www.searcharoo.net/SearcharooV7/
Assuming you already have a table in SQL..
You may do it like this:
Download the page
Parse its content
Bulk Insert the values in the table
In my database MYDB I have a table called MYTABLE and I have a column called Description. I am saving a long description in there with multiple HTML tags.
How can i return the values and not include all the HTML tags?
Is this even possible? What will be the best way of doing this? In the SQL statement or in code behind? And how will I do it?
See following
Best way to strip html tags from a string in sql server?
http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/
You are going to have to use some form of HTML parsing. If it is for basic HTML parsing then some form of Regex should suffice. However, for more advanced parsing you should look at something like HtmlAgilityPack. I use this to parse emails and I must say it works pretty well.