Lucene.Net. How to search through HTML entities - c#

How to search through html entities in lucene.net?
All my index in numeric html entities, so if I search for example "34" it comes &#<b>34</b>;
Also very interesting, how to make search through different fields with different words like in SQL. for example search phrase "word1 word2"
SELECT * FROM table WHERE
title LIKE 'word1%' OR title LIKE 'word2%' OR
description LIKE'word1%' OR description LIKE 'word2%'

It comes down to how you store it. When you store your document, it appears you're storing your HTML and searching on it.
I recommend that you have two separate fields:
One stores the raw HTML, but it is not analysed (there's no need to search on the markup, is there?)
One contains the HTML that is processed for searching. This field is not stored but it is analyzed.
In order to populate the second field, you should run the HTML through something like HTML Agility Pack to get the inner text of the HTML nodes you're storing/processing, and then run that text through the HttpUtility.HtmlDecode method to get the text that the HTML entities represent which you can actually analyze and search on.
Then, you can search on the analyzed field for whatever you wish without doing anything special, and then retrieve the content from the field that stores the raw HTML.
In regards to wildcard searches, they are supported, you just have to build your query appropriately (assuming you are using a QueryParser). Note that wildcard prefixes are not enabled by default.

Related

C# implement raven db full text search by the part of word

I have a grid and I need to support full text search. I need to support search not only by start with and end with, but I need to support search by the part of word. For example if I have "MyWord", I need that search will found by the part of "wor". If I try to use string.contains() I get the following error:
Contains is not supported, doing a substring match over a text field is a very slow operation, and is not allowed using the Linq API.
The recommended method is to use full text search (mark the field as Analyzed and use the Search() method to query it.
If I build raven db index and mark field as Analyzed, contains is not working. It works with StartWith() and EndWith(), but not with contains. Using .Search() I'm getting the same results. Another option is to use lucene syntax:
.Where("Name:*partOfWord*")
and it works fine, but I don't want to combine linq with lucene syntax and I want to solve it using raven db indexes.
Have you any ideas how to implement full text search for raven db using indexes?
You want to be using an NGram analyzer, as described here. It's an analyzer you can add to your RavenDB server by dropping its DLL in the Analyzers folder.
You really don't want to do any *substr Lucene queries ("ending with" clauses, that is), because the performance is terrible. The inconsistency in coding style is a lesser problem.
I use this query to search for persons full names by just typing a part of the name. It is recommended to set a minimum length of search string.
.Search(x => x.Name, "word to search" + "*", escapeQueryOptions: EscapeQueryOptions.AllowPostfixWildcard)

query a website and retrive public data from it

I am really new in c# programming. I would like some help from you guys (if possible). I have a website (it is a shopping website ) with data : products, price, description...etc. What I would like to do is: Since the website has a search capability so I would like to get the data from it by querying the search link and get only the important data (product id, name, price and description). When I perform the search I get many pages, and every time I press next I get new page with extra list of products. How can I simply make automation of these tasks?
I searched a lot over internet I found that I need to use webclient() with regular expression, and I thought that maybe a loop over the page content and over the search result pages would be necessary.
what do you think guys?
Website Example.
I´ll appreciate any effort from your side.
What you're describing is called scraping.
What you'll want is to use something like HtmlAgilityPack to get the website. Then you find the nodes you're interested in by using the DOM, and reading their inner text.
The whole process is rather complicated, but at least I've sent you off in the right direction. For the most part, search urls tend to have the same format.
In your link for instance
http://cdon.se/hemelektronik/advanced-search?manufacturer-id=&title=.&title-matchtype=1&genre-id=&page-size=15&sort-order=142&page=2
You can change 'page' to be smething else and you can go through all the pages that way.
Added:
Also don't TRY to use regex to parse html. It drove one particular person mad...
RegEx match open tags except XHTML self-contained tags

ASP.NET MVC / C# - String to valid URL characters?

I don't know how to ask this, and I don't know what it is called either so I'll just describe what I want to achieve.
In the database, some articles' title originaly has spaces:
my title with spaces
But in the url, spaces are replaced by other characters such as plus sign (+) or underscore (_)
http://www.mydomain.com/mycontroller/myaction/my_title_with_spaces
or
http://www.mydomain.com/mycontroller/myaction/my+title+with+spaces
Now, how do you do that in C#? Or is there any helper in ASP.NET MVC that can do something like that?
Let say we achieved the said URL, is there any risk that two unique titles become the same in the URL? Please consider these titles:
Title's
Titles
after parsing, they became the same
Titles
Titles
This will be a problem when retrieving the article from the database since I'll get two results, one for "Title" and one for "Title's".
I would implement that functionality like this:
1. When creating a new article, generate the URL representation based on the title.
Use a function that converts the title for a suitable representation.
For example, the title "This is an example" might generate something like "This_is_an_example".
This is up to you. You can create a function that parses the title with rules you define, or use an existing one if it suits better your problem.
2. Ensure the URL representation is unique
If it's going to be an ID, it must be unique. So, when creating new articles you must query your database for the resulting URL representation. If you get a result from the database, it means the newly created article generated the same representation as one of the already created articles. Add something to it so it remains unique.
This could be something like "This_is_an_example_2". In this case, we added the "_2" to the end of the generated representation so it differs from the already existing one. Once more, with each change you have to ensure this representation remains unique.
3. Save the created ID in the database, along with the article data
In the database be sure to save the "This_is_an_example" ID and relate it to the article. Maybe even as the table primary key?
4. Query the database for the correct article
Now, about showing a site visitor the correct article:
When a visitor asks for the following resource, for example:
http://www.mydomain.com/mycontroller/myaction/this_is_an_example_2
Extract the URL part that identifies the article, in this case "this_is_an_example_2".
When you have that, you have the identifier of the article in the database. So, you can query the database for the article with the "this_is_an_example_2" ID and show the article's content to the user.
This might involve some URL rewriting. Unfortunately I'm unable to help you with that in asp.NET. Some search on the subject will surely help you.

Sharepoint Custom Search Results Page using FullTextSQLQuery

I'm trying to create a customized search that displays results based on my FullTextSQLQuery results (i.e. user types 'Foo' clicks Search, my server-side code performs a FullTextSQLQuery bringing back PDF documents that contain 'Foo' in its text).
My question is what will I need to do after getting the results from my query in order to display the results to the user? Will I need to provide my own results aspx page or does SP have something that is out-of-box that I can use to perhaps pass my results along to?
I'm not aware of anything OOTB, but this is a simple matter of transforming the XML results into HTML using an XSL.

Parsing or Extracting the content of html table

Can I parse the html tables by giving only column name ?
Like only those data should be extracted from the table which matches those column names I give.
Like for example I have table of column names like serial no., name, address, phone no,total Rs..
And I want to extract the information about only name, phone no and total Rs.. Then how can I do it?
Take a look at Html Agility Pack It provides an LINQ api for searching html content.
Yes you can. You can use XPATH to scan your html document (google for screen scraping).
Another technique is UI testing frameworks like Watin which let you use CSS selectors and more to find elements on a HTML page and get the contents.
You can use Data Extracting SDK which has HtmlProcessor class with Tables property which handles HTML tables as DataTable objects.

Categories

Resources