Is there any way to do full-text search on Redis with C#?
Yes, see http://playnice.ly/blog/2010/05/05/a-fast-fuzzy-full-text-index-using-redis/.
The code is about 100 LOC in Python, and can be transformed to C#.
The code uses metaphone Python library; you can find C# implementations online, such as this one: http://code.google.com/p/doublemetaphone/
RediSearch implements a search engine on top of Redis.
This also enables more advanced features, like exact phrase matching and numeric filtering for text queries, that are not possible or efficient with traditional Redis search approaches.
RediSearch supports the most of the leading programming languages including C# see: https://libraries.io/nuget/NRediSearch
Related
Is there any way to implement the functionality of this SQL 2008 function into a C# library? I need a parser that is able to take a string, parse it and show me the noise words, exact matches, and inflectional forms - based on this I am trying to build a kind of rank for the text (used for ordering the results of a search)
No, it is impossible. Although you can implement custom natural language processing by yourself (minimum 10 person-years), or use another third-party tools and libraries, for example - Lucene.NET
The Problem:
I need a good free library or algorithm to determine whether a text is related to a search pattern or not. The search pattern can be an ordered or unordered list of words.
For some searches the order is relevant, for some it is not. Additionally I need the ability to define aliases for searched words (e.g. "(C#|C sharp) code").
I doubt that there is a free cheap c# library meeting all my requests.
Which libraries/algorithms would you use to implement that functionality?
I´m grateful for any tip.
EDIT:
I need this to filter search results from multiple specialized search services. The resulting program must be VERY strict, so false negatives are no problem.False positives should be avoided(as far as possible).
For free, start here with the builtin Regex namespace/class:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
More sophisticated search is unlikely to come for free (cf. Google Search Appliance or similar).
This question already has answers here:
How to detect the language of a string?
(9 answers)
Closed 8 years ago.
Is there any C# library which can detect the language of a particular piece of text? i.e. for an input text "This is a sentence", it should detect the language as "English". Or for "Esto es una sentencia" it should detect the language as "Spanish".
I understand that language detection from text is not a deterministic problem. But both Google Translate and Bing Translator have an "Auto detect" option, which best-guesses the input language. Is there something similar available publicly, preferably in C#?
Yes indeed, TextCat is very good for language identification. And it has a lot of implementations in different languages.
There were no ports in .Net. So I have written one: NTextCat (NuGet, Online Demo).
It is pure .NET Standard 2.0 DLL + command line interface to it. By default, it uses a profile of 14 languages.
Any feedback is very appreciated! New ideas and feature requests are welcomed too :)
Language detection is a pretty hard thing to do.
Some languages are much easier to detect than others simply due to the diacritics and digraphs/trigraphs used. For example, double-acute accents are used almost exclusively in Hungarian. The dotless i ‘ı’, is used exclusively [I think] in Turkish, t-comma (not t-cedilla) is used only in Romanian, and the eszett ‘ß’ occurs only in German.
Some digraphs, trigraphs and tetragraphs are also a good give-away. For example, you'll most likely find ‘eeuw’ and ‘ieuw’ primarily in Dutch, and ‘tsch’ and ‘dsch’ primarily in German etc.
More giveaways would include common words or common prefixes/suffixes used in a particular language. Sometimes even the punctuation that is used can help determine a language (quote-style and use, etc).
If such a library exists I would like to know about it, since I'm working on one myself.
Please find a C# implementation based on of 3grams analysis here:
http://idsyst.hu/development/language_detector.html
Here you have a simple detector based on bigram statistics (basically means learning from a big set which bigrams occur more frequently on each language and then count those in a piece of text, comparing to your previously detected values):
http://allantech.blogspot.com/2007/07/automatic-language-detection.html
This is probably good enough for many (most?) applications and doesn't require Internet access.
Of course it will perform worse than Google's or Bing's algorithm (which themselves aren't great). If you need excellent detection performance you would have to do both a lot of hard work and over huge amounts of data.
The other option would be to leverage Google's or Bing APIs if your app has Internet access.
You'll want a machine learning algorithm based on hidden markov chains, process a bunch of texts in different languages.
Then when it gets to the unidentified text, the language that has the closer 'score' is the winner.
There is a simple tool to identify text language:
http://www.detectlanguage.com/
I've found that "textcat" is very useful for this. I've used a PHP implementation, PHP Text Cat, based on this this original implementation, and found it reliable. If you have a look at the sources, you'll find it's not a terrifyingly difficult thing to implement in the language of your choice. The hard work -- the letter combinations that are relevant to a particular language -- is all in there as data.
Having just finished writing a Regex replacement and match function and tvf for SQLCLR for the fifth time, I sat and pondered whether there was a set of common community extensions for SQLCLR for the most common things you want in a database but are never provided.
Powershell for example has an excellent set of community extensions that cover a plethora of additional functionality not included in the box. I wouldn't use Powershell without it.
I thought maybe SQLCLR had something similar. I'm looking for things like:
Regular expression support (isMatch, Replace, Match)
Base64 encode/decode support
String formatting (Datetimes, byte arrays, ints floats and decimals, etc)
Hashing, encryption with arbitrary algorithms (I know SQL 2k5, 2k8 support some basic stuff but no SHA2? What is up with that?)
Common additional aggregations; OR bits, AND bits, cat strings (String.Join)
Compression/decompression
Does anyone know of a library that has common routine functionality like this that we all write over and over again?
Peter take a look at SQL# which is a SQLCLR Assembly created by Solomon Rutzky that has two versions, a FREE edition and a pay version. You will find that the FREE edition has a number of the items that you have mentioned above included.
http://www.sqlsharp.com/
SplitString(), and many others could be added. Perhaps we create a SqlClrContrib site? Where people can post their ideas to an open source project and we can add such functions?
I have a website that has over 400,000 items. Some similar, some vastly different. We want to provide a way to search these items the best way possible. After being delivered the website it was using full text indexing. The solution is basic at best, woefully inadequate at worst.
So what is the best way to search these items? They are stored in a SQL Server Database (2005). Our website is designed in C# 2.0.
Currently here is the process:
User enters value into text box.
We 'clean' this entry. Removing 'scary' characters that could be an attempted hack. Remove key words (and, or, etc..)
Pass value into a stored procedure to return results.
Return results.
Look at Lucene.NET. I think it's a vast improvement over full-text search in SQL Server.
SQL Server Central has a nice article on creating a Google-like Full Text Search using SQL Server. Unfortunately you have to register view the full article, but registration is free and they post a lot of good information. Here is the link:
http://www.sqlservercentral.com/articles/Full-Text+Search+(2008)/64248/
Excerpt:
...
Google Style
The key to a successful application is
to make it easy to use but powerful.
Google has done this with their Web
search engine. The syntax for queries
is simple and intuitive, but
full-featured. Though the basic
building blocks of a Google query are
simple you can combine them in
powerful ways. I'll begin with basic
Google query syntax and add some
additional operators to take advantage
of the power of SQL Server CONTAINS
predicate syntax. The full Google
syntax is defined in the Google
Help:Cheat Sheet at
http://www.google.com/help/cheatsheet.html.
...
The article has full example code and even a link to download it. Its an interesting read even if you don't plan on implementing it.
You can have a look at Lucene.net, it will minimize the calls to the database for the search queries.
Following from http://incubator.apache.org/lucene.net/
Lucene.Net is a source code,
class-per-class, API-per-API and
algorithmatic port of the Java Lucene
search engine to the C# and .NET
platform utilizing Microsoft .NET
Framework.
Lucene.Net sticks to the APIs and
classes used in the original Java
implementation of Lucene. The API
names as well as class names are
preserved with the intention of giving
Lucene.Net the look and feel of the C#
language and the .NET Framework. For
example, the method Hits.length() in
the Java implementation now reads
Hits.Length() in the C# port.
In addition to the APIs and classes
port to C#, the algorithm of Java
Lucene is ported to C# Lucene. This
means an index created with Java
Lucene is back-and-forth compatible
with the C# Lucene; both at reading,
writing and updating. In fact a Lucene
index can be concurrently searched and
updated using Java Lucene and C#
Lucene processes.
You could use Google site search to deliver your search results. Doesn't always give you the flexibility to display the results as you want, but for many is good enough.
Second step is quite controversial - what words you consider as 'scary'? If you use SQL server build-in full text search then instead of manually removing key words from input query you can setup lists of nose/stop words inside sql server.
Here is one feature I want to see here on StackOverflow as well as on any other site that provides search functionality:
give more priority(weight) to some fields of your documents
(in case of stackoverflow - search should prioritize topic title)
Also consider to use 3rd party solution for FTS such as Lucene or Sphinx - they can provide much better user experience than build-in functionality.
Some advantages of 3rd party FTS components are: reduced database load, better relevance of search results, better indexing speed, smaller size of database.