How do i edit how SOLR scores a document? - c#

I'm using SolrNet to map Solr Index Documents and Results to classes and use the server for a desktop search application. What I need from Solr is to give a query string, and return a list of documents with two details : the unique id for that document, and the score for that document
But the score i want is not the score that SOLR calculates by itself. I need a score that reflects only the frequency of that string in the document (in other words, hit-count in that document). How do I change how SOLR scores documents so that the score generated for each document is either equal to or proportional to the hit-count?

have you looked to function queries? specifically termfreq can be helpful for you.
http://wiki.apache.org/solr/FunctionQuery#termfreq
you can sort just by termfreq using http://solrurl/?q=myterm&sort=termfreq(text,'myterm') desc

Related

How to get Total document count from Firestore

How to get the total document count from Firestore in Unity c#?
In the below picture is my FireStore DB. I want to know two things.
I want to get the total count of documents. How do I get the total count of Documents from the collection "users" in unity C#?
How to filter based on the school. And get the name of the person in unity C#?
You have at least 2 choices:
a) Either you retrieve all documents and you count them. This is simple but will cost you as many reads as there are documents (not viable if you have many documents!)
b) You create a counter in an external document which you increment/decrement on each document creation/deletion. This will cost you some writes but only 1 read to get the count. It is a bit more complex to setup, just make sure the document creation/deletion and the increment/decrement are done as per the same batch operation to avoid inconsistencies in case of errors.
Perform a simple query such as collection("users").where("school","==", "XXX").get()

Elasticsearch query to search only words mentioned more than 3 times in document

Is it possible to search in elasticsearch if i want certain word to be occurred more than n times.
For e.g. I want to search for word "Bill Gate" only if "Bill Gate" is present more than 3 times in a document. I am storing field name as "content" that stores the full text of the story. Within this field I want to search for "Bill Gates" and return only those document that has "Bill Gates" word occurred more than or equal to 3 times.
I need to build query that will be generated dynamically based on no of occurrences user want to search for specific words within the content field.
Also, is there any lucene syntax available to do that.
Meanwhile am using elasticsearch version 6.2
Any help would be greatly appreciated.

How can I deal with slow performance on Contains query in Entity Framework / MS-SQL?

I'm building a proof of concept data analysis app, using C# & Entity Framework. Part of this app is calculating TF*IDF scores, which means getting a count of documents that contain every word.
I have a SQL query (to a remote database with about 2,000 rows) wrapped in a foreach loop:
idf = db.globalsets.Count(t => t.text.Contains("myword"));
Depending on my dataset, this loop would run 50-1,000+ times for a single report. On a sample set where it only has to run about 50 times, it takes nearly a minute, so about 1 second per query. So I'll need much faster performance to continue.
Is 1 second per query slow for an MSSQL contains query on a remote machine?
What paths could be used to dramatically improve that? Should I look at upgrading the web host the database is on? Running the queries async? Running the queries ahead of time and storing the result in a table (I'm assuming a WHERE = query would be much faster than a CONTAINS query?)
You can do much better than full text search in this case, by making use of your local machine to store the idf scores, and writing back to the database once the calculation is complete. There aren't enough words in all the languages of the world for you to run out of RAM:
Create a dictionary Dictionary<string,int> documentFrequency
Load each document in the database in turn, and split into words, then apply stemming. Then, for each distinct stem in the document, add 1 to the value in the documentFrequency dictionary.
Once all documents are processed this way, write the document frequencies back to the database.
Calculating a tf-idf for a given term in a given document can now be done just by:
Loading the document.
Counting the number of instances of the term.
Loading the correct idf score from the idf table in the database.
Doing the tf-idf calculation.
This should be thousands of times faster than your original, and hundreds of times faster than full-text-search.
As others have recommended, I think you should implement that query on the db side. Take a look at this article about SQL Server Full Text Search, that should be the way to solve your problem.
Applying a contains query in a loop extremely bad idea. It kills the performance and database. You should change your approach and I strongly suggest you to create Full Text Search indexes and perform query over it. You can retrieve the matched record texts with your query strings.
select t.Id, t.SampleColumn from containstable(Student,SampleColumn,'word or sampleword') C
inner join table1 t ON C.[KEY] = t.Id
Perform just one query, put the desired words which are searched by using operators (or, and etc.) and retrieve the matched texts. Then you can calculate TF-IDF scores in memory.
Also, still retrieving the texts from SQL Server into in memory might takes long to stream but it is the best option instead of apply N contains query in the loop.

Indexed databases in Random Access Memory

I'm currently writing a small test web application for a jobs search system.
I have a table Vacancies (the main table to talk about).
I need to make a rapid AJAX update of vacancies (in a suggest list below input control) matched to a user query. Different DBMS provide powerful programming extensions like Free Text Search in Microsoft SQL Server .. but I think that scanning a physical file takes plenty of time. And my idea is based on transfering the whole table Vacancies into RAM, so it, in my view, makes sense since in this case data retrieving demands less time.
So if a client types in a textbox something like "pro" - the suggest list shows up with suggestions:
-product manager
-professional designer
-programmer
-programmer C#
-programmer Java
-property administrator
-provision expert
when a user types another letter "g", the value of a textbox widens to "prog"
and the list is refreshed:
-programmer
-programmer C#
-programmer Java
To make that possible I plan to create a tree index with saved values in nodes, where a vacancy prefix will play a role of the index key and node values are the vacancy names. The index is built and populated only once with data from a data table. See what I mean below:
"pro" -> {
"product manager",
"professional designer",
"programmer",
"programmer C#",
"programmer Java"
"property administrator",
"provision expert"
}
So an index builder must analyze a string list and find the least prefixes of vacancy names.
Then when a builder finds a string with a letter after prior found prefix, it creates a child tree node ("prog") (the count of data values in that node decreases as it is constantly filtered) and adds itself up to the parent node ("pro")
"prog" -> {
"programmer",
"programmer C#",
"programmer Java"}
}
Can you advise me on the types of tree indexes that naturally fit to solve this problem?
What's the best of them by the seek time?
Thanks
This problem was solved years ago, you are recreating Lucene:
For what it's worth the type of tree you want is a Patricia Tree or a Radix Tree. In terms of storing all data in RAM, this is a bad idea because there are other applications that use RAM not just your index. Currently I am ripping out someone's custom database that they thought was a good idea to implement this way and replacing it with a real database solution.

Retrieving metadata from xml files

I need to retrieve metadata from multiple xml files. The structure of the xml file is the following:
<songs>
<song_title> some title </song_title>
<classification> some classification </classification>
<song_text> some text </song_text>
<singer>
<sing> singer's name </sing>
<gender> gender </gender>
<bornYear> year </bornYear>
<livePlace> live place </livePlace>
<liveArea> live area </liveArea>
</singer>
</songs>
The user chooses the search criteria - live place or live area. Then he enters the name of the place or area, that he searches for. I need to find and display links to songs, which have in its metadata the place or area, that user has entered. I am using .NET 3.5
This answer is more of a pointer...
You can use LINQ to XML to accomplish this task.
What Is LINQ to XML?
LINQ to XML is a LINQ-enabled,
in-memory XML programming interface
that enables you to work with XML from
within the .NET Framework programming
languages.
LINQ to XML is like the Document
Object Model (DOM) in that it brings
the XML document into memory. You can
query and modify the document, and
after you modify it you can save it to
a file or serialize it and send it
over the Internet. However, LINQ to
XML differs from DOM: It provides a
new object model that is lighter
weight and easier to work with, and
that takes advantage of language
improvements in Visual C# 2008.
You can then search and manipulate any XML document element using LINQ query expressions like the following example:
IEnumerable<XElement> partNos =
from item in purchaseOrder.Descendants("Item")
where (int) item.Element("Quantity") *
(decimal) item.Element("USPrice") > 100
orderby (string)item.Element("PartNumber")
select item;
You can use XPathing to easily get whatever you want if you have an aversion to Linq
http://msdn.microsoft.com/en-us/library/ms256086%28VS.85%29.aspx
node.SelectNodes("Songs[/Singer/LivePlace='California']")
this would get all Songs nodes that have a singer node with a liveplace node with the value of California.

Categories

Resources