Lucene as data store

Lucene as data store - c#

Is it possible to use Lucene as full fledged data store (like other(mongo,couch) nosql variants).
I know there are some limitations like newly updated documents by one indexer will not be shown in other indexer. So we need to restart the indexer to get the updates.
But i stumble upon solr lately, it seems these problems are avoided by some kind of snapshot replication.
So i thought i could use lucene as a data store since this also uses same kind of documents(JSON based) used by mongo and couch internally to manage documents, and its proven indexing algorithm fetches the records super fast.
But i am curious has anybody tried that before..? if not what are reasons not choosing this approach.

There is also the problem of durability. While a Lucene index should not get corrupted ever, I've seen it happen. And the approach Lucene takes to repairing a broken index is "throw it away and rebuild from the original data". Which makes perfect sense for an indexing tool. But it does require you to have the data stored somewhere else.

I've only worked with Solr, the Lucene derivative (and I would recommend using Solr to just about anyone) so my opinion may be a little biased but it should be possible to use Solr as a datastore yes, however it wouldn't be very useful without something more permanent in the background.
The problem you may encounter is that entering data into Solr does not guarantee you will get it back when you expect it. Baring the use of pretty strict faceting you may encounter problems retrieving your data simply because the indexer has decided to lump your results in a certain way.
I've experimented a little with this approach but the only real benefit I saw was in situations where you want the search index on the client side so that they can search quickly internally a then query the database for extended information.
My suggestion is to use solr for search and then have it return a short sample of the data you may want as well as an index for further querying in a traditional data store.
TL;DR: Yes, but I wouldn't recommend it.

The Guardian uses Solr as their data store. You can see some of their reasons in that slideshow.
In any case, I think their website is very heavily trafficked (certainly more so than anything I work on), so I think I would feel comfortable saying that Solr will probably work for you., since it scales to their requirements.

Related

Is it possible for Lucene to monitor a Sql Table and keep itself updated?

I am trying to understand some basics of Lucene, the full text search engine. More specifically I am looking at Lucene.Net.
Today I have an old legacy .NET 4.8 web app. Some is MVC, but the newer parts follow a pretty nice API first pattern. The app holds a lot of records (app half a million) with tons of different fields. The search functionality there is outdated to say the least. It is a ton of old Linq2SQL queries that fan out in like queries.
I would like to introduce a new and better way to search records, so I started looking at Lucene.Net. But I am trying to understand one key concept, and I can't seem to find the answer anywhere, and I think it might be because it cannot be done, but I would like to make sure.
Is it possible to set up Lucene to monitor a SQL table or view so I don't have to maintain the Lucene index from within my code. The code of this app does not lend itself to easily keeping a Lucene index updated when things are added, changed or deleted. But the database is good source of truth. I can live with a small delay on having the index up to date. But basically I would like define for each business model what fields are part of the index and what the id is, and then be able to query with that index from the C# server side code of my Web App.
Is such a scenario even possible or am I asking too much?

It's totally possible, but not out of the box. You have to implement it if you want it. Fundamentally you need to implement three things.
A way to know every time a piece of relevant data in the sql database changes
A place to capture information about that change, call it a change log.
A routine that reads the change log, applies those changes to the
LuceneNet index and than marks the record in the change log has processed.
There are of course lots of different ways to handle each of these.
This SO answer Lucene.Net index updates, when a manual change is done in SQL Database provides more details on one way this can be accomplished.

Schema Migration Scripts in NoSQL Databases

I have a active project that has always used C#, Entity Framework, and SQL Server. However, with the feasibility of NoSQL alternatives daily increasing, I am researching all the implications of switching the project to use MongoDB.
It is obvious that the major transition hurdles would be due to being "schema-less". A good summary of what that implies for languages like C# is found here in the official MongoDB documentation. Here are the most helpful relevant paragraphs (bold added):
Just because MongoDB is schema-less does not mean that your code can
handle a schema-less document. Most likely, if you are using a
statically typed language like C# or VB.NET, then your code is not
flexible and needs to be mapped to a known schema.
There are a number of different ways that a schema can change from one
version of your application to the next.
How you handle these is up to you. There are two different strategies:
Write an upgrade script. Incrementally update your documents as they
are used. The easiest strategy is to write an upgrade script. There is
effectively no difference to this method between a relational database
(SQL Server, Oracle) and MongoDB. Identify the documents that need to
be changed and update them.
Alternatively, and not supportable in most relational databases, is
the incremental upgrade. The idea is that your documents get updated
as they are used. Documents that are never used never get updated.
Because of this, there are some definite pitfalls you will need to be
aware of.
First, queries against a schema where half the documents are version 1
and half the documents are version 2 could go awry. For instance, if
you rename an element, then your query will need to test both the old
element name and the new element name to get all the results.
Second, any incremental upgrade code must stay in the code-base until
all the documents have been upgraded. For instance, if there have been
3 versions of a document, [1, 2, and 3] and we remove the upgrade code
from version 1 to version 2, any documents that still exist as version
1 are un-upgradeable.
The tooling for managing/creating such an initialization or upgrade scripts in SQL ecosystem is very mature (e.g. Entity Framework Migrations)
While there are similar tools and homemade scripts available for such upgrades in the NoSQL world (though some believe there should not be), there seems to be less consensus on "when" and "how" to run these upgrade scripts. Some suggest after deployment. Unfortunately this approach (when not used in conjunction with incremental updating) can leave the application in an unusable state when attempting to read existing data for which the C# model has changed.
If
"The easiest strategy is to write an upgrade script."
is truly the easiest/recommended approach for static .NET languages like C#, are there existing tools for code-first schema migration in NoSql Databases for those languages? or is the NoSql ecosystem not to that point of maturity?
If you disagree with MongoDB's suggestion, what is a better implementation, and can you give some reference/examples of where I can see that implementation in use?

Short version
Is "The easiest strategy is to write an upgrade script." is truly the easiest/recommended approach for static .NET languages like C#?
No. You could do that, but that's not the strength of NoSQL. Using C# does not change that.
are there existing tools for code-first schema migration in NoSql Databases for those languages?
Not that I'm aware of.
or is the NoSql ecosystem not to that point of maturity?
It's schemaless. I don't think that's the goal or measurement of maturity.
Warnings
First off, I'm rather skeptical that just pushing an existing relational model to NoSql would in a general case solve more problems than it would create.
SQL is for working with relations and on sets of data, noSQL is targeted for working with non-relational data: "islands" with few and/or soft relations. Both are good at what what they are targeting, but they are good at different things. They are not interchangeable. Not without serious effort in data redesign, team mindset and application logic change, possibly invalidating most previous technical design decision and having impact run up to architectural system properties and possibly up to user experience.
Obviously, it may make sense in your case, but definitely do the ROI math before committing.
Dealing with schema change
Assuming you really have good reasons to switch, and schema change management is a key in that, I would suggest to not fight the schemaless nature of NoSQL and embrace it instead. Accept that your data will have different schemas.
Don't do upgrade scripts
.. unless you know your application data set will never-ever grow or change notably. The other SO post you referenced explains it really well. You just can't rely on being able to do this in long term and hence you need a plan B anyway. Might as well start with it and only use schema update scripts if it really is the simpler thing to do for that specific case.
I would maybe add to the argumentation that a good NoSQL-optimized data model is usually optimized for single-item seeks and writes and mass-updates can be significantly heavier compared to SQL, i.e. to update a single field you may have to rewrite a larger portion of the document + maybe handle some denormalizations introduced to reduce the need of lookups in noSQL (and it may not even be transactional). So "large" in NoSql may happen to be significantly smaller and occur faster than you would expect, when measuring in upgrade down-time.
Support multiple schemas concurrently
Having different concurrently "active" schema versions is in practice expected since there is no enforcement anyway and that's the core feature you are buying into by switching to NoSQL in the first place.
Ideally, in noSQL mindset, your logic should be able to work with any input data that meets the requirements a specific process has. It should depend on its required input not your storage model (which also makes universally sense for dependency management to reduce complexity). Maybe logic just depends on a few properties in a single type of document. It should not break if some other fields have changed or there is some extra data added as long as they are not relevant to given specific work to be done. Definitely it should not care if some other model type has had changes. This approach usually implies working on some soft value bags (JSON/dynamic/dictionary/etc).
Even if the storage model is schema-less, then each business logic process has expectations about input model (schema subset) and it should validate it can work with what it's given. Persisted schema version number along model also helps in trickier cases.
As a C# guy, I personally avoid working with dynamic models directly and prefer creating a strongly typed objects to wrap each dynamic storage type. To avoid having to manage N concurrent schema version models (with minimal differences) and constantly upgrade logic layer to support new schema versions, I would implement it as a superset of all currently supported schema versions for given entity and implement any interfaces you need. Of course you could add N more abstraction layers ;) Once some old schema versions have eventually phased out from data, you can simplify your model and get strongly typed support to reach all dependents.
Also, it's important for logic layer should have a fallback or reaction plan should the input model NOT match the requirements for carrying out the intended logic. It's up to app when and where you can auto-upgrade, accept a discard, partial reset or have to direct to some trickier repair queue (up to manual fix if no automatics can cut it) or have to just outright reject the request due to incompatibility.
Yes, there's the problem of querying across sets of models with different versions, so you should always consider those cases as well. You may have to adjust querying logic to query different versions separately and merge results (or accept partial results if acceptable).
There definitely are tradeoffs to consider, sure.
So, migrations?
A downside (if you consider migrations tool set availability) is that you don't have one true schema to auto generate the model or it's changes as the C# model IS the source-of-truth schema you're currently supporting. Actually, quite similar to code-first mindset, but without migrations.
You could implement an incoming model pipe which auto-upgrades the models as they are read and hence reduce the number schema versions you need to support upstream. I would say this is as close to migrations as you get. I don't know any tools to do this for you automatically and I'm not sure I would want it to. There are trade-offs to consider, for example some clients consuming the data may get upgraded with different time-line etc. Upgrade to latest may not always be what you want.
Conclusion
NoSQL is by definition not SQL. Both are cool, but expecting equivalency or interchangeability is bound for trouble.
You still have to consider and manage schema in NoSQL, but if you want one true enforced & guaranteed schema, then consider SQL instead.

While Imre's answer is really great and I agree with it in every detail I would like to add more to it but also trying to not duplicate information.
Short version
If you plan to migrate your existing C#/EF/SQL project to MongoDB it is a high chance that you shouldn't. It probably works quite well for some time, the team knows it and probably hundreds or more bugs have been already fixed and users are more or less happy with it. This is the real value that you already have. And I mean it. For reasons why you should not replace old code with new code see here:
https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/.
Also more important than existence of tools for any technology is that it brings value and it works as promised (tooling is secondary).
Disclaimers
I do not like the explanation from mongoDB you cited that claims that statically typed language is an issue here. It is true but only on a basic, superficial level. More on this later.
I do not agree that EF Code First Migration is very mature - though it is really great for development and test environments and it is much, much better than previous .NET database-first approaches but still you have to have your own careful approach for production deployments.
Investing in your own tooling should not be a blocker for you. In fact if the engine you choose would be really great it is worthwhile to write some specific tooling around it. I believe that great teams rarely use tooling "off the shelves". They rather choose technologies wisely and then customize tools to their needs or build new tools around it (probably selling the tool a year or two years later).
Where the front line lays
It is not between statically and dynamically typed languages. This difference is highly overrated.
It is more about problem at hand and nature of your schema.
Part of the schema is quite static and it will play nicely both in static and dynamic "world" but other part can be naturally changing with time and it fits better for dynamically typed languages but not in the essence of it.
You can easily write code in C# that has a list of pairs (key, value) and thus have dynamism under control. What dynamically typed languages gives you is impression that you call properties directly while in C# you access it by "key". While being easier and prettier to use for developer it does not save you from bigger problems like deploy schema changes, access different versions of schemas etc.
So static/dynamic languages case is not an issue here at all.
It is rather drawing a line between data that you want to control from your code (that is involved in any logic) and the other part that you do not have to control strictly. The second part do not have to be explicitly and minutely expressed in schema in your code (it can be rather list or dictionary than named fields/properties because maintaining such fields costs you but does not brings any value).
My Use Case
Once upon a time my team has made a project that uses three different databases:
SQL for "usual" configuration and evidence stuff
Graph database to make it natural to build wide network of arbitrarily connected objects
Document database tuned for searching (Elastic Search in fact) to make searching instant and really modern (like dealing with typos or the like)
Of course it is a challenge to deploy such wide technology stack but each part of it brings its best to the whole solution.
The aim of the project is to search through a knowledge base of literally anything (projects, peoples, books, products, documents, simply anything).
That's why SQL is here only to record a list of available "knowledge databases" and users assigned to them. The schema here is obvious, stable and trivial. There is low probability of changes in the future.
Next, graph database allows to literally "throw" anything into the database from different sources around and connect things with each other. The idea, to put it simply, is to have objects accessible by ID.
Next, Elastic search is here to accumulate IDs and a selected subset of properties to make them searchable in the instant. Here the schema contains only ID and list of pairs (key, value).
As the final step, to put it simply, the solution calls Elastic Search, gets Ids and displays details (schema is irrelevant as we treat it as a list of pairs key x value, so GUI is prepared to build screens dynamically).
Though the way to the solution was really painful.
We tested a few graph databases by running concept proofs to find that most of them simply does not work in operations like updating data! (ugh!!!) Finally we have found one good enough DB.
On the other hand finding and using Elastic Search was a great pleasure! Though being great you have to be aware that under pressure of uploading massive data it can break so you have to adjust your tooling to adapt to it.
(so no silver bullet here).
Going into more widely used direction
Apart from my use case which is kind of extreme usually you have sth "in-between".
For example a database for documents.
It can have almost static "header" of fields like ID, name, author, and so on and your code can manage it "traditionally" but all other fields could be managed in a way that it can exists or not and can have different contents or structure.
"The header" is the part you decided to make it relevant for the project and controllable by the project. The rest is rather accompanying than crucial (from the project logic point of view).
Different approaches
I would rather recommend to learn about strengths of particular NoSQL database types, find answers why were they created, why are they popular and useful. Then answer in which way they can bring benefits to your project.
BTW. This is interesting why you have indicated MongoDB?
The other way around would be to answer what are your project's current greatest weaknesses or greatest challenges from technological point of view - being it performance, cost of support changes, need to scale significantly or other. Then try to answer if some NoSQL DB would be great at resolving the issue.
Conclusion
I'm sure you can find benefits of NoSQL databases to your project either by replacing part of it or by bringing new values to users (searching for example?). Either way I would prefer a really good technology that brings what it promises rather than looking if it is fully supported by tools around it.
And also concept proof is a really good tool to check technologies in a scenario that is very simple but at the same time meaningful for you. But the approach should be not to play with technologies but aggressively and quickly prove or disprove quality of them.
There are so much promises and advertises around that we should protect ourselves by focusing of the real things that works.

What are some "mental steps" a developer must take to begin moving from SQL to NO-SQL (CouchDB, FathomDB, MongoDB, etc)?

I have my mind firmly wrapped around relational databases and how to code efficiently against them. Most of my experience is with MySQL and SQL. I like many of the things I'm hearing about document-based databases, especially when someone in a recent podcast mentioned huge performance benefits. So, if I'm going to go down that road, what are some of the mental steps I must take to shift from SQL to NO-SQL?
If it makes any difference in your answer, I'm a C# developer primarily (today, anyhow). I'm used to ORM's like EF and Linq to SQL. Before ORMs, I rolled my own objects with generics and datareaders. Maybe that matters, maybe it doesn't.
Here are some more specific:
How do I need to think about joins?
How will I query without a SELECT statement?
What happens to my existing stored objects when I add a property in my code?
(feel free to add questions of your own here)

Firstly, each NoSQL store is different. So it's not like choosing between Oracle or Sql Server or MySQL. The differences between them can be vast.
For example, with CouchDB you cannot execute ad-hoc queries (dynamic queries if you like). It is very good at online - offline scenarios, and is small enough to run on most devices. It has a RESTful interface, so no drivers, no ADO.NET libraries. To query it you use MapReduce (now this is very common across the NoSQL space, but not ubiquitous) to create views, and these are written in a number of languages, though most of the documentation is for Javascript. CouchDB is also designed to crash, which is to say if something goes wrong, it just restarts the process (the Erlang process, or group of linked processes that is, not the entire CouchDB instance typically).
MongoDB is designed to be highly performant, has drivers, and seems like less of a leap for a lot of people in the .NET world because of this. I believe though that in crash situations it is possible to lose data (it doesn't offer the same level of transactional guarantees around writes that CouchDB does).
Now both of these are document databases, and as such they share in common that their data is unstructured. There are no tables, no defined schema - they are schemaless. They are not like a key-value store though, as they do insist that the data you persist is intelligible to them. With CouchDB this means the use of JSON, and with MongoDB this means the use of BSON.
There are many other differences between MongoDB and CouchDB and these are considered in the NoSQL space to be very close in their design!
Other than document databases, their are network oriented solutions like Neo4J, columnar stores (column oriented rather than row oriented in how they persist data), and many others.
Something which is common across most NoSQL solutions, other than MapReduce, is that they are not relational databases, and that the majority do not make use of SQL style syntax. Typcially querying follows an imperative mode of programming rather than the declarative style of SQL.
Another typically common trait is that absolute consistency, as typically provided by relational databases, is traded for eventual models of consistency.
My advice to anyone looking to use a NoSQL solution would be to first really understand the requirements they have, understand the SLAs (what level of latency is required; how consistent must that latency remain as the solutions scales; what scale of load is anticipated; is the load consistent or will it spike; how consistent does a users view of the data need to be, should they always see their own writes when they query, should their writes be immediately visible to all other users; etc...). Understand that you can't have it all, read up on Brewers CAP theorum, which basically says you can't have absolute consistence, 100% availability, and be partition tolerant (cope when nodes can't communicate). Then look into the various NoSQL solutions and start to eliminate those which are not designed to meet your requirements, understand that the move from a relational database is not trivial and has a cost associated with it (I have found the cost of moving an organisation in that direction, in terms of meetings, discussions, etc... itself is very high, preventing focus on other areas of potential benefit). Most of the time you will not need an ORM (the R part of that equation just went missing), sometimes just binary serialisation may be ok (with something like DB4O for example, or a key-value store), things like the Newtonsoft JSON/BSON library may help out, as may automapper. I do find that working with C#3 theere is a definite cost compared to working with a dynamic language like, say Python. With C#4 this may improve a little with things like the ExpandoObject and Dynamic from the DLR.
To look at your 3 specific questions, with all it depends on the NoSQL solution you adopt, so no one answer is possible, however with that caveat, in very general terms:
If persisting the object (or aggregate more likely) as a whole, your joins will typically be in code, though you can do some of this through MapReduce.
Again, it depends, but with Couch you would execute a GET over HTTP against either a specific resource, or against a MapReduce view.
Most likely nothing. Just keep an eye-out for the serialisation, deserialisation scenarios. The difficulty I have found comes in how you manage versions of your code. If the property is purely for pushing to an interface (GUI, web service) then it tends to be less of an issue. If the property is a form of internal state which behaviour will rely on, then this can get more tricky.
Hope it helps, good luck!

Just stop thinking about the database.
Think about modeling your domain. Build your objects to solve the problem at hand following good patterns and practices and don't worry about persistence.

Methods for storing searchable data in C#

In a desktop application, I need to store a 'database' of patient names with simple information, which can later be searched through. I'd expect on average around 1,000 patients total. Each patient will have to be linked to test results as well, although these can/will be stored seperately from the patients themselves.
Is a database the best solution for this, or overkill? In general, we'll only be searching based on a patient's first/last name, or ID numbers. All data will be stored with the application, and not shared outside of it.
Any suggestions on the best method for keeping all such data organized? The method for storing the separate test data is what seems to stump me when not using databases, while keeping it linked to the patient.
Off the top of my head, given a List<Patient>, I can imagine several LINQ commands to make searching a breeze, although with a list of 1,000 - 10,000 patients, I'm unsure if there's any performance concerns.

Use a database. Mainly because what you expect and what you get (especially over the long term) tend be two totally different things.

This is completely unrelated to your question on a technical level, but are you doing this for a company in the United States? What kind of patient data are you storing?
Have you looked into HIPAA requirements and checked to see if you're a covered entity? Be sure that you're complying with all legal regulations and requirements!

I think 1000 is to much to try to store in XML. I'd go with a simple db type, like access or Sqlite. Yes, as a matter of fact, I'd probably use Sqlite. Sql Server Express is probably overkill for it. http://sqlite.phxsoftware.com/ is the .net provider.

I would recommend a database. You can use SQL Server Express for something like that. Trying to use XML or something similar would probably get out of hand with that many rows.
For smaller databases/apps like this I've yet to notice any performance hits from using LINQ to SQL or Entity Framework.

I would use SQL Server Express because it has the best tool support (IDE integration) from Microsoft. I don't see any reason to consider it overkill.
Here's an article on how to embed it directly in your application (no separate installation needed).

If you had read-only files provided by another party in some kind of standard format which were meant to be used by the application, then I would consider simply indexing them according to your use cases and running your searches and UI against that. But that's still some customized work.
Relational databases are great for storing data in tables, and for representing the relationships between tables. Typically there are also good tools for getting the data in and out.
There are other systems you could use to store your data, but none which would so quickly be mapped to your input (you didn't mention how your data would get into this system) and then be queryable against with least effort.
Now, which database to choose...

Use Database...but maybe just SQLite, instead of a fully fledged database like MS SQL (Express).

Create a Search Engine with SQL 2000 and ASP.NET C#

I am looking to create a search engine that will be based on 5 columns in a SQL 2000 DB. I have looked into Lucene.NET and read the documentation on it, but wondering if anyone has any previous experience with this?
Thanks

IMHO it's not so much about performance, but about maintainability. In order to index your content using Lucene.NET you'll have to create some mechanism (service of triggered) which will add new rows (and remove deleted rows) from the Lucene index.
From a beginner's perspective I think it's probably easier to use the SQL Server built-in full text search engine.

i haven't dealt with Lucene yet but a friend of mine has and he said that their performance was 4 to 5 times better with lucene than full text indexing.

Performance better? I think that largely depends on volume and how you expect the data to scale.
SQL Server Full Text is far superior in my opinion. To get this to work with lucene you will need a process to maintain the index by extracting data from the SQL database.

You cam either use a Lucene Index or SQL FTS Index. I personally lean toward Lucene from a simplicity standpoint. It is also not a black box. Alot of which solution will work (and they both may work) depnds on query load, data size and data update frequency. Lucene does provide a well worn path to building very scalable search solutions for websites. In the future please include some more information about your problem.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.