Using Sphinx Search With Azure Table Storage

Using Sphinx Search With Azure Table Storage - c#

I have Sphinx SE running against a ms sql server currently and it has worked great for the past few years. The table sphinx used has recently expanded a lot and we need to leverage the speed provided by moving the table to an azure table storage.
What options do I have to allow sphinx to index this table from azure? I know it supports ms sql, but the azure table storage offering is a different beast. I also have found that Sphinx supports an xml input, but it would be very hard to export all of this data into a file to be read every 5 minutes. Has anyone conquered this issue using Azure Table Storage?
thanks

Well XMLpipe (or even TSVpipe) would be the way to to connect to the table-store. Lacking a native SQL based driver.
... but yes a simple implementation might well load all data. Which is actully what you possibly doing with MS-SQL. It's just the data is actully small enough that its reasonable practical.
Loading all data on MS-SQL would be similally "expensive"
So really your question is more how to index a 'large' dataset. Some sort of incremental update system, so you only need to load the 'changes. (The fact that using against a Storage Table, kind of then becomes just a trivial detail of the implementation)
One concept might see quite a bit in Sphinx is so called 'main'+'delta'
http://www.sphinxconsultant.com/sphinx-search-delta-indexing/
That works quite well with XMLpipe too. So can work with Asure. You just need to come up with a couple of scripts, one to download large quantity of data (to initially commission the 'main', it doesnt get used often)
... then a second script to only get the new records. Run some sort of query
You just need somesort of script to stream from Azure, and output itehr XML or TSV
https://www.google.com/search?q=Azure+Table+Storage+stream

Related

RavenDB - synchronize with Sql Server DB

I was thinking about utilizing RavenDB for some of my look-up scenarios I am doing in a high throughput application. This would replace all of the look-up calls I need to make to the DB to get things like site location, etc. Looking at a couple of options really (also .Net caching). I know that you can replicate Indexes from RavenDB to SQL Server, but wondering if anyone has done the reverse where they sync RavenDB with Sql Server?
Any suggestions / comments would be appreciated.
--S

I've done a similar scenario where data needed to be transferred in batch from a SQL Server system nightly into our RavenDB instance.
I couldn't find an off the shelf tool to do what I wanted as typically you should optimise the model you give RavenDB differently to SQL Server.
I wrote a custom console app that put the data into my RavenDB instance.
For example my console app:
Compacted several relationships into one document
Dealt with the different datatypes
TLDR: I wrote my own console app as I couldn't find a generic product that could do it.

So far the only avaible solution is write your own sync process.
I was looking for ways to improve the search scenearios using RavenDB , the RavenDB will be filled using my sql server relational database.
I think it should be a better way, however the only i can think rith now is to use a ETL process that keeps updating your NoSQL version of your structured data.

Windows Azure persistent storage tip

Ive just activated my Azure account, created ny first Asp Mvc 3.0 project (just the template) and deployed it :). Wonderfull
However im about to create a small app (just to get to learn Azure) and have hit a minor issue.
Heres what i want to do:
Create an mvc app which displays my music library and allow for searching, sorting, add new albums, etc.
Theres proberbly about 3000 albums.
What kind of storage should i use and does anyone know of a good tutorial example about how to dó this in c# with mvc?
Please note i dó not want to use SQL Azure, that would be to easy. I need to dig in and learn blob/table/? Types.
I just need a Sound recomendation on which storage type i should start studying, and more importantly where i should study it :).

Azure Storage Tables are different from SQL in which they are managed by the Azure and not by a DBMS. The have a Key field through which you can find data within the table, and you can use LINQ to access it. That said, there should be performance considerations when choosing where each data would go. SQL Azure should provide better relational access, so if you are going to have a high number of tables and expect a lot of joins operations, I'd go with it. But, if you are using a simple structured data that you need to maintain in your application, you can choose tables.
Of course, since storage is 10x cheaper than SQL Azure, you will always want to design applications that make good use of storage, but remember to check for any performance issues you might have.

The Windows Azure Platform Training Kit has a few labs, under Exploring Windows Azure Storage. That should give you a good start understanding the table and entity approach. Pay specific attention to partition and row keys. Storage is optimized to be colocated around partition key, and indexed within a partition via row key. You'll need to carefully plan your row key for searching. If you need to search on multiple properties within a table, you'll need to consider either additional tables (each containing a row key that you'd search), or maybe a NoSQL database like MongoDB (or a relational database like SQL Azure, but you said you want to avoid that approach).
Also, take a look at this blog post by David Pallman - he has a complete set of code snippets for every single type of storage operation. This could save you many hours of time as you try to figure out all the ways to interact with Table Storage.
Then, look at this MSDN post that talks about storage transactions, which will be relevant when you move beyond simple examples and shift focus into production code.

.NET Data Storage - Database vs single file

I have a C# application that allows one user to enter information about customers and job sites. The information is very basic.
Customer: Name, number, address, email, associated job site.
Job Site: Name, location.
Here are my specs I need for this program.
No limit on amount of data entered.
Single user per application. No concurrent activity or multiple users.
Allow user entries/data to be exported to an external file that can be easily shared between applications/users.
Allows for user queries to display customers based on different combinations of customer information/job site information.
The data will never be viewed or manipulated outside of the application.
The program will be running almost always, minimized to the task bar.
Startup time is not very important, however I would like the queries to be considerably fast.
This all seems to point me towards a database, but a very lightweight one. However I also need it to have no limitations as far as data storage. If you agree I should use a database, please let me know what would be best suited for my needs. If you don't think I should use a database, please make some other suggestions on what you think would be best.

My suggestion would be to use SQLite. You can find it here: http://sqlite.org/. And you can find the C# wrapper version here: http://sqlite.phxsoftware.com/
SQLite is very lightweight and has some pretty powerful stuff for such a lightweight engine. Another option you can look into is Microsoft Access.

You're asking the wrong question again :)
The better question is "how do I build an application that lets me change the data storage implementation?"
If you apply the repository pattern and properly interface it you can build interchangable persistence layers. So you could start with one implementation and change it as-needed wihtout needing to re-engineer the business or application layers.
Once you have a repository interface you could try implementations in a lot of differnt approaches:
Flat File - You could persist the data as XML, and provided that it's not a lot of data you could store the full contents in-memory (just read the file at startup, write the file at shutdown). With in-memory XML you can get very high throughput without concern for database indexes, etc.
Distributable DB - SQLite or SQL Compact work great; they offer many DB benefits, and require no installation
Local DB - SQL Express is a good middle-ground between a lightweight and full-featured DB. Access, when used carefully, can suffice. The main benefit is that it's included with MS Office (although not installed by default), and some IT groups are more comfortable having Access installed on machines than SQL Express.
Full DB - MySql, SQL Server, PostGreSQL, et al.
Given your specific requirements I would advise you towards an XML-based flat file--with the only condition being that you are OK with the memory-usage of the application directly correlating to the size of the file (since your data is text, even with the weight of XML, this would take a lot of entries to become very large).
Here's the pros/cons--listed by your requirements:
Cons
No limit on amount of data entered.
using in-memory XML would mean your application would not scale. It could easily handle a 10MB data-file, 100MB shouldn't be an issue (unless your system is low on RAM), above that you have to seriously question "can I afford this much memory?".
Pros
Single user per application. No concurrent activity or multiple users.
XML can be read into memory and held by the process (AppDomain, really). It's perfectly suited for single-user scenarios where concurrency is a very narrow concern.
Allow user entries/data to be exported to an external file that can be easily shared between applications/users.
XML is perfect for exporting, and also easy to import to Excel, databases, etc...
Allows for user queries to display customers based on different combinations of customer information/job site information.
Linq-to-XML is your friend :D
The data will never be viewed or manipulated outside of the application.
....then holding it entirely in-memory doesn't cause any issues
The program will be running almost always, minimized to the task bar.
so loading the XML at startup, and writing at shutdown will be acceptible (if the file is very large it could take a while)
Startup time is not very important, however I would like the queries to be considerably fast
Reading the XML would be relatively slow at startup; but when it's loaded in-memory it will be hard to beat. Any given DB will require that the DB engine be started, that interop/cross-process/cross-network calls be made, that the results be loaded from disk (if not cached by the engine), etc...

It sounds to me like a database is 100% what you need. It offers both the data storage, data retrieval (including queries) and the ability to export data to a standard format (either direct from the database, or through your application.)
For a light database, I suggest SQLite (pronounced 'SQL Lite' ;) ). You can google for tutorials on how to set it up, and then how to interface with it via your C# code. I also found a reference to this C# wrapper for SQLite, which may be able to do much of the work for you!

How about SQLite? It sounds like it is a good fit for your application.
You can use System.Data.SQLite as the .NET wrapper.

You can get SQL Server Express for free. I would say the question is not so much why should you use a database, more why shouldn't you? This type of problem is exactly what databases are for, and SQL Server is a very powerful and widely used database, so if you are going to go for some other solution you need to provide a good reason why you wouldn't go with a database.

A database would be a good fit. SQLite is good as others have mentioned.
You could also use a local instance of SQL Server Express to take advantage of improved integration with other pieces of the Microsoft development stack (since you mention C#).
A third option is a document database like Raven which may fit from the sounds of your data.
edit
A fourth option would be to try Lightswitch when the beta comes out in a few days. (8-23-2010)
/edit
There is always going to be a limitation on data storage (the empty space of the hard disk). According to wikipedia, SQL Express is limited to 10 GB for SQL Server Express 2008 R2

Getting started with Azure Storage coming from a relational database point of view

I'm designing a new system, and I have need to store a pretty large volume of different type of data, with realitivly few rows per type.
I know that if I were doing this with SQL Server (I don't want to use a SQL Azure database for this.) I'd make a new table for each type of data and make the correct relationships. I'm wondering if anybody has resources for people like me who are thinking in relational terms to begin designing for more "flat" storage like Azure or even S3.
I'll be using .NET as the consumer of said storage, possibly with an Azure Compute Instance, but more likely with a remote client using the REST or SOAP api. So any guidance with respect to that is also greatly appreciated.

The main thing to consider is whether you need relational database capabilities (joins, group by, etc.). If so, you'll have to put some thought into how to accomplish those using a non-relational storage solution.
If, however, your access looks like "store row #12345" and "retrieve row #12345", you should have an easy time using something like Windows Azure tables.
I would recommend Episode 10 of Cloud Cover (a weekly show I'm on) which covers Windows Azure's table storage API: http://channel9.msdn.com/shows/Cloud+Cover/Cloud-Cover-Episode-10-Table-Storage-API/

Any considerations before jumping into SQLite?

I have a WCF application that at present is using XML based file storage to store data that gets used to generate reports. Besides this processing decisions are made based on information stored in these XML files.
I'm now hitting volumes of around 30 000 text files. This is incredibly taxing, and the application at times comes to a grinding halt.
I've always wanted to swop out the XML DAL in favor of an RDBMS, but project managers simply won't allow it. But they would be willing to look at a serverless solution for example SQLLite. I am really tempted to just dive right in and start using it as a replacement DAL (Data Access Layer).
I would need no more than around 20 tables in the whole solution, and I would expect to get no more than around 20 000 - 100 000 transactions a day, however this is extreme, the real volumes would be less than this in most cases.
Update
I am not expecting a great deal of simultaneous connections, when I say transactions, I essentially mean 1 or 2 clients that make calls and execute against the database in order. At times there might be a possibility of external clients making quick calls to the DB. But the bulk of DB connections will be done by my WCF service, which is a back end scheduled task, not serving 100's of people across an organization.
Another good point is that I only need to retain data for 90 days, so the DB shouldn't grow too big.
My main concerns are:
How reliable is SQLLite? What if the DB File gets corrupted, will I loose all processing Data. How easy is the DB to back up? Will it handle my volumes? And lastly how well does the .net provider work (located here: http://sourceforge.net/projects/sqlite-dotnet2/).
If you have any experience with SQLLite, please post your experiences so I can make aan informed decision to switch or not.
Thanks in advance...

SQLite is as reliable as your OS and hardware.
Its transactional rate is similar to SQL server, and often faster because it's all in process.
The .NET ADO provider works great.
To back up the DB, stop the service and copy the file. If the journal file is present copy it too.
EDIT: SQLite uses UTF-8 by default so with the ADO-NET provider you should be able to avoid losing accents (just so long as you follow the typical XML in string rules).

You could consider Microsoft's Sql Compact Edition.
It's like sqlite, in terms of being a single file embedded database, but has better integration with the .net framework :)
SQLite seems reliable, and even with Microsoft's one, don't expect to receive much support in case of a corrupted database.

Given your transaction volume I'd say the fact that the DB itself is a single monolithic file with only file system locking available could be a problem.
There is no row based locking as far as I know.

I used SQLite with the .Net provider without problems in a monouser enviroment, except for one concern: accents, wich don't showed correcly. The backup is quite simply: the SQLite database is an plain text file. Simply copy it.

I use Sqlite for storing XML config data and have had no problems with it. I use the System.Data.Sqlite provider: http://sqlite.phxsoftware.com/. It's solid and has a good support forum. It also includes a LINQ provider. It also integrates with VS 2008 so you can use Server Explorer to query tables. The examples and documentation also show how to use parameterized commands and transactions for increased performance.
The release candidate for LinqPad now supports Sqlite: http://www.linqpad.net/Beta.aspx.
Sqlite stores everything in a single file, which can be backed up like any other binary file.
Sqlite only supports file-level locking, but shouldn't present a performance problem since it doesn't sound like you'll have a large number of simultaneous transactions.
Unicode shouldn't be a problem. This link in the forum addresses an area where someone was trying to read unicode characters with an incompatible utility http://sqlite.phxsoftware.com/forums/t/954.aspx.
This site shows how to do case-insenitive UTF8 comparisons using System.Data.Sqlite via a custom collator, with Russian characters as an example: http://www.codeproject.com/KB/database/SQLiteUTF8CIComparison.aspx.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.