I am creating a very simple database in C# which I use to store playlists and an overview of all my music. I want to make this C compatible in the future I plan to make this completely text based. The idea is that every text file is a table, and the contents are JSON format where every line of text is a record.
I don't want to have loose files for each database, so I was thinking about something like a zip file. I don't want to extract and compress every time I access a file. Is there someway I can use a stream reader/writer in C# on different files where windows only see one file?
I'm not completely convinced that this is the way to go. So I'm open to suggestions.
Update,
Im currently messing around with the "Local database" item in C#. I never payed any atention to it. It could very well be the solution.
Update2,
SQLite seems to be very simple. I have some experience with MySQL in the past with some php projects so that will give me a headstart.
You want to use a file as a container containing different files? If so, there are a lot ways to accomplish this. These are techniques I used in the past:
Zip:
A compressed file, such as Zip is known to behave that way, can be used as a solution for your interest. It is capable to store virtual files. They can vary in size to at least up to 1 Gigabyte (testet, but I currently don't know if there are implementation based size limits).
SQLite:
SQLite sounds oldschool, but it stores all database related stuff into one physical file. Creating a database with tables for each virtual file should to the trick. This approach is useful if you know that your virtual files won't use a lot of bytes in size or neither reach any limit of sqlite field datatypes. As your virtual files are going to use textlines, may you can be able to form then into attributes and tuples. This way you can even use SQL specific statements to query and filter your data as you wish to.
There are still more ways to implement that kind of container format by your own, but propably needs to invest more time and work in it than getting effort out of it. Stay tuned for better ideas and may ready to use implementations :-)
Will you ever try to search between your data? Then use a real database manager, in C# the built in local database file is the simpliest choice (if you are familiar with SQL).
The zip file is a good choice for data space and compactness (a single file instead of many files) but it is very slow: for each database operation the whole zip file will be reorganized. Even a tar file (without compression) needs a continous reallocation when the content changes, and a zip file needs extra computation and relocation.
If you want something what is compressed and still standard, you can use OpenXML (ods or xlsx, does not matter) to store your data but the save operation will be slow and even slower as your database grows.
Related
I'm working on refactoring a document storage service's site to go from a proprietary storage system to SQL. Everything is going fairly well, but I need to find a way to search through our repository for specific strings of text. We use a multitude of different file types (.xls,.xlsx,.doc,.txt, etc). They're displayed to the user by first converting them to a PDF, via line-by-line rebuilding using PDFSharp.
The speed isn't a consideration for viewing/searching a single file, but I have concerns about scalability. I was able to make a functioning text search by copying and then hooking into our conversion process, but I am fairly sure that this will not work for searching through a customer's entire document list (thousands and thousands of documents). If these were all of a uniform file type, it might be easier to do, but they aren't.
Is there an efficient way to do this of which I am unaware?
EDIT: The documents are stored on the server and referenced via document URLs in the DB
My recommendation is to build an index, either in SQL or in a file. One that matches files with all the possible search terms of interest in each file.
Ok, I have searched about this and read a few points of view about storing binary data in a [MySQL] database. Generally I consider this a bad idea and try to avoid it, favouring traditional file transfers and just storing a reference to the file in a database.
However, I am working on a project which requires database synchronisation with a remote/cloud database, not just for files, but also for settings and other user content. For this, and other reasons, I felt this might be an appropriate situation for binary storage in a database.
I have written a general system for the database sync which works well using Reflection and XML. I have also (against my instincts) integrated the file storage in to this system. Again, it works well - I chop files in to 64Kb BLOBs and store them in a table, with a file_id reference (linked to a seperate table which contains meta data such as file name/size/mime type).
This enables me to send bits and pieces as and when a connection is available, and also allows me to limit each request size to keep things running smoothly.
So far I have not found any issues with this, and have successfully imported and transferred over 1gb of data in both directions (over about 10-15 files / 16000 rows), but I worry about its scalability - will it slow down once there is 20gb+ data in there, or can MySQL handle it provided my queries are well structured?
Another reason for my decision to store the data in the database was that I figured I could simply add another HDD/storage device to MySQL if space ran low, in the hope of efficient scaling/replication/etc.
I would very much appreciate any views or comments as to whether this is a good or bad approach, and have I missed any obvious problems I'm likely to see once used in a production environment?
edit: I forgot to mention, the file sizes could range from 1KB to ~1GB
[Rough] Conclusion
Firstly: thanks very much to those who contributed a considered answer. Choosing the accepted answer here has been quite difficult as each has something decent to offer.
In the end (despite my hopes), I have decided that a pure MySQL storage server is at best only an ok solution (I still can't help wondering why they bother including the BLOB types though).
As the alternative, I am torn between #Nick Coons file system approach and #tadman's suggestion of a hybrid using a light weight key/value database engine such as leveldb. Provided the practicalities of using leveldb in this project are not an issue, this is most likely the approach I will work towards.
I have accepted tadman's answer on this basis; his answer was also most applicable and useful to my situation.
That being said, and for those that are interested: I have enjoyed quite a lot of success using only MySQL so far. I have tested a table storing over 15gb of binary data without any noticable negative side effects from to inserting/retrieving data from large tables (with careful queries). However, I am certain this is still very inefficient and either of the alternative methods mentioned will be significantly better.
I have to wonder why you're even bothering with a database at all, when the layer you've added on top to chunk, store, retrieve and reassemble would work just as well on a well-defined filesystem structure. MySQL wants all of its data on a single volume, so it's not a case of adding another drive whenever you feel like it, and replication of large amounts of binary data is going to be cripplingly slow as the binary logs will end up duplicating the amount of data you need to store.
The simplest approach is often the best one. Storing this in the filesystem directly is probably the best way to do it. If you need to keep an index of what's stored where, maybe you'd use a database like MySQL, but there's many ways to accomplish this same task. The more low-tech, the better. For example, don't rule out SQLite because an embedded database performs very well under light read and write load, and has the advantage of being "just a file" when it comes to backing up and restoring.
That being said, what you're doing sounds suspiciously similar to LevelDB, so before you commit to your approach, you'd have to see how it's significantly different than a key-value document store of that variety.
Short Answer:
I'm not sure there's a hard-lined way to answer this. You mentioned files being from 1KB to 1GB.. I wouldn't store binary data in a DB if it's going to anywhere near 1KB, let along 1GB. I may store a few bytes of binary data in a DB if it's incidental, but any large amount of data, especially that doesn't need to be searched, should be stored in the filesystem:
When you store data in a DB, you're storing it on a filesystem anyway, you've just added another layer (the DB) to the mix. There's a cost to this layer, so there ought to be a benefit to make up the difference. If you're storing the data so that you can search based on it or join it to other data, then this makes sense. But file data, binary or not, is typically not used in that way.
Example Implementation:
There are better methods to distribute file data than to enter it into a DB, such as a distributed filesystems (check into GlusterFS, MooseFS, both of which will scale by simply adding additional hard drives, whereas MySQL will not).
Typically, I'll store file data in the filesystem using an SHA1 hash of the data as the name of the file. If the hash is 98a75af529f07b1ef7be7400f51344b9f07b1ef7, then I'll store it in this directory structure:
./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7
That is, a top-level directory made up of the first two characters, a second-level directory made up of the second two characters, and then finally the file with the name of the total string. In this way, I can literally have billions of files without having so many in a single directory that the system is too slow to function.
Then I create a DB table with these columns to hold the meta data:
file_id, an auto_increment field
created, a field with a default value of current_timestamp
prev_id, more on this below
hash, the SHA1 hash on the filesystem
name, a textual name of the file (such as the original name that the file would have taken on disk.
When I need a hierarchical directory structure, I would also create a directory table and add a dir_id to the list of columns above.
If I edit the file represented by ./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7, I don't actually change that file on disk, I create a new one (because the new file contents would be represented by a new SHA1 hash), and create a new entry in the files table where prev_id equals the file_id of the file I edited. In other words, I now have versioning.
If I need this to be available in a distributed fashion, I setup MySQL replication and then use GlusterFS to replicate he filesystem across multiple servers.
I think you will find a fair amount of debate on this as I did when I began looking into this. I tend to lean toward storing in the file system and maintaining a reference. However, that is not to say that there is never a time to store binary data in a database.
I would say that simply to keep things in sync is not a reason within itself to make an argument for storing binary data in a database. There certainly are ways to keep file systems in sync so that as a database is kept in sync so is the file system.
The bottom line is that there is a fair amount of debate on this topic and you have to go with what works for you. If what you have set up works. Use it. Do performance and load testing to make sure it works. If it doesn't hold up, change it.
We have an application on the web that must allow the user to upload files with zip codes, these files are .csv's files. Any user will be able to upload the file from their computer, the issue is that the file may contain thousands of records. Right now i am getting the file, making sure it has the right headers but I am pushing the records one by one into the database.
I am using c# asp.net, is there a better way to do this?, more efficient from the code?. We cant use any external importers or data importers or tools like sql server business intelligence. How can I do this?, i was reading something about putting it in memory and then push it to the database?. Any urls, examples or suggestions would be much appreciated.
Regards
Firstly, I'm pretty sure that what you are asking is actually "How do you process a large file and insert the processed data into the database?".
Now assuming I am correct I would say the question is akin to 'how long is a piece of string?'. The reality is that an implementation for processing large files into a database is highly specific to your requirements.
However, at the simplest end of the spectrum you could simply upload the file straight into a table (or folder) and create a windows service that runs every x minutes, traverses through the table, picks each file and processes your data using bulk inserts and the prepare method (which may give you some performance benefits).
Alternatively you could look at something like MSMQ (Microsoft Message Queuing) and save any uploaded files direct to a queue which is then completely independent of your application and can be processed at any point in time along with easily scaled out.
At the end of the day though, honestly I don't think anyone here can give you a 'correct' answer to your question cause there really isn't one and you'll only be able to find improvements to your implementation by experimentation.
if this contains up to a million record, best to do this is to create a service to manage the inserting of records into the database to avoid timeout and prevent the web iis stress.
if you make it a windows service you can notify the service to process the zip files in certain directory where it was uploaded.
also, i would suggest to use bulk insert for more faster database transactions.
if there are validation you can probably stage the data into a different database and validate the data then push to the final database.
Since these records are in the same table and would then not be related to each other, Parallel.ForEach may be a valid answer here. Assuming you have a static method (may not necessarily need to be static) that inserts an individual record into the db, you can run Parallel.ForEach loop over an array where each index of the array represents a line of the CSV.
This assumes that uploading the large file to the server isn't the initial issue. If that is also part of the issue I would reccomend zipping the file and then using something like SharpZipLib to unzip it once it is uploaded. Since text compresses very well this may be the biggest boon to performance from the user's perspective.
How would you store a PDF document in a field in MySQL?
Currently I have a list of customers and each customer has a certificate with information about their account that they can give to other companies to prove that they're our customer. Currently their certificate is exported as a PDF and e-mailed to someone here at work (the customer gets a physical copy as well), and that person's mailbox is filled with these e-mails. I'd much prefer to just have it in the customer's record - allowing it to be accessed via the customer's file in our in-house CRM.
I considered putting the PDFs in a folder and storing their location as a varchar in the customer's record, but if the PDFs get moved/deleted/etc. then we're up a creek.
My understanding is that a BLOB or MEDIUMBLOB is the type of field that I'd use to store it, but I'm a little ignorant in this regard. I'm not sure how to store something like that in the field (what C# datatype to give it), and then how to get it and open it via a PDF reader.
Put it in the database, but the BLOB datatype probably won't cut it. The MEDIUMBLOB is normally sufficient.
MySQL Datatypes
BLOB, TEXT L + 2 bytes, where L < 216
MEDIUMBLOB, MEDIUMTEXT L + 3 bytes, where L < 224
LONGBLOB, LONGTEXT L + 4 bytes, where L < 232
I've used this several times with very good results. Be sure to save the filesize too, as it makes it easier to retrieve it. Not sure if it applies to C# as it does to PHP.
If using prepared statements with parameters the data will automatically be escaped AFAIK.
Also I can see no real reason as to why the database itself would get slow when storing this type of data in it. The main bottleneck will of course be the transfer of the data. Also MySQL is sometimes restrictive about the maximum length of queries and the responses in particular.
Once you have it running, it's pretty neat, especially when dealing with lots of small files. For a small number of large files, this approach does not make sense, better use some backup system to deal with moved/deleted files.
http://www.phpriot.com/articles/images-in-mysql is a good tutorial with some background information, and an implementation of storing images in MySQL
Honestly, I think that going with links instead of actually inserting the file into the database is the best way to go. Doing that will make the database very slow, and will be more trouble than its worth.
I would upload the files to a designated folder like "certificates" and made the certificate names go with a client number so they are easy to find and edit, remove etc. I have seen people store images in databases but even that is advised against.
If the method you wish is definitely a must, check out this article:
http://www.wellho.net/mouth/1001_-pdf-files-upload-via-PHP-store-in-MySQL-retrieve.html
It explains how to store, and retrieve .pdf files in a mySQL Database.
Best of luck!
I considered putting the PDFs in a
folder and storing their location as a
varchar in the customer's record, but
if the PDFs get moved/deleted/etc.
then we're up a creek.
That's the approach I would take. Then using some logic perhaps some BPEL type stuff - detect if any of the files move/delete and fire off a trigger to your DB to properly update the location/remove the location
I am creating an RSS reader as a hobby project, and at the point where the user is adding his own URL's.
I was thinking of two things.
A plaintext file where each url is a single line
SQLite where i can have unique ID's and descriptions following the URL
Is the SQLite idea to much of an overhead or is there a better way to do things like this?
What about as an OPML file? It's XML, so if you needed to store more data then the OPML specification supplies, you can always add your own namespace.
Additionally, importing and exporting from other RSS readers is all done via OPML. Often there is library support for it. If you're interested in having users switch then you have to support OPML. Thansk to jamesh for bringing that point up.
Why not XML?
If you're dealing with RSS anyway you mayaswell :)
Do you plan just to store URLs? Or you plan to add data like last_fetch_time or so?
If it's just a simple URL list that your program will read line-by-line and download data, store it in a file or even better in some serialized object written to a file.
If you plan to extend it, add comments/time of last fetch, etc, I'd go for SQLite, it's not that much overhead.
If it's a single user application that only has one instance, SQLite might be overkill.
You've got a few options as I see it:
SQLite / Database layer. Increases the dependencies your code needs to run. But allows concurrent access
Roll your own text parser. Complexity increases as you want to save more data and you're re-inventing the wheel. Less dependency and initially, while your data is simple, it's trivial for a novice user of your application to edit.
Use XML. It's well formed & defined and text editable. Could be overkill for storing just a URL though.
Use something like pickle to serialize your objects and save them to disk. Changes to your data structure means "upgrading" the pickle files. Not very intuitive to edit for a novice user, but extremely easy to implement.
I'd go with the XML text file option. You can use the XSD tool built into Visual Studio to create a DataTable out of the XML data, and it easily serializes back into the file when needed.
The other caveat is that I'm sure you're going to want the end user to be able to categorize their RSS feeds and be able to potentially search/sort them, and having that kind of datatable style will help with this.
You'll get easy file storage and access, the benefit of a "database" structure, but not quite the overhead of SQLite.