Store a PDF file in MySQL

Store a PDF file in MySQL - c#

How would you store a PDF document in a field in MySQL?
Currently I have a list of customers and each customer has a certificate with information about their account that they can give to other companies to prove that they're our customer. Currently their certificate is exported as a PDF and e-mailed to someone here at work (the customer gets a physical copy as well), and that person's mailbox is filled with these e-mails. I'd much prefer to just have it in the customer's record - allowing it to be accessed via the customer's file in our in-house CRM.
I considered putting the PDFs in a folder and storing their location as a varchar in the customer's record, but if the PDFs get moved/deleted/etc. then we're up a creek.
My understanding is that a BLOB or MEDIUMBLOB is the type of field that I'd use to store it, but I'm a little ignorant in this regard. I'm not sure how to store something like that in the field (what C# datatype to give it), and then how to get it and open it via a PDF reader.

Put it in the database, but the BLOB datatype probably won't cut it. The MEDIUMBLOB is normally sufficient.
MySQL Datatypes
BLOB, TEXT L + 2 bytes, where L < 216
MEDIUMBLOB, MEDIUMTEXT L + 3 bytes, where L < 224
LONGBLOB, LONGTEXT L + 4 bytes, where L < 232
I've used this several times with very good results. Be sure to save the filesize too, as it makes it easier to retrieve it. Not sure if it applies to C# as it does to PHP.
If using prepared statements with parameters the data will automatically be escaped AFAIK.
Also I can see no real reason as to why the database itself would get slow when storing this type of data in it. The main bottleneck will of course be the transfer of the data. Also MySQL is sometimes restrictive about the maximum length of queries and the responses in particular.
Once you have it running, it's pretty neat, especially when dealing with lots of small files. For a small number of large files, this approach does not make sense, better use some backup system to deal with moved/deleted files.

http://www.phpriot.com/articles/images-in-mysql is a good tutorial with some background information, and an implementation of storing images in MySQL

Honestly, I think that going with links instead of actually inserting the file into the database is the best way to go. Doing that will make the database very slow, and will be more trouble than its worth.
I would upload the files to a designated folder like "certificates" and made the certificate names go with a client number so they are easy to find and edit, remove etc. I have seen people store images in databases but even that is advised against.
If the method you wish is definitely a must, check out this article:
http://www.wellho.net/mouth/1001_-pdf-files-upload-via-PHP-store-in-MySQL-retrieve.html
It explains how to store, and retrieve .pdf files in a mySQL Database.
Best of luck!

I considered putting the PDFs in a
folder and storing their location as a
varchar in the customer's record, but
if the PDFs get moved/deleted/etc.
then we're up a creek.
That's the approach I would take. Then using some logic perhaps some BPEL type stuff - detect if any of the files move/delete and fire off a trigger to your DB to properly update the location/remove the location

Related

C# combined files direct accessable

I am creating a very simple database in C# which I use to store playlists and an overview of all my music. I want to make this C compatible in the future I plan to make this completely text based. The idea is that every text file is a table, and the contents are JSON format where every line of text is a record.
I don't want to have loose files for each database, so I was thinking about something like a zip file. I don't want to extract and compress every time I access a file. Is there someway I can use a stream reader/writer in C# on different files where windows only see one file?
I'm not completely convinced that this is the way to go. So I'm open to suggestions.
Update,
Im currently messing around with the "Local database" item in C#. I never payed any atention to it. It could very well be the solution.
Update2,
SQLite seems to be very simple. I have some experience with MySQL in the past with some php projects so that will give me a headstart.

You want to use a file as a container containing different files? If so, there are a lot ways to accomplish this. These are techniques I used in the past:
Zip:
A compressed file, such as Zip is known to behave that way, can be used as a solution for your interest. It is capable to store virtual files. They can vary in size to at least up to 1 Gigabyte (testet, but I currently don't know if there are implementation based size limits).
SQLite:
SQLite sounds oldschool, but it stores all database related stuff into one physical file. Creating a database with tables for each virtual file should to the trick. This approach is useful if you know that your virtual files won't use a lot of bytes in size or neither reach any limit of sqlite field datatypes. As your virtual files are going to use textlines, may you can be able to form then into attributes and tuples. This way you can even use SQL specific statements to query and filter your data as you wish to.
There are still more ways to implement that kind of container format by your own, but propably needs to invest more time and work in it than getting effort out of it. Stay tuned for better ideas and may ready to use implementations :-)

Will you ever try to search between your data? Then use a real database manager, in C# the built in local database file is the simpliest choice (if you are familiar with SQL).
The zip file is a good choice for data space and compactness (a single file instead of many files) but it is very slow: for each database operation the whole zip file will be reorganized. Even a tar file (without compression) needs a continous reallocation when the content changes, and a zip file needs extra computation and relocation.
If you want something what is compressed and still standard, you can use OpenXML (ods or xlsx, does not matter) to store your data but the save operation will be slow and even slower as your database grows.

What would be the optimal way to store large amounts of Unicode text?

I am developing a project where I need to store around 15k Unicode characters. What would be the best way to store this?
The main application is in C# and some other data is stored in a SQL Server DB. This huge amount of text needs to in someway be identifiable by a randomly generated entry key and a category key. Obviously, there may/should be more than one entry that has the same category key.
These entries will be added, retrieved, and also searched using keywords by category key.
I am currently looking at the following 2 ways: (Other ideas more than welcome)
Files
Each category key represented as a folder and each entry as a file using the entry key as the file name.
To search I would just use the Apache Lucene.Net project to build an index and just search by it.
SQL Server
Just stored as another column of type NVARCHAR(MAX) in a table.
Which of these ways is best? I am looking for other options, and pros/cons about these.

To answer your question, you have to answer this questions:
Will you store data more than 2 GB? Max data in nvarchar(max) is 2 GB.
Will you manipulate with this data inside sql server (full-text search, grouping, and so on)? You can't join or group by data from files.
Do you need transactional opertations? You can add file and failed to add record to DB and vice version.
So, assuming you have answers on this questions you can decide.
My advice - store large data in files or other blob storage (azure blob, amazone and so on) and has a table with list of this files.
Pros:
Small database size - easy to backup, easy to restore
Fast queries to file-list table (counts, joining, grouping and so on)
Cons:
You need to keep in sync your database and files storage
You have non-transactional operation, but it can be ignored by order of operations: save (or delete) file and then make changes in DB. So if you failed on DB, just start operation from the first step.

it's much easier having all data in one datastore. I would go with the SQL server solution.
However, if you are primarily concerned with storage space and the text is mainly ASCII, then encoding as UTF-8 would save ~50%. SQL server does not support UTF-8, only UTF-16 (UCS-2). So saving a separate file could have benefits.

Database Relational Records Archive & Restore

Years back, I had created a small system against a requirement where a snapped image from Android was uploaded onto a server along with its respective custom data and then stored on the disk and the custom data describing the image was further broken up and stored in the database. Each of the snapped images was actually a part of a campaign. Over the period, the system went on growing enough and now there are now over 10,000 campaigns already and over 500-1000 images per campaign. Though, the performance is not all that bad however I believe its just a matter of time. We now are thinking of archiving the past campaigns in another database called as Archive. Now here is what I am planning to do.
The Archive Database will have the exact same structure and the Archive functionality may have a search mechanism however, retrieval speed is not much of a concern here as this will happen very rarely.
I was thinking of removing records from one database and cloning it in the other, however the identity column probably will not let me do that very seamlessly. (and I may be wrong too.)
There needs to be a restore option too. (This is probably the most challenging part)
If I just make the records blank(except for the identity) from the original database and copy it to the other with no identity constraint, probably it is not going to help and I think it will loose the purpose of the exercise.
Any advise over this? Is there any known strategy or pattern or literature or even a link that may guide me on this?
Thank you in advance for your help.

I say: as long as you don't run out of space on your server, leave it as it is.
Over the period, the system went on growing enough and now there are now over 10,000 campaigns already and over 500-1000 images per campaign.
→ That's 5-10 millions of rows (created over several years).
For SQL Server, that's not that much.
Yes, I know...we're talking about image files stored in the database, not "regular" rows. Still, if your server has reasonably sized hardware, it shouldn't really matter.
I'm talking from experience here - at work, we have a SQL Server database which we use to store PDF files and images.
In our case, we're using a "regular" image column - since you're using SQL Server 2008, you could even use FILESTREAM (maybe you already do, but I don't know - you didn't say anything how exactly you're storing the image in the database).
We started the project on SQL Server 2005, where FILESTREAM wasn't available yet. In the meantime, we upgraded to SQL Server 2012, but never changed the data type in the table where we're storing the files.
If you still prefer creating a separate archive database and moving old data there, one piece of advice concerning this:
2) I was thinking of removing records from one database and cloning it
in the other, however the identity column probably will not let me do
that very seamlessly. (and I may be wrong too.)
[...]
4) If I just make the records blank(except for the identity) from the
original database and copy it to the other with no identity
constraint, probably it is not going to help and I think it will loose
the purpose of the exercise.
You don't need to set the column to identity in the archive database as well.
Just leave everything as it is in the main database, but remove the identity setting from the primary key in the archive database.
The archive database doesn't ever need to generate new keys (hence no need for identity), you're just copying rows with already existing keys from the main database.

I think good solution for you case is SSIS. This technology can provide fast loading of big volume of data to you Archive system. In addition you can use table partitioning to increase performance of manipulation of big data in Archive system. Also check such thing like comumnstore indexes (but it depends on version of SQL server).
I created such solution with following steps:
1) switch partition from main table t to another table t_1(the oldest rows in a table) in production system
2) load data to Archive system from table t_1
3) drop or truncate table t_1

Storing large files / binary data in a mysql database: when is it ok?

Ok, I have searched about this and read a few points of view about storing binary data in a [MySQL] database. Generally I consider this a bad idea and try to avoid it, favouring traditional file transfers and just storing a reference to the file in a database.
However, I am working on a project which requires database synchronisation with a remote/cloud database, not just for files, but also for settings and other user content. For this, and other reasons, I felt this might be an appropriate situation for binary storage in a database.
I have written a general system for the database sync which works well using Reflection and XML. I have also (against my instincts) integrated the file storage in to this system. Again, it works well - I chop files in to 64Kb BLOBs and store them in a table, with a file_id reference (linked to a seperate table which contains meta data such as file name/size/mime type).
This enables me to send bits and pieces as and when a connection is available, and also allows me to limit each request size to keep things running smoothly.
So far I have not found any issues with this, and have successfully imported and transferred over 1gb of data in both directions (over about 10-15 files / 16000 rows), but I worry about its scalability - will it slow down once there is 20gb+ data in there, or can MySQL handle it provided my queries are well structured?
Another reason for my decision to store the data in the database was that I figured I could simply add another HDD/storage device to MySQL if space ran low, in the hope of efficient scaling/replication/etc.
I would very much appreciate any views or comments as to whether this is a good or bad approach, and have I missed any obvious problems I'm likely to see once used in a production environment?
edit: I forgot to mention, the file sizes could range from 1KB to ~1GB
[Rough] Conclusion
Firstly: thanks very much to those who contributed a considered answer. Choosing the accepted answer here has been quite difficult as each has something decent to offer.
In the end (despite my hopes), I have decided that a pure MySQL storage server is at best only an ok solution (I still can't help wondering why they bother including the BLOB types though).
As the alternative, I am torn between #Nick Coons file system approach and #tadman's suggestion of a hybrid using a light weight key/value database engine such as leveldb. Provided the practicalities of using leveldb in this project are not an issue, this is most likely the approach I will work towards.
I have accepted tadman's answer on this basis; his answer was also most applicable and useful to my situation.
That being said, and for those that are interested: I have enjoyed quite a lot of success using only MySQL so far. I have tested a table storing over 15gb of binary data without any noticable negative side effects from to inserting/retrieving data from large tables (with careful queries). However, I am certain this is still very inefficient and either of the alternative methods mentioned will be significantly better.

I have to wonder why you're even bothering with a database at all, when the layer you've added on top to chunk, store, retrieve and reassemble would work just as well on a well-defined filesystem structure. MySQL wants all of its data on a single volume, so it's not a case of adding another drive whenever you feel like it, and replication of large amounts of binary data is going to be cripplingly slow as the binary logs will end up duplicating the amount of data you need to store.
The simplest approach is often the best one. Storing this in the filesystem directly is probably the best way to do it. If you need to keep an index of what's stored where, maybe you'd use a database like MySQL, but there's many ways to accomplish this same task. The more low-tech, the better. For example, don't rule out SQLite because an embedded database performs very well under light read and write load, and has the advantage of being "just a file" when it comes to backing up and restoring.
That being said, what you're doing sounds suspiciously similar to LevelDB, so before you commit to your approach, you'd have to see how it's significantly different than a key-value document store of that variety.

Short Answer:
I'm not sure there's a hard-lined way to answer this. You mentioned files being from 1KB to 1GB.. I wouldn't store binary data in a DB if it's going to anywhere near 1KB, let along 1GB. I may store a few bytes of binary data in a DB if it's incidental, but any large amount of data, especially that doesn't need to be searched, should be stored in the filesystem:
When you store data in a DB, you're storing it on a filesystem anyway, you've just added another layer (the DB) to the mix. There's a cost to this layer, so there ought to be a benefit to make up the difference. If you're storing the data so that you can search based on it or join it to other data, then this makes sense. But file data, binary or not, is typically not used in that way.
Example Implementation:
There are better methods to distribute file data than to enter it into a DB, such as a distributed filesystems (check into GlusterFS, MooseFS, both of which will scale by simply adding additional hard drives, whereas MySQL will not).
Typically, I'll store file data in the filesystem using an SHA1 hash of the data as the name of the file. If the hash is 98a75af529f07b1ef7be7400f51344b9f07b1ef7, then I'll store it in this directory structure:
./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7
That is, a top-level directory made up of the first two characters, a second-level directory made up of the second two characters, and then finally the file with the name of the total string. In this way, I can literally have billions of files without having so many in a single directory that the system is too slow to function.
Then I create a DB table with these columns to hold the meta data:
file_id, an auto_increment field
created, a field with a default value of current_timestamp
prev_id, more on this below
hash, the SHA1 hash on the filesystem
name, a textual name of the file (such as the original name that the file would have taken on disk.
When I need a hierarchical directory structure, I would also create a directory table and add a dir_id to the list of columns above.
If I edit the file represented by ./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7, I don't actually change that file on disk, I create a new one (because the new file contents would be represented by a new SHA1 hash), and create a new entry in the files table where prev_id equals the file_id of the file I edited. In other words, I now have versioning.
If I need this to be available in a distributed fashion, I setup MySQL replication and then use GlusterFS to replicate he filesystem across multiple servers.

I think you will find a fair amount of debate on this as I did when I began looking into this. I tend to lean toward storing in the file system and maintaining a reference. However, that is not to say that there is never a time to store binary data in a database.
I would say that simply to keep things in sync is not a reason within itself to make an argument for storing binary data in a database. There certainly are ways to keep file systems in sync so that as a database is kept in sync so is the file system.
The bottom line is that there is a fair amount of debate on this topic and you have to go with what works for you. If what you have set up works. Use it. Do performance and load testing to make sure it works. If it doesn't hold up, change it.

Should I dynamically recreate a PDF, rather than store it in either the database or the filesystem?

I need customers to be able to download PDFs of letters that have been sent to them.
I have read the threads about database versus filesystem storage of documents or images, and it does sound like the consensus is that, for anything more than just a few images, filesystem is the way to go.
What I want to know:
would a reasonable alternative be to just store the letter details in the database, and recreate the PDF 'on the fly' when it is requested?
Is that approach superior or inferior to fetching the PDF from the filesystem?

If it is for archival purposes, I would definitely store the PDF because in future, your PDF generation script may change and then the letter will not be exactly the same as what was originally sent. The customer will be expecting it to be exactly the same.
It doesn't matter what approach is superior, sometimes it is better to go for what approach is safer.

I'd store it off for two reasons
1) If you ever change how you generate the PDF, you probably don't want historical items to change. If you generate them every time, either they will change or you need to keep compatibility code to generate "old-style" records
2) Disk space is cheap. User's patience isn't. Unless you're really pressed for storage or pulling out of storage is harder than generating the PDF, be kind to your users and store it off.
Obviously if you create thousands of these an hour from a sparse dataset, you may not have the storage. But if you have the space, I'd vote for "use it"

Is there a forensics reason why you have to maintain records of letters sent to customers? If you are going to regenerate on the fly, how do you know that future code changes won't rewrite the letter (or, at least, the customer can make that argument in court if the information is used in a lawsuit)...

I'm inclined to say "it depends".
When one document is requested many times, it may be a saving if you compose it on the first request, and retrieve it subsequentially.
OTOH if most requests for a document are of the just-once type, and the creation process doesn't eat up most of your server capacity, on-the-fly will have clear advantage.

If you're using ASP.NET why not cache the PDF. Your cache can be stored in the database if you like or left in memory for as long as you may need it first. The enterprise library implements this for you in the caching application block and it's remarkably simple to use. If you cache the object, create a storage in the database using the block and then load it when you need it you won't have to worry about re-creating it.

Few things to consider, is the PDF generate based on data as it existed at some point in time. E.G. a Bill based on data from the prior month?
If so, Would you use the same template each month to generate this letter?
What happens if/when the letter format changes, if you regenerate on the fly it is no longer the same that was sent to them.
Is storing the PDF stream into the database a possibility?
I guess what I am getting at, do you need an exact representation of what was sent to the user, or is that flexible?

The question of whether to generate the pdfs dynamically or store them statically sounds more like a question of law than a question of programming.
If you don't have access to legal counsel that can provide guidance on this then it is going to be far safer to err on the side of caution and store them statically.

As long as the PDF document is of permanent nature (not just a work doc, but something official signed and sent somewhere else in the company or outside the company), you should have a copy of this PDF file on your network, and a link to this file in your database.
You cannot rely on the available data to reproduce the very same document at a different time mainly because:
Data can be changed (yes! suppose that the letter is settled to be signed by Head Of Department, and staff has changed?)
Your report format will change (header, footer, logo, etc)
The document you produced is kept by somebody else who will make use of the data available in the document.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.