I'm using TFS API to manage versions of my application's data.
In the first use i'm trying to convert all the data base data to the TFS workspace and then the check-in stuck for long time (can take more than hour if it not stuck forever), i'm dealing with 100,000-200,000 files to check-in.
There is any limitation in TFS of number of check-in files? if not, what can be the bottle neck of this operation?
Split the check in to small packages of files would help? if so, any recommended bulk size?
The number of changes in a changeset is stored as the CLR's int type.
So there's definitely an upper limit of int.MaxValue or 2,147,483,647.
More details you can refer the answer from Edward in this question:Is there a limit on the number of files in a changeset in TFS?
In other words, you are far from the check-in limitation. Check in process aborted may relate the network connection and current system load.
Moreover, just like above mentioned in the comment. It's not recommend to check in and version control database data file in TFS. Suggest you to create a script. Here is also a discussion about it: Do you use source control for your database items?
The databases themselves? No
The scripts that create them, including static data inserts, stored
procedures and the like; of course. They're text files, they are
included in the project and are checked in and out like everything
else.
Related
Years back, I had created a small system against a requirement where a snapped image from Android was uploaded onto a server along with its respective custom data and then stored on the disk and the custom data describing the image was further broken up and stored in the database. Each of the snapped images was actually a part of a campaign. Over the period, the system went on growing enough and now there are now over 10,000 campaigns already and over 500-1000 images per campaign. Though, the performance is not all that bad however I believe its just a matter of time. We now are thinking of archiving the past campaigns in another database called as Archive. Now here is what I am planning to do.
The Archive Database will have the exact same structure and the Archive functionality may have a search mechanism however, retrieval speed is not much of a concern here as this will happen very rarely.
I was thinking of removing records from one database and cloning it in the other, however the identity column probably will not let me do that very seamlessly. (and I may be wrong too.)
There needs to be a restore option too. (This is probably the most challenging part)
If I just make the records blank(except for the identity) from the original database and copy it to the other with no identity constraint, probably it is not going to help and I think it will loose the purpose of the exercise.
Any advise over this? Is there any known strategy or pattern or literature or even a link that may guide me on this?
Thank you in advance for your help.
I say: as long as you don't run out of space on your server, leave it as it is.
Over the period, the system went on growing enough and now there are now over 10,000 campaigns already and over 500-1000 images per campaign.
→ That's 5-10 millions of rows (created over several years).
For SQL Server, that's not that much.
Yes, I know...we're talking about image files stored in the database, not "regular" rows. Still, if your server has reasonably sized hardware, it shouldn't really matter.
I'm talking from experience here - at work, we have a SQL Server database which we use to store PDF files and images.
In our case, we're using a "regular" image column - since you're using SQL Server 2008, you could even use FILESTREAM (maybe you already do, but I don't know - you didn't say anything how exactly you're storing the image in the database).
We started the project on SQL Server 2005, where FILESTREAM wasn't available yet. In the meantime, we upgraded to SQL Server 2012, but never changed the data type in the table where we're storing the files.
If you still prefer creating a separate archive database and moving old data there, one piece of advice concerning this:
2) I was thinking of removing records from one database and cloning it
in the other, however the identity column probably will not let me do
that very seamlessly. (and I may be wrong too.)
[...]
4) If I just make the records blank(except for the identity) from the
original database and copy it to the other with no identity
constraint, probably it is not going to help and I think it will loose
the purpose of the exercise.
You don't need to set the column to identity in the archive database as well.
Just leave everything as it is in the main database, but remove the identity setting from the primary key in the archive database.
The archive database doesn't ever need to generate new keys (hence no need for identity), you're just copying rows with already existing keys from the main database.
I think good solution for you case is SSIS. This technology can provide fast loading of big volume of data to you Archive system. In addition you can use table partitioning to increase performance of manipulation of big data in Archive system. Also check such thing like comumnstore indexes (but it depends on version of SQL server).
I created such solution with following steps:
1) switch partition from main table t to another table t_1(the oldest rows in a table) in production system
2) load data to Archive system from table t_1
3) drop or truncate table t_1
I'm facing some doubts about what is better related to performance and best practices.
My system will do:
Per document insert and update, or
Batch document insert
As I figured out (in previous systems) #2 is straight forward:
Bulk delete old docs and add new ones, 10k docs max with no more then 20 fields per doc
Commit
Optimize
But the #1 still puzzles me as some customers will add docs one by one.
What is the penalty in commit and optimize at every insert and update? Or can I just ignore it as it only happens 20 times per day?
Java version is 3.5, .net version is 3.03
I just saw a blog post and want to know about what community have to say about.
I see no need to .Optimize() at all. Lucene will handle segment merges automatically, and you can provide your own logic to change how the merges are calculated. You could write something that merged away deleted documents when 10% of your documents are marked for deletion. There's no need for the functionality of Lucene to merge away every single deleted document.
Sure, you'll end up with more segment files and they will consume file descriptors, but have you ever ran into problems where you had too many files open? I tried googling for the maximum number of open files on a Windows server installation, but the answers varies from several thousand to limited by available memory.
I'm saving exceptions generated by Elmah as XML files.
Is there any way to configure it so that it automatically removes files older than X days? Or perhaps a max number of files in the directory? Or do i need to created a custom batch job that does this?
From the Elmah Project Site for ErrorLogImplementations. (Italics added for emphasis)
XmlErrorLog
The XmlFileErrorLog stores errors into loose XML files in a configurable directory. Each error gets its own file containing all of its details. The files can easily be copied around, deleted, compressed or mailed to someone for further diagnostics. It does not require any database engine or setup, like with SQL Server and Oracle, so there is very little management overhead and you do not need to worry about additional costs when it comes to hosting plans. Although simple, it relies on the file system performance for shredding through the directory, reading files and sorting through them. A smart way of keeping logs based on XmlFileErrorLog running smoothly is to limit the number of files by scheduling a task to periodically archive the old logs and clean up the folder.
You will need to create a custom batch job that does this.
We have an application on the web that must allow the user to upload files with zip codes, these files are .csv's files. Any user will be able to upload the file from their computer, the issue is that the file may contain thousands of records. Right now i am getting the file, making sure it has the right headers but I am pushing the records one by one into the database.
I am using c# asp.net, is there a better way to do this?, more efficient from the code?. We cant use any external importers or data importers or tools like sql server business intelligence. How can I do this?, i was reading something about putting it in memory and then push it to the database?. Any urls, examples or suggestions would be much appreciated.
Regards
Firstly, I'm pretty sure that what you are asking is actually "How do you process a large file and insert the processed data into the database?".
Now assuming I am correct I would say the question is akin to 'how long is a piece of string?'. The reality is that an implementation for processing large files into a database is highly specific to your requirements.
However, at the simplest end of the spectrum you could simply upload the file straight into a table (or folder) and create a windows service that runs every x minutes, traverses through the table, picks each file and processes your data using bulk inserts and the prepare method (which may give you some performance benefits).
Alternatively you could look at something like MSMQ (Microsoft Message Queuing) and save any uploaded files direct to a queue which is then completely independent of your application and can be processed at any point in time along with easily scaled out.
At the end of the day though, honestly I don't think anyone here can give you a 'correct' answer to your question cause there really isn't one and you'll only be able to find improvements to your implementation by experimentation.
if this contains up to a million record, best to do this is to create a service to manage the inserting of records into the database to avoid timeout and prevent the web iis stress.
if you make it a windows service you can notify the service to process the zip files in certain directory where it was uploaded.
also, i would suggest to use bulk insert for more faster database transactions.
if there are validation you can probably stage the data into a different database and validate the data then push to the final database.
Since these records are in the same table and would then not be related to each other, Parallel.ForEach may be a valid answer here. Assuming you have a static method (may not necessarily need to be static) that inserts an individual record into the db, you can run Parallel.ForEach loop over an array where each index of the array represents a line of the CSV.
This assumes that uploading the large file to the server isn't the initial issue. If that is also part of the issue I would reccomend zipping the file and then using something like SharpZipLib to unzip it once it is uploaded. Since text compresses very well this may be the biggest boon to performance from the user's perspective.
A CMS we use called Kentico stores Media Library Files on the file system, and also stores a record in the database for file meta data (title, description, etc.). When you use a Media Library control to list those items, it will read the files from the file system to display them. Is it faster to read from the file system then to query the database? Or would it be faster to run a simple query on the media file meta data database table?
Assumptions:
Kentico is an ASP.NET application, so the code is in C#. They use simple DataSets for passing their data around.
Only meta data would be read from the direct files like filename and size.
At most 100 files per folder.
The database query would be indexed correctly.
The query would be something like:
SELECT *
FROM Media_File
WHERE FilePath LIKE 'Path/To/Current/Media/Folder/%'
The short answer is, it depends on a number of variable factors, but the file system will generally be faster than a DB.
The longer answer is: scanning the local filesystem at a known location is generally fast, because the resource is close to home and computers are designed to do these operations very efficiently.
HOWEVER, whether it's FASTER than a database depends on the database implementation, where it's located, and how much data we're talking about. On the whole, DBMSes are optimized to very effectively store and query large datasets, while a "flat" filesystem can only scan the drive as fast as the hardware goes. How fast they are depends on the implementation (SqLite isn't going to be as fast overall as MS Sql Server or Oracle), the communication scheme (transferring files over a network is the slowest thing your computer does regularly; by contrast, named pipes provide very fast inter-process communication), and how much hardware you're throwing at it (a quad-Xeon blade server with SATA-RAID striping is going to be much faster than your Celeron laptop).
In addition to what others have said here, caching can come into play too depending on your cache settings. Don't forget to take those into account as Kentico, SQL, and IIS all have many different levels of caching and are used at different times depending on your setup, configuration, and which use case(s) you are optimizing.
When it comes to performance issues at this level, the answer is often: it depends. So benchmark your own solution to see which one helps most in your particular users' situational needs.
Kentico did release a couple of performance guides (for 5.0 and another for 5.5) that may help, but they still won't give you a definitive answer until you test it yourself.