sync local files with server files - c#

Scenario: I want to develop an application.The application should be able to connect to my remote server and download data to the local disk , while downloading it should check for new files and only download the new ones simultaneously creating the required(new) folders.
Problem: I have no idea how to compare the files in the server with the ones in the local disk.How to download only the new files from the server to the local disk?
What am thinking?: I want to sync the files in the local machine with the ones in the server. I am planning to use rsync for syncing but i have no idea how to use it with ASP.NET.
Kindly let me know if my approach is wrong or is there any other better way to accomplish this.

First you can compare the file names, then the file size and when all matches, you can compare the hashes of the files.

I call this kind of a problem a "data mastering" problem. I synchronize our databases with a Fortune 100 company throughout the week and have handled a number of business process issues.
The first rule of handling production data is not to do your users' data entry. They must be responsible for putting any business process into motion which touches production. They must understand the process and have access to logs showing what data was changed, otherwise they cannot handle issues. If you're doing this for them, then you are assuming these responsibilities. They will expect you to fix everything when problems happen, which you cannot feasibly do because IT cannot interpret business data or its relevance. For example, I handle delivery records but had to be taught that a duplicate key indicated a carrier change.
I inherited several mismanaged scenarios where IT simply dumped "newer" data into production without any further concern. Sometimes I get junk data, where I have to manually exclude incoming records from the mastering process because they have invalid negative quantities. Some of my on-hand records are more complete than incoming data, and so I have to skip synchronizing specific columns. When one application's import process simply failed, I had to put an end to complaints by creating a working update script. These are issues you need to think ahead about, because they will encourage you to organize control of each step of the synchronization process.
Synchronization steps:
Log what is there before you update
Download and compare local vs remote copies for differences; you cannot compare the two without a) having them both in the same physical location or b) controlling the other system
Log what you're updating with, and timestamp when you're updating it
Save and close the logs
Only when 1-4 are done should you post an update to production
Now as far as organizing a "mastering" process goes, which is what I call comparing the data and producing the lists of what's different, I have more experience to share. For one application, I had to restructure (decentralize) tables and reports before I could reliably compare both sources. This implies a need to understand the business data and know it is in proper form. You don't say if you're comparing PDFs, spreadsheets or images. For data, you must write a separate mastering process for each table (or worksheet), because the mastering process's comparison step may be specially shaped by business needs. Do not write one process which masters everything. Make each process controllable.
Not all information is compared the same way when imported. We get in PO and delivery data and therefore compare tens of thousands of records to determine which data points have changed, but some invoice information is simply imported without any future checks or synchronization. Business needs can even override updates and keep stale data on your end.
Each mastering process's comparer module can then be customized as needed. You'll want specific APIs when comparing file types like PDFs and spreadsheets. I use EPPlus for workbooks. Anything you cannot open has to be binary compared, of course.
A mastering process should not clean or transform the data, especially financial data. Those steps need to occur prior to mastering so that these issues are caught before mastering is begun.
My tools organize the data in 3 tabs -- Creates, Updates and Deletes -- each with DataGridViews showing the relevant records. Then I can log, review and commit changes or hand the responsibility to someone willing.
Mastering process steps:
(Clean / transform data externally)
Load data sources
Compare external to local data
Hydrate datasets indicating Creates, Updates and Deletes

Related

What is the best Method for monitoring a large number of clients reliably with good performance

This is more of a programming strategy and direction question, than the actual code itself.
I am programming in C-Sharp.
I have an application that remotely starts processes on many different clients on the network, could be up to 1000 clients in theory.
It then monitors the status of the remote processes by reading a log file on each client.
I currently do this by running one thread that loops through all of the clients in a list, and reading the log file. It works fine for 10 or 20 machines, but 1000 would probably be untenable.
There are several problems with this approach:
First, if the thread doesn’t finish reading all of the client statuses before it’s called again, the client statuses at the end of the list might not be read and updated.
Secondly, if any client in the list goes offline during this period, the updating hangs, until that client is back online again.
So I require a different approach, and have thought up a few possible ways to resolve this.
Spawn a separate thread for each client, to read their log file and update its progress.
a. However, I’m not sure if having 1000 threads running on my machine is something that would be acceptable.
Test the connect for each machine first, before trying to read the file, and if it cannot connect, then just ignore it for that iteration and move on to the next client in the list.
a. This still has the same problem of not getting through the list before the next call, and causes more delay and it tries to test the connection via a port first. With 1000 clients, this would be noticeable.
Have each client send the data to the machine running the application whenever there is an update.
a. This could create a lot of chatter with 1000 machines trying to send data repeatedly.
So I’m trying to figure if there is another more efficient and reliable method, that I haven’t considered, or which one of these would be the best.
Right now I’m leaning towards having the clients send updates to the application, instead of having the application pulling the data.
Looking for thoughts, concerns, ideas and recommendations.
In my opinion, you are doing this (Monitoring) the wrong way. Instead of keeping all logs in a text file, you'd better preserve them in a central data repository that can be of any kind. With respect to the fact that you are monitoring the performance of those system, your design and the mechanism behind it must not impact the performance of the target systems negatively, and with this design the disk and CPU would be involved so much in certain cases that can result in a performance issue itself.
I recommend you to create a log repository server using a fast in-memory database like Redis, and send logged data directly to that server. Keep in mind that this database must be running on a different virtual machine. You can then tune Redis to store received data on physical Disk once a particular number of indexes are reached or a particular interval elapses. The in-memory feature here is advantageous as you may need to query information a lot in a monitoring application like this. On the other hand, the performance of Redis is so high that it efficiently passes processing millions of indexes.
The blueprint for you is that:
1- Centralize all log data in a single repository.
2- Configure clients to send monitored information to the centralized repository.
3- Read the data from the centralized repository by the main server (monitoring system) when required.
I'm not trying to advertise for a particular tool here as I'm only sharing my own experience. There's many more tools that you can use for this purpose such as ElasticSearch.

Backup algorithm for windows service

I have to design a backup algorithm for some files used by a Windows Service and I already have some ideas, but I would like to hear the opinion of the wiser ones, in order to try and improve what I have in mind.
The software that I am dealing with follows a client-server architecture.
On the server side, we have a Windows Service that performs some tasks such as monitoring folders, etc, and it has several xml configuration files (around 10). These are the files that I want to backup.
On the client side, the user has a graphical interface that allows him to modify these configuration files, although this shouldn't happen very often. Communication with the server are made using WCF.
So the config files might be modified remotely by the user, but the administrator might also modify them manually on the server (the windows service monitors these changes).
And for the moment, this is what I have in mind for the backup algorithm (quite simple though):
When - backups will be performed in two situations:
Periodically: a parallel thread on the server application will perform a copy of the configuration files every XXXX months/weeks/whatever (configurable parameter). This is, it does not perform the backup each time the files are modified by user action, but only when the client app is launched.
Every time the user launches the client: every time the server detects that a user has launched the application, the server side will perform a backup.
How:
There will be a folder named Backup on the Program Data folder of the Windows Service. There, each time a backup is performed, a sub-folder named BackupYYYYMMDDHHmm will be created, containing all the concerned files.
Maintenance: Backup folders won't be kept forever. Periodically, all of those older than XXXX weeks/months/year (configurable parameter) will be deleted. Alternatively, I might only maintain N backup sub-folders (configurable parameter). I still haven't chosen an option, but I think I'll go for the first one.
So, this is it. Comments are very welcome. Thanks!!
I think your design is viable. just a few comments:
do you need to back up to a separate place other than the server? I don't feel it's safe to back up important data on same server, and I would rather back them up to a separate disk (perhaps a network location)
you need to implement the monitoring/backup/retention/etc. by yourself, and it sounds complicated - how long do you wish to spend on this?
Personally i would use some simple trick to achieve the backup, for example, since the data are plain text files (xml format) and light, I might simply back them up to some source control system: make the folder a checkout of SVN (or some other means) and create a simple script that detects/checks in changes to SVN, and schedule the script to be executed once a few hours (or more often up to your needs, or can be triggered by your service/app on demand) - this way it eliminates the unnecessary copy of data (as it checks in changes only), and it's much more trackable as svn provides all the history.
hope above can help a bit...

How to get real time update of data to main warehouse

All,
Need some info.
We have stores at multiple locations and use client server app installed for sales activity.
sales data is stored in database which is setup in all stores...
# end of day - a batch pulls data from all of the store locations and update main warehouse database.
We want to have real time implementation so that whenever there is transcation # any store... data will update immediately to main warehouse repository.
Any clue as how can we achive real time update of data to main warehouse ?
Thanks in advance...
One approach to this is called replication. There are several ways to do it in SQL Server. You're probably looking for transaction replication or merge replication.
Here's a place to start in the SQL Server 2012 documentation.
And here's a fairly recent overview that might be helpful.
You should make sure you understand what "real time" means, and how real time you really need to be. If you are not pre aggregating data and then storing it in the WH, then you should be able to set up replication between the database servers (if they can talk to each other). If you are loading an aggregate, then it gets tricky because you have to merge the measures (facts) into the warehouses existing measures, which is tough. If you don't need true real time, just a slow trickle, then consider simply running your current process on a schedule in sql agent.
First off - why not run the batch multiple times a day. It would not really be "real-time" but might yield good enough real world results.
One option would be to implement master-master replication provided by the SQL engine in use. Though this probably means that some steps need to be taken to guard against duplicate IDs, auto increment mismatch etc. For example we have a master-master system set up so that one produces entries with odd IDs, the other with even.
Another approach could be that all reads are performed against local databases, and all writes are performed into the single remote master. Data would be replicated as a master-slave setup. This would provide best data consistency, but slow network would make any writes slow. We have this kind of a setup implemented atop of the master-master replication as most interaction are reads.
One real world use case I have actually come across for a similar stores/warehouse setup was based on Firebird SQL. Every single table had triggers implemented to store every action on local databases in so called log tables. And there was a replication application running at all times, regularly checking these log tables, updating the data to a remote database and pulling in new data from the remote (which had it's own log tables). But as a downside it was a horror to maintain as triggers needed to be updated when something changed in the database setup and the replication application would fail/hang at times. But data consistency was maintained well and resolved by negative IDs being used for local database and positive for master/remote. But in the end it did not really provide real "real-time".
In the end - there is no one-shoe-fits-all answer and books could probably be written on the topic. Research and Google are your friends.

Looking for solution ideas on how to update files in real time that may be locked by other software

I'm interested in getting solution ideas for a problem we have.
Background:
We have software tools that run on laptops and flash data onto hardware components. This software reads in a series of data files in order to do the programming on the hardware. It's in a manufacturing environment and is running continuously throughout the day.
Problem:
Currently, they're a central repository that the software connects to to read the data files. The software reads the files and retains a lock on them throughout the entire flashing process. This is running all throughout the day on different hardware components, so it's feasible that these files could be "locked" for most of the day.
There's new requirements that state these data files that the software is reading need to be updated in real time, will minimal impact to the end user who is doing the flashing. We will be writing the service that drops the files out there in real time.
The software is developed by a third party vendor and is not modifiable by us. However, it expects a location to look for the data files, so everything up until the point of flashing is our process that we're free to change.
Question:
What approach would you take to solve this from a solution programming standpoint? We're not sure how to drop files out there in real time given the locks that will be present on them throughout the day. We'll settle for an "as soon as possible" solution if that is significantly easier.
The only way out of this conundrum seems to be the introduction of an extra file repository, along with a service-like piece of logic in charge of keeping these repositories synchronized.
In other words, the file upload takes places in one of the repositories (call it the "input repository"), and the flashing process uses the other repository (call it the "ouput repository"). The synchronization logic permanently pools the input repository for new files (based on file time stamp or other...) and when it finds such new files, attempts to copy these to the "output directory"; such copy either takes place instantly, when the flashing logic hasn't locked the corresponding file in the output directory, or it is differed till the file gets unlocked.
Note: During the file copy, the synchronization logic can/should lock the file, hence very temporarily preventing the file to be overwritten by new uploads, but ensuring full integrity of the copied file. The difference with the existing system is that the lock is held for a much shorter amount of time.
The drawback of this system is the full duplication of the repository, and this could be a problem if the repository is very big. However there doesn't appear to be many alternatives since we do not have control over the flashing process.
"As soon as possible" is your only option. You can't update a file that's locked, that's the whole point of a lock.
Edit:
Would it be possible to put the new file in a different location and then tell the 3rd party service to look in that location the next time it needs the file?

Synchronizing filesystem and cached data on program startup

I have a program that needs to retrieve some data about a set of files (that is, a directory and all files within it and sub directories of certain types). The data is (very) expensive to calculate, so rather than traversing the filesystem and calculating it on program startup, I keep a cache of the data in a SQLite database and use a FilesystemWatcher to monitor changes to the filesystem. This works great while the program is running, but the question is how to refresh/synchronize the data during program startup. If files have been added (or changed -- I presume I can detect this via last modified/size) the data needs to be recomputed in the cache, and if files have been removed, the data needs to be removed from the cache (since the interface traverses the cache instead of the filesystem).
So the question is: what's a good algorithm to do this? One way I can think of is to traverse the filesystem and gather the path and last modified/size of all files in a dictionary. Then I go through the entire list in the database. If there is not a match, then I delete the item from the database/cache. If there is a match, then I delete the item from the dictionary. Then the dictionary contains all the items whose data needs to be refreshed. This might work, however it seems it would be fairly memory-intensive and time-consuming to perform on every startup, so I was wondering if anyone had better ideas?
If it matters: the program is Windows-only written in C# on .NET CLR 3.5, using the SQLite for ADO.NET thing which is being accessed via the entity framework/LINQ for ADO.NET.
Our application is cross-platform C++ desktop application, but has very similar requirements. Here's a high-level description of what I did:
In our SQLite database there is a Files table that stores file_id, name, hash (currently we use last modified date as the hash value) and state.
Every other record refers back to a file_id. This makes is easy to remove "dirty" records when the file changes.
Our procedure for checking the filesystem and refreshing the cache is split into several distinct steps to make things easier to test and to give us more flexibility as to when the caching occurs (the names in italics are just what I happened to pick for class names):
On 1st Launch
The database is empty. The Walker recursively walks the filesystem and adds the entries into the Files table. The state is set to UNPROCESSED.
Next, the Loader iterates through the Files table looking for UNPARSED files. These are handed off to the Parser (which does the actual parsing and inserting of data)
This takes a while, so 1st launch can be a bit slow.
There's a big testability benefit because you can test the walking the filesystem code independently from the loading/parsing code. On subsequent launches the situation is a little more complicated:
n+1 Launch
The Scrubber iterates over the Files table and looks for files that have been deleted and files that have been modified. It sets the state to DIRTY if the file exists but has been modified or DELETED if the file no longer exists.
The Deleter (not the most original name) then iterates over the Files table looking for DIRTY and DELETED files. It deletes other related records (related via the file_id). Once the related records are removed, the original File record is either deleted or set back to state=UNPARSED
The Walker then walks the filesystem to pick-up new files.
Finally the Loader loads all UNPARSED files
Currently the "worst case scenario" (every file changes) is very rare - so we do this every time the application starts-up. But by splitting the process up unto these steps we could easily extend the implementation to:
The Scrubber/Deleter could be refactored to leave the dirty records in-place until after the new
data is loaded (so the application "keeps working" while new data is cached into the database)
The Loader could load/parse on a background thread during an idle time in the main application
If you know something about the data files ahead of time you could assign a 'weight' to the files and load/parse the really-important files immediately and queue-up the less-important files for processing at a later time.
Just some thoughts / suggestions. Hope they help!
Windows has a change journal mechanism, which does what you want: you subscribe to changes in some part of the filesystem and upon startup can read a list of changes which happened since last time you read them. See: http://msdn.microsoft.com/en-us/library/aa363798(VS.85).aspx
EDIT: I think it requires rather high privileges, unfortunately
The first obvious thing that comes to mind is creating a separate small application that would always run (as a service, perhaps) and create a kind of "log" of changes in the file system (no need to work with SQLite, just write them to a file). Then, when the main application starts, it can look at the log and know exactly what has changed (don't forget to clear the log afterwards :-).
However, if that is unacceptable to you for some reason, let us try to look at the original problem.
First of all, you have to accept that, in the worst case scenario, when all the files have changed, you will need to traverse the whole tree. And that may (although not necessarily will) take a long time. Once you realize that, you have to think about doing the job in background, without blocking the application.
Second, if you have to make a decision about each file that only you know how to make, there is probably no other way than going through all files.
Putting the above in other words, you might say that the problem is inherently complex (and any given problem cannot be solved with an algorithm that is simpler than the problem itself).
Therefore, your only hope is reducing the search space by using tweaks and hacks. And I have two of those on my mind.
First, it's better to query the database separately for every file instead of building a dictionary of all files first. If you create an index on the file path column in your database, it should be quicker, and of course, less memory-intensive.
Second, you don't actually have to query the database at all :-)
Just store the exact time when your application was last running somewhere (in a .settings file?) and check every file to see if it's newer than that time. If it is, you know it's changed. If it's not, you know you've caught it's change last time (with your FileSystemWatcher).
Hope this helps. Have fun.

Categories

Resources