Using Azure to process large amounts of data

Using Azure to process large amounts of data - c#

We have an application that over time stores immense amounts of data for our users (talking hundreds of TB or more here). Due to new EU directives, should a user decide to discontinue using our sevices, all their data must be available for export for the next 80 days, after which it MUST be eradicated completely. The data is stored in azure storage block blobs, and the metadata in an sql database.
Sadly, the data cannot be exported as-is (it is in a proprietary format), so it would need to be processed and converted to PDF for export. A file is approximately 240KB in size, so imagine the amount of PDFs for the TB value stated above.
We tried using functions to split the job into tiny 50 value chunks, but it went haywire at some point and created enormous costs, spinning out of control.
So what we're looking for is this:
Can be run on demand from a web trigger/queue/db entry
Is pay-what-you-use as this will occur at random times and (so we hope) rarely.
Can process extreme amounts of data fairly effectively at minimum cost
Is easy to maintain and keep track of. The functions jobs were just fire and pray -utter chaos- due to their amount and parallel processing.
Does anyone know of a service fitting our requirements?

Here's a getting started link for .NET, python or node.js:
https://learn.microsoft.com/en-us/azure/batch/batch-dotnet-get-started
The concept in batch is pretty simple, although it takes a bit of fiddling to get it working the first time, in my experience. I'll try to explain what it involves, to the best of my ability. Any suggestions or comments are welcome.
The following concepts are important:
The Pool. This is an abstraction of all the nodes (i.e. virtual machines) that you provision to do work. These could be running Linux, Windows Server or any of the other offerings that Azure has. You can provision a pool through the API.
The Jobs. This is an abstraction where you place the 'Tasks' you need executed. Each task is a command-line execution of your executable, possibly with some arguments.
Your tasks are picked one by one by an available node in the pool and it executes the command that the task specifies. Available on the node are your executables and a file that you assigned to the task, containing some data identifying, in your case e.g. which users should be processed by the task.
So suppose in your case that you need to perform the processing for 100 users. Each individual processing job is an execution of some executable you create, e.g. ProcessUserData.exe.
As an example, suppose your executable takes, in addition to a userId, an argument specifying whether this should be performed in test or prod, so e.g.
ProcessUserData.exe "path to file containing user ids to process" --environment test.
We'll assume that your executable doesn't need other input than the user id and the environment in which to perform the processing.
You upload all the application files to a blob (named "application blob" in the following). This consists of your main executable along with any dependencies. It will all end up in a folder on each node (virtual machine) in your pool, once provisioned. The folder is identified through an environment variable created on each node in your pool so that you can find it easily.
See https://learn.microsoft.com/en-us/azure/batch/batch-compute-node-environment-variables
In this example, you create 10 input files, each containing 10 userIds (100 userIds total) that should be processed. One file for each of the command line tasks. Each file could contain 1 user id or 10 userids, it's entirily up to you how you want your main executable to parse this file and process the input. You upload these to the 'input' blob container.
These will also end up in a directory identified by an environment variable on each node so are also easy to construct a path in your command line activity on each node.
When uploaded to the input container, you will receive a reference (ResourceFile) to each input file. One input file should be associated with one "Task" and each task is passed to an available node as the job executes.
The details of how to do this are clear from the getting started link, I'm trying to focus on the concepts, so I won't go into much detail.
You now create the tasks (CloudTask) to be executed, specify what it should run on the command line, and add them to the job. Here you reference the input file that each task should take as input.
An example (assuming Windows cmd):
cmd /c %AZ_BATCH_NODE_SHARED_DIR%\ProcessUserdata.exe %AZ_BATCH_TASK_DIR%\userIds1.txt --environment test
Here, userIds1.txt is the filename your first ResourceFile returned when you uploaded the input files. The next command will specify userIds2.txt, etc.
When you've created your list of CloudTask objects containing the commands, you add them to the job, e.g in C#.
await batchClient.JobOperations.AddTaskAsync(jobId, tasks);
And now you wait for the job to finish.
What happens now is that Azure batch looks at the nodes in the pool and while there are more tasks in the tasks list, it assigns a task to an available (idle) node.
Once completed (which you can poll for through the API), you can delete the pool, the job and pay only for the compute that you've used.
One final note: Your tasks may depend on external packages, i.e. an execution environment that is not installed by default on the OS you've selected, so there are a few possible ways of resolving this:
1. Upload an application package, which will be distributed to each node as it enters the pool (again, there's an environment variable pointing to it). This can be done through the Azure Portal.
2. Use a command line tool to get what you need, e.g. apt-get install on Ubuntu.
Hope that gives you an overview of what Batch is. In my opinion the best way to get started is to do something very simple, i.e. print environment variables in a single task on a single node.
You can inspect the stdout and stderr of each node while the execution is underway, again through the portal.
There's obviously a lot more to it than this, but this is a basic guide. You can create linked tasks and a lot of other nifty things, but you can read up on that if you need it.

Assuming lot of people are looking for a solution for these kind of requirements, the new release of ADLA (Azure Data Lake Analytics) supports the Parquet format now. This will be supported along with U-SQL. With less than 100 lines of code now you can read these small files to large files and with less number of resources (vertices) you can compress the data to Parquet files. For example, you can store 3TB data into 10000 parquet files. And reading these files is also very simple and as per the requirement you can create csv files in no time. This will save too much cost and time for you for sure.

Related

TPL Dataflow - How to ensure one block of work is complete before moving onto next?

I'm still relatively new to TPL Dataflow, and not 100% sure if I am using it correctly or if I'm even suppose to use it.
I'm trying to employ this library to help out with file-copying+file-upload.
Basically the structure/process of handling files in our application is as follows:
1) The end user will select a bunch of files from their disk and choose to import them into our system.
2) Some files have higher priority, while the others can complete at their own pace.
3) When a bunch of files is imported here is the process:
Queue these import requests, one request maps to one file
These requests are stored into a local sqlite db
These requests also explicitly indicate if it demands higher priority or not
We currently have two active threads running (one to manage higher priority and one for lower)
They go into a waiting state until signalled.
When new requests come in, they get signalled to dig into the local db to process the requests.
Both threads are responsible for copying the file to a separate cached location, so just a simple File.Copy call. The difference is, one thread does the actual File.Copy call immediately. While the other thread just enqueues them all onto the ThreadPool to run.
-Once the files are copied, the request gets updated, the request has a Status enum property that has different states like Copying, Copied, etc.
The request also requires a ServerTimestamp to be set, the ServerTimestamp is important, because there are times where a user may be saving changes to a file that's essentially the same, but has different versions, so the order is important.
Another separate thread is running that gets signalled to fetch requests from the local DB where the status is Copied. It will then ping an endpoint to ask for a ServerTimestamp, and update the request with it
Lastly once the request has had the file copy complete and the server ticket is set, we can now upload the physical file to the server.
So I'm toying around with using TransformBlock's
1- File Copy TransformBLock
I'm thinking there could be two File Copy TransformBlock's one that's for higher priority and one for lower priority.
My understanding is that it uses the TaskScheduler.Current which uses the ThreadPool behind the scenes. I was thinking maybe a custom TaskScheduler that spawns a new thread on the fly. This scheduler can be used for the higher priority file copy block.
2- ServerTimestamp TransformBlock
So this one will be linked to the 1st block, and take in all the copied files in and get the server timestamp and set it int he request.
3-UploadFIle TransformBlock
This will upload the file
Problems I'm facing:
Say for example we have 5 file requests enqueued in the local db.
File1
File2
File3-v1
File3-v2
File3-v3
We Post/SendAsync all 5 requests to the first block.
If File1,File2,File3-v1,File3-v3 are successful but File3-v2 fails, I kind of want the block to not flow onto the next ServerTimestamp block, because it's important the File3 versions are completely copied before proceeding, or else it will go out of order.
But this kind of leads onto how is it going to retry correctly and have the other 4 files that had already been copied move with it over to the next block?
I'm not sure if I am structuring this correctly or if TPL Dataflow supports my usecase.

Backup algorithm for windows service

I have to design a backup algorithm for some files used by a Windows Service and I already have some ideas, but I would like to hear the opinion of the wiser ones, in order to try and improve what I have in mind.
The software that I am dealing with follows a client-server architecture.
On the server side, we have a Windows Service that performs some tasks such as monitoring folders, etc, and it has several xml configuration files (around 10). These are the files that I want to backup.
On the client side, the user has a graphical interface that allows him to modify these configuration files, although this shouldn't happen very often. Communication with the server are made using WCF.
So the config files might be modified remotely by the user, but the administrator might also modify them manually on the server (the windows service monitors these changes).
And for the moment, this is what I have in mind for the backup algorithm (quite simple though):
When - backups will be performed in two situations:
Periodically: a parallel thread on the server application will perform a copy of the configuration files every XXXX months/weeks/whatever (configurable parameter). This is, it does not perform the backup each time the files are modified by user action, but only when the client app is launched.
Every time the user launches the client: every time the server detects that a user has launched the application, the server side will perform a backup.
How:
There will be a folder named Backup on the Program Data folder of the Windows Service. There, each time a backup is performed, a sub-folder named BackupYYYYMMDDHHmm will be created, containing all the concerned files.
Maintenance: Backup folders won't be kept forever. Periodically, all of those older than XXXX weeks/months/year (configurable parameter) will be deleted. Alternatively, I might only maintain N backup sub-folders (configurable parameter). I still haven't chosen an option, but I think I'll go for the first one.
So, this is it. Comments are very welcome. Thanks!!

I think your design is viable. just a few comments:
do you need to back up to a separate place other than the server? I don't feel it's safe to back up important data on same server, and I would rather back them up to a separate disk (perhaps a network location)
you need to implement the monitoring/backup/retention/etc. by yourself, and it sounds complicated - how long do you wish to spend on this?
Personally i would use some simple trick to achieve the backup, for example, since the data are plain text files (xml format) and light, I might simply back them up to some source control system: make the folder a checkout of SVN (or some other means) and create a simple script that detects/checks in changes to SVN, and schedule the script to be executed once a few hours (or more often up to your needs, or can be triggered by your service/app on demand) - this way it eliminates the unnecessary copy of data (as it checks in changes only), and it's much more trackable as svn provides all the history.
hope above can help a bit...

sync local files with server files

Scenario: I want to develop an application.The application should be able to connect to my remote server and download data to the local disk , while downloading it should check for new files and only download the new ones simultaneously creating the required(new) folders.
Problem: I have no idea how to compare the files in the server with the ones in the local disk.How to download only the new files from the server to the local disk?
What am thinking?: I want to sync the files in the local machine with the ones in the server. I am planning to use rsync for syncing but i have no idea how to use it with ASP.NET.
Kindly let me know if my approach is wrong or is there any other better way to accomplish this.

First you can compare the file names, then the file size and when all matches, you can compare the hashes of the files.

I call this kind of a problem a "data mastering" problem. I synchronize our databases with a Fortune 100 company throughout the week and have handled a number of business process issues.
The first rule of handling production data is not to do your users' data entry. They must be responsible for putting any business process into motion which touches production. They must understand the process and have access to logs showing what data was changed, otherwise they cannot handle issues. If you're doing this for them, then you are assuming these responsibilities. They will expect you to fix everything when problems happen, which you cannot feasibly do because IT cannot interpret business data or its relevance. For example, I handle delivery records but had to be taught that a duplicate key indicated a carrier change.
I inherited several mismanaged scenarios where IT simply dumped "newer" data into production without any further concern. Sometimes I get junk data, where I have to manually exclude incoming records from the mastering process because they have invalid negative quantities. Some of my on-hand records are more complete than incoming data, and so I have to skip synchronizing specific columns. When one application's import process simply failed, I had to put an end to complaints by creating a working update script. These are issues you need to think ahead about, because they will encourage you to organize control of each step of the synchronization process.
Synchronization steps:
Log what is there before you update
Download and compare local vs remote copies for differences; you cannot compare the two without a) having them both in the same physical location or b) controlling the other system
Log what you're updating with, and timestamp when you're updating it
Save and close the logs
Only when 1-4 are done should you post an update to production
Now as far as organizing a "mastering" process goes, which is what I call comparing the data and producing the lists of what's different, I have more experience to share. For one application, I had to restructure (decentralize) tables and reports before I could reliably compare both sources. This implies a need to understand the business data and know it is in proper form. You don't say if you're comparing PDFs, spreadsheets or images. For data, you must write a separate mastering process for each table (or worksheet), because the mastering process's comparison step may be specially shaped by business needs. Do not write one process which masters everything. Make each process controllable.
Not all information is compared the same way when imported. We get in PO and delivery data and therefore compare tens of thousands of records to determine which data points have changed, but some invoice information is simply imported without any future checks or synchronization. Business needs can even override updates and keep stale data on your end.
Each mastering process's comparer module can then be customized as needed. You'll want specific APIs when comparing file types like PDFs and spreadsheets. I use EPPlus for workbooks. Anything you cannot open has to be binary compared, of course.
A mastering process should not clean or transform the data, especially financial data. Those steps need to occur prior to mastering so that these issues are caught before mastering is begun.
My tools organize the data in 3 tabs -- Creates, Updates and Deletes -- each with DataGridViews showing the relevant records. Then I can log, review and commit changes or hand the responsibility to someone willing.
Mastering process steps:
(Clean / transform data externally)
Load data sources
Compare external to local data
Hydrate datasets indicating Creates, Updates and Deletes

Hints and tips for a Windows service I am creating in C# and Quartz.NET

I have a project ongoing at the moment which is create a Windows Service that essentially moves files around multiple paths. A job may be to, every 60 seconds, get all files matching a regular expression from an FTP server and transfer them to a Network Path, and so on. These jobs are stored in an SQL database.
Currently, the service takes the form of a console application, for ease of development. Jobs are added using an ASP.NET page, and can be editted using another ASP.NET page.
I have some issues though, some relating to Quartz.NET and some general issues.
Quartz.NET:
1: This is the biggest issue I have. Seeing as I'm developing the application as a console application for the time being, I'm having to create a new Quartz.NET scheduler on all my files/pages. This is causing multiple confusing errors, but I just don't know how to institate the scheduler in one global file, and access these in my ASP.NET pages (so I can get details into a grid view to edit, for example)
2: My manager would suggested I could look into having multiple 'configurations' inside Quartz.NET. By this, I mean that at any given time, an administrator can change the applications configuration so that only specifically chosen applications run. What'd be the easiest way of doing this in Quartz.NET?
General:
1: One thing that that's crucial in this application is assurance that the file has been moved and it's actually on the target path (after the move the original file is deleted, so it would be disastrous if the file is deleted when it hasn't actually been copied!). I also need to make sure that the files contents match on the initial path, and the target path to give peace of mind that what has been copied is right. I'm currently doing this by MD5 hashing the initial file, copying the file, and before deleting it make sure that the file exists on the server. Then I hash the file on the server and make sure the hashes match up. Is there a simpler way of doing this? I'm concerned that the hashing may put strain on the system.
2: This relates to the above question, but isn't as important as not even my manager has any idea how I'd do this, but I'd love to implement this. An issue would arise if a job is executed when a file is being written to, which may be that a half written file will be transferred, thus making it totally useless, and it would also be bad as the the initial file would be destroyed while it's being written to! Is there a way of checking of this?

As you've discovered, running the Quartz scheduler inside an ASP.NET presents many problems. Check out Marko Lahma's response to your question about running the scheduler inside of an ASP.NET web app:
Quartz.Net scheduler works locally but not on remote host
As far as preventing race conditions between your jobs (eg. trying to delete a file that hasn't actually been copied to the file system yet), what you need to implement is some sort of job-chaining:
http://quartznet.sourceforge.net/faq.html#howtochainjobs
In the past I've used the TriggerListeners and JobListeners to do something similar to what you need. Basically, you register event listeners that wait to execute certain jobs until after another job is completed. It's important that you test out those listeners, and understand what's happening when those events are fired. You can easily find yourself implementing a solution that seems to work fine in development (false positive) and then fails to work in production, without understanding how and when the scheduler does certain things with regards to asynchronous job execution.
Good luck! Schedulers are fun!

Looking for solution ideas on how to update files in real time that may be locked by other software

I'm interested in getting solution ideas for a problem we have.
Background:
We have software tools that run on laptops and flash data onto hardware components. This software reads in a series of data files in order to do the programming on the hardware. It's in a manufacturing environment and is running continuously throughout the day.
Problem:
Currently, they're a central repository that the software connects to to read the data files. The software reads the files and retains a lock on them throughout the entire flashing process. This is running all throughout the day on different hardware components, so it's feasible that these files could be "locked" for most of the day.
There's new requirements that state these data files that the software is reading need to be updated in real time, will minimal impact to the end user who is doing the flashing. We will be writing the service that drops the files out there in real time.
The software is developed by a third party vendor and is not modifiable by us. However, it expects a location to look for the data files, so everything up until the point of flashing is our process that we're free to change.
Question:
What approach would you take to solve this from a solution programming standpoint? We're not sure how to drop files out there in real time given the locks that will be present on them throughout the day. We'll settle for an "as soon as possible" solution if that is significantly easier.

The only way out of this conundrum seems to be the introduction of an extra file repository, along with a service-like piece of logic in charge of keeping these repositories synchronized.
In other words, the file upload takes places in one of the repositories (call it the "input repository"), and the flashing process uses the other repository (call it the "ouput repository"). The synchronization logic permanently pools the input repository for new files (based on file time stamp or other...) and when it finds such new files, attempts to copy these to the "output directory"; such copy either takes place instantly, when the flashing logic hasn't locked the corresponding file in the output directory, or it is differed till the file gets unlocked.
Note: During the file copy, the synchronization logic can/should lock the file, hence very temporarily preventing the file to be overwritten by new uploads, but ensuring full integrity of the copied file. The difference with the existing system is that the lock is held for a much shorter amount of time.
The drawback of this system is the full duplication of the repository, and this could be a problem if the repository is very big. However there doesn't appear to be many alternatives since we do not have control over the flashing process.

"As soon as possible" is your only option. You can't update a file that's locked, that's the whole point of a lock.
Edit:
Would it be possible to put the new file in a different location and then tell the 3rd party service to look in that location the next time it needs the file?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.