I am using Azure Cosmos DB. I have some collections as shown in snapshot below-
Now, I want to create a scheduled task which will retrieve all the data from collection- "CurrentDay" and do some calculation and store the value in another collection- "datewise". Similarly, I will need to retrieve all the data from collection- "datewise" and based one some calculation store the data in "monthwise" and then to "yearly".
I looked for some option in Scheduler in Azure portal and tried creating a scheduler but it seems I don't have sufficient permission license to use that feature. Basically I haven't used that so I am not sure it will work.
Had it been in SQL Server I could have done that using custom code in C#. The only option I currently have is to use REST API calls to fetch data, calculate in C# and Post it back to Azure Cosmos DB. Is there any better way of doing this?
Please let me know if I can provide any details.
I think using a scheduled task (on Azure) and getting the data via the REST API is probably what you want to do. There are several reasons why this isn't as bad as you might think:
Your server and your database are right next to each other in the data centre, so you don't need to pay for data transfer.
Latency is very low and bandwidth is very high, so you'll be limited by the database performance more than anything else (you can run parallel tasks in your scheduled task to make sure of this).
The REST API has a very well supported official C# client library.
Of course it depends on the scale of data we're talking about as to how you should provision your scheduled task.
I'd encapsulate your logic in an Azure WebJob method, and mark the method with a TimerTrigger. The TimerTrigger will call your given method on the schedule that you specify. This has a few less moving parts. If you were to go the scheduler route, you're still going to have to have the scheduler call some endpoint in order to perform the work. Packaging up your logic and schedule in a WebJob will simplify things a bit.
On a side note, if all data lived in the same collection, I'd suggest writing a stored procedure to perform these calculations. But alas, stored procedures in Cosmos are bounded at the collection level.
Related
I have a query on bi-directional data sync.
The scenario is, that we have ERP software running on a local network
which is developed in PowerBuilder and the database is SQL Anywhere 16, Also, we have our cloud software which is developed in .net6 and the database is Azure SQL. And also we have a Middleware developed on .net which interacts with our API and local DB. After an operation like Invoice generation, we need to keep the quantity of a product accurate the same as local DB and cloud DB. Whether the operation happened in the cloud or local network. Please share your thoughts.
The approach would depend on whether you are willing to sacrifice consistency or concurrency.
If you are willing to sacrifice consistency, which in your case would be acceptable I believe for some use cases like syncing local invoices to the cloud, your middleware could asynchronously ensure that both the databases are in sync on the side.
If you are willing to sacrifice concurrency, which would be needed to ensure quantity of a product is accurate before checking out, you would essentially use a lock to ensure the product is available and is not being checked out by someone else. This has the down side of slowing down the system since multiple requests would be waiting as previous requests are being processed.
As for the actual implementation itself, you could use a queue for each transaction which the middleware could receive and sync for the first option. And for the second option, you would need to use some kind of distributed lock for each product that your API/Middleware would need to acquire before committing changes.
Depending on the scale and business KPIs of your application, you would have to decide how to approach the second option. Here is another SO thread which has a decent discussion on various approaches based on what you are willing to sacrifice.
Right now I have a WebAPI application that after receiving a request dynamically creates a specific pipeline in C# to do a specific task.
However, because the number of pipelines and datasets is limited to 5000, the application requests will eventually cause to reach this limit. I'm thinking about a way to automatically delete the pipeline and its datasets, but I'm not sure how. Manual deletion is out of the question, unfortunately.
Is there maybe a way for executing a "self-destruction" of a pipeline after completion? Or maybe trigger of removing old pipelines periodically?
No such mechanism to cleaning all the resources directly in ADF,however you could use Azure Function Time Trigger to implement it in the schedule.Please refer to my thoughts:
1.Create time trigger azure function(for example triggered every day) to query pipeline runs with REST API or SDK.
2.Loop the results and filter the Status==Succeeded and runEnd< today to get the pipeline name list
3.Delete them one by one by name list by using Delete API.(REST API:https://learn.microsoft.com/en-us/rest/api/datafactory/pipelines/delete)
4.Deleting datasets is a little bit of trouble. Although you can get the pipeline name, the activities in the pipeline are not necessarily the same, resulting in different datasets.For example,if it is copy activity,you could get referenceName in inputs and outputs array.If it is feasible to clear all datasets and they will be re-created, you can easily use the LIST DATASETS API and kill them all.
I have a list of files in a blob in a storage account that I need to move to another storage account. Is there a way to specifically select blob files and move only the selected subset to a different storage account? If so, how can I do it?
edit: The list of blobs that need to be moved will be updated and the function process will need to run in an ongoing basis
The most rudimentary approach that I would recommend if you want to use Azure Functions for this is based on the fact that this problem is really about I/O more than it is about compute. So while there are patterns you can use to scale out work with Azure functions, those probably don't make much sense for this kind of problem.
The simplest approach here is to use a single timer trigger based function. You'll schedule this function to run as frequently as you need. Its job will be to execute your sproc, enumerate the results and then queue up each result for copying via a TransferManager from the Azure Blob Storage SDK.
If you're not familiar with the TransferManager class already, it takes care of tracking and optimizing the concurrent throughput of I/O operations for you. You would likely want to create a single TransferContext representing the batch of work the function is working on so you can keep track of progress, deal with failures, handle overwrite situation etc. You would be utilizing the CopyAsync method and, again if you're not familiar with this API, there is a parameter on this method named isServiceCopy. Since you're copying between two Azure Storage Service accounts you definitely want to utilize this so that it is a pure server<->server copy and the I/O doesn't have to pass through the server that your function instance is running on at all; your function ends up being little more than an orchestrator of the copying.
Now, like I said, this is the most rudimentary approach I would suggest. There are other things to consider such as remaining idempotent in the face of any failures. For example, if the stored procedure you're calling only returns a particular blob URI once (e.g. a poor man's queue in SQL server) and your Azure Function fails for some reason, then you would lose that work. I would really need to understand more details about that to prescribe a more durable alternative to that, but you'd definitely want to change this approach so you decouple the actual copying from the execution of the stored procedure to reduce the likelihood of failure there.
You can implement it with a Recurrence Logic App:
Runs every X time
Invoke your Stored Procedure to get the list of the files
For each file, use the Copy Blob component to move the source blob to the destination blob
I have this scenario, and I don't really know where to start. Suppose there's a Web service-like app (might be API tho) hosted on a server. That app receives a request to proccess some data (through some method we will call processData(data theData)).
On the other side, there's a robot (might be installed on the same server) that procceses the data. So, The web-service inserts the request on a common Database (both programms have access to it), and it's supposed to wait for that row to change and send the results back.
The robot periodically check the database for new rows, proccesses the data and set some sort of flag to that row, indicating that the data was processed.
So the main problem here is, what should the method proccessData(..) do to check for the changes of the data row?.
I know one way to do it: I can build an iteration block that checks for the row every x secs. But i don't want to do that. What I want to do is to build some sort of event listener, that triggers when the row changes. I know it might involve some asynchronous programming
I might be dreaming, but is that even possible in a web enviroment.?
I've been reading about a SqlDependency class, Async and AWait classes, etc..
Depending on how much control you have over design of this distributed system, it might be better for its architecture if you take a step back and try to think outside the domain of solutions you have narrowed the problem down to so far. You have identified the "main problem" to be finding a way for the distributed services to communicate with each other through the common database. Maybe that is a thought you should challenge.
There are many potential ways for these components to communicate and if your design goal is to reduce latency and thus avoid polling, it might in fact be the right way for the service that needs to be informed of completion of this work item to be informed of it right away. However, if in the future the throughput of this system has to increase, processing work items in bulk and instead poll for the information might become the only feasible option. This is also why I have chosen to word my answer a bit more generically and discuss the design of this distributed system more abstractly.
If after this consideration your answer remains the same and you do want immediate notification, consider having the component that processes a work item to notify the component(s) that need to be notified. As a general design principle for distributed systems, it is best to have the component that is most authoritative for a given set of data to also be the component to answer requests about that data. In this case, the data you have is the completion status of your work items, so the best component to act on this would be the component completing the work items. It might be better for that component to inform calling clients and components of that completion. Here it's also important to know if you only write this data to the database for the sake of communication between components or if those rows have any value beyond the completion of a given work item, such as for reporting purposes or performance indicators (KPIs).
I think there can be valid reasons, though, why you would not want to have such a call, such as reducing coupling between components or lack of access to communicate with the other component in a direct manner. There are many communication primitives that allow such notification, such as MSMQ under Windows, or Queues in Windows Azure. There are also reasons against it, such as dependency on a third component for communication within your system, which could reduce the availability of your system and lead to outages. The questions you might want to ask yourself here are: "How much work can my component do when everything around it goes down?" and "What are my design priorities for this system in terms of reliability and availability?"
So I think the main problem you might want to really try to solve fist is a bit more abstract: how should the interface through which components of this distributed system communicate look like?
If after all of this you remain set on having the interface of communication between those components be the SQL database, you could explore using INSERT and UPDATE triggers in SQL. You can easily look up the syntax of those commands and specify Stored Procedures that then get executed. In those stored procedures you would want to check the completion flag of any new rows and possibly restrain the number of rows you check by date or have an ID for the last processed work item. To then notify the other component, you could go as far as using the built-in stored procedure XP_cmdshell to execute command lines under Windows. The command you execute could be a simple tool that pings your service for completion of the task.
I'm sorry to have initially overlooked your suggestion to use SQL Query Notifications. That is also a feasible way and works through the Service Broker component. You would define a SqlCommand, as if normally querying your database, pass this to an instance of SqlDependency and then subscribe to the event called OnChange. Once you execute the SqlCommand, you should get calls to the event handler you added to OnChange.
I am not sure, however, how to get the exact changes to the database out of the SqlNotificationEventArgs object that will be passed to your event handler, so your query might need to be specific enough for the application to tell that the work item has completed whenever the query changes, or you might have to do another round-trip to the database from your application every time you are notified to be able to tell what exactly has changed.
Are you referring to a Message Queue? The .Net framework already provides this facility. I would say let the web service manage an application level queue. The robot will request the same web service for things to do. Assuming that the data needed for the jobs are small, you can keep the whole thing in memory. I would rather not involve a database, if you don't already have one.
I have a C# service application which interacts with a database. It was recently migrated from .NET 2.0 to .NET 4.0 so there are plenty of new tools we could use.
I'm looking for pointers to programming approaches or tools/libraries to handle defining tasks, configuring which tasks they depend on, queueing, prioritizing, cancelling, etc.
There are various types of services:
Data (for retrieving and updating)
Calculation (populate some table with the results of a calculation on the data)
Reporting
These services often depend on one another and are triggered on demand, i.e., a Reporting task, will probably have code within it such as
if (IsSomeDependentCalculationRequired())
PerformDependentCalculation(); // which may trigger further calculations
GenerateRequestedReport();
Also, any Data modification is likely to set the Required flag on some of the Calculation or Reporting services, (so the report could be out of date before it's finished generating). The tasks vary in length from a few seconds to a couple of minutes and are performed within transactions.
This has worked OK up until now, but it is not scaling well. There are fundamental design problems and I am looking to rewrite this part of the code. For instance, if two users request the same report at similar times, the dependent tasks will be executed twice. Also, there's currently no way to cancel a task in progress. It's hard to maintain the dependent tasks, etc..
I'm NOT looking for suggestions on how to implement a fix. Rather I'm looking for pointers to what tools/libraries I would be using for this sort of requirement if I were starting in .NET 4 from scratch. Would this be a good candidate for Windows Workflow? Is this what Futures are for? Are there any other libraries I should look at or books or blog posts I should read?
Edit: What about Rx Reactive Extensions?
I don't think your requirements fit into any of the built-in stuff. Your requirements are too specific for that.
I'd recommend that you build a task queueing infrastructure around a SQL database. Your tasks are pretty long-running (seconds) so you don't need particularly high throughput in the task scheduler. This means you won't encounter performance hurdles. It will actually be a pretty manageable task from a programming perspective.
Probably you should build a windows service or some other process that is continuously polling the database for new tasks or requests. This service can then enforce arbitrary rules on the requested tasks. For example it can detect that a reporting task is already running and not schedule a new computation.
My main point is that your requirements are that specific that you need to use C# code to encode them. You cannot make an existing tool fit your needs. You need the turing completeness of a programming language to do this yourself.
Edit: You should probably separate a task-request from a task-execution. This allows multiple parties to request a refresh of some reports while at the same time only one actual computation is running. Once this single computation is completed all task-requests are marked as completed. When a request is cancelled the execution does not need to be cancelled. Only when the last request is cancelled the task-execution is cancelled as well.
Edit 2: I don't think workflows are the solution. Workflows usually operate separately from each other. But you don't want that. You want to have rules which span multiple tasks/workflows. You would be working against the system with a workflow based model.
Edit 3: A few words about the TPL (Task Parallel Library). You mentioned it ("Futures"). If you want some inspiration on how tasks could work together, how dependencies could be created and how tasks could be composed, look at the Task Parallel Library (in particular the Task and TaskFactory classes). You will find some nice design patterns there because it is very well designed. Here is how you model a sequence of tasks: You call Task.ContinueWith which will register a continuation function as a new task. Here is how you model dependencies: TaskFactory.WhenAll(Task[]) starts a task that only runs when all its input tasks are completed.
BUT: The TPL itself is probably not well suited for you because its task cannot be saved to disk. When you reboot your server or deploy new code, all existing tasks are being cancelled and the process aborted. This is likely to be unacceptable. Please just use the TPL as inspiration. Learn from it what a "task/future" is and how they can be composed. Then implement your own form of tasks.
Does this help?
I would try to use the state machine package stateless to model the workflow. Using a package will provide a consistent way to advance the state of the workflow, across the various services. Each of your services would hold an internal statemachine implementation, and expose methods for advancing it. Stateless will be resposible for triggering actions based on the state of the workflow, and enforce you to explicitly setup the various states that it can be in - this will be particularly useful for maintenance, and it will probably help you understand the domain better.
If you want to solve this fundamental problem properly and in a scalable way, you should probably look as SOA architecture style.
Your services will receive commands and generate events you can handle in order to react on facts happen in your system.
And, yes, there are tools for it. For example NServiceBus is a wonderful tool to build SOA systems.
You can do a SQL data agent to run SQL queries in timed interval. You have to write the application yourself it looks like. Write like a long running program that checks the time and does something. I don't think there is clearcut tools out there to do what you are trying to do. Do C# application, WCF service. data automation can be done in the sql itself.
If I understand you right you want to cache the generated reports and do not the work again. As other commenters have pointed out this can be solved elegantly with multiple Producer/Consumer queues and some caches.
First you enqueue your Report request. Based on the report genration parameters you can check the cache first if a previously generated report is already available and simply return this one. If due to changes in the database the report becomes obsolete you need to take care that the cache is invalidated in a reliable manner.
Now if the report was not generated yet you need need to schedule the report for generation. The report scheduler needs to check if the same report is already beeing generated. If yes register an event to notify you when it is completed and return the report once it is finished. Make sure that you do not access the data via the caching layer since it could produce races (report is generated, data is changed and the finished report would be immediatly discared by the cache leaving noting for you to return).
Or if you do want to prevent to return outdated reports you can let the caching layer become your main data provider which will produce as many reports until one report is generated in time which was not outdated. But be aware that if you have constant changes in your database you might enter an endless loop here by constantly generating invalid reports if the report generation time is longer as the average time between to changes to your db.
As you can see you have plenty of options here without actually talking about .NET, TPL, SQL server. First you need to set your goals how fast/scalable and reliable your system should be then you need to choose the appropriate architecture-design as described above for your particular problem domain. I cannot do it for you because I do not have your full domain know how what is acceptable and what not.
The tricky part is the handover part between different queues with the proper reliability and correctness guarantees. Depending on your specific report generation needs you can put this logic into the cloud or use a single thread by putting all work into the proper queues and work on them concurrently or one by one or something in between.
TPL and SQL server can help there for sure but they are only tools. If used wrongly due to not sufficient experience with the one or the other it might turn out that a different approach (like the usage of only in memory queues and persisted reports on in the file system) is better suited for your problem.
From my current understanding I would not use SQL server to misuse it as a cache but if you want a database I would use something like RavenDB or RaportDB which look stable and much more light weight compared to a full blown SQL server.
But if you already have a SQL server running then go ahead and use it.
I am not sure if I understood you correctly, but you might want to have a look at JAMS Scheduler: http://www.jamsscheduler.com/. It's non-free, but a very good system for scheduling depending tasks and reporting. I have used it with success at my previous company. It's written in .NET and there is a .NET API for it, so you can write your own apps communicating with JAMS. They also have a very good support and are eager to implement new features.