I am new to API development and I want to create a Web API end point which will be receiving a large amount of log data. And I want to send that data to Amazon s3 bucket via Amazon Kinesis delivery stream. Below is a sample application which works FINE, but I have NO CLUE how to INGEST large inbound of data and in What format my API should be receiving data? How my API Endpoint should look like.
[HttpPost]
public async void Post() // HOW to allow it to receive large chunk of data?
{
await WriteToStream();
}
private async Task WriteToStream()
{
const string myStreamName = "test";
Console.Error.WriteLine("Putting records in stream : " + myStreamName);
// Write 10 UTF-8 encoded records to the stream.
for (int j = 0; j < 10000; ++j)
{
// I AM HARDCODING DATA HERE FROM THE LOOP COUNTER!!!
byte[] dataAsBytes = Encoding.UTF8.GetBytes("testdata-" + j);
using (MemoryStream memoryStream = new MemoryStream(dataAsBytes))
{
PutRecordRequest putRecord = new PutRecordRequest();
putRecord.DeliveryStreamName = myStreamName;
Record record = new Record();
record.Data = memoryStream;
putRecord.Record = record;
await kinesisClient.PutRecordAsync(putRecord);
}
}
}
P.S: IN real world app I will not have that for loop. I want my API to ingest large data, what should be the definition of my API? Do I need to use something called multiform/data, file? Please guide me.
Here is my thought process. As you are exposing a API for the logging, your input should contain below attributes
Log Level (info, debug, warn, fatal)
Log message (string)
Application ID
Application Instance ID
application IP
Host (machine in which the error was logged)
User ID (for whom the error occurred)
Time stamp in Utc (time at which the error occurred)
Additional Data (customisable as xml / json)
I will suggest exposing the API as AWS lambda via Gateway API as it will help in scaling out as load increases.
To take sample for how to build API and use model binding, you may refer https://learn.microsoft.com/en-us/aspnet/web-api/overview/formats-and-model-binding/model-validation-in-aspnet-web-api
I don't have much context so basically will try to provide answer from how I see it.
First instead of sending data to webapi I would send data directly to S3. In azure there is Share Access Token so you send request to you api to give you url where to upload file(there is many options but you can limit by time, limit by IP who can upload). So to upload file 1. Do call to get upload Url, 2. PUT to that url. Looks like in Amazon it called Signed Policy.
After that write lambda function which will be triggered on S3 upload, this function will be sending event (Again I dont know how its in AWS but in azure I will send Blob Queue message) this event will contain url to file and start position.
Write second Lambda which listens to events and do actually processing, so in my apps sometimes i know that to process N items it take 10 seconds so I usually choose N to be something not longer that 10-20 seconds, due to nature of deployments. After you processed N rows and not yet finished send same event but now Start position = Start position on the begging + N. More info how to read range
Designing this way you can process large files, even more you can be smarter because you can send multiple events where you can say Start Line, End Line so you will be able to process your file in multiple instances.
PS. Why I would not recommend you upload files to WebApi its because those files will be in memory, so lets say you have 1GB files sending from multiple sources in this case you will kill your servers in minutes.
PS2. Format of file depends, could be json since its the easiest way to read those files, but keep in mind that if you have large files it will be expensive to read whole file to memory. Here is example how to read them properly. So other option could be just flat file then will be easy to read it, since then you can read range and process it
PS3. In azure I would use Azure Batch Jobs
Related
I have a C# REST API with an upload endpoint that has the sole purpose to process a binary file and add its metadata (as an Attachment model) to a List<Attachment> property of a different entity.
When I call the endpoint from my web application in a sequential manner like below (pseudo code), the endpoint does as intended and processes each binary file and adds a single Attachment to the provided entity.
const attachments = [Attachment, Attachment, Attachment];
for(const attachment of attachments) {
await this.api.upload(attachment);
}
But when I try to upload the attachments in a parallel manner like below (pseudo code), each binary file gets processed properly, but only one Attachment metadata object gets added to the entity.
const attachments = [Attachment, Attachment, Attachment];
const requests = attachments.map((a) => this.api.upload(a));
await Promise.all(requests);
The endpoint basically does the following (simplified):
var attachment = new Attachment()
{
// Metadata is collected from the binary (FormFile)
};
using (var session = Store.OpenAsyncSession())
{
var entity = await session.LoadAsync<Entity>(entityId);
entity.Attachments.Add(attachment);
await session.StoreAsync(entity);
await session.SaveChangesAsync();
};
I suspect that the problem is that the endpoint is called at the same time. Both request open (at the same time) a database session and query the entity into memory. They each add the Attachment to the entity and update it in the database. The saved attachment you see in the database is from the request that finishes last, e.g. the request that takes the longest.
I've tried to recreate the issue by creating this example. When you open the link, the example runs right away. You can see the created entities on this database server.
Open the Hogwarts database and after that open the contact Harry Potter and you see two attachments added. When you open the contact Hermione Granger you only see the one attachment added (the Second.txt), although it should also have both attachments.
What is the best approach to solve this issue? I prefer not having to send the files as a batch to the endpoint. Appreciate any help!
PS: You might need to run the example manually by clicking on Run. If the database doesn't exist on the server (as the server gets emptied automatically) you can create it manually with the Hogwarts name. And because it looks like a race condition, sometimes both Attachment items are added properly. So you might need to run the example a few times.
That is a a fairly classic example of a race condition in writing to the database, you are correct.
The sequence of event is:
Req 1 load doc Attachments = []
Req 1 load doc Attachments = []
Req 1 Attachments.Push()
Req 2 Attachments.Push()
Req 1 SaveChanges()
Req 2 SaveChanges()
The change in 5 overwrites the change in 4, so you are losing data.
There are two ways to handle this scenario. You can enable optimistic concurrency for this particular scenario, see the documentation on the topic:
https://ravendb.net/docs/article-page/4.2/csharp/client-api/session/configuration/how-to-enable-optimistic-concurrency#enabling-for-a-specific-session
Basically, you can do session.Advanced.UseOptimisticConcurrency = true; to cause the transaction to fail if the document was updated behind the scenes.
You can then retry the transaction to make it work (make sure to create a new session).
Alternatively, you can use the patching API, which will allow you to add an item to the document concurrently safely.
Here is the relevant documentation:
https://ravendb.net/docs/article-page/4.2/csharp/client-api/operations/patching/single-document#add-item-to-array
Note that there is a consideration here, you shouldn't care what the order of the operations are (because they can happen in any order).
If there is a business usecase behind the order, you probably cannot use the patch API easily and need to go with the full transaction route.
The Revit Addin is working perfectly and I have also converted correctly for Design automation. I have debugged it with local debugger. It worked perfect.
So I can say app bundle is perfectly doing well.
Now coming to the web application code, it works perfect until last line of "workItemStatus".
I need a rfa file and a big Json file as input file, to run the code. Both together will be 1 mb in size. But code is stack (endlessly waiting) when uploading file, workitem does not start.
I read in another stackoverflow post, that Forge does not allow more than 16kb upload to oss bucket by.....
Url = string.Format("https://developer.api.autodesk.com/oss/v2/buckets/{0}/objects/{1}", bucketKey, inputFileNameOSS)
That post says, I will need to upload bigger files to another cloud service and use the signed-in URL instead of Forge oss bucket.
The code looks correct while debugging and it is stack, when it reach to the line
WorkItemStatus workItemStatus = await _designAutomation.CreateWorkItemAsync(workItemSpec);
I have debugged the code, looks like perfectly working until "workItemStatus" value, in DesignAutomationController.cs "StartWorkItem".
Every Key and Value looks perfectly passed.
Is it because of the file size ? As the Json file is big, I am uploading it like the other input (.rfa/.rvt) files.
string callbackUrl = string.Format("{0}/api/forge/callback/designautomation?id={1}&outputFileName={2}", OAuthController.GetAppSetting("FORGE_WEBHOOK_URL"), browerConnectionId, outputFileNameOSS);
WorkItem workItemSpec = new WorkItem()
{
ActivityId = activityName,
Arguments = new Dictionary<string, IArgument>()
{
{ "inputFile", inputFileArgument },
{ "inputJsonFile", inputFileArgument1 },
{ "outputFile", outputFileArgument },
{ "onComplete", new XrefTreeArgument { Verb = Verb.Post, Url = callbackUrl } }
}
};
***WorkItemStatus workItemStatus = await _designAutomation.CreateWorkItemAsync(workItemSpec);***
return Ok(new { WorkItemId = workItemStatus.Id }); ```
I read in another stackoverflow post, that Forge does not allow more than 16kb upload to oss bucket by..
The 16kb limit is on a payload of design automation endpoints including the workitem. The limits are defined here. If the workitem payload exceeds 16kb you will see an error HTTP 413 Payload Too Large.
To send large json inputs to design automation, you may first upload the json to OSS (or even another storage service such as Amazon S3). Then call the workitem with a signed url to the json file (similar to the signed url for the rfa file).
Edit:
1. Large JSON files can be uploaded to OSS using Data Management endpoint.
2. A signed URL with read access can then be obtained for that object using endpoint.
3. The URL obtained can then be passed to Design Automation workitem payload as an input argument, instead of embedding the json contents into the payload.
I am currently writing a Discord Bot in C#. I have most the bot done but for this next update I am wanting to add on the capability of checking if the Streamer has Gone live. Currently I am polling the Twitch API and Pulling the JSON File that it has and checking whether or not the JSON Stream Object is Null or Not. But this takes 3-5 min after the streamer to go live before it finally sees that Stream is not Null even though I poll the JSON every 5 seconds. Is there anyway to do this more efficiently? My code is Below:
private const string Url = "https://api.twitch.tv/kraken/streams/streamer";
var request = (HttpWebRequest)WebRequest.Create(Url);
request.Method = "Get";
request.Timeout = 12000;
request.ContentType = "application/vnd.twitchtv.v5+json";
request.Headers.Add("Client-ID", "ID");
using (var s = request.GetResponse().GetResponseStream())
{
using (var sr = new System.IO.StreamReader(s))
{
var jsonObject = JObject.Parse(sr.ReadToEnd());
var jsonStream = jsonObject["stream"];
// twitch channel is online if stream is not null.
LastTwitchStatus = jsonStream.Type != JTokenType.Null;
}
}
Looks like it's intended behavior of Twitch API.
They are definitely more focused on pushing their horsepower to streaming, not immediate data provision through API.
While there might be a limitation like this, you can try scrapping the page if timing is crucial and you don't want to wait 3-5 min for something that already happened.
One idea is to poll page each 5s or so and then query the HTML document for something characteristic that distinguish offline and online channel.
Idea for scrapping in JavaScript (just replicate in .NET):
For example, I have tried to query user pages (https://www.twitch.tv/username) in JavaScript with:
$(".recent-past-broadcast").length > 0
and for user that is not broadcasting it yields true while for broadcasting user it yields false. Problem might be for user with no recent broadcasts history though.
You can try checking videos page (https://www.twitch.tv/username/videos/all) for their live indicator too like:
$(".cn-livestatus__circle").length > 0
It will yield true for streaming user and false for the one that does not stream (even if he/she is online).
Of course that's least efficient way on doing this and requires lots of download as compared to just polling but... still it seems more up to date than asking API every 5s and still getting actual state delayed by 3-5min.
Just replicate querying like above in .NET and you're there.
You could also mix two approaches and if you see that someone started streaming, just disable page scrapping and swap to only API calls for checking if you're up-to-date still.
Useful tooling for scrapping:
For parsing HTML documents use parsers like AngleSharp to do this in .NET:
https://github.com/AngleSharp/AngleSharp
When I execute my QueryRequest object, I get a totalRows of around 110,000 while the response rows are around 38,000. So I am not receiving the entire result and must do paging.
But I see that my QueryRequest object has no startIndex property.
How can I receive the entire result set?
I am using a Premium version of Google Analytics. Does Google still return 10MB of data with each request?
UPDATE: I don't think my question is a duplicate. What I meant by the question was how can I get a specific page of results when my QueryRequest has no startIndex property.
JobsResource j = null;
QueryRequest qr = null
...
j.qr.Query = "SELECT examplecolumns FROM myTable";
QueryResponse response = j.Query(qr, projectId).Execute();
Call the getQueryResults() method to fetch the rest of the results.
https://cloud.google.com/bigquery/docs/reference/v2/jobs/getQueryResults
You cannot return the entire result set because Google has general quota limits for all of their API's:
General Quota Limits (All APIs)
The following quota limits are shared between the Management API, Core Reporting API, MCF Reporting API, Metadata API, and Real Time Reporting API.
50,000 requests per project per day – can be increased
10 queries per second (QPS) per IP.
In the Developers Console this quota is referred to as the per-user limit. By default, it is set to 1 query per second (QPS) and can be adjusted to a maximum value of 10. If the per-user limit is set to a value larger than 10 QPS, the Google Analytics quota policy will still take effect and limit per-user requests to 10 QPS.
If your application makes all API requests from a single IP address (i.e. on behalf of your users) you should consider using the userIP or quotaUser parameters with each request to get full QPS quota for each user. See the query parameters summary for details.
For more information have a look at this link: Configuration and Reporting API Limits and Quotas
You can also find more information on the subject here:Querying Data
The following additional limits apply for querying data.
Maximum tables per query: 1,000
Maximum query length: 256 KB
The query / getQueryResults methods are used to push some of the waiting for job completion into the server. Clients may see faster notification of job complete when using this mechanism, and will receive the first page of the query results in that response, avoiding the need for one additional round trip to fetch the data.
The general mechanism for using these apis, in pseudo code, is:
response = query(...)
while (!response.jobComplete) {
response = getQueryResults(response.jobReference);
}
moreData = false
do {
// consume response.rows
moreData = response.pageToken != null
if (moreData) {
response = getQueryResults(response.jobReference, response.pageToken)
}
} while (moreData)
Note that some type safe languages will be more difficult to code this, as the first response in that do loop may be a QueryResponse or a GetQueryResultsResponse type, depending on whether the query job finished within the initial timeout on the query() call, or whether it finished within the while (!response.jobComplete) polling loop.
I have an application which intended to stream videos back from our local DB. I spent a lot of time yesterday attempting to return the data a either a RangeFileContentResult or RangeFileStreamResult without success.
In short, when I return the file as either of these two results I cannot seem to get a video to stream correctly (or play at all).
The request from the browser gets sent with the following headers:
Range: bytes=0-
And the response comes provided gives these headers as an example:
Accept-Ranges: bytes
Content-Range: bytes 0-5103295/5103296
In terms of network traffic, I get a series of 206's for partial results, then a 200 at the end (according to fiddler) which seems correct.
Chrome's network tab disagrees with this and see's an initial request (always 13 bytes which I assume is a handshake) then a couple more requests which have a status of either cancelled or pending.
As far as I understand, this is more or less correct, 206 - cancel, 206 - cancel etc. But the video never plays.
If I switch the result from my controller to a FileResult, the video plays and Chrome, IE10 and Firefox and appears to begin playing before the end of the download is completed (which feels a little like it's streaming! although I suspect it's not)
But with the range result I get nothing in chrome or IE and the entire video downloads in one drop in firefox.
As far as I understood, the RangeFileContentResult should handle responding to the client with a range of bytes to download (which mine doesn't seem to do, it just tells it to get the whole file (illustrated by the response above)). And the client should respond to that, which it doesn't seem to do.
Does anyone have any thoughts in this area? Specifically:
a) Should RangeFileContentResult be sending a range of bytes back to the client?
b) Is there any way I can explicitly control the range of bytes requested from the client side?
c) Is there any reason or anything I'm doing wrong here which would cause browsers not to load the video at all, when requesting a RangeFileContentResult?
EDIT: Added a diagram to help describe what I'm seeing:
EDIT2: Ok, so the plot thickens. Whilst playing around with the RangedFile gubbins we needed to push another system test version out and I left the 'RangeFileContentResult' on my controller action as below:
private ActionResult RetrieveVideo(MediaItem media)
{
return new RangeFileContentResult(
media.Content,
media.MimeType,
media.Id.ToString(),
DateTime.Now);
}
Rather oddly, this now seems to work as expected on our Azure system test environment but still not on my local machine. I wonder if there's something IIS based which works happily on Azures IIS8, but not on my local 7.5 instance?
The reason of the issue described here is the value passed to modificationDate parameter of RangeFileContentResult constructor:
return new RangeFileContentResult(media.Content, media.MimeType, media.Id.ToString(), DateTime.Now);
This date is used by the RangeFileResult in order to create two headers:
ETag - This header is an identifier used by browser and server to make sure that they are speaking about the same entity.
Last-Modified - This header informs the browser about the last modification date of the entity.
The fact that a DateTime.Now is being passed every time the browser makes partial request might be a reason for ETag and Last-Modified headers values to change before the client will get the whole entity (usually if the entire process takes longer than one second).
In case described above, the browser is sending If-Range header with the request. This header is telling the server that the entire entity should be resend if the entity tag (or modification date because If-Range can carry either one of those two values) doesn't much. This is what happens in this case.
The fact that modification date is "dynamic" may also cause further issues if client decides to use one of following headers for verification: If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match.
The solution in this situation is to keep a modification date in database with the file to make sure it is consistent.
There is also a place for optimization here. Instead of grabbing the whole video from DB every time a partial request is being made, one can either cache it or grab only the relevant part (if the database engine which application is using allows such an operation). Such a mechanism can be used in order to create specialized action result by delivering from RangeFileResult and overwriting WriteEntireEntity and WriteEntityRange methods.
Ok So I didn't have enough time to look at RangeFileResult in details, but I have just downloaded the file (RangeFileContentResult) from
RangeFileContentResult
and modified my code so it looks like
public ActionResult Movie()
{
byte[] file = System.IO.File.ReadAllBytes(#"C:\HOME\asp\Java\Java EE. Programming Spring 3.0\01.avi");
return new RangeFileContentResult(file, "video/x-msvideo", "01.avi", DateTime.Now);
}
and again it works. However, I noticed that when I stop the video I have an exception and it happens in RangeFileResult
if (context.HttpContext.Response.IsClientConnected)
{
WriteEntityRange(context.HttpContext.Response, RangesStartIndexes[i], RangesEndIndexes[i]);
if (MultipartRequest)
context.HttpContext.Response.Write("\r\n");
context.HttpContext.Response.Flush();
}
So you better modify the code to handle it.In terms when users already disconnected , but you are still trying to send them a response.
Again, technically it's not a big difference whether you pass byte[] or Stream , because even when you pass Stream the code working with it
using (FileStream)
{
FileStream.Seek(rangeStartIndex, SeekOrigin.Begin);
int bytesRemaining = Convert.ToInt32(rangeEndIndex - rangeStartIndex) + 1;
byte[] buffer = new byte[_bufferSize];
while (bytesRemaining > 0)
{
int bytesRead = FileStream.Read(buffer, 0, _bufferSize < bytesRemaining ? _bufferSize : bytesRemaining);
response.OutputStream.Write(buffer, 0, bytesRead);
bytesRemaining -= bytesRead;
}
}
again reads data and puts them into an byte[] array!.... So it's up to you!
BUT... I suggest that you pay attention to a content type that you provide!!!
Point is that your browser must be able to handle it!So if you provide something unknown definitely you will have problems.To find your content type string please check
mime-types-by-content-type
Again I just gave a quick look and if you have problems I will help you later when come home.
mofiPlease just copy these two files in your mvc project
RangeFileResult
RangeFileStreamResult
public ActionResult Movie()
{
var path = new FileStream(#"C:\temp\01.avi", FileMode.Open);
return new RangeFileStreamResult(path, "video/x-msvideo", "01.avi", DateTime.Now);
}
Now run your project and open in chrome (for example: http://youraddress.com:45454/Main/Movie) you should see your file playing using a standard chrome video player. it's streaming and you can see it if you put a breakpoint at
return new RangeFileStreamResult(path, "video/x-msvideo", "01.avi", DateTime.Now);
Again the source is easy to modify to change the buffer size which is used for streaming!