Idempotency for BigQuery load jobs using Google.Cloud.BigQuery.V2

Idempotency for BigQuery load jobs using Google.Cloud.BigQuery.V2 - c#

You are able to create a csv load job to load data from a csv file in Google Cloud Storage by using the BigQueryClient in Google.Cloud.BigQuery.V2 which has a CreateLoadJob method.
How can you guarantee idempotency with this API to ensure that say the network dropped before getting a response and you kicked off a retry you would not end up with the same data being loaded into BigQuery multiple times?
Example API usage
private void LoadCsv(string sourceUri, string tableId, string timePartitionField)
{
var tableReference = new TableReference()
{
DatasetId = _dataSetId,
ProjectId = _projectId,
TableId = tableId
};
var options = new CreateLoadJobOptions
{
WriteDisposition = WriteDisposition.WriteAppend,
CreateDisposition = CreateDisposition.CreateNever,
SkipLeadingRows = 1,
SourceFormat = FileFormat.Csv,
TimePartitioning = new TimePartitioning
{
Type = _partitionByDayType,
Field = timePartitionField
}
};
BigQueryJob loadJob = _bigQueryClient.CreateLoadJob(sourceUri: sourceUri,
destination: tableReference,
schema: null,
options: options);
loadJob.PollUntilCompletedAsync().Wait();
if (loadJob.Status.Errors == null || !loadJob.Status.Errors.Any())
{
//Log success
return;
}
//Log error
}

You can achieve idempotency by generating your own jobid based on e.g. file location you loaded and target table.
job_id = 'my_load_job_{}'.format(hashlib.md5(sourceUri+_projectId+_datasetId+tableId).hexdigest())
var options = new CreateLoadJobOptions
{
WriteDisposition = WriteDisposition.WriteAppend,
CreateDisposition = CreateDisposition.CreateNever,
SkipLeadingRows = 1,
JobId = job_id, #add this
SourceFormat = FileFormat.Csv,
TimePartitioning = new TimePartitioning
{
Type = _partitionByDayType,
Field = timePartitionField
}
};
In this case if you try reinsert the same job_id you got error.
You can also easily generate this job_id for check in case if pooling failed.

There are two places you could end up losing the response:
When creating the job to start with
When polling for completion
The first one is relatively tricky to recover from without a job ID; you could list all the jobs in the project and try to find one that looks like the one you'd otherwise create.
However, the C# client library generates a job ID so that it can retry, or you can specify your own job ID via CreateLoadJobOptions.
The second failure time is much simpler: keep the returned BigQueryJob so you can retry the polling if that fails. (You could store the job name so that you can recover even if your process dies while waiting for it to complete, for example.)

Related

How to check that InsertOne(document) inserted document to the database?

Since collection.InsertOne(document) returns void how do i know that the document written to the database for sure? I have a function which need to be run exactly after document is written to the database.
How can I check that without running a new query?

"Since collection.InsertOne(document) returns void" - is wrong, see db.collection.insertOne():
Returns: A document containing:
A boolean acknowledged as true if the operation ran with write concern or false if write concern was disabled.
A field insertedId with the _id value of the inserted document.
So, run
ret = db.collection.insertOne({your document})
print(ret.acknowledged);
or
print(ret.insertedId);
to get directly the _id of inserted document.

The write concern can be configured on either the connection string or the MongoClientSettings which are both passed in to the MongoClient object on creation.
var client = new MongoClient(new MongoClientSettings
{
WriteConcern = WriteConcern.W1
});
More information on write concern can be found on the MongoDB documentation - https://docs.mongodb.com/manual/reference/write-concern/
If the document is not saved the C# Driver will throw an exception (MongoWriteException).
Also if you have any write concern > Acknowledged, you'll also get back the Id of the document you've just save.
var client = new MongoClient(new MongoClientSettings
{
WriteConcern = WriteConcern.W1
});
var db = client.GetDatabase("test");
var orders = db.GetCollection<Order>("orders");
var newOrder = new Order {Name = $"Order-{Guid.NewGuid()}"};
await orders.InsertOneAsync(newOrder);
Console.WriteLine($"Order Id: {newOrder.Id}");
// Output
// Order Id: 5f058d599f1f033f3507c368
public class Order
{
public ObjectId Id { get; set; }
public string Name { get; set; }
}

Get data from document DB using multithreading/parallel

I have the JSON documents in the document DB (~30k documents) where each document has a unique ID something like AA123, AA124. There is a tool we use to pull those documents from the document DB where it has a restriction of 500 documents per GET request call. So this has to go through 60 times GET requests to fetch the result which takes sometime. I am looking to get this optimized to run this in quick time(run threads parallely), so that I can get the data quickly. Below is the sample code on how I am pulling the data from the DB as of now.
private int maxItemsPerCall = 500;
public override async Task<IEnumerable<docClass>> Getdocuments()
{
string accessToken = "token";
SearchResponse<docClass> docs = await db.SearchDocuments<docClass>(initialload, accessToken); //Gets top 500
List<docClass> routeRules = new List<docClass>();
routeRules.AddRange(docs.Documents);
var remainingCalls = (docs.TotalDocuments / maxItemsPerCall);
while (remainingCalls > 0 && docs.TotalDocuments > maxItemsPerSearch)
{
docs = await db.SearchDocuments<docClass>(GetFollowUp(docs.Documents.LastOrDefault().Id.Id), requestOptions);
routeRules.AddRange(docs.Documents);
remainingCalls--;
}
return routeRules;
}
private static SearchRequest initialload = new SearchRequest()
{
Filter = new SearchFilterGroup(
new[]
{
new SearchFilter(Field.Type, FilterOperation.Equal, "documentRule")
},
GroupOperator.And),
OrderBy = Field.Id,
Top = maxItemsPerCall,
Descending = false
};
private static SearchRequest GetFollowUp(string lastId)
{
SearchRequest followUpRequest = new SearchRequest()
{
Filter = new SearchFilterGroup(
new[] {
new SearchFilter(Field.Type, FilterOperation.Equal, "documentRule"),
new SearchFilter(Field.Id, FilterOperation.GreaterThan, lastId)
},
GroupOperator.And),
OrderBy = Field.Id,
Top = maxItemsPerCall,
};
return followUpRequest;
}
Help needed: Since I am using the each GET request(500 documents based on IDs depending on the ID of the previous run), how can I use to run this parallely (atleast 5 parallel threads at a time) fetching 500 records per thread (i.e. 2500 parallely in total for 5 threads at a time). I am not familiar with threading, so it would be helpful if someone can suggest how to do this.

CreateJob for Azure Data Lake Analytics from C#

I am trying to create a Job using the C# api and the DataLakeAnalyticsJobManagementClient and have been unsuccessful in every attempt with the error message: "Invalid job definition.". There is no other useful information about what is invalid about it. The job is a U-SQL job and I began bey creating it in the azure portal and that worked just fine and ran correctly with no errors.
I am building up the JobInformation and JobProperties using the same information that the portal test one used and I know the U-SQL statements are valid.
JobProperties props = new JobProperties(File.ReadAllText(#"C:\myusqlscript.usql"));
var myId = Guid.NewGuid();
JobInformation jobNfo = new JobInformation("mysamplejob", JobType.USql, props,myId) { DegreeOfParallelism = 1, Priority = 1000};
jobNfo.Validate(); //<--this doesn't throw an exception either
var jobs = await _adlaJobClient.Job.ListAsync("myanalyticsaccountname");
var adlaJob = await _adlaJobClient.Job.CreateAsync("myanalyticsaccountname", myId, jobNfo);
I have tried various combinations of constructors and property settings including just using defaults for some of the properties and I get the same result: "Invalid job definition." There is no other info that would indicate missing information or formatting issues or anything like that.
Anyone out there created Azure Data Lake Analytics jobs with the C# API?

Anyone out there created Azure Data Lake Analytics jobs with the C# API?
You need to use USqlJobProperties instead of JobProperties.
var props = new USqlJobProperties(File.ReadAllText(#"C:\myusqlscript.usql"));
The official document for Data Lake analytics get started .NET SDK is not available. But we can also get some useful sample code from the histories of this document.
public static Guid SubmitJobByPath(string scriptPath, string jobName)
{
var script = File.ReadAllText(scriptPath);
var jobId = Guid.NewGuid();
var properties = new USqlJobProperties(script);
var parameters = new JobInformation(jobName, JobType.USql, properties, priority: 1, degreeOfParallelism: 1, jobId: jobId);
var jobInfo = _adlaJobClient.Job.Create(_adlaAccountName, jobId, parameters);
return jobId;
}

NetSuite custom record search through suiteTalk using C#

We are having an issue with searching a custom record through SuiteTalk. Below is a sample of what we are calling. The issue we are having is in trying to set up the search using the internalId of the record. The issue here lies in in our initial development account the internal id of this custom record is 482 but when we deployed it through the our bundle the record was assigned with the internal Id of 314. It would stand to reason that this internal id is not static in a site per site install so we wondered what property to set up to reference the custom record. When we made the record we assigned its “scriptId’ to be 'customrecord_myCustomRecord' but through suitetalk we do not have a “scriptId”. What is the best way for us to allow for this code to work in all environments and not a specific one? And if so, could you give an example of how it might be used.
Code (C#) that we are attempting to make the call from. We are using the 2013.2 endpoints at this time.
private SearchResult NetSuite_getPackageContentsCustomRecord(string sParentRef)
{
List<object> PackageSearchResults = new List<object>();
CustomRecord custRec = new CustomRecord();
CustomRecordSearch customRecordSearch = new CustomRecordSearch();
SearchMultiSelectCustomField searchFilter1 = new SearchMultiSelectCustomField();
searchFilter1.internalId = "customrecord_myCustomRecord_sublist";
searchFilter1.#operator = SearchMultiSelectFieldOperator.anyOf;
searchFilter1.operatorSpecified = true;
ListOrRecordRef lRecordRef = new ListOrRecordRef();
lRecordRef.internalId = sParentRef;
searchFilter1.searchValue = new ListOrRecordRef[] { lRecordRef };
CustomRecordSearchBasic customRecordBasic = new CustomRecordSearchBasic();
customRecordBasic.recType = new RecordRef();
customRecordBasic.recType.internalId = "314"; // "482"; //THIS LINE IS GIVING US THE TROUBLE
//customRecordBasic.recType.name = "customrecord_myCustomRecord";
customRecordBasic.customFieldList = new SearchCustomField[] { searchFilter1 };
customRecordSearch.basic = customRecordBasic;
// Search for the customer entity
SearchResult results = _service.search(customRecordSearch);
return results;
}

I searched all over for a solution to avoid hardcoding internalId's. Even NetSuite support failed to give me a solution. Finally I stumbled upon a solution in NetSuite's knowledgebase, getCustomizationId.
This returns the internalId, scriptId and name for all customRecord's (or customRecordType's in NetSuite terms! Which is what made it hard to find.)
public string GetCustomizationId(string scriptId)
{
// Perform getCustomizationId on custom record type
CustomizationType ct = new CustomizationType();
ct.getCustomizationTypeSpecified = true;
ct.getCustomizationType = GetCustomizationType.customRecordType;
// Retrieve active custom record type IDs. The includeInactives param is set to false.
GetCustomizationIdResult getCustIdResult = _service.getCustomizationId(ct, false);
foreach (var customizationRef in getCustIdResult.customizationRefList)
{
if (customizationRef.scriptId == scriptId) return customizationRef.internalId;
}
return null;
}

you can make the internalid as an external property so that you can change it according to environment.
The internalId will be changed only when you install first time into an environment. when you deploy it into that environment, the internalid will not change with the future deployments unless you choose Add/Rename option during deployment.

System.DirectoryServices.Protocol search question

I'm trying to re write a search from System.DirectoryServices to System.DirectoryServices.Protocol
In S.DS I get all the requested attributes back, but in S.DS.P, I don't get the GUID, or the HomePhone...
The rest of it works for one user.
Any Ideas?
public static List<AllAdStudentsCV> GetUsersDistinguishedName( string domain, string distinguishedName )
{
try
{
NetworkCredential credentials = new NetworkCredential( ConfigurationManager.AppSettings[ "AD_User" ], ConfigurationManager.AppSettings[ "AD_Pass" ] );
LdapDirectoryIdentifier directoryIdentifier = new LdapDirectoryIdentifier( domain+":389" );
using ( LdapConnection connection = new LdapConnection( directoryIdentifier, credentials ) )
{
SearchRequest searchRequest = new SearchRequest( );
searchRequest.DistinguishedName = distinguishedName;
searchRequest.Filter = "(&(objectCategory=person)(objectClass=user)(sn=Afcan))";//"(&(objectClass=user))";
searchRequest.Scope = SearchScope.Subtree;
searchRequest.Attributes.Add("name");
searchRequest.Attributes.Add("sAMAccountName");
searchRequest.Attributes.Add("uid");
searchRequest.Attributes.Add("telexNumber"); // studId
searchRequest.Attributes.Add("HomePhone"); //ctrId
searchRequest.SizeLimit = Int32.MaxValue;
searchRequest.TimeLimit = new TimeSpan(0, 0, 45, 0);// 45 min - EWB
SearchResponse searchResponse = connection.SendRequest(searchRequest) as SearchResponse;
if (searchResponse == null) return null;
List<AllAdStudentsCV> users = new List<AllAdStudentsCV>();
foreach (SearchResultEntry entry in searchResponse.Entries)
{
AllAdStudentsCV user = new AllAdStudentsCV();
user.Active = "Y";
user.CenterName = "";
user.StudId = GetstringAttributeValue(entry.Attributes, "telexNumber");
user.CtrId = GetstringAttributeValue(entry.Attributes, "HomePhone");
user.Guid = GetstringAttributeValue(entry.Attributes, "uid");
user.Username = GetstringAttributeValue(entry.Attributes, "sAMAccountName");
users.Add(user);
}
return users;
}
}
catch (Exception ex)
{
throw;
}
}
Also, if I want to fetch EVERY user in AD, so I can synch data with my SQL DB, how do I do that, I Kept getting max size exceeded, errors. I set the size to maxInt32... is there an "ignore size" option?
Thanks,
Eric-

I think that the standard way is to use System.DirectoryServices, not System.DirectoryServices.Protocol. Why do you want to user the later ?
Concerning your second question about the error message "max sized exceeded", it may be because you try to fetch too many entries at once.
Active Directory limits the number of objects returned by query, in order to not overload the directory (the limit is something like 1000 objects). The standard way to fetch all the users is using paging searchs.
The algorithm is like this:
You construct the query that will fetch all the users
You specify a specific control (Paged Result Control) in this query indicating that this is
a paged search, with 500 users per page
You launch the query, fetch the first page and parse the first 500 entries in
that page
You ask AD for the next page, parse the next 500 entries
Repeat until there are no pages left

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Idempotency for BigQuery load jobs using Google.Cloud.BigQuery.V2 - c#

Related

How to check that InsertOne(document) inserted document to the database?

Get data from document DB using multithreading/parallel

CreateJob for Azure Data Lake Analytics from C#

NetSuite custom record search through suiteTalk using C#

System.DirectoryServices.Protocol search question

Categories

Resources