Currently we are using clustered redis.
In some instances, we have to delete hundred of thousands of cached objects. I am wondering what's currently the best approach to do so?
At the moment we are executing a LUA scrip with scan and unlink.
private const string ClearCacheLuaScript = (#"
local cursor = 0
local calls = 0
local dels = 0
repeat
local result = redis.call('SCAN', cursor, 'MATCH', ARGV[1])
calls = calls + 1
for _,key in ipairs(result[2]) do
redis.call('UNLINK', key)
dels = dels + 1
end
cursor = tonumber(result[1])
until cursor == 0");
public async Task FlushAllAsync(string section, string group, string prefix)
{
var cacheKey = GetCacheKey(section, group, prefix);
await _cache.ScriptEvaluateAsync(ClearCacheLuaScript
, values: new RedisValue[]
{
cacheKey+"*",
});
}
Currently wondering if the alternative next best approach would be to iteratively delete with keys. Something like
public async Task FlushAllKeys()
{
foreach(var key in keys)
{
await _cache.KeyDeleteAsync(key );
}
}
My main question is, should I be deleting via a pattern using LUA or grab all keys and iteratively delete and not use LUA script
Related
We use mongodb as a database. In local development, the official image, but in production a Cosmos DB with Mongo API.
We use change streams to watch for changes in the database.
After we process the created or updated record, we need to make changes to the record again (set flags).
Maybe you guess it, this leads to an infinite loop.
I first designed a solution that works locally with the official mongo image. When I set the flags, I also set a Guid #changeId in the dataset and create a filter that checks if the #changeId has changed during an update, if so, the process is not triggered.
I do it like this:
public async Task Watch(CancellationToken token)
{
logger.LogInformation("Watching database for changes...");
token.Register(() => this.logger.LogInformation("Stop Polling. Stop requested."));
var filter = Builders<ChangeStreamDocument<BsonDocument>>
.Filter.Where(change =>
change.OperationType == ChangeStreamOperationType.Insert
|| change.OperationType == ChangeStreamOperationType.Update
|| change.OperationType == ChangeStreamOperationType.Replace
);
var updateExistsFilter = Builders<ChangeStreamDocument<BsonDocument>>.Filter.Exists("updateDescription.updatedFields");
var changeIdNotChangedFilter = Builders<ChangeStreamDocument<BsonDocument>>.Filter.Not(
Builders<ChangeStreamDocument<BsonDocument>>.Filter.Exists("updateDescription.updatedFields.#changeId")
);
var relevantUpdateFilter = updateExistsFilter & changeIdNotChangedFilter;
//if no updateDescription exists, its an insert so data should be processed anyway
var insertFilter = Builders<ChangeStreamDocument<BsonDocument>>.Filter.Not(
Builders<ChangeStreamDocument<BsonDocument>>.Filter.Exists("updateDescription.updatedFields")
);
filter &= relevantUpdateFilter | insertFilter;
var definition = new EmptyPipelineDefinition<ChangeStreamDocument<BsonDocument>>()
.Match(filter)
.AppendStage<ChangeStreamDocument<BsonDocument>, ChangeStreamDocument<BsonDocument>, BsonDocument>(
"{ $project: { '_id': 1, 'fullDocument': 1, 'ns': 1, 'documentKey': 1 }}"
);
var options = new ChangeStreamOptions
{
FullDocument = ChangeStreamFullDocumentOption.UpdateLookup
};
while (!token.IsCancellationRequested)
{
using (var cursor = await base.collection.WatchAsync(definition, options, token))
{
await cursor.ForEachAsync(async (doc) =>
{
ProcessData(doc);
}, token);
}
await Task.Delay(TimeSpan.FromSeconds(1), token);
}
}
Unfortunately, it wasn't until testing on QA with a Cosmos DB attached that I noticed Cosmos doesn't support UpdateDescription. see here
It goes to an infinity loop.
Is there a way to get the updated fields?
Or maybe is it possible to make sure that the changestream does not react when using the update function of the mongo driver?
Or is there any way to avoid this infinity loop?
You are able to create a csv load job to load data from a csv file in Google Cloud Storage by using the BigQueryClient in Google.Cloud.BigQuery.V2 which has a CreateLoadJob method.
How can you guarantee idempotency with this API to ensure that say the network dropped before getting a response and you kicked off a retry you would not end up with the same data being loaded into BigQuery multiple times?
Example API usage
private void LoadCsv(string sourceUri, string tableId, string timePartitionField)
{
var tableReference = new TableReference()
{
DatasetId = _dataSetId,
ProjectId = _projectId,
TableId = tableId
};
var options = new CreateLoadJobOptions
{
WriteDisposition = WriteDisposition.WriteAppend,
CreateDisposition = CreateDisposition.CreateNever,
SkipLeadingRows = 1,
SourceFormat = FileFormat.Csv,
TimePartitioning = new TimePartitioning
{
Type = _partitionByDayType,
Field = timePartitionField
}
};
BigQueryJob loadJob = _bigQueryClient.CreateLoadJob(sourceUri: sourceUri,
destination: tableReference,
schema: null,
options: options);
loadJob.PollUntilCompletedAsync().Wait();
if (loadJob.Status.Errors == null || !loadJob.Status.Errors.Any())
{
//Log success
return;
}
//Log error
}
You can achieve idempotency by generating your own jobid based on e.g. file location you loaded and target table.
job_id = 'my_load_job_{}'.format(hashlib.md5(sourceUri+_projectId+_datasetId+tableId).hexdigest())
var options = new CreateLoadJobOptions
{
WriteDisposition = WriteDisposition.WriteAppend,
CreateDisposition = CreateDisposition.CreateNever,
SkipLeadingRows = 1,
JobId = job_id, #add this
SourceFormat = FileFormat.Csv,
TimePartitioning = new TimePartitioning
{
Type = _partitionByDayType,
Field = timePartitionField
}
};
In this case if you try reinsert the same job_id you got error.
You can also easily generate this job_id for check in case if pooling failed.
There are two places you could end up losing the response:
When creating the job to start with
When polling for completion
The first one is relatively tricky to recover from without a job ID; you could list all the jobs in the project and try to find one that looks like the one you'd otherwise create.
However, the C# client library generates a job ID so that it can retry, or you can specify your own job ID via CreateLoadJobOptions.
The second failure time is much simpler: keep the returned BigQueryJob so you can retry the polling if that fails. (You could store the job name so that you can recover even if your process dies while waiting for it to complete, for example.)
I have SSIS package with a transformation script component. It loads about 460 rows then it stops to do the script component again ( I dont know why it does this) , ofc it creates my C# class variables again and "forgets" where it was the "first time it ran" poping out nulls for the varibles.
Is there anyway make the script component not run its self again after 460 rows ? My batch that I am pulling is 10000 so it cant be that.
And the weirdest thing of all is that after 3 times of run the package (without changing anything) it does everything right...
public class ScriptMain : UserComponent
{
string MarkToRem;
string TypeToRem;
string SerToRem;
int IDCnt;
public override void PreExecute()
{
base.PreExecute();
}
public override void PostExecute()
{
base.PostExecute();
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
MyOutputBuffer.AddRow();
if(Row.IncomingPrice == "Mark")
{
MarkToRem = Row.IncomingCode ; // Setting ver to remember the mark we are in
MyOutputBuffer.ID = Row.IncomingID.ToString();
MyOutputBuffer.Mark = MarkToRem;
MyOutputBuffer.Type = "";
MyOutputBuffer.Series = "";
MyOutputBuffer.Code = "";
MyOutputBuffer.Price = "";
MyOutputBuffer.Description = "Mark Verander";
}
else if( Row.IncomingPrice == "Sub")
{
TypeToRem = Row.IncomingCode; // Save our current Type
SerToRem = Row.IncomingCode; //Save our current Series
// ============ Output ========================
MyOutputBuffer.ID = Row.IncomingID.ToString();
MyOutputBuffer.Mark = MarkToRem;
MyOutputBuffer.Type = "";
MyOutputBuffer.Series = "";
MyOutputBuffer.Code = "";
MyOutputBuffer.Price = "";
MyOutputBuffer.Description = "Sub en series verander";
}
else if (Row.IncomingPrice == "Series")
{
SerToRem = Row.IncomingCode; //Save our current Series
// ============ Output ========================
MyOutputBuffer.ID = Row.IncomingID.ToString();
MyOutputBuffer.Mark = MarkToRem;
MyOutputBuffer.Type = "";
MyOutputBuffer.Series = SerToRem;
MyOutputBuffer.Code = "";
MyOutputBuffer.Price = "";
MyOutputBuffer.Description = "Series verander";
}
else
{
MyOutputBuffer.ID = Row.IncomingID.ToString();
MyOutputBuffer.Mark = MarkToRem;
MyOutputBuffer.Type = TypeToRem;
MyOutputBuffer.Series =SerToRem;
MyOutputBuffer.Code = Row.IncomingCode;
MyOutputBuffer.Price = Row.IncomingPrice;
MyOutputBuffer.Description = Row.IncomingDiscription;
}
IDCnt = IDCnt + 1;
}
}
The first 9 rows looks like this. For the incoming data
ID Code Price Discription
1 184pin DDR Mark
2 DDR - Non-ECC Sub
3 ME-A1GDV4 388 Adata AD1U400A1G3-R 1Gb ddr-400 ( pc3200 ) , CL3 - 184pin - lifetime warranty
4 ME-C512DV4 199 Corsair Valueselect VS512MB400 512mb ddr-400 ( pc3200 ) , CL2.5 - 184pin - lifetime warranty
5 ME-C1GDV4 399 Corsair Valueselect VS1GB400C3 1Gb ddr-400 ( pc3200 ) , CL3 - 184pin - lifetime warranty
6 240pin DDR2 Mark
7 DDR2 - Non-ECC Sub
8 Adata - lifetime warranty Series
9 ME-A2VD26C5 345 Adata AD2U667B2G5 Valuselect , 2Gb ddr2-667 ( pc2-5400 ) , CL5 , 1.8v - 240pin - lifetime warranty
Solved it.
Avoid Asynchronous Transformation wherever possible
SSIS runtime executes every task other than data flow task in the defined sequence. Whenever the SSIS runtime engine encounters a data flow task, it hands over the execution of the data flow task to data flow pipeline engine.
The data flow pipeline engine breaks the execution of a data flow task into one more execution tree(s) and may execute two or more execution trees in parallel to achieve high performance.
Synchronous transformations get a record, process it and pass it to the other transformation or destination in the sequence. The processing of a record is not dependent on the other incoming rows.
Whereas the asynchronous transformation requires addition buffers for its output and does not utilize the incoming input buffers. It also waits for all incoming rows to arrive for processing, that’s the reason the asynchronous transformation performs slower and must be avoided wherever possible. For example, instead of using Sort Transformation you can get sorted results from the source itself by using ORDER BY clause.
Currently, I'm sending some data to Parse.com. All works well, however, I would like to add a row if it's a new user or update the current table if it's an old user.
So what I need to do is check if the current Facebook ID (the key I'm using) shows up anywhere in the fbid column, then update it if case may be.
How can I check if the key exists in the column?
Also, I'm using C#/Unity.
static void sendToParse()
{
ParseObject currentUser = new ParseObject("Game");
currentUser["name"] = fbname;
currentUser["email"] = fbemail;
currentUser["fbid"] = FB.UserId;
Task saveTask = currentUser.SaveAsync();
Debug.LogError("Sent to Parse");
}
Okay, I figured it out.
First, I check which if there is any Facebook ID in the table that matches the current ID, then get the number of matches.
public static void getObjectID()
{
var query = ParseObject.GetQuery("IdealStunts")
.WhereEqualTo("fbid", FB.UserId);
query.FirstAsync().ContinueWith(t =>
{
ParseObject obj = t.Result;
objectID = obj.ObjectId;
Debug.LogError(objectID);
});
}
If there is any key matching the current Facebook ID, don't do anything. If there aren't, just add a new user.
public static void sendToParse()
{
if (count != 0)
{
Debug.LogError("Already exists");
}
else
{
ParseObject currentUser = new ParseObject("IdealStunts");
currentUser["name"] = fbname;
currentUser["email"] = fbemail;
currentUser["fbid"] = FB.UserId;
Task saveTask = currentUser.SaveAsync();
Debug.LogError("New User");
}
}
You will have to do a StartCoroutine for sendToParse, so getObjectID has time to look through the table.
It may be a crappy implementation, but it works.
What you need to do is create a query for the fbid. If the query returns an object, you update it. If not, you create a new.
I'm not proficient with C#, but here is an example in Objective-C:
PFQuery *query = [PFQuery queryWithClassName:#"Yourclass]; // Name of your class in Parse
query.cachePolicy = kPFCachePolicyNetworkOnly;
[query whereKey:#"fbid" equalTo:theFBid]; // Variable containing the fb id
NSArray *users = [query findObjects];
self.currentFacebookUser = [users lastObject]; // Array should contain only 1 object
if (self.currentFacebookUser) { // Might have to test for NULL, but probably not
// Update the object and save it
} else {
// Create a new object
}
I'm building a console application that have to process a bunch of document.
To stay simple, the process is :
for each year between X and Y, query the DB to get a list of document reference to process
for each of this reference, process a local file
The process method is, I think, independent and should be parallelized as soon as input args are different :
private static bool ProcessDocument(
DocumentsDataset.DocumentsRow d,
string langCode
)
{
try
{
var htmFileName = d.UniqueDocRef.Trim() + langCode + ".htm";
var htmFullPath = Path.Combine("x:\path", htmFileName;
missingHtmlFile = !File.Exists(htmFullPath);
if (!missingHtmlFile)
{
var html = File.ReadAllText(htmFullPath);
// ProcessHtml is quite long : it use a regex search for a list of reference
// which are other documents, then sends the result to a custom WS
ProcessHtml(ref html);
File.WriteAllText(htmFullPath, html);
}
return true;
}
catch (Exception exc)
{
Trace.TraceError("{0,8}Fail processing {1} : {2}","[FATAL]", d.UniqueDocRef, exc.ToString());
return false;
}
}
In order to enumerate my document, I have this method :
private static IEnumerable<DocumentsDataset.DocumentsRow> EnumerateDocuments()
{
return Enumerable.Range(1990, 2020 - 1990).AsParallel().SelectMany(year => {
return Document.FindAll((short)year).Documents;
});
}
Document is a business class that wrap the retrieval of documents. The output of this method is a typed dataset (I'm returning the Documents table). The method is waiting for a year and I'm sure a document can't be returned by more than one year (year is part of the key actually).
Note the use of AsParallel() here, but I never got issue with this one.
Now, my main method is :
var documents = EnumerateDocuments();
var result = documents.Select(d => {
bool success = true;
foreach (var langCode in new string[] { "-e","-f" })
{
success &= ProcessDocument(d, langCode);
}
return new {
d.UniqueDocRef,
success
};
});
using (var sw = File.CreateText("summary.csv"))
{
sw.WriteLine("Level;UniqueDocRef");
foreach (var item in result)
{
string level;
if (!item.success) level = "[ERROR]";
else level = "[OK]";
sw.WriteLine(
"{0};{1}",
level,
item.UniqueDocRef
);
//sw.WriteLine(item);
}
}
This method works as expected under this form. However, if I replace
var documents = EnumerateDocuments();
by
var documents = EnumerateDocuments().AsParrallel();
It stops to work, and I don't understand why.
The error appears exactly here (in my process method):
File.WriteAllText(htmFullPath, html);
It tells me that the file is already opened by another program.
I don't understand what can cause my program not to works as expected. As my documents variable is an IEnumerable returning unique values, why my process method is breaking ?
thx for advises
[Edit] Code for retrieving document :
/// <summary>
/// Get all documents in data store
/// </summary>
public static DocumentsDS FindAll(short? year)
{
Database db = DatabaseFactory.CreateDatabase(connStringName); // MS Entlib
DbCommand cm = db.GetStoredProcCommand("Document_Select");
if (year.HasValue) db.AddInParameter(cm, "Year", DbType.Int16, year.Value);
string[] tableNames = { "Documents", "Years" };
DocumentsDS ds = new DocumentsDS();
db.LoadDataSet(cm, ds, tableNames);
return ds;
}
[Edit2] Possible source of my issue, thanks to mquander. If I wrote :
var test = EnumerateDocuments().AsParallel().Select(d => d.UniqueDocRef);
var testGr = test.GroupBy(d => d).Select(d => new { d.Key, Count = d.Count() }).Where(c=>c.Count>1);
var testLst = testGr.ToList();
Console.WriteLine(testLst.Where(x => x.Count == 1).Count());
Console.WriteLine(testLst.Where(x => x.Count > 1).Count());
I get this result :
0
1758
Removing the AsParallel returns the same output.
Conclusion : my EnumerateDocuments have something wrong and returns twice each documents.
Have to dive here I think
This is probably my source enumeration in cause
I suggest you to have each task put the file data into a global queue and have a parallel thread take writing requests from the queue and do the actual writing.
Anyway, the performance of writing in parallel on a single disk is much worse than writing sequentially, because the disk needs to spin to seek the next writing location, so you are just bouncing the disk around between seeks. It's better to do the writes sequentially.
Is Document.FindAll((short)year).Documents threadsafe? Because the difference between the first and the second version is that in the second (broken) version, this call is running multiple times concurrently. That could plausibly be the cause of the issue.
Sounds like you're trying to write to the same file. Only one thread/program can write to a file at a given time, so you can't use Parallel.
If you're reading from the same file, then you need to open the file with only read permissions as not to put a write lock on it.
The simplest way to fix the issue is to place a lock around your File.WriteAllText, assuming the writing is fast and it's worth parallelizing the rest of the code.