C# Optimized Merge Algorithm - c#

I have a data pull service through which my C# application pulls data. Data is pulled in using multiple jobs and once the data request is complete, the data pull service calls the notify method which I have implemented in my application class.
The following is the notify method code. It just checks if results is non-empty then calls mergeResults in new thread.
public override void notify(List<IFields> results)
{
if (!results.IsNullOrEmpty())
{
Task.Run(() => { mergeResults(results); });
}
}
I am using a List to store final merge results.
List<IFields> mergedResults;
I am using object mergeLock for mutual exclusion.
Here's the merge logic I am using:
public void mergeResults(List<IFieldsByPrePost> results)
{
lock (mergeLock)
{
foreach (var result in results)
{
if (mergedResults.Count > 0)
{
var properties = mergedResults.First().getDiffProperties();
bool isMatch = false;
foreach (var mergedResult in mergedResults)
{
isMatch = true;
foreach (var property in properties)
{
var value1 = mergedResult.GetType().GetProperty(property).GetValue(mergedResult).ToString();
var value2 = result.GetType().GetProperty(property).GetValue(result).ToString();
if (value1 != value2) { isMatch = false; break; }
}
if (isMatch)
{
mergedResult.Count += result.Count;
break;
}
}
if (!isMatch)
{
mergedResults.Add(result);
}
}
else
{
mergedResults.Add(result);
}
}
}
}
The above logic works but it is very slow, whenever a large set of results is passed to the method.
Also, the notify method is called multiple times by the data pull service with different result sets, further slowing it down.
I am looking for a better approach to solve this problem.
TLDR; This algorithm is slow, can anyone show me a way to make it run faster?

I'd suggest, that IFields and/or IFieldsByPrePost derive from
IEquatable<IFields> and/or IEquatable<IFieldsByPrePost>.
So you can just test equality with
IFields fields1;
IFieldsByPrePost fields2;
bool equal = fields1.Equals(fields2);
This way you get around the Reflection, which is slowing your code down.
Then its just
foreach (var result in results)
{
if (!mergedResults.Any(x => x.Equals(result))
{
mergedResults.Add(result);
}
}
I don't know, what you are doing with the
mergedResult.Count,
so I am ommiting this.

The thing that sticks out to me first is that the mergeResults method isn't generic, so I'm not sure why reflection is necessary. Removing the lines:
var value1 = mergedResult.GetType().GetProperty(property).GetValue(mergedResult).ToString();
var value2 = result.GetType().GetProperty(property).GetValue(result).ToString();
if (value1 != value2) { isMatch = false; break; }
and using the direct property:
if(mergedResult.Property1 == result.Property1) { isMatch = false; break; }
could help.

Related

C# best practice: "break" from within Action<> callback called in a loop?

I had to write my own foreach method for various reasons. This resembles an IEnumerable foreach statement:
public void ForEachEdge(in Vertex vertex, Action<Edge> callback)
{
var edge = GetEdge(vertex.BaseEdgeIndex);
do
{
callback.Invoke(edge);
edge = GetEdge(edge.GetNext(vertex.Index));
} while (edge.Index != vertex.BaseEdgeIndex);
}
I'm using it like so but I wish to be able to "break" out of the entire loop:
ForEachEdge(edge.Vertex0Index, (e) =>
{
if (inEdge.AreConnectingSameVertices(e))
{
// break out of inner while loop here ...
}
});
What would be best practice to break?
Return a status value?
Pass a "ref bool stopEnumerating" parameter in? (requires class instance to wrap it in, right?)
Your thoughts ...
I'm mostly concerned about what end users (developers) would expect in such a case.
the ref parameter method won't be as clean as a return value indicating continuation status. You would have to switch to a Func<Edge, bool>
Func<Edge, bool> callback;
...
if (callback.Invoke(edge)) {
/// do your break logic
}
I decided that (for now) I'll go with a Predicate<> rather than Action<>:
public void ForEachEdge(in Vertex vertex, Predicate<Edge> callback)
{
var edge = GetEdge(vertex.BaseEdgeIndex);
do
{
if (callback.Invoke(edge))
break;
edge = GetEdge(edge.GetNextRadialEdgeIndex(vertex.Index));
} while (edge.IsValid && edge.Index != vertex.BaseEdgeIndex);
}
Which makes the user's code look like this:
ForEachEdge(edge.Vertex0Index, e =>
{
if (inEdge.AreConnectingSameVertices(e))
{
// found it, do something, then exit loop
return true;
}
// continue with next item
return false;
});
The nicest thing about this solution: both Predicate<> and Action<> variants can exist side-by-side! User either returns true/false from the predicate, or does not return anything and thus uses the Action<> version, like so:
ForEachEdge(edge.Vertex0Index, e =>
{
if (inEdge.AreConnectingSameVertices(e))
{
// do stuff
}
});
Purrfect! :)

How to bulk insert/update lots of data in the correct way?

Following thing boggles my mind:
I have to bulk insert a lot of changes, some are inserts some are updates. I am not sure how to do it the best way.
Logic looks something like this:
public class Worker
{
public void Run(){
var mailer = new Mailer();
HashSet<DbElements> dbElementsLookUp = new HashSet<DbElement>(dbContext.DbElements);
List<Element> elements = GetSomeChangesFromSomewhere();
var dbElementsToSave = new List<DbElements>();
foreach(var element in elements)
{
CreateOrUpdateDbElement(element, dbElementsToSave);
// Sends some data based on the element - due to legacy implementation it uses its own context
mailer.SendSomeLogging(element);
}
try
{
dbContext.ChangeTracker.DetectChanges();
dbContext.Set<DbElement>().AddRange(dbElementsToSave);
dbContext.SaveChanges();
}
catch (Exception e)
{
LogErrors(e);
}
}
private CreateOrUpdateDbElement(ElementDto element, HashSet<DbElement> lookUp, List<DbElement> dbElementsToSave)
{
var entity = lookUp.FirstOrDefault(e => e.Id == element.Id);
if(element is not null)
{
entity.SomeProperty = element.SomeProperty;
dbContext.Configuration.AutoDetectChangesEnabled = false;
dbContext.Entry(entity).State = EntityState.Modified;
dbContext.Configuration.AutoDetectChangesEnabled = true;
}
else
{
dbElementsToSave.Add(new DbElement
{
SomeProperty = element.SomeProperty,
CreationDate = DateTime.Now
});
}
}
}
I'm not sure what's the best way to do this, especially for the DetectChanges. Is it save to disable the autodetectchanges and call the detectchanges outside of the foreach. I am working with a lot of data and due to the legacy implementation it is pretty slow because for each mail there is a write operation on the database. It actually works on another instance of the context so it does not interfer with the saving of the objects of dbelements.
Is it better to add the entities to update to another list and do the same as for the adding of entities?

Queue to ConcurrentQueue

I have a regular Queue object in C# (4.0) and I'm using BackgroundWorkers that access this Queue.
The code I was using is as follows:
do
{
while (dataQueue.Peek() == null // nothing waiting yet
&& isBeingLoaded == true // and worker 1 still actively adding stuff
)
System.Threading.Thread.Sleep(100);
// otherwise ready to do something:
if (dataQueue.Peek() != null) // because maybe the queue is complete and also empty
{
string companyId = dataQueue.Dequeue();
processLists(companyId);
// use up the stuff here //
} // otherwise nothing was there yet, it will resolve on the next loop.
} while (isBeingLoaded == true // still have stuff coming at us
|| dataQueue.Peek() != null); // still have stuff we haven’t done
However, I guess when dealing with threads I should be using a ConcurrentQueue.
I was wondering if there were examples of how to use a ConcurrentQueue in a Do While Loop like above?
Everything I tried with the TryPeek wasn't working..
Any ideas?
You can use a BlockingCollection<T> as a producer-consumer queue.
My answer makes some assumptions about your architecture, but you can probably mold it as you see fit:
public void Producer(BlockingCollection<string> ids)
{
// assuming this.CompanyRepository exists
foreach (var id in this.CompanyRepository.GetIds())
{
ids.Add(id);
}
ids.CompleteAdding(); // nothing left for our workers
}
public void Consumer(BlockingCollection<string> ids)
{
while (true)
{
string id = null;
try
{
id = ids.Take();
} catch (InvalidOperationException) {
}
if (id == null) break;
processLists(id);
}
}
You could spin up as many consumers as you need:
var companyIds = new BlockingCollection<string>();
Producer(companyIds);
Action process = () => Consumer(companyIds);
// 2 workers
Parallel.Invoke(process, process);

Doing locking in ASP.NET correctly

I have an ASP.NET site with a fairly slow search function, and I want to improve performance by adding the results to the cache for one hour using the query as the cache-key:
using System;
using System.Web;
using System.Web.Caching;
public class Search
{
private static object _cacheLock = new object();
public static string DoSearch(string query)
{
string results = "";
if (HttpContext.Current.Cache[query] == null)
{
lock (_cacheLock)
{
if (HttpContext.Current.Cache[query] == null)
{
results = GetResultsFromSlowDb(query);
HttpContext.Current.Cache.Add(query, results, null, DateTime.Now.AddHours(1), Cache.NoSlidingExpiration, CacheItemPriority.Normal, null);
}
else
{
results = HttpContext.Current.Cache[query].ToString();
}
}
}
else
{
results = HttpContext.Current.Cache[query].ToString();
}
return results;
}
private static string GetResultsFromSlowDb(string query)
{
return "Hello World!";
}
}
Let’s say visitor A does a search. The cache is empty, the lock is set and the result is requested from the database. Now visitor B comes along with a different search: Won’t visitor B have to wait by the lock until visitor A’s search has completed? What I really wanted was for B to call the database immediately, because the results will be different and the database can handle multiple requests – I just don’t want to repeat expensive unnecessary queries.
What would be the correct approach for this scenario?
Unless you're absolutely certain that it's critical to have no redundant queries then I would avoid locking altogether. The ASP.NET cache is inherently thread-safe, so the only drawback to the following code is that you might temporarily see a few redundant queries racing each other when their associated cache entry expires:
public static string DoSearch(string query)
{
var results = (string)HttpContext.Current.Cache[query];
if (results == null)
{
results = GetResultsFromSlowDb(query);
HttpContext.Current.Cache.Insert(query, results, null,
DateTime.Now.AddHours(1), Cache.NoSlidingExpiration);
}
return results;
}
If you decide that you really must avoid all redundant queries then you could use a set of more granular locks, one lock per query:
public static string DoSearch(string query)
{
var results = (string)HttpContext.Current.Cache[query];
if (results == null)
{
object miniLock = _miniLocks.GetOrAdd(query, k => new object());
lock (miniLock)
{
results = (string)HttpContext.Current.Cache[query];
if (results == null)
{
results = GetResultsFromSlowDb(query);
HttpContext.Current.Cache.Insert(query, results, null,
DateTime.Now.AddHours(1), Cache.NoSlidingExpiration);
}
object temp;
if (_miniLocks.TryGetValue(query, out temp) && (temp == miniLock))
_miniLocks.TryRemove(query);
}
}
return results;
}
private static readonly ConcurrentDictionary<string, object> _miniLocks =
new ConcurrentDictionary<string, object>();
Your code has a potential race condition:
if (HttpContext.Current.Cache[query] == null)
{
...
}
else
{
// When you get here, another thread may have removed the item from the cache
// so this may still return null.
results = HttpContext.Current.Cache[query].ToString();
}
In general I wouldn't use locking, and would do it as follows to avoid the race condition:
results = HttpContext.Current.Cache[query];
if (results == null)
{
results = GetResultsFromSomewhere();
HttpContext.Current.Cache.Add(query, results,...);
}
return results;
In the above case, multiple threads might attempt to load data if they detect a cache miss at about the same time. In practice this is likely to be rare, and in most cases unimportant, because the data they load will be equivalent.
But if you want to use a lock to prevent it you can do so as follows:
results = HttpContext.Current.Cache[query];
if (results == null)
{
lock(someLock)
{
results = HttpContext.Current.Cache[query];
if (results == null)
{
results = GetResultsFromSomewhere();
HttpContext.Current.Cache.Add(query, results,...);
}
}
}
return results;
Your code is correct. You are also using double-if-sandwitching-lock which will prevent race conditions which is a common pitfall when not used. This will no lock access to existing stuff in the cache.
The only problem is when many clients are inserting into the cache at the same time, and they will queue behind the lock but what I would do is to put the results = GetResultsFromSlowDb(query); outside the lock:
public static string DoSearch(string query)
{
string results = "";
if (HttpContext.Current.Cache[query] == null)
{
results = GetResultsFromSlowDb(query); // HERE
lock (_cacheLock)
{
if (HttpContext.Current.Cache[query] == null)
{
HttpContext.Current.Cache.Add(query, results, null, DateTime.Now.AddHours(1), Cache.NoSlidingExpiration, CacheItemPriority.Normal, null);
}
else
{
results = HttpContext.Current.Cache[query].ToString();
}
}
}
else
{
results = HttpContext.Current.Cache[query].ToString();
}
If this is slow, your problem is elsewhere.

Returning the first method that works, more elegant way?

Recently I've found myself writing methods which call other methods in succession and setting some value based on whichever method returns an appropriate value first. What I've been doing is setting the value with one method, then checking the value and if it's not good then I check the next one. Here's a recent example:
private void InitContent()
{
if (!String.IsNullOrEmpty(Request.QueryString["id"]))
{
Content = GetContent(Convert.ToInt64(Request.QueryString["id"]));
ContentMode = ContentFrom.Query;
}
if (Content == null && DefaultId != null)
{
Content = GetContent(DefaultId);
ContentMode = ContentFrom.Default;
}
if (Content == null) ContentMode = ContentFrom.None;
}
Here the GetContent method should be returning null if the id isn't in the database. This is a short example, but you can imagine how this might get clunky if there were more options. Is there a better way to do this?
The null coalescing operator might have the semantics you want.
q = W() ?? X() ?? Y() ?? Z();
That's essentially the same as:
if ((temp = W()) == null && (temp = X()) == null && (temp == Y()) == null)
temp = Z();
q = temp;
That is, q is the first non-null of W(), X(), Y(), or if all of them are null, then Z().
You can chain as many as you like.
The exact semantics are not quite like I sketched out; the type conversion rules are tricky. See the spec if you need the exact details.
You could also do something a little more sneaky, along the lines of this:
private Int64? GetContentIdOrNull(string id)
{
return string.IsNullOrEmpty(id) ? null : (Int64?)Convert.ToInt64(id);
}
private Int64? GetContentIdOrNull(DefaultIdType id)
{
return id;
}
private void InitContent()
{
// Attempt to get content from multiple sources in order of preference
var contentSources = new Dictionary<ContentFrom, Func<Int64?>> {
{ ContentFrom.Query, () => GetContentIdOrNull(Request.QueryString["id"]) },
{ ContentFrom.Default, () => GetContentIdOrNull(DefaultId) }
};
foreach (var source in contentSources) {
var id = source.Value();
if (!id.HasValue) {
continue;
}
Content = GetContent(id.Value);
ContentMode = source.Key;
if (Content != null) {
return;
}
}
// Default
ContentMode = ContentFrom.None;
}
That would help if you had many more sources, at the cost of increased complexity.
Personally, I find when I have lots of statements that are seemingly disparate, it's time to make some functions.
private ContentMode GetContentMode(){
}
private Content GetContent(int id){
}
private Content GetContent(HttpRequest request){
return GetContent(Convert.ToInt64(request.QueryString["id"]));
}
private void InitContent(){
ContentMode mode = GetContentMode();
Content = null;
switch(mode){
case ContentMode.Query:
GetContent(Request);
break;
case ContentMode.Default:
GetContent(DefaultId);
break;
case ContentMode.None:
... handle none case...
break;
}
}
This way, you separate your intentions - first step, determine the content mode. Then, get the content.
I suggest you try some kind of Factory design pattern for this case. You can abstract the content create procedure by register different creators. Moreover, you can add preference on each creator for your own logic. Besides, I suggest you encapsulate all data related to Content just like "ContentDefinition" class from other's post.
In general, you need to know that there is always a trade off between flexibility and efficiency. Sometime your first solution is good enough:)
Ok, because I noticed a bit late that you actually wanted the ContentFrom mode as well, I've done my best to come up with a translation of your sample below my original answer
In general I use the following paradigm for cases like this. Search and replace your specific methods here and there :)
IEnumerable<T> ValueSources()
{
yield return _value?? _alternative;
yield return SimpleCalculationFromCache();
yield return ComplexCalculation();
yield return PromptUIInputFallback("Please help by entering a value for X:");
}
T EffectiveValue { get { return ValueSources().FirstOrDefault(v => v!=null); } }
Note how you can now make v!=null arbitrarily 'interesting' for your purposes.
Note also how lazy evaluation makes sure that the calculations are never done when _value or _alternative are set to 'interesting' values
Here is my initial attempt at putting your sample into this mold. Note how I added quite a lot of plumbing to make sure this actually compiles into standalone C# exe:
using System.Collections.Generic;
using System.Linq;
using System;
using T=System.String;
namespace X { public class Y
{
public static void Main(string[]args)
{
var content = Sources().FirstOrDefault(c => c); // trick: uses operator bool()
}
internal protected struct Content
{
public T Value;
public ContentFrom Mode;
//
public static implicit operator bool(Content specimen) { return specimen.Mode!=ContentFrom.None && null!=specimen.Value; }
}
private static IEnumerable<Content> Sources()
{
// mock
var Request = new { QueryString = new [] {"id"}.ToDictionary(a => a) };
if (!String.IsNullOrEmpty(Request.QueryString["id"]))
yield return new Content { Value = GetContent(Convert.ToInt64(Request.QueryString["id"])), Mode = ContentFrom.Query };
if (DefaultId != null)
yield return new Content { Value = GetContent((long) DefaultId), Mode = ContentFrom.Default };
yield return new Content();
}
public enum ContentFrom { None, Query, Default };
internal static T GetContent(long id) { return "dummy"; }
internal static readonly long? DefaultId = 42;
} }
private void InitContent()
{
Int64? id = !String.IsNullOrEmpty(Request.QueryString["id"])
? Convert.ToInt64(Request.QueryString["id"])
: null;
if (id != null && (Content = GetContent(id)) != null)
ContentMode = ContentFrom.Query;
else if(DefaultId != null && (Content = GetContent(DefaultId)) != null)
ContentMode = ContentFrom.Default;
else
ContentMode = ContentFrom.None;
}

Categories

Resources