How to combine IObservable sequences in UniRx/Rx.NET? - c#

I'm using the UniRx flavor of Reactive Extensions for the Unity3D game engine.
Unity uses C#, so I guess it's similar to Rx.NET.
I need a more beautiful way of checking when several observable sequences complete.
In the example below, one of the sequences is dependent on the outcome of the first (since it needs an integer for processID).
The observables are both of type IObservable<string>.
var processListObservable = APIBuilder
.GetProcessList(authInfo.Token, authInfo.PlatformURL, (int)product.Id)
.Subscribe(listJson =>
{
processList = ProcessList.FromJson(listJson);
int processID = (int)processList.Processes[0].ProcessDataId;
//Retrieve Detailed information of the first entry
var processDetailsObservable = APIBuilder
.GetProcessDetails(token, platformURL, product.Id, processID)
.Subscribe(detailsJson =>
{
processData = ProcessData.FromJson(detailsJson);
SetupPlotView();
});
});
Any hint would be highly appreciated. Also some suggestions to solve the same scenario minus the dependency on the result of the first sequence.

Instead of putting your code into the Subscribe handler, you could make it part of the sequence. You could use the Select operator in order to project each listJson to an IObservable<string> (resulting to a nested IObservable<IObservable<string>>), and then flatten the sequence by using either the Concat or the Merge operator, depending on whether you want to prevent or allow concurrency.
var processListObservable = APIBuilder
.GetProcessList(authInfo.Token, authInfo.PlatformURL, (int)product.Id)
.Select(listJson =>
{
var processList = ProcessList.FromJson(listJson);
int processID = (int)processList.Processes[0].ProcessDataId;
return APIBuilder.GetProcessDetails(token, platformURL, product.Id, processID);
})
.Concat() // or .Merge() to allow concurrency
.ObserveOn(SynchronizationContext.Current) // Optional
.Do(detailsJson =>
{
var processData = ProcessData.FromJson(detailsJson);
SetupPlotView(processData);
});
await processListObservable.DefaultIfEmpty(); // Start and await the operation
The await in the final line will cause an implicit subscription to the processListObservable, and your code will execute as a side-effect of this subscription.

Related

LINQ To SQL in a parallel loop: How to prevent duplicate insertions?

I'm running into some trouble in trying to parallelize a computationally expensive API integration.
The integration queries an API in parallel and populates a ConcurrentBag collection. Some processing is done, and then it is passed to Parallel.ForEach() in which it is interfaced with the database by using LINQ To Sql.
There is:
one outer loop which runs in parallel for Courses
an inner loop through Disciplines
inside it, another loop iterating through Lessons.
The problem I'm running into is: as any one lesson may belong to more than one course, looping over courses in parallel means that sometimes a lesson will be inserted more than once.
The code currently looks like this:
(externalCourseList is the collection of type ConcurrentBag<ExternalCourse>.)
Parallel.ForEach(externalCourseList, externalCourse =>
{
using ( var context = new DataClassesDataContext() )
{
var dbCourse = context.Courses.Single(
x => x.externalCourseId == externalCourse.courseCode.ToString());
dbCourse.ShortDesc = externalCourse.Summary;
//dbCourse.LongDesc = externalCourse.MoreInfo;
//(etc)
foreach (var externalDiscipline in externalCourse.Disciplines)
{
var dbDiscipline = context.Disciplines.Where(
x => x.ExternalDisciplineID == externalDiscipline.DisciplineCode
.ToString())
.SingleOrDefault();
if (dbDiscipline == null)
dbDiscipline = new Linq2SQLEntities.Discipline();
dbDiscipline.Title = externalDiscipline.Name;
//(etc)
dbDiscipline.ExternalDisciplineID = externalDiscipline.DisciplineCode
.ToString();
if (!dbDiscipline.IsLoaded)
context.Disciplines.InsertOnSubmit(dbDiscipline);
// relational table used as one-to-many relationship for legacy reasons
var courseDiscipline = dbDiscipline.Course_Disciplines.SingleOrDefault(
x => x.CourseID == dbCourse.CourseID);
if (courseDiscipline == null)
{
courseDiscipline = new Course_Discipline
{
Course = dbCourse,
Discipline = dbDiscipline
};
context.Course_Disciplines.InsertOnSubmit(courseDiscipline);
}
foreach (var externalLesson in externalDiscipline.Lessons)
{
/// The next statement throws an exception
var dbLesson = context.Lessons.Where(
x => x.externalLessonID == externalLesson.LessonCode)
.SingleOrDefault();
if (dbLesson == null)
dbLesson = new Linq2SQLEntities.Lesson();
dbLesson.Title = externalLesson.Title;
//(etc)
dbLesson.externalLessonID = externalLesson.LessonCode;
if (!dbLesson.IsLoaded)
context.Lessons.InsertOnSubmit(dbLesson);
var disciplineLesson = dbLesson.Discipline_Lessons.SingleOrDefault(
x => x.DisciplineID == dbDiscipline.DisciplineID
&& x.LessonID == dbLesson.LessonID);
if (disciplineLesson == null)
{
disciplineLesson = new Discipline_Lesson
{
Discipline = dbDiscipline,
Lesson = dbLesson
};
context.Discipline_Lessons.InsertOnSubmit(disciplineLesson);
}
}
}
context.SubmitChanges();
}
});
(IsLoaded is implemented as described here.)
An exception is thrown at the line preceded with /// because the same lesson is often inserted multiple times and calling .SingleOrDefault() on context.Lessons.Where(x => x.externalLessonID == externalLesson.LessonCode) fails.
What would be the best way to solve this?
One approach could be to separate the insertion of the lessons in the database from the other work that has to be done in parallel. I haven't studied your code deeply, so I am not sure if this approach is feasible, but I'll give an example anyway. The basic idea is to serialize the insertion of the lessons, in order to avoid the problems caused by the parallelization:
IEnumerable<Lesson[]> query = externalCourseList
.AsParallel()
.AsOrdered() // Optional
.Select(externalCourse =>
{
using DataClassesDataContext context = new();
List<Lesson> results = new();
// Here do the work that adds lessons in the results list.
return results.ToArray();
}
.AsSequential();
This is a parallel query (PLINQ), that does the parallel work while it is enumerated. So at this point it hasn't started yet. Now let's enumerate it:
using DataClassesDataContext context = new();
foreach (Lesson lesson in query.SelectMany(x => x))
{
// Here insert the lesson in the DB.
}
The work of inserting the lessons in the DB will be done exclusively on the current thread. This thread will also participate in the work inside the parallel query, along with ThreadPool threads. In case this is a problem, you could offload the enumeration of the query on a ThreadPool thread, freeing the current thread from doing anything else than the lesson-inserting work. I've posted an OffloadEnumeration extension method here, that you could use just before starting the enumeration of the query:
query = OffloadEnumeration(query);

Looking for an elegant Rx.NET way to implement certain data processing

Given:
Database as the source of the data
The data has to be grouped and aggregated, where the aggregation process must be done in code and is asynchronous.
I am using the following simple code to simulate the real life:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Reactive.Linq;
using System.Reactive.Threading.Tasks;
using System.Threading.Tasks;
namespace ObservableTest
{
class Program
{
public class Result
{
public int Key;
private int m_previous = -1;
public async Task<Result> AggregateAsync(int x)
{
return await Task.Run(async () =>
{
await Task.Delay(10);
Debug.Assert(m_previous < 0 ? x == Key : m_previous == x - 10);
m_previous = x;
return this;
});
}
public int Complete()
{
Debug.Assert(m_previous / 10 == 9);
return Key;
}
}
static void Main()
{
var obs = GetSource()
.GroupBy(x => x % 10)
.SelectMany(g => g.Aggregate(Observable.Return(new Result { Key = g.Key }), (resultObs, x) => resultObs.SelectMany(result => result.AggregateAsync(x).ToObservable()))
.Merge()
.Select(x => x.Complete()));
obs.Subscribe(Console.WriteLine, () => Console.WriteLine("Press enter to exit ..."));
Console.ReadLine();
}
static IObservable<int> GetSource()
{
return Enumerable.Range(0, 10).SelectMany(remainder => Enumerable.Range(0, 10).Select(i => 10 * i + remainder)).ToObservable();
}
}
}
The GetSource returns numbers from 0 to 99 in a certain order. The order already matches the one needed for the grouping. View this method as if it was quering a database using a SQL statement with ORDER BY matching the anticipated grouping.
So, having an observable of database content I need to group it, aggregate asynchronously and replace each group with the aggregation result.
Here is my solution (from the code above):
var obs = GetSource()
.GroupBy(x => x % 10)
.SelectMany(g => g.Aggregate(Observable.Return(new Result { Key = g.Key }), (resultObs, x) => resultObs.SelectMany(result => result.AggregateAsync(x).ToObservable()))
.Merge()
.Select(x => x.Complete()));
I see multiple problems with it:
GroupBy is wrong here, because the data is already in the right order. It should be a sort of Window or Buffer, but driven by a predicate rather than sample count or time interval.
The asynchronous aggregation looks cumbersome and hence I assume I botched it too.
What is the proper Rx.NET way of achieving what I want?
I am not entirely sure whether there is a proper Rx way to solve this problem but things start getting messy in Rx when dealing with collections especially when items need to be added, updated or removed.
I wrote DynamicData an open source project which specifically deals with manipulating collections. So my disclaimer with this answer is I am very biased as to the solution.
Back to the problem, I would instantiate an observable cache like this
var myCache = new SourceCache<MyObject, MyId>(myobject=>myobject.Id)
You can now observe the cache and apply operators. To group and apply some transforms do the following
var mystream = myCache.Connect()
.Group(myobject => // group value) //creates an observable a cache for each group
.Transform((myGroup,key) => myGroup.Cache.Connect().QueryWhenChanged(query=> //aggregate result)
//now do something with the result
where Transform is an overload of Rx Select operator. I previous blogged a detailed solution which may be appropriate to your problem here Aggregation Example.
This cache is thread safe and you can use the addorupdate and remove methods to load and change it asynchronously.
Remember by default RX avoids concurrency. However, if you need to you can introduce schedulers to assign work when you want it.
Per your comments:
I don't believe using GroupBy Is bad here at all if you want a predicate to drive the partitioning.
my approach is below (can paste into linqpad with the reactive library included). I still struggling with warping my mind with observable but I believe this follows a good idiom as it's also shown by microsoft at https://msdn.microsoft.com/en-us/library/hh242963%28v=vs.103%29.aspx (last example)
void Main()
{
Console.WriteLine("starting on thread {0}",Thread.CurrentThread.ManagedThreadId);
//GetSource()
//.GroupBy(x => x % 10)
var sharedSource = GetSource().Publish();
var closingSignal = sharedSource.Where(MyPredicateFunc);
sharedSource.Window(()=>closingSignal)
.Select(x => x.ObserveOn(TaskPoolScheduler.Default))
.SelectMany(g=>g.Aggregate(0, (s,i) =>ExpensiveAggregateFunctionNoTask(s,i)).SingleAsync())
.Subscribe(i=>Console.WriteLine("Got {0} on thread {1}",i,Thread.CurrentThread.ManagedThreadId))
;
sharedSource.Connect();
}// Define other methods and classes here
bool MyPredicateFunc(int i){
return (i %10 == 0);
}
static IObservable<int> GetSource()
{
return Enumerable.Range(0, 10)
.SelectMany(remainder => Enumerable.Range(0, 10).Select(i => 10 * i + remainder)).ToObservable();
}
int ExpensiveAggregateFunctionNoTask(int lastResult, int currentElement){
var r = lastResult+currentElement;
Console.WriteLine("Adding {0} and {1} on thread {2}", lastResult, currentElement, Thread.CurrentThread.ManagedThreadId);
Task.Delay(250).Wait(); //simulate expensive operation
return r;
}
Doing this you will see that the we have created a new thread for each grouping made and then we wait async in the SelectMany.

"Merging" a stream of streams to produce a stream of the latest values of each

I have an IObservable<IObservable<T>> where each inner IObservable<T> is a stream of values followed by an eventual OnCompleted event.
I would like to transform this into an IObservable<IEnumerable<T>>, a stream consisting of the latest value from any inner stream that is not completed. It should produce a new IEnumerable<T> whenever a new value is produced from one of the inner streams (or an inner stream expires)
It is most easily shown with a marble diagram (which I hope is comprehensive enough):
input ---.----.---.----------------
| | '-f-----g-|
| 'd------e---------|
'a--b----c-----|
result ---a--b-b--c-c-c-e-e-e---[]-
d d d e f g
f f
([] is an empty IEnumerable<T> and -| represents the OnCompleted)
You can see that it slightly resembles a CombineLatest operation.
I have been playing around with Join and GroupJoin to no avail but I feel that that is almost certainly the right direction to be heading in.
I would like to use as little state as possible in this operator.
Update
I have updated this question to include not just single-valued sequences - the resultant IObservable<IEnumerable<T>> should include only the latest value from each sequence - if a sequence has not produced a value, it should not be included.
Here's a version based your solution yesterday, tweaked for the new requirements. The basic idea is to just put a reference into your perishable collection, and then update the value of the reference as the inner sequence produces new values.
I also modified to properly track the inner subscriptions and unsubscribe if the outer observable is unsubscribed.
Also modified to tear it all down if any of the streams produce an error.
Finally, I fixed some race conditions that could violate Rx Guidelines. If your inner observables are firing concurrently from different threads, you could wind up call obs.OnNext concurrently which is a big no-no. So I've gated each inner observable using the same lock to prevent that (see the Synchronize call). Note that because of this, you could probably get away with using a regular double linked list instead of the PerishableCollection because now all of the code using the collection is within a lock so you don't need the threading guarantees of the PerishableCollection.
// Acts as a reference to the current value stored in the list
private class BoxedValue<T>
{
public T Value;
public BoxedValue(T initialValue) { Value = initialValue; }
}
public static IObservable<IEnumerable<T>> MergeLatest<T>(this IObservable<IObservable<T>> source)
{
return Observable.Create<IEnumerable<T>>(obs =>
{
var collection = new PerishableCollection<BoxedValue<T>>();
var outerSubscription = new SingleAssignmentDisposable();
var subscriptions = new CompositeDisposable(outerSubscription);
var innerLock = new object();
outerSubscription.Disposable = source.Subscribe(duration =>
{
BoxedValue<T> value = null;
var lifetime = new DisposableLifetime(); // essentially a CancellationToken
var subscription = new SingleAssignmentDisposable();
subscriptions.Add(subscription);
subscription.Disposable = duration.Synchronize(innerLock)
.Subscribe(
x =>
{
if (value == null)
{
value = new BoxedValue<T>(x);
collection.Add(value, lifetime.Lifetime);
}
else
{
value.Value = x;
}
obs.OnNext(collection.CurrentItems().Select(p => p.Value.Value));
},
obs.OnError, // handle an error in the stream.
() => // on complete
{
if (value != null)
{
lifetime.Dispose(); // removes the item
obs.OnNext(collection.CurrentItems().Select(p => p.Value.Value));
subscriptions.Remove(subscription); // remove this subscription
}
}
);
});
return subscriptions;
});
}
This solution will work for one-item streams but unfortunately accumulates every item in an inner stream until it finishes.
public static IObservable<IEnumerable<T>> MergeLatest<T>(this IObservable<IObservable<T>> source)
{
return Observable.Create<IEnumerable<T>>(obs =>
{
var collection = new PerishableCollection<T>();
return source.Subscribe(duration =>
{
var lifetime = new DisposableLifetime(); // essentially a CancellationToken
duration
.Subscribe(
x => // on initial item
{
collection.Add(x, lifetime.Lifetime);
obs.OnNext(collection.CurrentItems().Select(p => p.Value));
},
() => // on complete
{
lifetime.Dispose(); // removes the item
obs.OnNext(collection.CurrentItems().Select(p => p.Value));
}
);
});
});
}
Another solution given by Dave Sexton, creator of Rxx - it uses Rxx.CombineLatest which appears to be quite similar in its implementation to Brandon's solution:
public static IObservable<IEnumerable<T>> CombineLatestEagerly<T>(this IObservable<IObservable<T>> source)
{
return source
// Reify completion to force an additional combination:
.Select(o => o.Select(v => new { Value = v, HasValue = true })
.Concat(Observable.Return(new { Value = default(T), HasValue = false })))
// Merge a completed observable to force combination with the first real inner observable:
.Merge(Observable.Return(Observable.Return(new { Value = default(T), HasValue = false })))
.CombineLatest()
// Filter out completion notifications:
.Select(l => l.Where(v => v.HasValue).Select(v => v.Value));
}

How to cancel asynchronuos action in LINQ query?

This question extends my previous one Asynchronuos binding and LINQ query hangs. Assume I have a LINQ such us:
var query = from var item in items where item.X == 1 select item;
I can iterate throughout the query asynchronuosly and dispatch each item to UI (or I may use IProgress):
foreach(var item in query)
{
Application.Current.Dispatcher.BeginInvoke(
new Action(() => source.Add(item)));
}
Now I would like to cancel the query... I can simply declare a CancellactionTokenSource cts, put a token into a task and then:
foreach(var item in query)
{
cts.Token.ThrowIfCancellationRequested();
Application.Current.Dispatcher.BeginInvoke(
new Action(() => source.Add(item)));
}
The trouble is, that I'm able to cancel only when new result appears. So if there is a long chain of items, that don't meet my query condition, my cancel request is ignored.
How to involve cancellation into LINQ (to objects) and be able to check the cancel token for each item?
I'm not sure as I didn't test it... But I think you could put it as side-effect inside your linq query, maybe creating a method inside your where to do so, such as:
Change this:
var query = from var item in items where item.X == 1 select item;
To:
var query = from var item in items where CancelIfRequestedPredicate(item,i=>i.X == 1) select item;
And create a method:
private bool CancelIfRequestedPredicate<T>(T item,Predicate<T> predicate)
{
cts.Token.ThrowIfCancellationRequested();
return predicate(item);
}
Since linq uses deferred execution, I think it will run your method at each iteration.
Just as an observation, I don't know what will be the behavior if you're not using Linq to Objects (You didn't mention if you're using linq to sql, or something like this). But it probably won't work.
Hope it helps.
For my specific to Entity Framework version of this problem:
I've been digging around for a while today trying to find an answer for this when I finally found something (one of the 30ish pages I visited) that was a clear answer (for my)
issue which is specifically a linq query running against entity framework.
for later versions of Entity Framework (as of now)
there are extension methods for ToListAsync which include an overload that take a cancellation token.
as does task (in my case my query was in a task) I ran the query in, but it was the data query I was most concerned about.
var sourceToken = new System.Threading.CancellationTokenSource();
var t2 = System.Threading.Tasks.Task.Run(() =>
{
var token = sourceToken.Token;
return context.myTable.Where(s => s.Price == "Right").Select(i => i.ItemName).ToListAsync(token);
}
, sourceToken.Token
);

Detecting "near duplicates" using a LINQ/C# query

I'm using the following queries to detect duplicates in a database.
Using a LINQ join doesn't work very well because Company X may also be listed as CompanyX, therefore I'd like to amend this to detect "near duplicates".
var results = result
.GroupBy(c => new {c.CompanyName})
.Select(g => new CompanyGridViewModel
{
LeadId = g.First().LeadId,
Qty = g.Count(),
CompanyName = g.Key.CompanyName,
}).ToList();
Could anybody suggest a way in which I have better control over the comparison? Perhaps via an IEqualityComparer (although I'm not exactly sure how that would work in this situation)
My main goals are:
To list the first record with a subset of all duplicates (or "near duplicates")
To have some flexibility over the fields and text comparisons I use for my duplicates.
For your explicit "ignoring spaces" case, you can simply call
var results = result.GroupBy(c => c.Name.Replace(" ", ""))...
However, in the general case where you want flexibility, I'd build up a library of IEqualityComparer<Company> classes to use in your groupings. For example, this should do the same in your "ignore space" case:
public class CompanyNameIgnoringSpaces : IEqualityComparer<Company>
{
public bool Equals(Company x, Company y)
{
return x.Name.Replace(" ", "") == y.Name.Replace(" ", "");
}
public int GetHashCode(Company obj)
{
return obj.Name.Replace(" ", "").GetHashCode();
}
}
which you could use as
var results = result.GroupBy(c => c, new CompanyNameIgnoringSpaces())...
It's pretty straightforward to do similar things containing multiple fields, or other definitions of similarity, etc.
Just note that your defintion of "similar" must be transitive, e.g. if you're looking at integers you can't define "similar" as "within 5", because then you'd have "0 is similar to 5" and "5 is similar to 10" but not "0 is similar to 10". (It must also be reflexive and symmetric, but that's more straightforward.)
Okay, so since you're looking for different permutations you could do something like this:
Bear in mind this was written in the answer so it may not fully compile, but you get the idea.
var results = result
.Where(g => CompanyNamePermutations(g.Key.CompanyName).Contains(g.Key.CompanyName))
.GroupBy(c => new {c.CompanyName})
.Select(g => new CompanyGridViewModel
{
LeadId = g.First().LeadId,
Qty = g.Count(),
CompanyName = g.Key.CompanyName,
}).ToList();
private static List<string> CompanyNamePermutations(string companyName)
{
// build your permutations here
// so to build the one in your example
return new List<string>
{
companyName,
string.Join("", companyName.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
};
}
In this case you need to define where the work is going to take place i.e. fully on the server, in local memory or a mixture of both.
In local memory:
In this case we have two routes, to pull back all the data and just do the logic in local memory, or to stream the data and apply the logic piecewise. To pull all the data just ToList() or ToArray() the base table. To stream the data would suggest using ToLookup() with custom IEqualityComparer, e.g.
public class CustomEqualityComparer: IEqualityComparer<String>
{
public bool Equals(String str1, String str2)
{
//custom logic
}
public int GetHashCode(String str)
{
// custom logic
}
}
//result
var results = result.ToLookup(r => r.Name,
new CustomEqualityComparer())
.Select(r => ....)
Fully on the server:
Depends on your provider and what it can successfully map. E.g. if we define a near duplicate as one with an alternative delimiter one could do something like this:
private char[] delimiters = new char[]{' ','-','*'}
var results = result.GroupBy(r => delimiters.Aggregate( d => r.Replace(d,'')...
Mixture:
In this case we are splitting the work between the two. Unless you come up with a nice scheme this route is most likely to be inefficient. E.g. if we keep the logic on the local side, build groupings as a mapping from a name into a key and just query the resulting groupings we can do something like this:
var groupings = result.Select(r => r.Name)
//pull into local memory
.ToArray()
//do local grouping logic...
//Query results
var results = result.GroupBy(r => groupings[r]).....
Personally I usually go with the first option, pulling all the data for small data sets and streaming large data sets (empirically I found streaming with logic between each pull takes a lot longer than pulling all the data then doing all the logic)
Notes: Dependent on the provider ToLookup() is usually immediate execution and in construction applies its logic piecewise.

Categories

Resources