One To Many LINQ Joins Across Nested Collections - c#

I have a development scenario where I am joining two collections with Linq; a single list of column header objects which contain presentation metadata, and an enumeration of kv dictionaries which result from a web service call. I can currently iterate (for) through the dictionary enumeration, and join the single header list to the current kv dictionary without issue. After joining, I emit a curated array of dictionary values for each iteration.
What I would like to do is eliminate the for loop, and join the single header list directly to the entire enumeration. I understand the 1-to-1 collection join pretty well, but the 1-to-N syntax is eluding me.
Details
I have the following working method:
public void GetQueryResults(DataTable outputTable)
{
var odClient = new ODataClient(UrlBase);
var odResponse = odClient.FindEntries(CommandText);
foreach (var row in odResponse)
{
var rowValues = OutputFields
.Join(row, h => h.Key, r => r.Key,
(h, r) => new { Header = h, ResultRow = r })
.Select(r => r.ResultRow.Value);
outputTable.Rows.Add(rowValues.ToArray());
}
}
odResponse contains IEnumerable<IDictionary<string, object>>; OutputFields contains IList<QueryField>; the .Join produces an enumeration of anons containing matched field metadata (.Header) and response kv pairs (.ResultRow); finally, the .Select emits the matched response values for row consumption. The OutputField collection looks like this:
class QueryField
{
public string Key { get; set; }
public string Label { get; set; }
public int Order { get; set; }
}
Which is declared as:
public IList<QueryField> OutputFields { get; private set; }
By joining the collection of field headers to the response rows, I can pluck just the columns I need from the response. If the header keys contain { "size", "shape", "color" } and the response keys contain { "size", "mass", "color", "longitude", "latitude" }, I will get an array of values for { "size", "shape", "color" }, where shape is null, and the mass, longitude, and latitude values are ignored. For the purposes of this scenario, I am not concerned with ordering. This all works a treat.
Problem
What I'd like to do is refactor this method to return an enumeration of value array rows, and let the caller manage the consumption of the data:
public IEnumerable<string[]> GetQueryResults()
{
var odClient = new ODataClient(UrlBase);
var odResponse = odClient.FindEntries(CommandText);
var responseRows = //join OutputFields to each row in odResponse by .Key
return responseRows;
}
Followup Question
Would a Linq-implemented solution for this refactor require an immediate scan of the enumeration, or can it pass back a lazy result? The purpose of the refactor is to improve encapsulation without causing redundant collection scans. I can always build imperative loops to reformat the response data the hard way, but what I'd like from Linq is something like a closure.
Thanks heaps for spending the time to read this; any suggestions are appreciated!

I'm not completely sure what you mean but could it be you're meaning something like this?
public IEnumerable<object[]> GetQueryResults()
{
var odClient = new ODataClient(UrlBase);
var odResponse = odClient.FindEntries(CommandText);
// i'd rather you linq here.
var responseRows = from row in odResponse
select new object[]
{
from field in row
join outputfield in OutputFields
on field.Key equals outputfield.Key
select field.Value
};
return responseRows;
}
Instead of filling a DataTable. This will create an array of objects and filling it with field.Value where the field.Key exists in the outputfields. The whole thing is encapsulated in a IEnumerable. (from row in odResponse)
Usage:
var responseRows = GetQueryResults2();
foreach(var rowValues in responseRows)
outputTable.Rows.Add(rowValues);
The trick here is, within one query you iterate a list and create a subquery on the fields and stores the subquery result directly in a object[]. The object[] is only created when the responseRows is iterated. This is the answer on your second question I think -> the Lazy result.

Related

Building a collection of objects retrievable via tags

I need a collection that can store and retrieve objects with multiple, potentially shared tags. I need to be able to store one object with multiple tags, and retrieve all objects that have one or more tags.
My first idea was for the collection to store an array of objects, and a Dictionary<string, Hashset<int>> where the key is the tag and the value is the indexes that tag applies to.
For multiple tags, get the intersection of the index collections
To remove a tag from an object, remove that index from the collection
however, if an object is removed from the collection, all the indexes after that point are now incorrect.
Am I heading in the right direction? Is there an existing implementation of this that I'm unaware of, or a standard approach to collections that would help here?
Given
public class Something
{
public HashSet<string> Tags { get; set; }
}
Usage
var list = new List<Something>
{
new Something()
{
Tags = new HashSet<string>() { "tag1", "tag2" }
},
new Something()
{
Tags = new HashSet<string>() { "tag3", "tag4" }
}
};
var searchList = new List<string> { "tag1", "tag4"};
var result = list.Where(x => x.Tags.Any(y => searchList.Contains(y)));
Fairly standard in memory approach
If you wanted it more typed, use enums (if you don't need them dynamic)
You've headed in the right direction. I would say that you should cache common intersections in other HashSet<T> instances to even speed up and simplifiy things more.
however, if an object is removed from the collection, all the indexes
after that point are now incorrect.
Although you can build an inverse dictionary Dictionary<int, HashSet<string>> in order to remove a given object from the tag index to avoid iterating the entire index when some object is removed:
var tags = objectTagMap[394]
foreach(var tag in tags)
tagObjectMap[tag].Remove(394)
Anyway, if you're thinking about an in-memory index, why don't you use Redis? Redis provides you both hashes (dictionaries), sets and sorted sets (and some other data structures).
This is a very very simplified sample of how you would build the same strategy in Redis:
# Store objects as key-value pairs
set object:1 { "id": 1 }
set object:2 { "id": 2 }
set object:3 { "id": 3 }
// sadd (set add) to build the tag index
sadd tagA 1 2
sadd tagB 3
// sunion to get object ids from two or more tags
sunion tagA tagB
// mget (multiple get) to get object data from the result
// of sunion concatenating "object:" with each object id
// This is a simple example. In a real world system you would use
// SCAN to avoid bottlenecks and being able to leverage paging.
mget object:1 object:2 object:3
Why not use:
Dictionary<List<string>, HashSet<int>> taggedDict = new Dictionary<List<string>, HashSet<int>>();
var searchList = new List<string> { "tag1", "tag4" };
var keys = taggedDict.Keys.Where(x => x.Any(y => searchList.Contains(y)));
This isn't the best way, but the way I'm using for now until it becomes a problem is simply a collection consisting of two dictionaries; Dictionary<string, Hashset<T>> to get objects with a tag, and Dictionary<T, Hashset<string>> to get the tags on an object. It's simple and functional, and should suffice for smaller collections.

c# Appropriate data structure for storing values from csv file. Specific Case

I'm writing a program that will simply read 2 different .csv files containing following information:
file 1 file2
AA,2.34 BA,6.45
AB,1.46 BB,5.45
AC,9.69 BC,6.21
AD,3.6 AC,7.56
Where first column is string, second is double.
So far I have no difficulty in reading those files and placing values to the List:
firstFile = new List<KeyValuePair<string, double>>();
secondFile = new List<KeyValuePair<string, double>>();
I'm trying to instruct my program:
to take first value from the first column from the first row of the first file (in this case AA)
and look if there might be a match in the entire first column in the second file.
If string match is found, compare their corresponding second values (double in this case), and if in this case match found, add the entire row to the separate List.
Something similar to the below pseudo-code:
for(var i=0;i<firstFile.Count;i++)
{
firstFile.Column[0].value[i].SearchMatchesInAnotherFile(secondFile.Column[0].values.All);
if(MatchFound)
{
CompareCorrespondingDoubles();
if(true)
{
AddFirstValueToList();
}
}
}
Instead of List I tried to use Dictionary but this data structure is not sorted and no way to access the key by the index.
I'm not asking for the exact code to provide, rather the question is:
What would you suggest to use as an appropriate data structure for this program so that I can investigate myself further?
KeyValuePair is actually only used for Dictionarys. I suggest to create your own custom type:
public class MyRow
{
public string StringValue {get;set;}
public double DoubleValue {get;set;}
public override bool Equals(object o)
{
MyRow r = o as MyRow;
if (ReferenceEquals(r, null)) return false;
return r.StringValue == this.StringValue && r.DoubleValue == this.DoubleValue;
}
public override int GetHashCode()
{
unchecked { return StringValue.GetHashCode ^ r.DoubleValue.GetHashCode(); }
}
}
And store the files in lists of this type:
List<MyRow> firstFile = ...
List<MyRow> secondFile = ...
Then you can determine the intersection (all elements that occure in both lists) via LINQ's Intersect method:
var result = firstFile.Intersect(secondFile).ToList();
It's necessary to override Equals and GetHashCode, because otherwise Intersect would only make a reference comparison. Alternativly you could implement an own IEqualityComparer<MyRow, MyRow> that does the comparison and pass it to the appropriate Intersect overload, too.
But if you can ensure that the keys (the string values are unique), you can also use a
Dictionary<string, double> firstFile = ...
Dictionary<string, double> secondFile = ...
And in this case use this LINQ statement:
var result = new Dictionary<string, double>(
firstFile.Select(x => new { First = x, Second = secondFile.FirstOrDefault(y => x.Key == y.Key) })
.Where(x => x.Second?.Value == x.First.Value));
which had a time complexity of O(m+n) while the upper solution would be O(m*n) (for m and n being the row counts of the two files).

Speeding up iterating through two foreach loops

Trying speed up iterating though two foreach loops at the moment it takes about 15 seconds`
foreach (var prodCost in Settings.ProdCostsAndQtys)
{
foreach (var simplified in Settings.SimplifiedPricing
.Where(simplified => prodCost.improd.Equals(simplified.PPPROD) &&
prodCost.pplist.Equals(simplified.PPLIST)))
{
prodCost.pricecur = simplified.PPP01;
prodCost.priceeur = simplified.PPP01;
}
}
Basically the ProdCostsAndQtys list is a list of objects which has 5 properties, the size of the list is 798677
The SimplifiedPricing list is a list of objects with 44 properties, the size of this list is 347 but is more than likely going to get a lot bigger (hence wanting to get the best performance now).
The loop iterates through all the objects in the first list within the second loop if the two conditions match they replace the two properties from the first loop with the second loop.
It seems that your SimplifiedPricing is a smaller lookup list and the outer loop iterates on a larger list. It looks to me as if the main source of delay is the Equals check for each item on the smaller list to match each item in the larger list. Also, when you have a match, you update the value in the larger list, so updating multiple times looks redundant.
Considering this, I would suggest building up a Dictionary for the items in the smaller list, increasing memory consumption but drastically speeding up lookup times. First we need something to hold the key of this dictionary. I will assume that the improd and pplist are integers, but it does not matter for this case:
public struct MyKey
{
public readonly int Improd;
public readonly int Pplist;
public MyKey(int improd, int pplist)
{
Improd = improd;
Pplist = pplist;
}
public override int GetHashCode()
{
return Improd.GetHashCode() ^ Pplist.GetHashCode();
}
public override bool Equals(object obj)
{
if (!(obj is MyKey)) return false;
var other = (MyKey)obj;
return other.Improd.Equals(this.Improd) && other.Pplist.Equals(this.Pplist);
}
}
Now that we have something that compares Pplist and Improd in one go, we can use it as a key for a dictionary containing the SimplifiedPricing.
IReadOnlyDictionary<MyKey, SimplifiedPricing> simplifiedPricingLookup =
(from sp in Settings.SimplifiedPricing
group sp by new MyKey(sp.PPPROD, sp.PPLIST) into g
select new {key = g.Key, value = g.Last()}).ToDictionary(o => o.key, o => o.value);
Notice the IReadOnlyDictionary. This is to show our intent of not modifying this dictionary after its creation, allowing us to safely parallelize the main loop:
Parallel.ForEach(Settings.ProdCostsAndQtys, c =>
{
SimplifiedPricing value;
if (simplifiedPricingLookup.TryGetValue(new MyKey(c.improd, c.pplist), out value))
{
c.pricecur = value.PPP01;
c.priceeur = value.PPP01;
}
});
This should change your single-threaded O(n²) loop to a parallelized O(n) loop, with a slight overhead for creating the simplifiedPricingLookup dictionary.
A join should be more efficient:
var toUpdate = from pc in Settings.ProdCostsAndQtys
join s in Settings.SimplifiedPricing
on new { prod=pc.improd, list=pc.pplist } equals new { prod=s.PPPROD, list=s.PPLIST }
select new { prodCost = pc, simplified = s };
foreach (var pcs in toUpdate)
{
pcs.prodCost.pricecur = pcs.simplified.PPP01;
pcs.prodCost.priceeur = pcs.simplified.PPP01;
}
You could make use of multiple threads with parallel.Foreach:
Parallel.ForEach(Settings.ProdCostsAndQtys, prodCost =>
{
foreach (var simplified in Settings.SimplifiedPricing
.Where(simplified =>
prodCost.improd.Equals(simplified.PPPROD) &&
prodCost.pplist.Equals(simplified.PPLIST))
{
prodCost.pricecur = simplified.PPP01;
prodCost.priceeur = simplified.PPP01;
}
}
However, this only applies if you have the lists in memory. There are far more efficient mechanisms for updating the lists in the database. Also, using linq join might make the code more readable at neglectible performance cost.

Picking a collection for my data

I have to load and use my data from a db. The data is represented like this:
group_id term
1 hello
1 world
1 bye
2 foo
2 bar
etc.
What is a good C# collection to load and use this data?
Looks like you need a Dictionary<int, List<string>>:
var dict = new Dictionary<int, List<string>>();
var dict.Add(1, new List<string> { "hello", "world", "bye" });
var dict.Add(2, new List<string> { "foo", "bar" });
It all depends on what you have to do with the collection but it seems like Lookup is a good candidate in case you need to group by group_id.
If your data is in a datatable:
var lookup = table.AsEnumerable().ToLookup(row => row.Field<int>("group_id"));
and then access the groups the following way:
foreach (var group in lookup)
{
int groupID = group.Key;
IEnumerable<DataRow> groupRows = group;
}
It depends very strongly on what you need to do with your data.
If you just need to list your data, create a class which holds the data and use a List<Vocable>.
public class Vocable
{
public int Group { get; set; }
public string Term { get; set; }
}
List<Vocable> vocables;
If you need to look up all terms belonging to a group, use a Dictionary<int, List<string>> using the group id as key and a list of terms as value.
If you need to look up the group a term belongs to, use a Dictionary<string, int> using the term as key and the group id as value.
To load and save data from and to a DB you can use the DataSet/DataTable/DataRow classes.
Look at DataAdapter etc. depends on your database server (MySQL, MsSQL, ...)
if you want to work with objects i suggest you the ORMs EntityFramework (Microsoft) or SubSonic.
http://subsonicproject.com/docs/The_5_Minute_Demo
with an ORM you can use LINQ Queries like this:
// Define a query that returns all Department
// objects and course objects, ordered by name.
var departmentQuery = from d in schoolContext.Departments.Include("Courses")
orderby d.Name
select d;

Update object in IEnumerable<> not updating?

I have an IEnumerable of a POCO type containing around 80,000 rows
and a db table (L2E/EF4) containing a subset of rows where there was a "an error/a difference" (about 5000 rows, but often repeated to give about 150 distinct entries)
The following code gets the distinct VSACode's "in error" and then attempts to update the complete result set, updating JUST the rows that match...but it doesn't work!
var vsaCodes = (from g in db.GLDIFFLs
select g.VSACode)
.Distinct();
foreach (var code in vsaCodes)
{
var hasDifference = results.Where(r => r.VSACode == code);
foreach (var diff in hasDifference)
diff.Difference = true;
}
var i = results.Count(r => r.Difference == true);
After this code, i = 0
I've also tried:
foreach (var code in vsaCodes)
{
results.Where(r => r.VSACode == code).Select(r => { r.Difference = true; return r; }).ToList();
}
How can I update the "results" to set only the matching Difference property?
Assuming results is just a query (you haven't shown it), it will be evaluated every time you iterate over it. If that query creates new objects each time, you won't see the updates. If it returns references to the same objects, you would.
If you change results to be a materialized query result - e.g. by adding ToList() to the end - then iterating over results won't issue a new query, and you'll see your changes.
I had the same kind of error some time ago. The problem is that linq queries are often deferred and not executed when it appears you are calling them.
Quotation from "Pro LINQ Language Integrated Query in C# 2010":
"Notice that even though we called the query only once, the results of
the enumeration are different for each of the enumerations. This is
further evidence that the query is deferred. If it were not, the
results of both enumerations would be the same. This could be a
benefit or detriment. If you do not want this to happen, use one of
the conversion operators that do not return an IEnumerable so that
the query is not deferred, such as ToArray, ToList, ToDictionary, or
ToLookup, to create a different data structure with cached results
that will not change if the data source changes."
Here you have a good explanation with examples of it:
http://blogs.msdn.com/b/charlie/archive/2007/12/09/deferred-execution.aspx
Regards
Parsing words pretty closely on #jonskeet's answer...
If your query is simply a filter and the underlying source objects are updated, the query will be reevaluated and may exclude these objects based on the filter condition in which case your query results will change on subsequent enumerations but the underlying objects will still have been updated.
The key is a lack of a projection to a new type as far as updating and persisting the changed objects.
ToList() is the usual solution to this issue and it will solve the problem if there is a projection to a new type but things gets cloudy in event your query filters but does not project. Updates to the query still affect the original source objects given everything is referencing the same object.
Again, parsing words but these edge cases can trip you up.
public class Widget
{
public string Name { get; set; }
}
var widgets1 = new[]
{
new Widget { Name = "Red", },
new Widget { Name = "Green", },
new Widget { Name = "Blue", },
new Widget { Name = "Black", },
};
// adding ToList() will result in 'static' query result but
// updates to the objects will still affect the source objects
var query1 = widgets1
.Where(i => i.Name.StartsWith("B"))
//.ToList()
;
foreach (var widget in query1)
{
widget.Name = "Yellow";
}
// produces no output unless you uncomment out the ToList() above
// query1 is reevaluated and filters out "yellow" which does not start with "B"
foreach (var name in query1)
Console.WriteLine(name.Name);
// produces Red, Green, Yellow, Yellow
// the underlying widgets were updated
foreach (var name in widgets1)
Console.WriteLine(name.Name);

Categories

Resources