I have csv file with 30 000 lines. I have to select many values based on many conditions, so insted of many loops and "if's" i decided to use linq. I have written class to read csv. It implements IEnumerable to be used with linq. This is my enumerator:
class CSVEnumerator : IEnumerator
{
private CSVReader _csv;
private int _index;
public CSVEnumerator(CSVReader csv)
{
_csv = csv;
_index = -1;
}
public void Reset(){_index = -1;}
public object Current
{
get
{
return new CSVRow(_index,_csv);
}
}
public bool MoveNext()
{
return ++_index < _csv.TotalRows;
}
}
It's working, but it's slow. Let's say i want to select max value in column A in range 100;150 row.
max = (from CSVRow r in csv where r.ID > 100 && r.ID < 150 select r).Max(y=>y["A"]);
This will work, but linq searches for max value in 30 000 rows instead of 48.
As I said, I could use loop, but only in this example case, conditions are "brutal" :)
Is there any way to override linq collection search. Something like: look into query used on my enumerator, look, if any linq conditions in "where" contains "row ID filter" and give another data based on this.
I don't want to copy part of data to another array/collection and problem is not in my csv reader. Accessing every row by id is fast, only problem is when you access all 30 000 of them.
Any help appriciated :-)
If you wanted to be able to use LINQ for this efficiently, you would need to use expression trees, in a similar (but much simpler) way than what various LINQ providers for SQL databases do. While doable, I think it would be quite a lot of code for such a simple task.
Because of that, I think a better solution would be to use a separate method to select the rows you want (and then possibly use LINQ to work with the result).
Also, many operations that return collections (including your original code and my modification) can be simplified by using iterator methods.
So, your code could look something like this:
public static IEnumerable<CSVRow> GetRows(
this CSVReader reader, int idGreaterThan, int idLessThan)
{
for (int i = idGreaterThan + 1; i < idLessThan; i++)
{
yield return new CSVRow(reader, i);
}
}
Here, it's an extension method for CSVReader, but another solution (e.g. actual method on that class) might be more appropriate for you.
Your example would then look something like:
max = csvReader.GetRows(100, 150).Max(y => y["A"]);
(Also, I find it weird that when you have limits 100 and 150, you actually want rows between 101 and 149. But I'm assuming you have a reason for that, so I did the same.)
As far as LINQ is concerned, r.ID is simply a value that is being filtered and so all 30k lines are considered for use in the Max operation. If this is a row index, which seems to be the case here, you can use Skip and Take to avoid comparing all 30k rows.
max = csv.Skip(100).Take(50).Max(y => y["A"]);
#DougM is right about the order of evaluation, but in this case what I would do is take a one time hit on initialization and generate lookups for any "index" fields: basically, pre calculate a map (dictionary) of row index to row. That said, this would only be useful if you have many repeated queries for a given index field.
Related
I need to count the rows of a column except the duplicate ones
House Number
123
124
11
12
11
11
Total House Number: 4
I have searched and can't find the right syntax for my code.
I tried dictionary but it seems that it is not right for my code.
I am a complete beginner in c#
//Total House
int House = 0;
for (int row = 0; row < dataGridView1.Rows.Count; ++row)
{
if ((string)dataGridView1.Rows.[row].Cells("House_Number").Distinct())
{
House++;
}
}
TotalHouse.Text = "Total Houses " + $"{House}";
I tried the above code but it has an error Identifier expected.
Your code has a few potential problems, but let's start with the ones that will prevent it from compiling.
if ((string)dataGridView1.Rows.[row].Cells("House_Number").Distinct())
One problematic bit here is Rows.[row]. There shouldn't be a period there. If you have a period like that, C# will expect an identifier to follow it, not another operator. In this case, you have the [] operator following it, which is invalid. It should probably look like this:
if ((string)dataGridView1.Rows[row].Cells("House_Number").Distinct())
We're getting closer. However, the test inside an if statement must evaluate to a bool--that's true or false. Yours evaluates to a string because you're casting the whole thing to a string. That's because this part runs first:
dataGridView1.Rows[row].Cells("House_Number").Distinct()
Then this part runs:
(string)
So the whole thing becomes a string. We'll have to remove that (string) bit.
Let's take a closer look at dataGridView1.Rows[row].Cells("House_Number").Distinct(). Cells isn't a method--it's a property. That means you can't use the syntax Cells("House_Number"). However, the result of Cells is a DataGridViewCellCollection, which allows [] syntax, so you can do something like Cells["House_Number"].
Distinct() isn't going to give you a bool value--it will give you a collection of unique cells in the form of something called an IEnumerable.
dataGridView1.Rows[row].Cells.Distinct() isn't going to give you distinct cells in a column--it's going to give you distinct cells in a row. That's probably not what you want.
You're probably going to want something that looks like this:
int houses = dataGridView1.Rows
.Cast<DataGridViewRow>()
.Select(r => (int)r.Cells["House_Number"].Value)
.Distinct()
.Count();
Walking through this:
Start with dataGridView1.
Get a DataGridViewCellCollection of rows: .Rows
DataGridViewCellCollection is pretty old, so it implements IEnumerable instead of IEnumerable<DataGridViewRow>. We need to turn it into an IEnumerable<DataGridViewRow>, so we call LINQ's Cast<DataGridViewRow>().
Use LINQ to turn that into an IEnumerable<int>: .Select(r => (int)r.Cells["House_Number"].Value)
a. The argument to Select is a lambda expression. It takes one argument, r, which is a DataGridViewRow. It will return an int.
b. Get the cells for the row: .Cells
c. Get the specific cell we want: ["House_Number"]
d. Get the value of that cell
e. The value is returned as an object; we need to cast it to an int: (int)
Use LINQ to turn that IEnumerable<int> into another one that only has distinct values: .Distinct()
Count our results: .Count()
You'll need a reference to System.Linq for this to work. Put this at the top of your file if it isn't already there:
using System.Linq;
You can achieve this easily by adding a nuget reference to System.Data.DataSetExtensions
and then using linq to select the distinct house numbers:
var count = dataGridView1.AsEnumerable().Select(dr => dr["House_Number"]).Distinct().Count();
Otherwise you could achieve this using a hashset:
var hs = new HashSet<string>();
foreach (DataRow dataRow in dataGridView1.Rows)
{
hs.Add(dataRow["House_Number"].ToString());
}
TotalHouse.Text = "Total Houses " + $"{hs.Count}";
I'm reading a file and turning each line within it into a class, let's call it Record, and returning each Record as it is read using IEnumerable<Record> and yield return.
Because of this I only start actually performing these reads whenever I do an operation on the enumeration, such as performing a sum on it or iterating through it with a foreach.
I do need to go through each record and then translate that into a database, but due to database design before my time I need the totals on each record in the database, so I need these totals before I start translating them into my database.
At the moment I have five separate .Count() or .Sum() operations on my enumeration before I start iterating the enumeration (example int i = records.Sum(r => r.SomeField) or int j = records.Count(r => r.IsSomethingTrue)). Each one of those counts or sums will loop through the entire file to calculate each one separately. I'm not really happy with this behaviour and would like to find a more efficient way of doing this.
I am using .NET 3.5 if that makes any difference.
You could use your own struct to calculate a few values at the single pass through an enumerable object.
public struct ComplexAccumulator
{
public int TotalSumField { get; set; }
public int CountSomethingTrue { get; set; }
}
Now you can use Aggreagate extension method to accumulate values:
records.Aggregate(default(ComplexAccumulator), (a, r) => new ComplexAccumulator
{
TotalSumFiled = a.TotalSumField + r.SumField,
CountSomethingTrue = a.CountSomethingTrue + r.IsSomethingTrue ? 1 : 0,
});
Instead of the struct you could use suitable Tuple instance, f.e. something like Tuple<int, int, int>.
Efficiency is not a strength of LINQ... You need to replace some LINQ things with manual loops here.
You seem to need two passes over the data. One for aggregation:
var sum = 0; //etc.
foreach (var item in items) {
//compute all 5 aggregates here
}
And then one to translate the data:
items.Select(item => Translate(item, aggregates))
Whether you should buffer items (for example using ToList) or not depends on whether available memory can hold those items or not.
You can use Aggregate to perform all 5 aggregations in one pass but that's not better than a loop in any way. It's slower, far more code and the code arguably is illegible.
I'm trying to figure out the best way to represent some data. It basically follows the form Manufacturer.Product.Attribute = Value. Something like:
Acme.*.MinimumPrice = 100
Acme.ProductA.MinimumPrice = 50
Acme.ProductB.MinimumPrice = 60
Acme.ProductC.DefaultColor = Blue
So the minimum price across all Acme products is 100 except in the case of product A and B. I want to store this data in C# and have some function where GetValue("Acme.ProductC.MinimumPrice") returns 100 but GetValue("Acme.ProductA.MinimumPrice") return 50.
I'm not sure how to best represent the data. Is there a clean way to code this in C#?
Edit: I may not have been clear. This is configuration data that needs to be stored in a text file then parsed and stored in memory in some way so that it can be retrieved like the examples I gave.
Write the text file exactly like this:
Acme.*.MinimumPrice = 100
Acme.ProductA.MinimumPrice = 50
Acme.ProductB.MinimumPrice = 60
Acme.ProductC.DefaultColor = Blue
Parse it into a path/value pair sequence:
foreach (var pair in File.ReadAllLines(configFileName)
.Select(l => l.Split('='))
.Select(a => new { Path = a[0], Value = a[1] }))
{
// do something with each pair.Path and pair.Value
}
Now, there two possible interpretations of what you want to do. The string Acme.*.MinimumPrice could mean that for any lookup where there is no specific override, such as Acme.Toadstool.MinimumPrice, we return 100 - even though there is nothing referring to Toadstool anywhere in the file. Or it could mean that it should only return 100 if there are other specific mentions of Toadstool in the file.
If it's the former, you could store the whole lot in a flat dictionary, and at look up time keep trying different variants of the key until you find something that matches.
If it's the latter, you need to build a data structure of all the names that actually occur in the path structure, to avoid returning values for ones that don't actually exist. This seems more reliable to me.
So going with the latter option, Acme.*.MinimumPrice is really saying "add this MinimumPrice value to any product that doesn't have its own specifically defined value". This means that you can basically process the pairs at parse time to eliminate all the asterisks, expanding it out into the equivalent of a completed version of the config file:
Acme.ProductA.MinimumPrice = 50
Acme.ProductB.MinimumPrice = 60
Acme.ProductC.DefaultColor = Blue
Acme.ProductC.MinimumPrice = 100
The nice thing about this is that you only need a flat dictionary as the final representation and you can just use TryGetValue or [] to look things up. The result may be a lot bigger, but it all depends how big your config file is.
You could store the information more minimally, but I'd go with something simple that works to start with, and give it a very simple API so that you can re-implement it later if it really turns out to be necessary. You may find (depending on the application) that making the look-up process more complicated is worse over all.
I'm not entirely sure what you're asking but it sounds like you're saying either.
I need a function that will return a fixed value, 100, for every product ID except for two cases: ProductA and ProductB
In that case you don't even need a data structure. A simple comparison function will do
int GetValue(string key) {
if ( key == "Acme.ProductA.MinimumPrice" ) { return 50; }
else if (key == "Acme.ProductB.MinimumPrice") { return 60; }
else { return 100; }
}
Or you could have been asking
I need a function that will return a value if already defined or 100 if it's not
In that case I would use a Dictionary<string,int>. For example
class DataBucket {
private Dictionary<string,int> _priceMap = new Dictionary<string,int>();
public DataBucket() {
_priceMap["Acme.ProductA.MinimumPrice"] = 50;
_priceMap["Acme.ProductB.MinimumPrice"] = 60;
}
public int GetValue(string key) {
int price = 0;
if ( !_priceMap.TryGetValue(key, out price)) {
price = 100;
}
return price;
}
}
One of the ways - you can create nested dictionary: Dictionary<string, Dictionary<string, Dictionary<string, object>>>. In your code you should split "Acme.ProductA.MinimumPrice" by dots and get or set a value to the dictionary corresponding to the splitted chunks.
Another way is using Linq2Xml: you can create XDocument with Acme as root node, products as children of the root and and attributes you can actually store as attributes on products or as children nodes. I prefer the second solution, but it would be slower if you have thousands of products.
I would take an OOP approach to this. The way that you explain it is all your Products are represented by objects, which is good. This seems like a good use of polymorphism.
I would have all products have a ProductBase which has a virtual property that defaults
virtual MinimumPrice { get { return 100; } }
And then your specific products, such as ProductA will override functionality:
override MinimumPrice { get { return 50; } }
According to the requirement we have to return a collection either in reverse order or as
it is. We, beginning level programmer designed the collection as follow :(sample is given)
namespace Linqfying
{
class linqy
{
static void Main()
{
InvestigationReport rpt=new InvestigationReport();
// rpt.GetDocuments(true) refers
// to return the collection in reverse order
foreach( EnquiryDocument doc in rpt.GetDocuments(true) )
{
// printing document title and author name
}
}
}
class EnquiryDocument
{
string _docTitle;
string _docAuthor;
// properties to get and set doc title and author name goes below
public EnquiryDocument(string title,string author)
{
_docAuthor = author;
_docTitle = title;
}
public EnquiryDocument(){}
}
class InvestigationReport
{
EnquiryDocument[] docs=new EnquiryDocument[3];
public IEnumerable<EnquiryDocument> GetDocuments(bool IsReverseOrder)
{
/* some business logic to retrieve the document
docs[0]=new EnquiryDocument("FundAbuse","Margon");
docs[1]=new EnquiryDocument("Sexual Harassment","Philliphe");
docs[2]=new EnquiryDocument("Missing Resource","Goel");
*/
//if reverse order is preferred
if(IsReverseOrder)
{
for (int i = docs.Length; i != 0; i--)
yield return docs[i-1];
}
else
{
foreach (EnquiryDocument doc in docs)
{
yield return doc;
}
}
}
}
}
Question :
Can we use other collection type to improve efficiency ?
Mixing of Collection with LINQ reduce the code ? (We are not familiar with LINQ)
Looks fine to me. Yes, you could use the Reverse extension method... but that won't be as efficient as what you've got.
How much do you care about the efficiency though? I'd go with the most readable solution (namely Reverse) until you know that efficiency is a problem. Unless the collection is large, it's unlikely to be an issue.
If you've got the "raw data" as an array, then your use of an iterator block will be more efficient than calling Reverse. The Reverse method will buffer up all the data before yielding it one item at a time - just like your own code does, really. However, simply calling Reverse would be a lot simpler...
Aside from anything else, I'd say it's well worth you learning LINQ - at least LINQ to Objects. It can make processing data much, much cleaner than before.
Two questions:
Does the code you currently have work?
Have you identified this piece of code as being your performance bottleneck?
If the answer to either of those questions is no, don't worry about it. Just make it work and move on. There's nothing grossly wrong about the code, so no need to fret! Spend your time building new functionality instead. Save LINQ for a new problem you haven't already solved.
Actually this task seems pretty straightforward. I'd actually just use the Reverse method on a Generic List.
This should already be well-optimized.
Your GetDocuments method has a return type of IEnumerable so there is no need to even loop over your array when IsReverseOrder is false, you can just return it as is as Array type is IEnumerable...
As for when IsReverseOrder is true you can use either Array.Reverse or the Linq Reverse() extension method to reduce the amount of code.
I have a large collection of strings (up to 1M) alphabetically sorted. I have experimented with LINQ queries against this collection using HashSet, SortedDictionary, and Dictionary. I am static caching the collection, it's up to 50MB in size, and I'm always calling the LINQ query against the cached collection. My problem is as follows:
Regardless of collection type, performance is much poorer than SQL (up to 200ms). When doing a similar query against the underlying SQL tables, performance is much quicker ( 5-10ms). I have implemented my LINQ queries as follows:
public static string ReturnSomething(string query, int limit)
{
StringBuilder sb = new StringBuilder();
foreach (var stringitem in MyCollection.Where(
x => x.StartsWith(query) && x.Length > q.Length).Take(limit))
{
sb.Append(stringitem);
}
return sb.ToString();
}
It is my understanding that the HashSet, Dictionary, etc. implement lookups using binary tree search instead of the standard enumeration. What are my options for high performance LINQ queries into the advanced collection types?
In your current code you don't make use of any of the special features of the Dictionary / SortedDictionary / HashSet collections, you are using them the same way that you would use a List. That is why you don't see any difference in performance.
If you use a dictionary as index where the first few characters of the string is the key and a list of strings is the value, you can from the search string pick out a small part of the entire collection of strings that has possible matches.
I wrote the class below to test this. If I populate it with a million strings and search with an eight character string it rips through all possible matches in about 3 ms. Searching with a one character string is the worst case, but it finds the first 1000 matches in about 4 ms. Finding all matches for a one character strings takes about 25 ms.
The class creates indexes for 1, 2, 4 and 8 character keys. If you look at your specific data and what you search for, you should be able to select what indexes to create to optimise it for your conditions.
public class IndexedList {
private class Index : Dictionary<string, List<string>> {
private int _indexLength;
public Index(int indexLength) {
_indexLength = indexLength;
}
public void Add(string value) {
if (value.Length >= _indexLength) {
string key = value.Substring(0, _indexLength);
List<string> list;
if (!this.TryGetValue(key, out list)) {
Add(key, list = new List<string>());
}
list.Add(value);
}
}
public IEnumerable<string> Find(string query, int limit) {
return
this[query.Substring(0, _indexLength)]
.Where(s => s.Length > query.Length && s.StartsWith(query))
.Take(limit);
}
}
private Index _index1;
private Index _index2;
private Index _index4;
private Index _index8;
public IndexedList(IEnumerable<string> values) {
_index1 = new Index(1);
_index2 = new Index(2);
_index4 = new Index(4);
_index8 = new Index(8);
foreach (string value in values) {
_index1.Add(value);
_index2.Add(value);
_index4.Add(value);
_index8.Add(value);
}
}
public IEnumerable<string> Find(string query, int limit) {
if (query.Length >= 8) return _index8.Find(query, limit);
if (query.Length >= 4) return _index4.Find(query,limit);
if (query.Length >= 2) return _index2.Find(query,limit);
return _index1.Find(query, limit);
}
}
I bet you have an index on the column so SQL server can do the comparison in O(log(n)) operations rather than O(n). To imitate the SQL server behavior, use a sorted collection and find all strings s such that s >= query and then look at values until you find a value that does not start with s and then do an additional filter on the values. This is what is called a range scan (Oracle) or an index seek (SQL server).
This is some example code which is very likely to go into infinite loops or have one-off errors because I didn't test it, but you should get the idea.
// Note, list must be sorted before being passed to this function
IEnumerable<string> FindStringsThatStartWith(List<string> list, string query) {
int low = 0, high = list.Count - 1;
while (high > low) {
int mid = (low + high) / 2;
if (list[mid] < query)
low = mid + 1;
else
high = mid - 1;
}
while (low < list.Count && list[low].StartsWith(query) && list[low].Length > query.Length)
yield return list[low];
low++;
}
}
If you're doing a "starts with", you only care about ordinal comparisons, and you can have the collection sorted (again in ordinal order) then I would suggest you have the values in a list. You can then binary search to find the first value which starts with the right prefix, then go down the list linearly yielding results until the first value which doesn't start with the right prefix.
In fact, you could probably do another binary search for the first value which doesn't start with the prefix, so you'd have a start and an end point. Then you just need to apply the length criterion to that matching portion. (I'd hope that if it's sensible data, the prefix matching is going to get rid of most candidate values.) The way to find the first value which doesn't start with the prefix is to search for the lexicographically-first value which doesn't - e.g. with a prefix of "ABC", search for "ABD".
None of this uses LINQ, and it's all very specific to your particular case, but it should work. Let me know if any of this doesn't make sense.
If you are trying to optimize looking up a list of strings with a given prefix you might want to take a look at implementing a Trie (not to be mistaken with a regular tree) data structure in C#.
Tries offer very fast prefix lookups and have a very small memory overhead compared to other data structures for this sort of operation.
About LINQ to Objects in general. It's not unusual to have a speed reduction compared to SQL. The net is littered with articles analyzing its performance.
Just looking at your code, I would say that you should reorder the comparison to take advantage of short-circuiting when using boolean operators:
foreach (var stringitem in MyCollection.Where(
x => x.Length > q.Length && x.StartsWith(query)).Take(limit))
The comparison of length is always going to be an O(1) operation (as the length is being stored as part of the string, it doesn't count each character every time), whereas the call to StartsWith is going to be an O(N) operation, where N is the length of query (or the length of the string, whichever is smaller).
By placing the comparison of length before the call to StartsWith, if that comparison fails, you save yourself some extra cycles which could add up when processing large numbers of items.
I don't think that a lookup table is going to help you here, as lookup tables are good when you are comparing the entire key, not parts of the key, like you are doing with the call to StartsWith.
Rather, you might be better off using a tree structure which is split based on the letters in the words in the list.
However, at that point, you are really just recreating what SQL Server is doing (in the case of indexes) and that would just be a duplication of effort on your part.
I think the problem is that Linq has no way to use the fact that your sequence is already sorted. Especially it cannot know, that applying the StartsWith function retains the order.
I would suggest to use the List.BinarySearch method together with a IComparer<string> that does only comparison of the first query chars (this might be tricky, since it's not clear, if the query string will always be the first or the second parameter to ()).
You could even use the standard string comparison, since BinarySearch returns a negative number which you can complement (using ~) in order to get the index of the first element that is larger than your query.
You have then to start from the returned index (in both directions!) to find all elements matching your query string.