I have a two-dimensional array, object[,] mData
I need to retrieve the index of a specific string (key) in that object array without using a loop (e.g foreach, for).
The reason why I don't want to use a loop is because I am trying to optimize a current block of code that uses a loop which causes the process to take a long time since it is managing too much data.
Is there a way to do this?
CODE
`
object [,] mData = null;
string sKey = String.Empty;
for (int iIndex = 0; iIndex < mData.GetUpperBound(1); iIndex++)
{
if (mData[0, iIndex].Value == sKey);
{
return;
}
}
`
You will need a loop to linear search for an element. Even if there is a method that you can call to get the index (which there isn't, I don't think), there would still be a loop inside the method.
If you're worried about performance, try a binary search if the data is sorted.
I am trying to optimize a current block of code that uses a loop which causes the process to take a long time since it is managing too many data.
Loops don't neccessarily make your code run significantly slower. The core of your problem is that you have too many data. If you have that many data, then slowness is expected. What you can do is to run the time-consuming operation asynchronously so that the UI doesn't freeze.
Related
My question is raised based on this question, I had posted an answer on that question..here
This is the code.
var lines = System.IO.File.ReadLines(#"C:\test.txt");
var Minimum = lines[0];//Default length set
var Maximum = "";
foreach (string line in lines)
{
if (Maximum.Length < line.Length)
{
Maximum = line;
}
if (Minimum.Length > line.Length)
{
Minimum = line;
}
}
and alternative for this code using LINQ (My approach)
var lines = System.IO.File.ReadLines(#"C:\test.txt");
var Maximum = lines.OrderByDescending(a => a.Length).First().ToString();
var Minimum = lines.OrderBy(a => a.Length).First().ToString();
LINQ is easy to read and implement..
I want to know which one is good for performance.
And how Linq work internally for OrderByDescending and OrderBy for ordering by length?
You can read the source code for OrderBy.
Stop doing micro-optimizing or premature-optimization on your code. Try to write code that performs correctly, then if you face a performance problem later then profile your application and see where is the problem. If you have a piece of code which have performance problem due to finding the shortest and longest string then start to optimize this part.
We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil. Yet we should not pass
up our opportunities in that critical 3% - Donald Knuth
File.ReadLines is returning an IEnumerable<string>, It means that if you do a foreach over it it will return data to you one by one. I think the best performance improvement you can do here is to improve the reading of file from the disk. If it is small enough to load the whole file into memory use File.ReadAllLines, if it is not try reading the file in big chunks that fits in memory. Reading a file line by line will cause performance degradation due to I/O operation from disk. So the problem here is not how LINQ or loop perform, The problem is in number of disk reads.
With the second method, you are not only sorting the lines twice... You are reading the file twice. This because File.ReadLines returns a IEnumerable<string>. This clearly shows why you shouldn't ever ever enumerate a IEnumerable<> twice unless you know how it was built. If you really want to do it, add a .ToList() or a .ToArray() that will materialize the IEnumerable<> to a collection... And while the first method has a memory footprint of a single line of text (because it reads the file one line at a time), the second method will load the whole file in memory to sort it, so will have a much bigger memory footprint, and if the file is some hundred mb, the difference is big (note that technically you could have a file with a single line of text long 1gb, so this rule isn't absolute... It is for reasonable files that have lines long up to some hundred characters :-) )
Now... Someone will tell you that premature optimization is evil, but I'll tell you that ignorance is twice evil.
If you know the difference between the two blocks of code then you can do an informed choice between the two... Otherwise you are simply randomly throwing rocks until it seems to work. Where seems to work is the keyword here.
In my opinion, you need to understand some points for deciding what is the best way.
First, let's think that we want to solve the problem with LINQ. Then, to write the most optimized code, you must understand Deferred Execution. Most Linq methods, such as Select, Where, OrderBy, Skip, Take and some others uses DE. So, what is Deferred Execution? It means that, these methods will not be executed unless the user doesn't need them. These methods will just create iterator. And this iterator is ready to be executed, when we need them. So, how can user make them execute? The answer is, with the help of foreach which will call GetEnumerator or other Linq methods. Such as, ToList(), First(), FirstOrDefault(), Max() and some others.
These process will help us to gain some performance.
Now, let's come back to your problem. File.ReadLines will return IEnumerable<string>, which it means that, it will not read the lines, unless we need them. In your example, you have twice called sorting method for this object, which it means that it will sort this collection over again twice. Instead of that, you can sort the collection once, then call ToList() which will execute the OrderedEnumerable iterator and then get first and last element of the collection which physically inside our hands.
var orderedList = lines
.OrderBy(a => a.Length) // This method uses deferred execution, so it is not executed yet
.ToList(); // But, `ToList()` makes it to execute.
var Maximum = orderedList.Last();
var Minimum = orderedList.First();
BTW, you can find OrderBy source code, here.
It returns OrderedEnumerable instance and the sorting algorithm is here:
public IEnumerator<TElement> GetEnumerator()
{
Buffer<TElement> buffer = new Buffer<TElement>(source);
if (buffer.count > 0)
{
EnumerableSorter<TElement> sorter = GetEnumerableSorter(null);
int[] map = sorter.Sort(buffer.items, buffer.count);
sorter = null;
for (int i = 0; i < buffer.count; i++) yield return buffer.items[map[i]];
}
}
And now, let's come back to another aspect which effects the performance. If you see, Linq uses another element to store sorted collection. Of course, it will take some memory, which tells us it is not the most efficent way.
I just tried to explain you how does Linq work. But, I am very agree with #Dotctor as a result to your overall answer. Just, don't forget that, you can use File.ReadAllLines which will not return IEnumerable<stirng>, but string[].
What does it mean? As I tried to explain in the beginning, difference is that, if it is IEnumerable, then .net will read line one by one when enuemrator enumerates over iterator. But, if it is string[], then all lines in our application memory.
The most efficient approach is to avoid LINQ here, the approach using foreach needs only one enumeration.
If you want to put the whole file into a collection anyway you could use this:
List<string> orderedLines = System.IO.File.ReadLines(#"C:\test.txt")
.OrderBy(l => l.Length)
.ToList();
string shortest = orderedLines.First();
string longest = orderedLines.Last();
Apart from that you should read about LINQ's deferred execution.
Also note that your LINQ approach does not only order all lines twice to get the longest and the shortest, it also needs to read the whole file twice since File.ReadLines is using a StreamReader(as opposed to ReadAllLines which reads all lines into an array first).
MSDN:
When you use ReadLines, you can start enumerating the collection of
strings before the whole collection is returned; when you use
ReadAllLines, you must wait for the whole array of strings be returned
before you can access the array
In general that can help to make your LINQ queries more efficient, f.e. if you filter out lines with Where, but in this case it's making things worse.
As Jeppe Stig Nielsen has mentioned in a comment, since OrderBy needs to create another buffer-collection internally(with ToList the second), there is another approach that might be more efficient:
string[] allLines = System.IO.File.ReadAllLines(#"C:\test.txt");
Array.Sort(allLines, (x, y) => x.Length.CompareTo(y.Length));
string shortest = allLines.First();
string longest = allLines.Last();
The only drawback of Array.Sort is that it performs an unstable sort as opposed to OrderBy. So if two lines have the same length the order might not be maintained.
Does anyone know of a way to get the Parallel.Foreach loop to use chunk partitioning versus, what i believe is range partitioning by default. It seems simple when working with arrays because you can just create a custom partitioner and set load-balancing to true.
Since the number of elements in an IEnumerable isn't known until runtime I can't seem to figure out a good way to get chunk partitioning to work.
Any help would be appreciated.
thanks!
The tasks i'm trying to perform on each object take significantly different times to perform. At the end i'm usually waiting hours for the last thread to finish its work. What I'm trying to achieve is to have the parallel loop request chunks along the way instead of pre-allocating items to each thread.
If your IEnumerable was really something that had a an indexer (i.e you could do obj[1] to get a item out) you could do the following
var rangePartitioner = Partitioner.Create(0, source.Length);
Parallel.ForEach(rangePartitioner, (range, loopState) =>
{
// Loop over each range element without a delegate invocation.
for (int i = range.Item1; i < range.Item2; i++)
{
var item = source[i]
//Do work on item
}
});
However if it can't do that you must write a custom partitioner by creating a new class derived from System.Collections.Concurrent.Partitioner<TSource>. That subject is too broad to cover in a SO answer but you can take a look at this guide on the MSDN to get you started.
UPDATE: As of .NET 4.5 they added a Partitioner.Create overload that does not buffer data, it has the same effect of making a custom partitioner with a range max size of 1. With this you won't get a single thread that has a bunch of queued up work if it got unlucky with a bunch of slow items in a row.
var partitoner = Partitioner.Create(source, EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(partitoner, item =>
{
//Do work
}
This question is a follow up to my previous question regarding binary search (Fast, in-memory range lookup against +5M record table).
I have sequential text file, with over 5M records/lines, in the format below. I need to load it into Range<int>[] array. How would one do that in a timely fashion?
File format:
start int64,end int64,result int
start int64,end int64,result int
start int64,end int64,result int
start int64,end int64,result int
...
I'm going to assume you have a good disk. Scan through the file once and count the number of entries. If you can guarantee your file has no blank lines, then you can just count the number of newlines in it -- don't actually parse each line.
Now you can allocate your array once with exactly that many entries. This avoids excessive re-allocations of the array:
var numEntries = File.ReadLines(filepath).Count();
var result = new Range<int>[numEntries];
Now read the file again and create your range objects with code something like:
var i = 0;
foreach (var line in File.ReadLines(filepath))
{
var parts = line.Split(',');
result[i++] = new Range<int>(long.Parse(parts[0]), long.Parse(parts[1]), int.Parse(parts[2]);
}
return result;
Sprinkle in some error handling as desired. This code is easy to understand. Try it out in your target environment. If it is too slow, then you can start optimizing it. I wouldn't optimize prematurely though because that will lead to much more complex code that might not be needed.
This is a typical (?) producer-consumer problem which can be solved using multiple threads. In your case the producer is reading data from disk and the consumer is parsing the lines and populating the array. I can see two different cases:
Producer is (much) faster than the consumer: in this case you should try using more consumer threads;
Consumer is (much) faster than the producer: you can't do very much to speed up things other than affecting your hardware configuration such as buying a faster hard disk or using a RAID 0. In this case I wouldn't even use a multithreading solution because it's not worth the added complexity.
This question might help you implementing that in C#.
Is there any advantage to using this
private static IEnumerable<string> GetColumnNames(this IDataReader reader)
{
for (int i = 0; i < reader.FieldCount; i++)
yield return reader.GetName(i);
}
instead of this
private static string[] GetColumnNames(this IDataReader reader)
{
var columnNames = new string[reader.FieldCount];
for (int i = 0; i < reader.FieldCount; i++)
columnNames[i] = reader.GetName(i);
return columnNames;
}
Here is how I use this method
int orderId = _noOrdinal;
IEnumerable<string> columnNames = reader.GetColumnNames();
if (columnNames.Contains("OrderId"))
orderId = reader.GetOrdinal("OrderId");
while (reader.Read())
yield return new BEContractB2CService
{
//..................
Order = new BEOrder
{ Id = orderId == _noOrdinal ?
Guid.Empty : reader.GetGuid(orderId)
},
//............................
The two approaches are quite different so it depends on what you are subsequently going to do with the result I would say.
For example:
The first case requires the data reader to remain open until the result is read, the second doesn't. So how long are you going to hold this result for and do you want to leave the data reader open that long.
The first case is less performant if you are definitely going to read the data, but probably more performant if you often don't, particularly if there is a lot of data.
The result from your first case should only be read/iterated/searched once. Then second case can be stored and searched multiple times.
If you have a large amount of data then the first case could be used in such a way that you don't need to bring all that data in to memory in one go. But again that really depends on what you do with the IEnumerable in the calling method.
Edit:
Given your use-case the methods are probably pretty much equivalent for any given measure of 'good-ness'. Tables don't tend to have many columns, and your use of .Contains ensures the data will be read every time. Personally I would stick with the array method here if only because it's a more straightforward approach.
What's the next line of the code... is it looking for a different column name? If so the second case is the only way to go.
On reason off the top of my head: The array version means you have to spend time building the array first. Most of the code's clients may not necessarily need a specific array. Mostly, i've found, that most code is just going to iterate over it in which case, why waste time building an array (or list as an alternative) you never actually need.
The first one is lazy. That is your code is not evaluated until you iterate the enumerable and because you use closures it will run the code until it yields a value then turn control back to the calling code until you iterate to the next value via MoveNext. Additionally with linq you can achieve the second one by calling the first and then calling ToArray. The reason you might want to do this is to make sure you get the data as it is when you make the call versus when you iterate in case the values change in between.
One advantage has to do with memory consumption. If FieldCount is say 1 million, then the latter needs to allocate an array with 1 million entries, while the former does not.
This benefit depends on how the method is consumed though. For example, if you are processing a list of files one-by-one, then there is no need to know all the files up front.
I was hoping to get some advice on how to speed up the following function. Specifically, I'm hoping to find a faster way to convert numbers (mostly doubles, IIRC there's one int in there) to strings to store as Listview subitems. As it stands, this function takes 9 seconds to process 16 orders! Absolutely insane, especially considering that with the exception of the call to process the DateTimes, it's all just string conversion.
I thought it was the actual displaying of the listview items that was slow, so I did some research and found that adding all subitems to an array and using Addrange was far faster than adding the items one at a time. I implemented the change, but got no better speed.
I then added some stopwatches around each line to narrow down exactly what's causing the slowdown; unsurprisingly, the call to the datetime function is the biggest slowdown, but I was surprised to see that the string.format calls were extremely slow as well, and given the number of them, make up the majority of my time.
private void ProcessOrders(List<MyOrder> myOrders)
{
lvItems.Items.Clear();
marketInfo = new MarketInfo();
ListViewItem[] myItems = new ListViewItem[myOrders.Count];
string[] mySubItems = new string[8];
int counter = 0;
MarketInfo.GetTime();
CurrentTime = MarketInfo.CurrentTime;
DateTime OrderIssueDate = new DateTime();
foreach (MyOrder myOrder in myOrders)
{
string orderIsBuySell = "Buy";
if (!myOrder.IsBuyOrder)
orderIsBuySell = "Sell";
var listItem = new ListViewItem(orderIsBuySell);
mySubItems[0] = (myOrder.Name);
mySubItems[1] = (string.Format("{0:g}", myOrder.QuantityRemaining) + "/" + string.Format("{0:g}", myOrder.InitialQuantity));
mySubItems[2] = (string.Format("{0:f}", myOrder.Price));
mySubItems[3] = (myOrder.Local);
if (myOrder.IsBuyOrder)
{
if (myOrder.Range == -1)
mySubItems[4] = ("Local");
else
mySubItems[4] = (string.Format("{0:g}", myOrder.Range));
}
else
mySubItems[4] = ("N/A");
mySubItems[5] = (string.Format("{0:g}", myOrder.MinQuantityToBuy));
string IssueDateString = (myOrder.DateWhenIssued + " " + myOrder.TimeWhenIssued);
if (DateTime.TryParse(IssueDateString, out OrderIssueDate))
mySubItems[6] = (string.Format(MarketInfo.ParseTimeData(CurrentTime, OrderIssueDate, myOrder.Duration)));
else
mySubItems[6] = "Error getting date";
mySubItems[7] = (string.Format("{0:g}", myOrder.ID));
listItem.SubItems.AddRange(mySubItems);
myItems[counter] = listItem;
counter++;
}
lvItems.BeginUpdate();
lvItems.Items.AddRange(myItems.ToArray());
lvItems.EndUpdate();
}
Here's the time data from a sample run:
0: 166686
1: 264779
2: 273716
3: 136698
4: 587902
5: 368816
6: 955478
7: 128981
Where the numbers are equal to the indexes of the array. All other lines were so low in ticks as to be negligible compared to these.
Although I'd like to be able to use the number formatting of string.format for pretty output, I'd like to be able to load a list of orders within my lifetime more, so if there's an alternative to string.format that's considerably faster but without the bells and whistles, I'm all for it.
Edit: Thanks to all of the people who suggested the myOrder class might be using getter methods rather than actually storing the variables as I originally thought. I checked that and sure enough, that was the cause of my slowdown. Although I don't have access to the class to change it, I was able to piggyback onto the method call to populate myOrders and copy each of the variables to a list within the same call, then use that list when populating my Listview. Populates pretty much instantly now. Thanks again.
I find it hard to believe that simple string.Format calls are causing your slowness problems - it's generally a very fast call, especially for nice simple ones like most of yours.
But one thing that might give you a few microseconds...
Replace
string.Format("{0:g}", myOrder.MinQuantityToBuy)
with
myOrder.MinQuantityToBuy.ToString("g")
This will work when you're doing a straight format of a single value, but isn't any good for more complex calls.
I put all the string.format calls into a loop and was able to run them all 1 million times in under a second, so your problem isn't string.format...it's somewhere else in your code.
Perhaps some of these properties have logic in their getter methods? What sort of times do you get if you comment out all the code for the listview?
It is definitely not string.Format that is slowing you down. Suspect the property accesses from myOrder.
On one of the format calls, try to declare a few local variables and set those to the properties you try to format, then pass those local variables to yoru string.Format and retime. You may find that your string.Format now runs in lightning speed as it should.
Now, property accesses usually don't require much time to run. However, I've seen some classes where each property access is logged (for audit trail). Check if this is the case and if some operation is holding your property access from returning immediately.
If there is some operation holding a property access, try to queue up those operations (e.g. queue up the logging calls) and have a background thread execute them. Return the property access immediately.
Also, never put slow-running code (e.g. elaborate calculations) into a property accesser/getter, nor code that has side-effects. People using the class will not be aware that it will be slow (since most property accesses are fast) or has side effects (since most property accesses do not have side effects). If the access is slow, rename the access to a GetXXX() function. If it has side effects, name the method something that conveys this fact.
Wow. I feel a little stupid now. I've spent hours beating my head against the wall trying to figure out why a simple string operation would be taking so long. MarketOrders is (I thought) an array of myOrders, which is populated by an explicit call to a method which is severely restricted as far as how times per second it can be run. I don't have access to that code to check, but I had been assuming that myOrders were simple structs with member variables that were assigned when MarketOrders is populated, so the string.format calls would simply be acting on existing data. On reading all of the replies that point to the access of the myOrder data as the culprit, I started thinking about it and realized that MarketOrders is likely just an index, rather than an array, and the myOrder info is being read on demand. So every time I call an operation on one of its variables, I'm calling the slow lookup method, waiting for it to become eligible to run again, returning to my method, calling the next lookup, etc. No wonder it's taking forever.
Thanks for all of the replies. I can't believe that didn't occur to me.
I am glad that you got your issue solved. However I did a small refactoring on your method and came up with this:
private void ProcessOrders(List<MyOrder> myOrders)
{
lvItems.Items.Clear();
marketInfo = new MarketInfo();
ListViewItem[] myItems = new ListViewItem[myOrders.Count];
string[] mySubItems = new string[8];
int counter = 0;
MarketInfo.GetTime();
CurrentTime = MarketInfo.CurrentTime;
// ReSharper disable TooWideLocalVariableScope
DateTime orderIssueDate;
// ReSharper restore TooWideLocalVariableScope
foreach (MyOrder myOrder in myOrders)
{
string orderIsBuySell = myOrder.IsBuyOrder ? "Buy" : "Sell";
var listItem = new ListViewItem(orderIsBuySell);
mySubItems[0] = myOrder.Name;
mySubItems[1] = string.Format("{0:g}/{1:g}", myOrder.QuantityRemaining, myOrder.InitialQuantity);
mySubItems[2] = myOrder.Price.ToString("f");
mySubItems[3] = myOrder.Local;
if (myOrder.IsBuyOrder)
mySubItems[4] = myOrder.Range == -1 ? "Local" : myOrder.Range.ToString("g");
else
mySubItems[4] = "N/A";
mySubItems[5] = myOrder.MinQuantityToBuy.ToString("g");
// This code smells:
string issueDateString = string.Format("{0} {1}", myOrder.DateWhenIssued, myOrder.TimeWhenIssued);
if (DateTime.TryParse(issueDateString, out orderIssueDate))
mySubItems[6] = MarketInfo.ParseTimeData(CurrentTime, orderIssueDate, myOrder.Duration);
else
mySubItems[6] = "Error getting date";
mySubItems[7] = myOrder.ID.ToString("g");
listItem.SubItems.AddRange(mySubItems);
myItems[counter] = listItem;
counter++;
}
lvItems.BeginUpdate();
lvItems.Items.AddRange(myItems.ToArray());
lvItems.EndUpdate();
}
This method should be further refactored:
Remove outer dependencies Inversion of control (IoC) in mind and by using dependency injection (DI);
Create new property "DateTimeWhenIssued" for MyOrder that will return DateTime data type. This should be used instead of joining two strings (DateWhenIssued and TimeWhenIssued) and then parsing them into DateTime;
Rename ListViewItem as this is a built in class;
ListViewItem should have a new constructor for boolean "IsByOrder": var listItem = new ListViewItem(myOrder.IsBuyOrder). Instead of a string "Buy" or "Sell";
"mySubItems" string array should be replaced with a class for better readability and extendibility;
Lastly, the foreach (MyOrder myOrder in myOrders) could be replaced with a "for" loop as you are using a counter anyway. Besides "for" loops are faster too.
Hopefully you do not mind my suggestions and that they are doable in your situation.
PS. Are you using generic arrays? ListViewItem.SubItems property could be
public List<string> SubItems { get; set; };