Scraping a webpage, containing about 250 table divisions.
Using WatiN and WatinCSSSelectors
First I select all td tags with attribute 'width=90%':
var allMainTDs = browser.CssSelectAll("td[width=\"90%\"]");
Then I make a foreach loop, stick the contents of the var into a List. The int is there to check what td tag the loop is currently at.
List<Element> eletd = new List<Element>();
int i = 0;
foreach (Element td in allMainTDs)
{
eletd.Add(td);
i++;
Console.WriteLine(i);
}
It reaches the 250th tag fairly quickly. But then it takes about 6 minutes (timed with a StopWatch object) to go onto the next statement. What is happening here?
You could try this:
var eletd = new List<Element>(allMainTDs);
A foreach loop is roughly equivalent to the following code (not exactly, but close enough):
IEnumerator<T> enumerator = enumerable.GetEnumerator();
try
{
while (enumerator.MoveNext())
{
T element = enumerator.Current;
// here goes the body of the loop
}
}
finally
{
IDisposable disposable = enumerator as System.IDisposable;
if (disposable != null) disposable.Dispose();
}
The behavior you describe points to the cleanup portion of this code. It's possible that the enumerator for the result of the CssSelectAll call has a heavy Dispose method. You could confirm this by replacing your loop with something like the code above, and omit the finally block, or set breakpoints to confirm Dispose takes forever to run.
If you under .net 4.0 and you execution environment allows for parallelism, you may be should try the
Prallel.ForEach(..);
Related
I have 2 methods that can do the work for me, one is serial and the other one is parallel.
The reason for parallelization is because there are lots of iteration(about 100,000 or so)
For some reason, the parallel one do skip or double doing some iterations, and I don't have any clue how to debug it.
The serial method
for(int i = somenum; i >= 0; i-- ){
foreach (var nue in nuelist)
{
foreach (var path in nue.pathlist)
{
foreach (var conn in nue.connlist)
{
Func(conn,path);
}
}
}
}
The parallel method
for(int i = somenum; i >= 0; i-- ){
Parallel.ForEach(nuelist,nue =>
{
Parallel.ForEach(nue.pathlist,path=>
{
Parallel.ForEach(nue.connlist, conn=>
{
Func(conn,path);
});
});
});
}
Inside Path class
Nue firstnue;
public void Func(Conn conn,Path path)
{
List<Conn> list = new(){conn};
list.AddRange(path.list);
_ = new Path(list);
}
public Path(List<Conn>)
{
//other things
firstnue.pathlist.Add(this);
/*
firstnue is another nue that will be
in the next iteration of for loop
*/
}
They are both the same method except, of course, foreach and Parallel.ForEach loop.
the code is for the code in here (GitHub page)
List<T>, which I assume you use with firstnue.pathlist, isn't thread-safe. That means, when you add/remove items from the same List<T> from multiple threads at the same time, your data will get corrupt. In order to avoid that problem, the simplest solution is to use a lock, so multiple threads doesn't try to modify list at once.
However, a lock essentially serializes the list operations, and if the only thing you do in Func is to change a list, you may not gain much by parallelizing the code. But, if you still want to give it a try, you just need to change this:
firstnue.pathlist.Add(this);
to this:
lock (firstnue.pathlist)
{
firstnue.pathlist.Add(this);
}
Thanks to sedat-kapanoglu, I found the problem is really about thread safety. The solution was to change every List<T> to ConcurrentBag<T>.
For everyone, like me, The solution of "parallel not working with collections" is to change from System.Collections.Generic to System.Collections.Concurrent
I have been playing around with various implementations of a PriorityQueue class lately, and I have come across some behavior I do not fully understand.
Here, is a snippet from the unit test I am running:
PriorityQueue<Int32> priorityQueue = new PriorityQueue<Int32>();
Randomizer r = new Randomizer();
priorityQueue.AddRange(r.GetInts(Int32.MinValue, Int32.MaxValue, r.Next(300, 10000)));
priorityQueue.PopFront(); // Gets called, and works correctly
Int32 numberToPop = priorityQueue.Count / 3;
priorityQueue.PopFront(numberToPop); // Does not get called, an empty IEnumberable<T> (T is an Int32 here) is returned
As I noted in the comments, the PopFront() gets called and operates correctly, but when I try to call the PopFront(numberToPop), the method does not get called at all, as in, it does not even enter the method.
Here are the methods:
public T PopFront()
{
if (items.Count == 0)
{
throw new InvalidOperationException("No elements exist in the queue");
}
T item = items[0];
items.RemoveAt(0);
return item;
}
public IEnumerable<T> PopFront(Int32 numberToPop)
{
Debug.WriteLine("PriorityQueue<T>.PopFront({0})", numberToPop);
if (numberToPop > items.Count)
{
throw new ArgumentException(#"The numberToPop exceeds the number
of elements in the queue", "numberToPop");
}
while (numberToPop-- > 0)
{
yield return PopFront();
}
}
Now, previously, I had implemented the overloaded PopFront function like this:
public IEnumerable<T> PopFront(Int32 numberToPop)
{
Console.WriteLine("PriorityQueue<T>.PopFront({0})", numberToPop);
if (numberToPop > items.Count)
{
throw new ArgumentException(#"The numberToPop exceeds the number
of elements in the queue", "numberToPop");
}
var poppedItems = items.Take(numberToPop);
Clear(0, numberToPop);
return poppedItems;
}
The previous implementation (above) worked as expected. With all that being said, I am obviously aware that my use of the yield statement is incorrect (most likely because I am removing then returning elements in the PopFront() function), but what I am really interested in knowing is why the PopFront(Int32 numberToPop) is never even called and, if it is not called, why then is it returning an empty IEnumerable?
Any help/explanation to why this is occurring is greatly appreciated.
When you use yield return, the compiler creates a state machine for you. Your code won't start executing until you start to enumerate (foreach or ToList) the IEnumerable<T> returned by your method.
From the yield documentation
On an iteration of the foreach loop, the MoveNext method is called for elements. This call executes the body of MyIteratorMethod until the next yield return statement is reached. The expression returned by the yield return statement determines not only the value of the element variable for consumption by the loop body but also the Current property of elements, which is an IEnumerable.
On each subsequent iteration of the foreach loop, the execution of the iterator body continues from where it left off, again stopping when it reaches a yield return statement. The foreach loop completes when the end of the iterator method or a yield break statement is reached.
There are a number of different way to accomplish the same simple loop though the items of an object in c#.
This has made me wonder if there is any reason be it performance or ease of use, as to use on over the other. Or is it just down to personal preference.
Take a simple object
var myList = List<MyObject>;
Lets assume the object is filled and we want to iterate over the items.
Method 1.
foreach(var item in myList)
{
//Do stuff
}
Method 2
myList.Foreach(ml =>
{
//Do stuff
});
Method 3
while (myList.MoveNext())
{
//Do stuff
}
Method 4
for (int i = 0; i < myList.Count; i++)
{
//Do stuff
}
What I was wondering is do each of these compiled down to the same thing? is there a clear performance advantage for using one over the others?
or is this just down to personal preference when coding?
Have I missed any?
The answer the majority of the time is it does not matter. The number of items in the loop (even what one might consider a "large" number of items, say in the thousands) isn't going to have an impact on the code.
Of course, if you identify this as a bottleneck in your situation, by all means, address it, but you have to identify the bottleneck first.
That said, there are a number of things to take into consideration with each approach, which I'll outline here.
Let's define a few things first:
All of the tests were run on .NET 4.0 on a 32-bit processor.
TimeSpan.TicksPerSecond on my machine = 10,000,000
All tests were performed in separate unit test sessions, not in the same one (so as not to possibly interfere with garbage collections, etc.)
Here's some helpers that are needed for each test:
The MyObject class:
public class MyObject
{
public int IntValue { get; set; }
public double DoubleValue { get; set; }
}
A method to create a List<T> of any length of MyClass instances:
public static List<MyObject> CreateList(int items)
{
// Validate parmaeters.
if (items < 0)
throw new ArgumentOutOfRangeException("items", items,
"The items parameter must be a non-negative value.");
// Return the items in a list.
return Enumerable.Range(0, items).
Select(i => new MyObject { IntValue = i, DoubleValue = i }).
ToList();
}
An action to perform for each item in the list (needed because Method 2 uses a delegate, and a call needs to be made to something to measure impact):
public static void MyObjectAction(MyObject obj, TextWriter writer)
{
// Validate parameters.
Debug.Assert(obj != null);
Debug.Assert(writer != null);
// Write.
writer.WriteLine("MyObject.IntValue: {0}, MyObject.DoubleValue: {1}",
obj.IntValue, obj.DoubleValue);
}
A method to create a TextWriter which writes to a null Stream (basically a data sink):
public static TextWriter CreateNullTextWriter()
{
// Create a stream writer off a null stream.
return new StreamWriter(Stream.Null);
}
And let's fix the number of items at one million (1,000,000, which should be sufficiently high to enforce that generally, these all have about the same performance impact):
// The number of items to test.
public const int ItemsToTest = 1000000;
Let's get into the methods:
Method 1: foreach
The following code:
foreach(var item in myList)
{
//Do stuff
}
Compiles down into the following:
using (var enumerable = myList.GetEnumerable())
while (enumerable.MoveNext())
{
var item = enumerable.Current;
// Do stuff.
}
There's quite a bit going on there. You have the method calls (and it may or may not be against the IEnumerator<T> or IEnumerator interfaces, as the compiler respects duck-typing in this case) and your // Do stuff is hoisted into that while structure.
Here's the test to measure the performance:
[TestMethod]
public void TestForEachKeyword()
{
// Create the list.
List<MyObject> list = CreateList(ItemsToTest);
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle through the items.
foreach (var item in list)
{
// Write the values.
MyObjectAction(item, writer);
}
// Write out the number of ticks.
Debug.WriteLine("Foreach loop ticks: {0}", s.ElapsedTicks);
}
}
The output:
Foreach loop ticks: 3210872841
Method 2: .ForEach method on List<T>
The code for the .ForEach method on List<T> looks something like this:
public void ForEach(Action<T> action)
{
// Error handling omitted
// Cycle through the items, perform action.
for (int index = 0; index < Count; ++index)
{
// Perform action.
action(this[index]);
}
}
Note that this is functionally equivalent to Method 4, with one exception, the code that is hoisted into the for loop is passed as a delegate. This requires a dereference to get to the code that needs to be executed. While the performance of delegates has improved from .NET 3.0 on, that overhead is there.
However, it's negligible. The test to measure the performance:
[TestMethod]
public void TestForEachMethod()
{
// Create the list.
List<MyObject> list = CreateList(ItemsToTest);
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle through the items.
list.ForEach(i => MyObjectAction(i, writer));
// Write out the number of ticks.
Debug.WriteLine("ForEach method ticks: {0}", s.ElapsedTicks);
}
}
The output:
ForEach method ticks: 3135132204
That's actually ~7.5 seconds faster than using the foreach loop. Not completely surprising, given that it uses direct array access instead of using IEnumerable<T>.
Remember though, this translates to 0.0000075740637 seconds per item being saved. That's not worth it for small lists of items.
Method 3: while (myList.MoveNext())
As shown in Method 1, this is exactly what the compiler does (with the addition of the using statement, which is good practice). You're not gaining anything here by unwinding the code yourself that the compiler would otherwise generate.
For kicks, let's do it anyways:
[TestMethod]
public void TestEnumerator()
{
// Create the list.
List<MyObject> list = CreateList(ItemsToTest);
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
// Get the enumerator.
using (IEnumerator<MyObject> enumerator = list.GetEnumerator())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle through the items.
while (enumerator.MoveNext())
{
// Write.
MyObjectAction(enumerator.Current, writer);
}
// Write out the number of ticks.
Debug.WriteLine("Enumerator loop ticks: {0}", s.ElapsedTicks);
}
}
The output:
Enumerator loop ticks: 3241289895
Method 4: for
In this particular case, you're going to gain some speed, as the list indexer is going directly to the underlying array to perform the lookup (that's an implementation detail, BTW, there's nothing to say that it can't be a tree structure backing the List<T> up).
[TestMethod]
public void TestListIndexer()
{
// Create the list.
List<MyObject> list = CreateList(ItemsToTest);
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle by index.
for (int i = 0; i < list.Count; ++i)
{
// Get the item.
MyObject item = list[i];
// Perform the action.
MyObjectAction(item, writer);
}
// Write out the number of ticks.
Debug.WriteLine("List indexer loop ticks: {0}", s.ElapsedTicks);
}
}
The output:
List indexer loop ticks: 3039649305
However the place where this can make a difference is arrays. Arrays can be unwound by the compiler to process multiple items at a time.
Instead of doing ten iterations of one item in a ten item loop, the compiler can unwind this into five iterations of two items in a ten item loop.
However, I'm not positive here that this is actually happening (I have to look at the IL and the output of the compiled IL).
Here's the test:
[TestMethod]
public void TestArray()
{
// Create the list.
MyObject[] array = CreateList(ItemsToTest).ToArray();
// Create the writer.
using (TextWriter writer = CreateNullTextWriter())
{
// Create the stopwatch.
Stopwatch s = Stopwatch.StartNew();
// Cycle by index.
for (int i = 0; i < array.Length; ++i)
{
// Get the item.
MyObject item = array[i];
// Perform the action.
MyObjectAction(item, writer);
}
// Write out the number of ticks.
Debug.WriteLine("Enumerator loop ticks: {0}", s.ElapsedTicks);
}
}
The output:
Array loop ticks: 3102911316
It should be noted that out-of-the box, Resharper offers a suggestion with a refactoring to change the above for statements to foreach statements. That's not to say this is right, but the basis is to reduce the amount of technical debt in code.
TL;DR
You really shouldn't be concerned with the performance of these things, unless testing in your situation shows that you have a real bottleneck (and you'll have to have massive numbers of items to have an impact).
Generally, you should go for what's most maintainable, in which case, Method 1 (foreach) is the way to go.
In regards to the final bit of the question, "Did I miss any?" Yes, and I feel I would be remiss to not mention this even though the question is quite old. While those four ways of doing it will execute in relatively the same amount of time, there is a way not shown above that runs faster than all of them. Quite significantly, in fact, as the number of items in the iterated list increases. It would be the exact same way as the last method but instead of getting .Count in the condition check of the loop, you assign this value to a variable before setting up the loop and use that instead. Which leaves you with something like this:
var countVar = list.Count;
for(int i = 0; i < countVar; i++)
{
//loop logic
}
By doing it this way, you're only looking up a variable value at each iteration, rather than resolving the Count or Length properties, which is considerably less efficient.
I would suggest an even better and not well-known approach for faster loop iteration over a list. I would recommend you to first read about Span<T>. Note that you can use it if you are using .NET Core.
List<MyObject> list = new();
foreach (MyObject item in CollectionsMarshal.AsSpan(list))
{
// Do something
}
Be aware of the caveats:
The CollectionsMarshal.AsSpan method is unsafe and should be used only if you know what you're doing. CollectionsMarshal.AsSpan returns a Span<T> on the private array of List<T>. Iterating over a Span<T> is fast as the JIT uses the same tricks as for optimizing arrays. Using this method, it won't check the list is not modified during the enumeration.
This is a more detailed explanation of what it does behind the scenes and more, super interesting!
atm I do it like this:
lock (LockObj)
{
foreach (var o in Oo)
{
var val = o.DateActive;
if (val.AddSeconds(30) < DateTime.Now) Oo.Remove(o);
}
}
and I get this error:
Collection was modified; enumeration operation may not execute
how this should be done?
You have to use a regular for loop.
for (int i = 0; i < Oo.Length; ++i)
{
var val = Oo[i];
if (val.AddSeconds(30) < DateTime.Now)
{
Oo.RemoveAt(i);
i--; // since we just removed an element
}
}
The reason you cannot edit a collection with a foreach loop is because foreach uses a readonly IEnumerator of the collection you are iterating.
you can't modify a collection you are enumerating..
to change it get a copy of it and change it.
for(var k in OO.ToList())
.....
or
use count and iterate the collection with index,
for (int i=0;i<OO.Count();i++)
.....
You simply cannot modify the collection if you are iterating with foreach. You have two options, Loop with For instead of foreach or create another Collection and modify that.
This problem is completely unrelated to locking.
If you add/remove elements from a List all iterators pointing to that list become invalid.
One alternative to using an iterator is manually working with indices. Then you can iterate backwards and remove elements with RemoveAt.
for(int i=Oo.Count-1;i>=0;i--)
{
var o=Oo[i];
if (o.DateActive.AddSeconds(30)<DateTime.Now)
Oo.RemoveAt(i);
}
Unfortunately this native implementation is O(n^2). If you write it in a more complex way where you first assign the elements to their new position and then truncate the list it becomes O(n).
Buf if Oo is a List<T> there is a much better solution. You can use Oo.RemoveAll(o=>o.DateActive.AddSeconds(30)<DateTime.Now). Unfortunately you there is no such extension method on IList<T> by default.
I'd write the code like this:
lock (LockObj)
{
DateTime deleteTime=DateTime.Now.AddSeconds(-30);
Oo.RemoveAll(o=>o.DateActive<deleteTime);
}
As a sidenote I'd personally use UTC times instead of local times for such code.
class Program
{
static void Main(string[] args)
{
List<OOItem> oo = new List<OOItem>();
oo.Add( new OOItem() { DateActive = DateTime.Now.AddSeconds(-31) });
lock(LockObj)
{
foreach( var item in oo.Where( ooItem => ooItem.DateActive.AddSeconds(30) < DateTime.Now ).ToArray())
{
oo.Remove(item);
}
}
Debug.Assert( oo.Count == 0);
}
}
public class OOItem
{
public DateTime DateActive { get; set; }
}
I'm going to suggest an approach that avoids messing around with decrementing loop indexes and other stuff that makes code difficult to understand.
I think the best bet is to write a nice query and then do a foreach over the result of turning the query into an array:
var inactives = from o in Oo
where o.DateActive < DateTime.Now
select o;
foreach (var o in inactives.ToArray())
{
Oo.Remove(o);
}
This avoids the issue of the collection changing and makes the code quite a bit more readable.
If you're a little more "functionally" oriented then here's another choice:
(from o in Oo
where o.DateActive < DateTime.Now
select o)
.ToList()
.ForEach(o => Oo.Remove(o));
Enjoy!
The problem is not related to the lock.
Use a for() loop instead of foreach().
I can't 100% replace your code because your code provides no hint of what collection type "Oo" is. Neither does the name "Oo". Perhaps one of the evils of var keyword overuse? Or maybe I just can't see enough of your code ;)
int size = Oo.Length();
for(int i = 0; i < size; i++){
if (Oo[i].AddSeconds(30) < DateTime.Now){
Oo[i].RemoveAt(i);
size--; // Compensate for new size after removal.
}
}
you can use Parallel.ForEach(oO, val=> { oO.Remove(val); })
Parallel doesn't have the IEnumerator problem !
In the following C# code snippet
I have a 'while' loop inside a 'foreach' loop and I wish to jump to the next item in 'foreach' when a certain condition occurs.
foreach (string objectName in this.ObjectNames)
{
// Line to jump to when this.MoveToNextObject is true.
this.ExecuteSomeCode();
while (this.boolValue)
{
// 'continue' would jump to here.
this.ExecuteSomeMoreCode();
if (this.MoveToNextObject())
{
// What should go here to jump to next object.
}
this.ExecuteEvenMoreCode();
this.boolValue = this.ResumeWhileLoop();
}
this.ExecuteSomeOtherCode();
}
'continue' would jump to the beginning of the 'while' loop not the 'foreach' loop.
Is there's a keyword to use here, or should I just use goto which I don't really like.
Use the break keyword. That will exit the while loop and continue execution outside it. Since you don't have anything after the while, it would loop around to the next item in the foreach loop.
Actually, looking at your example more closely, you actually want to be able to advance the for loop without exiting the while. You can't do this with a foreach loop, but you can break down a foreach loop to what it actually automates. In .NET, a foreach loop is actually rendered as a .GetEnumerator() call on the IEnumerable object (which your this.ObjectNames object is).
The foreach loop is basically this:
IEnumerator enumerator = this.ObjectNames.GetEnumerator();
while (enumerator.MoveNext())
{
string objectName = (string)enumerator.Value;
// your code inside the foreach loop would be here
}
Once you have this structure, you can call enumerator.MoveNext() within your while loop to advance to the next element. So your code would become:
IEnumerator enumerator = this.ObjectNames.GetEnumerator();
while (enumerator.MoveNext())
{
while (this.ResumeWhileLoop())
{
if (this.MoveToNextObject())
{
// advance the loop
if (!enumerator.MoveNext())
// if false, there are no more items, so exit
return;
}
// do your stuff
}
}
The following should do the trick
foreach (string objectName in this.ObjectNames)
{
// Line to jump to when this.MoveToNextObject is true.
this.ExecuteSomeCode();
while (this.boolValue)
{
if (this.MoveToNextObject())
{
// What should go here to jump to next object.
break;
}
}
if (! this.boolValue) continue; // continue foreach
this.ExecuteSomeOtherCode();
}
The break; keyword will exit a loop:
foreach (string objectName in this.ObjectNames)
{
// Line to jump to when this.MoveToNextObject is true.
while (this.boolValue)
{
// 'continue' would jump to here.
if (this.MoveToNextObject())
{
break;
}
this.boolValue = this.ResumeWhileLoop();
}
}
Use goto.
(I guess people will be mad with this response, but I definitely think it's more readable than all other options.)
You can use "break;" to exit the innermost while or foreach.