I'm fighting with this for days, I hope You can push me in the right direction.
This is a recursive threading algorithm which parses a resource in a thread looking for links to other resources storing them in a ConcurrentBag for future TakeOuts. Threads creation is limited by an array with configurable size to preserve resources.
I have a private static ConcurrentBag<string> which gets filled by many threads. These are Tasks that are stored in private static Task[] with configurable size (app preferences).
There is a loop in Main that's doing TryTake() into local string url variable. When successfull it loops the Task[] trying to find empty slot creating new Task passing state object url and storing it in Task[] like this:
TaskArray[x] = new Task(FindLinks, url, TaskCreationOptions.LongRunning | TaskCreationOptions.PreferFairness);
The FindLinks is declared as
private static readonly Action<object> FindLinks = input => { ... }
In the main Task[] loop I am setting url to null before next TryTake(out url).
What is my problem here is the state object input that is passed from url in the main loop becomes null inside the Task lambda function. I've read almost all MSDN articles about TPL and can't figure this one out :(
How can I pass a variable (string) to the Task safely without closure (or whatever it is happening).
Any other ideas about improving this algorithm are welcome too.
Thanks.
Edit:
I have solved the problem by reordering statements and slightly rewriting the code in the main loop. No more assigning null to the variable. I suspect it was caused by compiler's statement reordering or preemption. Here is what it looks like now causing no more troubles:
string url;
if (CollectedLinks.TryTake(out url))
{
var queued = false;
while (!queued)
{
// Loops thru the array looking for empty slot (null)
for (byte i = 0; i < TaskArray.Length; i++)
{
if (TaskArray[i] == null)
{
TaskArray[i] = new Task(FindLinks, url, TaskCreationOptions.LongRunning | TaskCreationOptions.PreferFairness);
TaskArray[i].Start(TaskScheduler.Current);
queued = true; break;
}
}
if (!queued)
{
// Loop and clean the array
for (var i = 0; i < TaskArray.Length; i++)
{
if (TaskArray[i] == null)
continue;
if (TaskArray[i].Status == TaskStatus.RanToCompletion || TaskArray[i].Status == TaskStatus.Canceled || TaskArray[i].Status == TaskStatus.Faulted)
{
TaskArray[i].Wait(0);
TaskArray[i] = null;
}
}
}
}
}
I have solved the problem by reordering statements and slightly rewriting the code in the main loop. No more assigning null to the variable. I suspect it was caused by compiler's statement reordering or preemption. Here is what it looks like now causing no more troubles:
string url;
if (CollectedLinks.TryTake(out url))
{
var queued = false;
while (!queued)
{
// Loops thru the array looking for empty slot (null)
for (byte i = 0; i < TaskArray.Length; i++)
{
if (TaskArray[i] == null)
{
TaskArray[i] = new Task(FindLinks, url, TaskCreationOptions.LongRunning | TaskCreationOptions.PreferFairness);
TaskArray[i].Start(TaskScheduler.Current);
queued = true; break;
}
}
if (!queued)
{
// Loop and clean the array
for (var i = 0; i < TaskArray.Length; i++)
{
if (TaskArray[i] == null)
continue;
if (TaskArray[i].Status == TaskStatus.RanToCompletion || TaskArray[i].Status == TaskStatus.Canceled || TaskArray[i].Status == TaskStatus.Faulted)
{
TaskArray[i].Wait(0);
TaskArray[i] = null;
}
}
}
}
}
Related
I have this piece of code in C#:
Thread.BeginCriticalRegion();
if(visitedUrls.Contains(url) || visitedUrls.Where( x => x.Contains(root) ).Count() > 150) {
return;
}
else{
visitedUrls.Add(url);
}
Thread.EndCriticalRegion();
which it's into a function that gets called by several different processes, and that's why I (tried to) make it thread-safe.
The exception Collection was modified; enumeration operation may not execute raises on the if line, but if I leave it as
if(visitedUrls.Contains(url)
it works fine, why?
EDIT
This the actual code:
public void scrapAzienda(String url, String root_url, int depth)
{
if (depth <= 0) return;
var web = new HtmlWeb();
HtmlNode[] nodes = null;
HtmlDocument doc = null;
HtmlNode bodyNode = null;
Thread.BeginCriticalRegion();
if (urlVisitati.Contains(url) || urlVisitati.Where(x => x.Contains(root_url)).Count() > 150)
return;
else
urlVisitati.Add(url);
Thread.EndCriticalRegion();
try
{
doc = web.Load(url, Proxy.getUrl(), Proxy.getPort(), Proxy.getUsername(), Proxy.getPassword());
nodes = doc.DocumentNode.SelectNodes("//a[#href]").ToArray() ?? null;
foreach (HtmlNode item in nodes)
{
Task.Factory.StartNew(() => scrapAzienda(item.Attributes["href"].Value, root_url, depth - 1), TaskCreationOptions.AttachedToParent);
}
GC.Collect();
if (doc != null)
{
bodyNode = doc.DocumentNode.SelectSingleNode("//body");
cercaNumeri(bodyNode.InnerText, url);
cercaEmail(bodyNode.InnerText, url);
}
}
catch (Exception) { }
}
Basically it's just a webscraper.
I think threading is the entirety of your issue. Thread.BeginCriticalRegion() doesn't do what you think it does. From the docs:
Notifies a host that execution is about to enter a region of code in which the effects of a thread abort or unhandled exception might jeopardize other tasks in the application domain.
In other words, it doesn't enforce thread-safety, it just says "if this breaks, it's gonna take everything down with it!"
What you need instead is a basic lock:
lock(someObj)
{
if (urlVisitati.Contains(url) || urlVisitati.Where(x => x.Contains(root_url)).Count() > 150)
{
return;
}
else
{
urlVisitati.Add(url);
}
}
someObj needs to be a static object. Every thread needs to refer to the same object. I usually create a basic object for this purpose, at the class level:
private static readonly object SyncLock = new object();
You then use lock with that object: lock(SyncLock). You can also lock on the list itself, however, only one thread can get a lock on the sync object at a time. To help prevent deadlocks, your code should ideally be the only source of locks on whatever object you're syncing on. Can you guarantee something inside the list class itself won't get a lock on itself? Don't worry about it, make your own sync object. No big deal.
This is a crash-course into threading with lock. There are other ways of doing this. This one should work for you.
my code is a bit complex, but the core is starting threads like this:
Thread task = new Thread(new ParameterizedThreadStart(x => { ThreadReturn = BuildChildNodes(x); }));
task.Start((NodeParameters)tasks[0]);
it should work. but when i check my CPU usage i get barely 10%. so i do assume it's just using one core. barely.
ThreadReturn btw is a value i use a setter on, to have some kind of event, when the thread is ready:
public object ThreadReturn
{
set
{
lock (thisLock)
{
NodeReturn result = (NodeReturn)value;
if (result.states.Count == 0) return;
Content[result.level + 1].AddRange(result.states);
if (result.level + 1 >= MaxDepth) return;
for (int i = 0; i < result.states.Count; i++)
{
Thread newTask = new Thread(new ParameterizedThreadStart(x => ThreadReturn = BuildChildNodes(x)));
NodeParameters param = new NodeParameters()
{
level = result.level + 1,
node = Content[result.level + 1].Count - (i + 1),
turn = SkipOpponent ? StartTurn : !result.turn
};
if (tasks.Count > 100)
unstarted.Add(param);
else
{
newTask.Start(param);
tasks.Add(newTask);
}
}
}
}
}
i got some crazy error about mark stack overflow so i limited the maximum number of parallel tasks with putting them into a second list...
i'm not firm in multithreading so this code is a little bit messy... maybe you can show me a better way which actually uses my cores.
btw: it's not the locks fault. i tried without before. -> same result
Edit: this is my code before i went to the Threading class. i find it more suitable:
Content.Clear();
Content.Add(new List<T> { Root });
for (var i = 0; i < maxDepth; i++)
Content.Add(new List<T>());
Task<object> firstTask = new Task<object>(x => BuildChildNodes(x), (new NodeParameters() { level = 0, node = 0, turn = Turn }));
firstTask.Start();
tasks.Add(firstTask);
while (tasks.Count > 0 && Content.Last().Count == 0)
{
Task.WaitAny(tasks.ToArray());
for (int task = tasks.Count - 1; task >= 0; task--)
{
if (tasks[task].IsCompleted)
{
NodeReturn result = (NodeReturn)tasks[task].Result;
tasks.RemoveAt(task);
Content[result.level + 1].AddRange(result.states);
if (result.level + 1 >= maxDepth) continue;
for (int i = 0; i < result.states.Count; i++)
{
Task<object> newTask = new Task<object>(x => BuildChildNodes(x), (object)new NodeParameters() { level = result.level + 1, node = Content[result.level + 1].Count - (i + 1), turn = SkipOpponent ? Turn : !result.turn });
newTask.Start();
}
}
}
}
In every state i'm calculating children and in my main thread i put them into my state tree while waiting for the tasks to finish. please assume i'd actually use the return value of waitany, i did a git reset and now... welll... it's gone^^
Edit:
Okay i don't know what exactly i did wrong but... in general everything was a total mess. i now implemented the deep construction method and maybe because there's much less... "traffic" now my whole code runs in 200ms. so... thanks for this!
i don't know if i should delete this question hence stupidity or if you guys want to post answers so i can rate them postive, you really helped me a lot :)
Disregarding all the other issues you have here, essentially your lock ruins the show.
What you are saying is, hey random person go and do some stuff! just make sure you don't do it at the same time as anyone else (lock), you could have 1000 threads, but only one thread is going to be active at one time on one core, hence your results.
Here are some other thoughts.
Get the gunk out of the setter, this would fail any sane code review.
Use Tasks instead of Thread.
Thinking about what needs thread safety, and elegantly lock only what needs it, Take a look at the Interlocked for dealing with numeric atomic manipulation
Take a look at the concurrent collections you may get more mileage out of this
Simplify your code.
I can't give any more advice as it's just about impossible to know what you are trying to do.
I am bulding a web-scraping project.
I have two lists:
private ConcurrentQueue<string> links = new ConcurrentQueue<string>();
private ConcurrentQueue<string> Visitedlinks = new ConcurrentQueue<string>();
On for all the links that I find on a page and one which will hold all the links I have scrapped.
Method that handels the business:
public async Task GetUrlContent(string url)
{
var page = string.Empty;
try
{
page = await service.Get(url);
if (page != string.Empty)
{
Regex regex = new Regex(#"<a[^>]*?href\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>",
RegexOptions.Singleline | RegexOptions.CultureInvariant);
if (regex.IsMatch(page))
{
Console.WriteLine("Downloading url: " + url);
for (int i = 0; i < regex.Matches(page).Count; i++)
{
if (regex.Matches(page)[i].Groups[1].Value.StartsWith("/"))
{
if (!links.Contains(BaseUrl + regex.Matches(page)[i].Groups[1].Value.ToLower().Replace(".html", "")) &&
!Visitedlinks.Contains(BaseUrl + regex.Matches(page)[i].Groups[1].Value.ToLower()))
{
Uri ValidUri = GetUrl(regex.Matches(page)[i].Groups[1].Value);
if (ValidUri != null && HostUrls.Contains(ValidUri.Host))
links.Enqueue(regex.Matches(page)[i].Groups[1].Value.ToLower().Replace(".html", ""));
else
links.Enqueue(BaseUrl + regex.Matches(page)[i].Groups[1].Value.ToLower().Replace(".html", ""));
}
}
}
}
var results = links.Where(m => !Visitedlinks.Contains(m)); // problkem here, get multiple values
if (!results.Any())
{
// do nothing
}
else
{
Parallel.ForEach(results, new ParallelOptions { MaxDegreeOfParallelism = 4 },
webpage =>
{
if (ValidUrl(webpage))
{
if (!Visitedlinks.Contains(webpage))
{
Visitedlinks.Enqueue(webpage);
GetUrlContent(webpage).Wait();
}
}
});
}
}
}
catch (Exception e)
{
throw;
}
}
Problem is here:
var results = links.Where(m => !Visitedlinks.Contains(m));
The first iteration I might get:
Link1, link2, link3, link4,
Second iteration:
Link2 link3 link4, link5, link6 ,link 7
Third:
Link 3, link4, link 5, link 6, etc
This means that I will get the same links multiple times since this is a parallel foreach which does several operations at once. I can't figure out how to make sure that I dont get multiple values.
Anyone that can lend a helping hand?
If I understand correctly, the first queue contains the links you want to scrape, and the second queue contains the ones you have scraped.
The problem is that you're trying to iterate over the contents of your ConcurrentQueue:
var results = links.Where(m => !Visitedlinks.Contains(m));
This won't work predictably if you're accessing these queues from multiple threads.
What you should do is take items out of the queue and process them. What stands out is that TryDequeue doesn't appear anywhere in your code. Items are going into the queue but never coming out. The whole purpose of a queue is that we put things in and take them out. ConcurrentQueue makes it safe for multiple threads to put items in and take them out without stepping all over each other.
If you dequeue a link that you want to process:
string linkToProcess = null;
if(links.TryDequeue(out linkToProcess)) // if this returns false, the queue was empty
{
// process it
}
Then as soon as you've taken an item out of the queue to process it, it won't be in the queue anymore. Other threads don't have to check to see if an item has been processed. They just take the next item out of the queue, if there is one. Two threads won't ever take the same item out of the queue. Only one thread can take a given item out of the queue, because as soon as it does, the item isn't in the queue anymore.
Thanks to #Scott Hannen
The final solution is as follows:
Parallel.ForEach(links, new ParallelOptions { MaxDegreeOfParallelism = 25 },
webpage =>
{
try
{
if (WebPageValidator.ValidUrl(webpage))
{
string linkToProcess = webpage;
if (links.TryDequeue(out linkToProcess) && !Visitedlinks.Contains(linkToProcess))
{
Task obj = Scrape(linkToProcess);
Visitedlinks.Enqueue(linkToProcess);
}
}
}
catch (Exception e)
{
log.Error("Error occured: " + e.Message);
Console.WriteLine("Error occured, check log for further details.");
}
I've been writing an AI that plays chess recently. The AI is designed to run two separate instances, each connected to a client server. The server calls a run function for each AI in turn. What I'm trying to do is write code that ponders while the other AI is making its move. However, I've come across an issue. I'll show the code so as to make explaining said issue easier:
public override bool run()
{
PonderSearch ponderSearch = new PonderSearch();
Thread ponderThread;
AIMove bestMove = null;
int Depth = 0;
// Print the board
if (moves.Length < 2)
{
theBoard.Print();
}
if (!FirstTurn)
{
AIMove lastMove = new AIMove(AI.moves[0]);
Depth = ponderSearch.Depth;
foreach (MoveTable result in ponderSearch.TheTable)
{
if (result.TheMove == lastMove)
{
bestMove = result.TheResult.Move;
}
}
// End thread
ponderThread.Abort();
ponderThread.Join();
}
// Looks through information about the players
for (int p = 0; p < players.Length; p++)
{
Console.Write(players[p].getPlayerName());
// if playerID is 0, you're white, if its 1, you're black
if (players[p].getId() == playerID())
{
Console.Write(" (ME)");
}
Console.WriteLine(" time remaining: " + players[p].getTime());
}
AIMove otherPMove = new AIMove();
AIPiece pieceMoved = new AIPiece();
// if there has been a move, print the most recent move
if (moves.Length > 0)
{
// Update the board state with previous move
theBoard = theBoard.Update();
pieceMoved = theBoard.GetPiece((short)moves[0].getToRank(),
(short)moves[0].getToFile());
otherPMove = new AIMove(moves[0], pieceMoved);
if (lastMoves.Count >= 8)
{
lastMoves.RemoveAt(7);
}
lastMoves.Insert(0, otherPMove);
}
// Generate move
Search theSearch = new Search(lastMoves);
if (!FirstTurn)
{
theSearch.Update(Depth, bestMove);
}
AIMove theMove = theSearch.Minimax(theBoard, (short)playerID());
// Update last 8 moves
if (lastMoves.Count >= 8)
{
lastMoves.RemoveAt(7);
}
lastMoves.Insert(0, theMove);
if (theMove != null)
{
Console.WriteLine("Move Chosen:");
theMove.Print();
theBoard = theBoard.Move(theMove, (short)playerID());
}
else
{
Console.WriteLine("No move available");
}
theBoard.Print();
// Begin pondering
ponderSearch = new PonderSearch(lastMoves, (short)playerID(), theBoard, theMove);
ponderThread = new Thread(new ThreadStart(ponderSearch.Ponder));
ponderThread.Start();
FirstTurn = false;
return true;
}
Anyway, as written, the compiler throws multiple errors saying my Thread hasn't been initialized but the point is that the function runs multiple times, ending the thread that was started in the most recent call at the beginning of the current one.
Is there any way I can do this?
Thanks,
EDIT: The error I get is:
Error 4 Use of unassigned local variable 'ponderThread' C:\Users\...\AI.CS 52 13 csclient
This has nothing to do with threading. It's a simple scoping issue. All local variables (declared inside a method) is typically put on the stack and cleaned up when the method exists.
Hence the ponderThread will be garbage collected after the run() method have exited. So the next time your method enter it will have the member variable FirstTurn set to true while ponderThread is uninitialized as it's a local variable.
A quick fix is to change the ponderThread variable to a class variable (called member variable in C#).
That will however give you thread synchronization problems as you are going to share state between two threads.
I suggest that you read up a bit more about threads before going further.
I have a regular Queue object in C# (4.0) and I'm using BackgroundWorkers that access this Queue.
The code I was using is as follows:
do
{
while (dataQueue.Peek() == null // nothing waiting yet
&& isBeingLoaded == true // and worker 1 still actively adding stuff
)
System.Threading.Thread.Sleep(100);
// otherwise ready to do something:
if (dataQueue.Peek() != null) // because maybe the queue is complete and also empty
{
string companyId = dataQueue.Dequeue();
processLists(companyId);
// use up the stuff here //
} // otherwise nothing was there yet, it will resolve on the next loop.
} while (isBeingLoaded == true // still have stuff coming at us
|| dataQueue.Peek() != null); // still have stuff we haven’t done
However, I guess when dealing with threads I should be using a ConcurrentQueue.
I was wondering if there were examples of how to use a ConcurrentQueue in a Do While Loop like above?
Everything I tried with the TryPeek wasn't working..
Any ideas?
You can use a BlockingCollection<T> as a producer-consumer queue.
My answer makes some assumptions about your architecture, but you can probably mold it as you see fit:
public void Producer(BlockingCollection<string> ids)
{
// assuming this.CompanyRepository exists
foreach (var id in this.CompanyRepository.GetIds())
{
ids.Add(id);
}
ids.CompleteAdding(); // nothing left for our workers
}
public void Consumer(BlockingCollection<string> ids)
{
while (true)
{
string id = null;
try
{
id = ids.Take();
} catch (InvalidOperationException) {
}
if (id == null) break;
processLists(id);
}
}
You could spin up as many consumers as you need:
var companyIds = new BlockingCollection<string>();
Producer(companyIds);
Action process = () => Consumer(companyIds);
// 2 workers
Parallel.Invoke(process, process);