I need to create a program scraping a website.
And I did use Thread to solve.
Example:
I have 100 pages and I need divide it, instead of get each page I need custom Thread number to get page:
2 threads - 50 pages/thread
4 threads - 25 pages/thread
I tried my code below, however when to the last page of each thread that very slow.
Before I ask I used to find the way to solve but I can't, therefore I need help.
int so_thread = 10;//thread number
int page_du = 0;
List<NameValueCollection> List_item = new List<NameValueCollection>();
Thread[] threads = new Thread[so_thread];
int dem = 0;
await Task.Run(() =>
{
for (int i = 1; i <= so_thread; i++)
{
if ((Int32.Parse(o_sopage.Text) % so_thread) != 0 && i == so_thread)
{
page_du = Int32.Parse(o_sopage.Text) % so_thread;//Int32.Parse(o_sopage.Text) == page number need get
}
threads[i - 1] = new Thread((object data) =>
{
Array New_Data = new object[2];
New_Data = (Array)data;
int _i = (int)New_Data.GetValue(0);
int _pagedu = (int)New_Data.GetValue(1);
int page_per_thread = Int32.Parse(o_sopage.Text) / so_thread;//Int32.Parse(o_sopage.Text) == page number need get
for (int j = ((page_per_thread * _i) - page_per_thread) + 1; j <= ((page_per_thread * _i) + _pagedu); j++)
{
//MessageBox.Show(j.ToString());
var TG = ebay.GetPage(j);
lock (List_item)
{
List_item.AddRange(TG);
dem++;
progressBar1.Invoke((MethodInvoker)delegate
{
progressBar1.Value = dem;
});
}
}
});
object DATA = new object[2] { i, page_du };
threads[i - 1].Start(DATA);
}
});
Use Parallel.ForEach instead of creating the threads on your own.
Parallel.ForEach(yourCollection, () => { /* your code here */});
Related
I've written such producer/consumer code, which should generate big file filled with random data
class Program
{
static void Main(string[] args)
{
Random random = new Random();
String filename = #"d:\test_out";
long numlines = 1000000;
var buffer = new BlockingCollection<string[]>(10); //limit to not get OOM.
int arrSize = 100; //size of each string chunk in buffer;
String[] block = new string[arrSize];
Task producer = Task.Factory.StartNew(() =>
{
long blockNum = 0;
long lineStopped = 0;
for (long i = 0; i < numlines; i++)
{
if (blockNum == arrSize)
{
buffer.Add(block);
blockNum = 0;
lineStopped = i;
}
block[blockNum] = random.Next().ToString();
//null is sign to stop if last block is not fully filled
if (blockNum < arrSize - 1)
{
block[blockNum + 1] = null;
}
blockNum++;
};
if (lineStopped < numlines)
{
buffer.Add(block);
}
buffer.CompleteAdding();
}, TaskCreationOptions.LongRunning);
Task consumer = Task.Factory.StartNew(() =>
{
using (var outputFile = new StreamWriter(filename))
{
foreach (string[] chunk in buffer.GetConsumingEnumerable())
{
foreach (string value in chunk)
{
if (value == null) break;
outputFile.WriteLine(value);
}
}
}
}, TaskCreationOptions.LongRunning);
Task.WaitAll(producer, consumer);
}
}
And it does what is intended to do. But for some unknown reason it produces only ~550000 strings, not 1000000 and I can not understand why this is happening.
Can someone point on my mistake? I really don't get what's wrong with this code.
The buffer
String[] block = new string[arrSize];
is declared outside the Lambda. That means it is captured and re-used.
That would normally go unnoticed (you would just write out the wrong random data) but because your if (blockNum < arrSize - 1) is placed inside the for loop you regularly write a null into the shared buffer.
Exercise, instead of:
block[blockNum] = random.Next().ToString();
use
block[blockNum] = i.ToString();
and predict and verify the results.
I have a nested for loop which takes 30 seconds to run and I'm looking to parallelize it based on the number of cores on my machine.
Original loop:
var currentCap = model.LoanCap;
var currentRlRate = model.RlRate;
var maxRateObj = new Dictionary<string, double>();
var maxRateOuterLoopCount = 0;
var maxRateInnerLoopCount = 0;
for (var i = currentRlRate + rlRateStep; i <= maxRlRate; i += rlRateStep)
{
maxRateOuterLoopCount++;
var tempFyy = currentFyy;
var tempIrr = currentIrr;
var lowestCapSoFar = currentCap;
var startingCap = maxRateObj.ContainsKey(capKey) ? maxRateObj[capKey] : currentCap;
for (var j = startingCap - capStep; j >= minCap; j -= capStep)
{
maxRateInnerLoopCount++;
tempModel = new ApplicationModel(model);
var tempIrrAndFyy = GetIrrAndFyyTuple(tempModel, i, j, precision);
var updatedIrr = tempIrrAndFyy.Item1;
var updatedFyy = tempIrrAndFyy.Item2;
// stop decrementing cap because we got a good-enough IRR to save this pair
if (Math.Abs(currentIrr - updatedIrr) >= irrDiffPrecision || updatedFyy < minFyy)
{
var endingCap = j + capStep; // go back one step since we just stepped out of bounds
maxRateObj = new Dictionary<string, double>
{
{rlRateKey, i },
{capKey, endingCap }
};
// set vars so the outer loop can check if we are still operating within constraints
lowestCapSoFar = endingCap;
tempIrr = updatedIrr;
tempFyy = updatedFyy;
break;
}
}
// Break out of the outerloop if the cap gets too low
if (lowestCapSoFar <= minCap) { break; }
// ... or if Fyy gets too low (when credit policy is enforced)
if (enforceFyyPolicy && tempFyy < minFyy) { break; }
// ... or if Irr gets too low (when credit policy is enforced)
if (enforceIrrPolicy && Math.Abs(tempIrr - targetIrr) > irrDiffPrecision) { break; }
}
Now when I move this loop into the body of Parallel.For(), I lose the context which I previously had for the variable i... How can I get that functionality back since I need it for my maxRateObj?
var degreeOfParallelism = Environment.ProcessorCount;
var result = Parallel.For(0, degreeOfParallelism, x =>
{
var tempFyy = currentFyy;
var tempIrr = currentIrr;
var lowestCapSoFar = currentCap;
var startingCap = maxRateObj.ContainsKey(capKey) ? maxRateObj[capKey] : currentCap;
for (var j = startingCap - capStep; j >= minCap; j -= capStep)
{
tempModel = new ApplicationModel(model);
var tempIrrAndFyy = GetIrrAndFyyTuple(tempModel, i, j, precision); // i IS NOT DEFINED HERE!
var updatedIrr = tempIrrAndFyy.Item1;
var updatedFyy = tempIrrAndFyy.Item2;
// stop decrementing cap because we got a good-enough IRR to save this pair
if (Math.Abs(currentIrr - updatedIrr) >= irrDiffPrecision || updatedFyy < minFyy)
{
var endingCap = j + capStep; // go back one step since we just stepped out of bounds
maxRateObj = new Dictionary<string, double>
{
{rlRateKey, i }, // i IS NOT DEFINED HERE!
{capKey, endingCap }
};
// set vars so the outer loop can check if we are still operating within constraints
lowestCapSoFar = endingCap;
tempIrr = updatedIrr;
tempFyy = updatedFyy;
break;
}
}
// Break out of the outerloop if the cap gets too low
if (lowestCapSoFar <= minCap) { return; }
// ... or if Fyy gets too low (when credit policy is enforced)
if (enforceFyyPolicy && tempFyy < minFyy) { return; }
// ... or if Irr gets too low (when credit policy is enforced)
if (enforceIrrPolicy && Math.Abs(tempIrr - targetIrr) > irrDiffPrecision) { return; }
});
Don't do degreeOfParallelism number of parallel iterations. Perform the same number of iterations in your parallel loop as you were doing previously, but spread them over your processors by using ParallelOptions.MaxDegreeOfParallelism.
It looks to me like it's a matter of performing a parallel loop from 0 to numSteps (calculated below), setting the MaxDegreeOfParallelism of your loop, and reconstituting i from the value of x in the loop body. Something like...
var start = (currentRlRate + rlRateStep);
var end = maxRlRate;
var numSteps = (end - start) / rlRateStep;
Parallel.For(0,
numSteps,
new ParallelOptions {
MaxDegreeOfParallelism = degreeOfParallelism
},
x => {
var i = (x * rlRateStep) + start;
//lean on i
});
im frenchi so sorry first sorry for my english .
I have an error on visual studio (index out of range) i have this problem only with a Parallel.For not with classic for.
I think one thread want acces on my array[i] and another thread want too ..
It's a code for calcul Kmeans clustering for building link between document (with cosine similarity).
more information :
IndexOutOfRange is on similarityMeasure[i]=.....
I have a computer with 2 Processor (12logical)
with classic for , cpu usage is 9-14% , time for 1 iteration=9min..
with parallel.for , cpu usage is 70-90% =p, time for 1 iteration =~1min30
Sometimes it works longer before generating an error
My function is :
private static int FindClosestClusterCenter(List<Centroid> clustercenter, DocumentVector obj)
{
float[] similarityMeasure = new float[clustercenter.Count()];
float[] copy = similarityMeasure;
object sync = new Object();
Parallel.For(0, clustercenter.Count(), (i) => //for(int i = 0; i < clustercenter.Count(); i++) Parallel.For(0, clustercenter.Count(), (i) => //
{
similarityMeasure[i] = SimilarityMatrics.FindCosineSimilarity(clustercenter[i].GroupedDocument[0].VectorSpace, obj.VectorSpace);
});
int index = 0;
float maxValue = similarityMeasure[0];
for (int i = 0; i < similarityMeasure.Count(); i++)
{
if (similarityMeasure[i] > maxValue)
{
maxValue = similarityMeasure[i];
index = i;
}
}
return index;
}
My function is call here :
do
{
prevClusterCenter = centroidCollection;
DateTime starttime = DateTime.Now;
foreach (DocumentVector obj in documentCollection)//Parallel.ForEach(documentCollection, parallelOptions, obj =>//foreach (DocumentVector obj in documentCollection)
{
int ind = FindClosestClusterCenter(centroidCollection, obj);
resultSet[ind].GroupedDocument.Add(obj);
}
TimeSpan tempsecoule = DateTime.Now.Subtract(starttime);
Console.WriteLine(tempsecoule);
//Console.ReadKey();
InitializeClusterCentroid(out centroidCollection, centroidCollection.Count());
centroidCollection = CalculMeanPoints(resultSet);
stoppingCriteria = CheckStoppingCriteria(prevClusterCenter, centroidCollection);
if (!stoppingCriteria)
{
//initialisation du resultat pour la prochaine itération
InitializeClusterCentroid(out resultSet, centroidCollection.Count);
}
} while (stoppingCriteria == false);
_counter = counter;
return resultSet;
FindCosSimilarity :
public static float FindCosineSimilarity(float[] vecA, float[] vecB)
{
var dotProduct = DotProduct(vecA, vecB);
var magnitudeOfA = Magnitude(vecA);
var magnitudeOfB = Magnitude(vecB);
float result = dotProduct / (float)Math.Pow((magnitudeOfA * magnitudeOfB),2);
//when 0 is divided by 0 it shows result NaN so return 0 in such case.
if (float.IsNaN(result))
return 0;
else
return (float)result;
}
CalculMeansPoint :
private static List<Centroid> CalculMeanPoints(List<Centroid> _clust)
{
for (int i = 0; i < _clust.Count(); i++)
{
if (_clust[i].GroupedDocument.Count() > 0)
{
for (int j = 0; j < _clust[i].GroupedDocument[0].VectorSpace.Count(); j++)
{
float total = 0;
foreach (DocumentVector vspace in _clust[i].GroupedDocument)
{
total += vspace.VectorSpace[j];
}
_clust[i].GroupedDocument[0].VectorSpace[j] = total / _clust[i].GroupedDocument.Count();
}
}
}
return _clust;
}
You may have some side effects in FindCosineSimilarity, make sure it does not modify any field or input parameter. Example: resultSet[ind].GroupedDocument.Add(obj);. If resultSet is not a reference to locally instantiated array, then that is a side effect.
That may fix it. But FYI you could use AsParallel for this rather than Parallel.For:
similarityMeasure = clustercenter
.AsParallel().AsOrdered()
.Select(c=> SimilarityMatrics.FindCosineSimilarity(c.GroupedDocument[0].VectorSpace, obj.VectorSpace))
.ToArray();
You realize that if you synchronize the whole Content of the Parallel-For, it's just the same as having a normal synchrone for-loop, right? Meaning the code as it is doesnt do anything in parallel, so I dont think you'll have any Problems with concurrency. My guess from what I can tell is clustercenter[i].GroupedDocument is propably an empty Array.
i have following code I used up until now to compare a list of file-entrys to itsef by hash-codes
for (int i = 0; i < fileLists.SourceFileListBefore.Count; i++) // Compare SourceFileList-Files to themselves
{
for (int n = i + 1; n < fileLists.SourceFileListBefore.Count; n++) // Don´t need to do the same comparison twice!
{
if (fileLists.SourceFileListBefore[i].targetNode.IsFile && fileLists.SourceFileListBefore[n].targetNode.IsFile)
if (fileLists.SourceFileListBefore[i].hash == fileLists.SourceFileListBefore[n].hash)
{
// do Something
}
}
}
where SourceFileListBefore is a List
I want to change this code to be able to execute parallel on multiple cores. I thought about doing this with PLINQ, but im completely new to LINQ.
I tried
var duplicate = from entry in fileLists.SourceFileListBefore.AsParallel()
where fileLists.SourceFileListBefore.Any(x => (x.hash == entry.hash) && (x.targetNode.IsFile) && (entry.targetNode.IsFile))
select entry;
but it wont work like this, because I have to execute code for each pair of two hash-code matching entrys. So I would at least have to get a collection of results with x+entry from LINQ, not just one entry. Is that possible with PLINQ?
Why don't you look at optimising your code first?
looking at this statement:
if (fileLists.SourceFileListBefore[i].targetNode.IsFile && fileLists.SourceFileListBefore[n].targetNode.IsFile)
Means you can straight away build1 single list of files where IsFile == true (making the loop smaller already)
secondly,
if (fileLists.SourceFileListBefore[i].hash == fileLists.SourceFileListBefore[n].hash)
Why don't you build a hash lookup of the hashes first.
Then iterate over your filtered list, looking up in the lookup you created, if it contains > 1, it means there is a match as (current node hash + some other node hash). So you only do some work on the matched hashes which is not your node.
I wrote a blog post about it which you can read at # CodePERF[dot]NET -.NET Nested Loops vs Hash Lookups
PLINQ will only be slightly improving a bad solution to your problem.
Added some comparisons:
Total File Count: 16900
TargetNode.IsFile == true: 11900
Files with Duplicate Hashes = 10000 (5000 unique hashes)
Files with triplicate Hashes = 900 (300 unique hashes)
Files with Unique hashes = 1000
And the actual setup method:
[SetUp]
public void TestStup()
{
_sw = new Stopwatch();
_files = new List<File>();
int duplicateHashes = 10000;
int triplicateHashesCount = 900;
int randomCount = 1000;
int nonFileCount = 5000;
for (int i = 0; i < duplicateHashes; i++)
{
var hash = i % (duplicateHashes / 2);
_files.Add(new File {Id = i, Hash = hash.ToString(), TargetNode = new Node {IsFile = true}});
}
for (int i = 0; i < triplicateHashesCount; i++)
{
var hash = int.MaxValue - 100000 - i % (triplicateHashesCount / 3);
_files.Add(new File {Id = i, Hash = hash.ToString(), TargetNode = new Node {IsFile = true}});
}
for (int i = 0; i < randomCount; i++)
{
var hash = int.MaxValue - i;
_files.Add(new File { Id = i, Hash = hash.ToString(), TargetNode = new Node { IsFile = true } });
}
for (int i = 0; i < nonFileCount; i++)
{
var hash = i % (nonFileCount / 2);
_files.Add(new File {Id = i, Hash = hash.ToString(), TargetNode = new Node {IsFile = false}});
}
_matched = 0;
}
Than your current method:
[Test]
public void FindDuplicates()
{
_sw.Start();
for (int i = 0; i < _files.Count; i++) // Compare SourceFileList-Files to themselves
{
for (int n = i + 1; n < _files.Count; n++) // Don´t need to do the same comparison twice!
{
if (_files[i].TargetNode.IsFile && _files[n].TargetNode.IsFile)
if (_files[i].Hash == _files[n].Hash)
{
// Do Work
_matched++;
}
}
}
_sw.Stop();
}
Takes around 7.1 seconds on my machine.
Using lookup to find hashes which appear multiple times takes 21ms.
[Test]
public void FindDuplicatesHash()
{
_sw.Start();
var lookup = _files.Where(f => f.TargetNode.IsFile).ToLookup(f => f.Hash);
foreach (var duplicateFiles in lookup.Where(files => files.Count() > 1))
{
// Do Work for each unique hash, which appears multiple times in _files.
// If you need to do work on each pair, you will need to create pairs from duplicateFiles
// this can be an excercise for you ;-)
_matched++;
}
_sw.Stop();
}
In my test, using PLINQ for counting the lookups, is actually slower (As there is a large cost of dividing lists between threads and aggregating results back)
[Test]
public void FindDuplicatesHashParallel()
{
_sw.Start();
var lookup = _files.Where(f => f.TargetNode.IsFile).ToLookup(f => f.Hash);
_matched = lookup.AsParallel().Where(g => g.Count() > 1).Sum(g => 1);
_sw.Stop();
}
This took 120ms, so almost 6 times as long with my current source list.
I have this code:
int LRLength = LR.Count;
for (int i = 0; i < LR.Count; i++)
{
LRLength = LR.Count;
LR = merge(LR);
if (LR.Count < LRLength)
{
LR = merge(LR);
if (LR.Count == LRLength)
{
break;
}
}
}
And this is the function merge:
private List<Lightnings_Extractor.Lightnings_Region> merge(List<Lightnings_Extractor.Lightnings_Region> Merged)
{
List<Lightnings_Extractor.Lightnings_Region> NewMerged = new List<Lightnings_Extractor.Lightnings_Region>();
Lightnings_Extractor.Lightnings_Region reg;
int dealtWith = -1;
for (int i = 0; i < Merged.Count; i++)
{
if (i != dealtWith)
{
reg = new Lightnings_Extractor.Lightnings_Region();
if (i < Merged.Count - 1)
{
if (Merged[i].end + 1 >= Merged[i + 1].start)
{
reg.start = Merged[i].start;
reg.end = Merged[i + 1].end;
NewMerged.Add(reg);
dealtWith = i + 1;
}
else
{
reg.start = Merged[i].start;
reg.end = Merged[i].end;
NewMerged.Add(reg);
}
}
else
{
reg.start = Merged[i].start;
reg.end = Merged[i].end;
NewMerged.Add(reg);
}
}
}
return NewMerged;
}
In this class: Lightnings_Extractor.Lightnings_Region I have only two int variables.
The idea in this function is to get a List and merge areas that are congruent.
For example once im calling the function and the List LR Length is 8 now I will get it back less. For example if it needed to merge two indexs to one then the List I will get in return the Length will be 7. If it will need to merge another indexs then the Length will be 6 and so on.
What I want to check on the first code above is when I should stop calling the function to merge indexs.
If the length was 8 and the next time it's still 8 do nothing stop the loop.
If the length is 8 and the next time it's 7 then call the function again.
If the length is 7 stop the loop . But if the length is 6 keep calling it once again.
Untill the last length will be the same as the length before !!!
So I tried this code but it's not working good:
int LRLength = LR.Count;
for (int i = 0; i < LR.Count; i++)
{
LRLength = LR.Count;
LR = merge(LR);
if (LR.Count < LRLength)
{
LR = merge(LR);
if (LR.Count == LRLength)
{
break;
}
}
}
Trying to make some assumptions as to what you're trying to accomplish. The following will basically capture the original length of the list for comparison. It will run at least once, and keep running until the LRLength == LR.Count
int LRLength = LR.Count;
do{
LR = merge(LR);
} while(LR.Count != LRLength);
If you were trying to run the loop until you got the same count twice in a row:
int prevCount;
do{
prevCount = LR.Count;
LR = merge(LR);
} while(prevCount != LR.Count);