Cancel Site Crawl on Abot

Cancel Site Crawl on Abot - c#

I have a list of domains that are crawled using Abot, the aim is that when it finds an amazon link on one of the sites, it quits, then moves onto the next site. But I can't seem to see who I can exit the page crawl. e.g.
https://github.com/sjdirect/abot
static Main(string[] args)
{
var domains= new List<string> { "http://domain1", "http://domain2" };
foreach (string domain in domains)
{
var config = new CrawlConfiguration
{
MaxPagesToCrawl = 100,
MinCrawlDelayPerDomainMilliSeconds = 3000
};
var crawler = new PoliteWebCrawler(config);
crawler.PageCrawlCompleted += PageCrawlCompleted;
var uri = new Uri(domain);
var crawlResult = crawler.Crawl(uri);
}
}
private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
var crawledPage = e.CrawledPage;
var crawlContext = e.CrawlContext;
var document = crawledPage.AngleSharpHtmlDocument;
var anchors = document.QuerySelectorAll("a").OfType<IHtmlAnchorElement>();
var hrefs = anchors.Select(x => x.Href).ToList();
var regEx= new Regex(#"https?:\/\/(www|smile)\.amazon(\.co\.uk|\.com).*");
var resultList = hrefs.Where(f => regEx.IsMatch(f)).ToList();
if (resultList.Any())
{
//NEED TO EXIT THE SITE CRAWL HERE
}
}

I would suggest the following...
var myCancellationToken = new CancellationTokenSource();
crawler.CrawlAsync(someUri, myCancellationToken);
private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
//More performant (since the parsing has already been done by Abot)
var hasAmazonLinks = e.CrawledPage.ParsedLinks
.Any(hl => hl.HrefValue.AbsoluteUri
.ToLower()
.Contains("amazon.com"));
if (hasAmazonLinks)
{
//LOG SOMETHING BEFORE YOU STOP THE CRAWL!!!!!
//Option A: Preferred method, Will clear all scheduled pages and cancel any threads that are currently crawling
myCancellationToken.Cancel();
//Option B: Same result as option A but no need to do anything with tokens. Not the preferred method.
e.CrawlContext.IsCrawlHardStopRequested = true;
//Option C: Will clear all scheduled pages but will allow any threads that are currently crawling to complete. No cancellation tokens needed. Consider it a soft stop to the crawl.
e.CrawlContext.IsCrawlStopRequested = true;
}
}

PoliteWebCrawler is designed to start crawling and dig deeper into the website URLs. If you just want to get the content of a URL (for example first page of a website) you can use PageRequester which is designed for such jobs.
var pageRequester = new PageRequester(new CrawlConfiguration(), new WebContentExtractor());
var crawledPage = await pageRequester.MakeRequestAsync(new Uri("http://google.com"));
Log.Logger.Information("{result}", new
{
url = crawledPage.Uri,
status = Convert.ToInt32(crawledPage.HttpResponseMessage.StatusCode)
});
By the way, if you want to stop a crawler during the process, you can use one of these two methods:
//1. hard crawl stop
crawlContext.CancellationTokenSource.Cancel();
//2. soft stop
crawlContext.IsCrawlStopRequested = true;

Related

Thread + While(true) + Entity

I'm building a candle recorder (Binance Crypto), interesting in 1 minute candles, including intra candle data for market study purpose (But eventually I could use this same code to actually be my eyes on what's happening in the market)
To avoid eventual lag / EF / SQL performance etc. I decided do accomplish this using two threads.
One receives the subscribed (Async) tokens from Binance and put them in a ConcurrentQueue, while another keeps trying to dequeue and save the data in MSSQL
My question goes for the second Thread, a while(true) loop. Whats the best approach to save like 200 + info/sec to SQL while these info come in individually (sometimes 300 info in a matter of 300ms, sometime less) using EF:
Should I open the SQL con each time I want to save? (Performance).
Whats the best approach to accomplish this?
-- EDITED --
At one point I got 600k+ in the Queue so I'm facing problems inserting to SQL
Changed from Linq to SQL to EF
Here's my actual code:
//Initialize
public void getCoinsMoves()
{
Thread THTransferDatatoSQL = new Thread(TransferDatatoSQL);
THTransferDatatoSQL.Name = "THTransferDatatoSQL";
THTransferDatatoSQL.SetApartmentState(ApartmentState.STA);
THTransferDatatoSQL.IsBackground = true;
THTransferDatatoSQL.Start();
List<string> SymbolsMap;
using(DBBINANCEEntities lSQLBINANCE = new DBBINANCEEntities())
{
SymbolsMap = lSQLBINANCE.TB_SYMBOLS_MAP.Select(h => h.SYMBOL).ToList();
}
socketClient.Spot.SubscribeToKlineUpdatesAsync(SymbolsMap, Binance.Net.Enums.KlineInterval.OneMinute, h =>
{
RecordCandles(h);
});
}
//Enqueue Data
public void RecordCandles(Binance.Net.Interfaces.IBinanceStreamKlineData Candle)
{
FRACTIONED_CANDLES.Enqueue(new TB_FRACTIONED_CANDLES_DATA()
{
BASE_VOLUME = Candle.Data.BaseVolume,
CLOSE_TIME = Candle.Data.CloseTime.AddHours(-3),
MONEY_VOLUME = Candle.Data.QuoteVolume,
PCLOSE = Candle.Data.Close,
PHIGH = Candle.Data.High,
PLOW = Candle.Data.Low,
POPEN = Candle.Data.Open,
SYMBOL = Candle.Symbol,
TAKER_BUY_BASE_VOLUME = Candle.Data.TakerBuyBaseVolume,
TAKER_BUY_MONEY_VOLUME = Candle.Data.TakerBuyQuoteVolume,
TRADES = Candle.Data.TradeCount,
IS_LAST_CANDLE = Candle.Data.Final
});
}
//Transfer Data to SQL
public void TransferDatatoSQL()
{
while (true)
{
TB_FRACTIONED_CANDLES_DATA NewData;
if (FRACTIONED_CANDLES.TryDequeue(out NewData))
{
using (DBBINANCEEntities LSQLBINANCE = new DBBINANCEEntities())
{
LSQLBINANCE.TB_FRACTIONED_CANDLES_DATA.Add(NewData);
if (NewData.IS_LAST_CANDLE)
LSQLBINANCE.TB_CANDLES_DATA.Add(new TB_CANDLES_DATA()
{
BASE_VOLUME = NewData.BASE_VOLUME,
CLOSE_TIME = NewData.CLOSE_TIME,
IS_LAST_CANDLE = NewData.IS_LAST_CANDLE,
MONEY_VOLUME = NewData.MONEY_VOLUME,
PCLOSE = NewData.PCLOSE,
PHIGH = NewData.PHIGH,
PLOW = NewData.PLOW,
POPEN = NewData.POPEN,
SYMBOL = NewData.SYMBOL,
TAKER_BUY_BASE_VOLUME = NewData.TAKER_BUY_BASE_VOLUME,
TAKER_BUY_MONEY_VOLUME = NewData.TAKER_BUY_MONEY_VOLUME,
TRADES = NewData.TRADES
});
LSQLBINANCE.SaveChanges();
}
}
Thread.Sleep(1);
}
}
Thx in Adv
Rafael

I see one error in your code, you're sleeping a background thread after every insert, don't sleep if there's more data. Instead of:
if (FRACTIONED_CANDLES.TryDequeue(out NewData))
{
using (DBBINANCEEntities LSQLBINANCE = new DBBINANCEEntities())
{
LSQLBINANCE.TB_FRACTIONED_CANDLES_DATA.Add(NewData);
if (NewData.IS_LAST_CANDLE)
LSQLBINANCE.TB_CANDLES_DATA.Add(new TB_CANDLES_DATA()
{
BASE_VOLUME = NewData.BASE_VOLUME,
CLOSE_TIME = NewData.CLOSE_TIME,
IS_LAST_CANDLE = NewData.IS_LAST_CANDLE,
MONEY_VOLUME = NewData.MONEY_VOLUME,
PCLOSE = NewData.PCLOSE,
PHIGH = NewData.PHIGH,
PLOW = NewData.PLOW,
POPEN = NewData.POPEN,
SYMBOL = NewData.SYMBOL,
TAKER_BUY_BASE_VOLUME = NewData.TAKER_BUY_BASE_VOLUME,
TAKER_BUY_MONEY_VOLUME = NewData.TAKER_BUY_MONEY_VOLUME,
TRADES = NewData.TRADES
});
LSQLBINANCE.SaveChanges();
}
}
Thread.Sleep(1);
Change the last line to:
else
Thread.Sleep(1);
This may resolve your problem.

Get data from document DB using multithreading/parallel

I have the JSON documents in the document DB (~30k documents) where each document has a unique ID something like AA123, AA124. There is a tool we use to pull those documents from the document DB where it has a restriction of 500 documents per GET request call. So this has to go through 60 times GET requests to fetch the result which takes sometime. I am looking to get this optimized to run this in quick time(run threads parallely), so that I can get the data quickly. Below is the sample code on how I am pulling the data from the DB as of now.
private int maxItemsPerCall = 500;
public override async Task<IEnumerable<docClass>> Getdocuments()
{
string accessToken = "token";
SearchResponse<docClass> docs = await db.SearchDocuments<docClass>(initialload, accessToken); //Gets top 500
List<docClass> routeRules = new List<docClass>();
routeRules.AddRange(docs.Documents);
var remainingCalls = (docs.TotalDocuments / maxItemsPerCall);
while (remainingCalls > 0 && docs.TotalDocuments > maxItemsPerSearch)
{
docs = await db.SearchDocuments<docClass>(GetFollowUp(docs.Documents.LastOrDefault().Id.Id), requestOptions);
routeRules.AddRange(docs.Documents);
remainingCalls--;
}
return routeRules;
}
private static SearchRequest initialload = new SearchRequest()
{
Filter = new SearchFilterGroup(
new[]
{
new SearchFilter(Field.Type, FilterOperation.Equal, "documentRule")
},
GroupOperator.And),
OrderBy = Field.Id,
Top = maxItemsPerCall,
Descending = false
};
private static SearchRequest GetFollowUp(string lastId)
{
SearchRequest followUpRequest = new SearchRequest()
{
Filter = new SearchFilterGroup(
new[] {
new SearchFilter(Field.Type, FilterOperation.Equal, "documentRule"),
new SearchFilter(Field.Id, FilterOperation.GreaterThan, lastId)
},
GroupOperator.And),
OrderBy = Field.Id,
Top = maxItemsPerCall,
};
return followUpRequest;
}
Help needed: Since I am using the each GET request(500 documents based on IDs depending on the ID of the previous run), how can I use to run this parallely (atleast 5 parallel threads at a time) fetching 500 records per thread (i.e. 2500 parallely in total for 5 threads at a time). I am not familiar with threading, so it would be helpful if someone can suggest how to do this.

C# console application code doesn't execute after await

I'm trying to make a webscraper where I get all the download links for the css/js/images from a html file.
Problem
The first breakpoint does hit, but the second one not after hitting "Continue".
Picture in Visual Studio
Code I'm talking about:
private static async void GetHtml(string url, string downloadDir)
{
//Get html data, create and load htmldocument
HttpClient httpClient = new HttpClient();
//This code gets executed
var html = await httpClient.GetStringAsync(url);
//This code not
Console.ReadLine();
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//Get all css download urls
var linkUrl = htmlDocument.DocumentNode.Descendants("link")
.Where(node => node.GetAttributeValue("type", "")
.Equals("text/css"))
.Select(node=>node.GetAttributeValue("href",""))
.ToList();
//Downloading css, js, images and source code
using (var client = new WebClient())
{
for (var i = 0; i <scriptUrl.Count; i++)
{
Uri uri = new Uri(scriptUrl[i]);
client.DownloadFile(uri,
downloadDir + #"\js\" + uri.Segments.Last());
}
}
Edit
Im calling the getHtml method from here:
private static void Start()
{
//Create a list that will hold the names of all the subpages
List<string> subpagesList = new List<string>();
//Ask user for url and asign that to var url, also add the url to the url list
Console.WriteLine("Geef url van de website:");
string url = "https://www.hethwc.nl";
//Ask user for download directory and assign that to var downloadDir
Console.WriteLine("Geef locatie voor download:");
var downloadDir = #"C:\Users\Daniel\Google Drive\Almere\C# II\Download tests\hethwc\";
//Download and save the index file
var htmlSource = new System.Net.WebClient().DownloadString(url);
System.IO.File.WriteAllText(#"C:\Users\Daniel\Google Drive\Almere\C# II\Download tests\hethwc\index.html", htmlSource);
// Creating directories
string jsDirectory = System.IO.Path.Combine(downloadDir, "js");
string cssDirectory = System.IO.Path.Combine(downloadDir, "css");
string imagesDirectory = System.IO.Path.Combine(downloadDir, "images");
System.IO.Directory.CreateDirectory(jsDirectory);
System.IO.Directory.CreateDirectory(cssDirectory);
System.IO.Directory.CreateDirectory(imagesDirectory);
GetHtml("https://www.hethwc.nu", downloadDir);
}

How are you calling GetHtml? Presumably that is from a sync Main method, and you don't have any other non-worker thread in play (because your main thread exited): the process will terminate. Something like:
static void Main() {
GetHtml();
}
The above will terminate the process immediately after GetHtml returns and the Main method ends, which will be at the first incomplete await point.
In current C# versions (C# 7.1 onwards) you can create an async Task Main() method, which will allow you to await your GetHtml method properly, as long as you change GetHtml to return Task:
async static Task Main() {
await GetHtml();
}

Insufficient system resources exist to complete the requested service when using GeneticSharp

Long story short, I'm using GeneticSharp for an iterative/conditional reinforcement learning algorithm. This means that I'm making a bunch of different GeneticAlgorithm instances, each using a shared SmartThreadPool. Only one GA is running at a time though.
After a few iterations of my algorithm, I run into this error, which happens when attempting to start the SmartThreadPool.
Is there any obvious reason this should be happening? I've tried using a different STPE and disposing of it each time, but that didn't seem to help either. Is there some manual cleanup I need to be doing in between each GA run? Should I be using one shared GA instance?
Edit: Quick code sample
static readonly SmartThreadPoolTaskExecutor Executor = new SmartThreadPoolTaskExecutor() { MinThreads = 2, MaxThreads = 8 };
public static void Main(string[] args)
{
var achromosome = new AChromosome();
var bchromosome = new BChromosome();
while(true)
{
achromosome = FindBestAChromosome(bchromosome);
bchromosome = FindBestBChromosome(achromosome);
// Log results;
}
}
public static AChromosome FindBestAChromosome(BChromosome chromosome)
{
AChromosome result;
var selection = new EliteSelection();
var crossover = new UniformCrossover();
var mutation = new UniformMutation(true);
using (var fitness = new AChromosomeFitness(chromosome))
{
var population = new Population(50, 70, chromosome);
var ga = new GeneticAlgorithm(population, fitness, selection, crossover, mutation);
ga.Termination = new GenerationNumberTermination(100);
ga.GenerationRan += LogGeneration;
ga.TaskExecutor = Executor;
ga.Start();
LogResults();
result = ga.BestChromosome as AChromosome;
ga.GenerationRan -= LogGeneration;
}
return result;
}
public static BChromosome FindBestBChromosome(AChromosome chromosome)
{
BChromosome result;
var selection = new EliteSelection();
var crossover = new UniformCrossover();
var mutation = new UniformMutation(true);
using (var fitness = new BChromosomeFitness(chromosome))
{
var population = new Population(50, 70, chromosome);
var ga = new GeneticAlgorithm(population, fitness, selection, crossover, mutation);
ga.Termination = new GenerationNumberTermination(100);
ga.GenerationRan += LogGeneration;
ga.TaskExecutor = Executor;
ga.Start();
LogResults();
result = ga.BestChromosome as BChromosome;
ga.GenerationRan -= LogGeneration;
}
return result;
}
AChromosome and BChromosome are each fairly simple, a couple doubles and ints and maybe a function pointer (to a static function).
Edit2: Full call stack with replaced bottom two entries
Unhandled Exception: System.IO.IOException: Insufficient system resources exist to complete the requested service.
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
at System.Threading.EventWaitHandle..ctor(Boolean initialState, eventResetMode mode, string name)
at Amib.Threading.SmartThreadPool..ctor()
at GeneticSharp.Infrastructure.Threading.SmartThreadPoolTaskExecutor.Start()
at GeneticSharp.Domain.GeneticAlgorithm.EvaluateFitness()
at GeneticSharp.Domain.GeneticAlgorithm.EndCurrentGeneration()
at GeneticSharp.Domain.GeneticAlgorithm.EvolveOneGeneration()
at GeneticSharp.Domain.GeneticAlgorithm.Resume()
at GeneticSharp.Domain.GeneticAlgorithm.Start()
at MyProject.Program.FindBestAChromosome(BChromosome chromosome)
at MyProject.Program.Main(String[] args)
Edit3: One last thing to note is that my fitness functions are pretty processing-intensive and one run can take almost 2g of ram (running on a machine with 16g, so no worries there). I've seen no problems with garbage collection though.
So far, this only happens after about 5 iterations (which takes multiple hours).

It turns out it was my antivirus preventing the threads from finalizing. Now I'm running it on a machine with a different antivirus and it's running just fine. If I come up with a better answer for how to handle this in the future, I'll update here.

Multithreading with Windows Store Applications

We are currently creating a Windows Store Application which gains information from an RSS feed and inputs this information into an ObservableCollection. The issue we are having is when the information is being gained, the Applications UI becomes unresponsive.
In order to get around this, I thought about creating a new thread and calling the method within this. Though, after some research we realised that this was no longer possible in Windows Store Apps. How can we get around this?
The method that collects the information is below.
public void getFeed()
{
setupImages();
string[] feedUrls = new string[] {
"http://www.igadgetos.co.uk/blog/category/gadget-news/feed/",
"http://www.igadgetos.co.uk/blog/category/gadget-reviews/feed/",
"http://www.igadgetos.co.uk/blog/category/videos/feed/",
"http://www.igadgetos.co.uk/blog/category/gaming/feed/",
"http://www.igadgetos.co.uk/blog/category/jailbreak-2/feed/",
"http://www.igadgetos.co.uk/blog/category/kickstarter/feed/",
"http://www.igadgetos.co.uk/blog/category/cars-2/feed/",
"http://www.igadgetos.co.uk/blog/category/software/feed/",
"http://www.igadgetos.co.uk/blog/category/updates/feed/"
};
{
try
{
XNamespace dc = "http://purl.org/dc/elements/1.1/";
XNamespace content = "http://purl.org/rss/1.0/modules/content/";
foreach (var feedUrl in feedUrls)
{
var doc = XDocument.Load(feedUrl);
var feed = doc.Descendants("item").Select(c => new ArticleItem() //Creates a copy of the ArticleItem Class.
{
Title = c.Element("title").Value,
//There are another 4 of these.
Post = stripTags(c.Element(content + "encoded").Value) }
).OrderByDescending(c => c.PubDate);
this.moveItems = feed.ToList();
foreach (var item in moveItems)
{
item.ID = feedItems.Count;
feedItems.Add(item);
}
}
lastUpdated = DateTime.Now;
}
catch
{
MessageDialog popup = new MessageDialog("An error has occured downloading the feed, please try again later.");
popup.Commands.Add(new UICommand("Okay"));
popup.Title = "ERROR";
popup.ShowAsync();
}
}
}
How would we be able to cause the Application to not freeze as we gain this information, as Threading is not possible within Windows Store Applications.
E.g - We planned to use;
Thread newThread = new Thread(getFeed);
newThread.Start

You need to use the well documented async pattern for your operations that happen on the UI thread. The link given by Paul-Jan in the comments is where you need to start. http://msdn.microsoft.com/en-us/library/windows/apps/hh994635.aspx

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Cancel Site Crawl on Abot - c#

Related

Thread + While(true) + Entity

Get data from document DB using multithreading/parallel

C# console application code doesn't execute after await

Insufficient system resources exist to complete the requested service when using GeneticSharp

Multithreading with Windows Store Applications

Categories

Resources