We are scraping an Web based API using Microsoft Azure. The issue is that there is SO much data to retrieve (there are combinations/permutations involved).
If we use a standard Web Job approach, we calculated it would take about 200 years to process all the data we want to get - and we would like our data to be refreshed every week.
Each request/response from the API takes about a 0.5-1.0 seconds to process. Request size is on average 20000 bytes and the average response is 35000 bytes. I believe the total number of requests is in the millions.
Another way to think about this question would be: how would you use Azure to Web scrape - and make sure you don't overload (in terms of memory + network) the VM it's running on? (I don't think you need too much CPU processing in this case).
What we have tried so far:
Used Service Bus Queues/Worker Roles scaled to 8 small VMs - but this caused a lot of network errors to occur (there must be some network limit to how much EACH worker role VM can handle).
Used Service Bus Queues/Continuous Web Job scaled to 8 small VMs - but this seems to work slower - and even scaled, doesn't give us too much control on what's happening behind the scenes. (We don't REALLY know how many VMs are up).
It seems that these things are built for CPU calculation - not for Web/API scraping.
Just to clarify: I throw my requests into a queue - which then get picked up by my multiple VMs for processing to get the responses. That's how I was using the queues. Each VM was using the ServiceBusTrigger class as prescribed by microsoft.
Is it better to have a lot small VMs or few massive VMs?
What C# classes should we be looking at?
What are the technical best practices when trying to do something like this on Azure?
Actually a web scraper is something that I have up and running, in Azure, for quite some time now :-)
AFAIK there is no 'magic bullet'. Scraping a lot of sources with deadlines is quite hard.
How it works (the most important things):
I use worker roles and C# code for the code itself.
For scheduling, I use the queue storage. I put crawling tasks on the queue with a timeout (e.g. 'when to crawl then') and have the scraper pull them off. You can put triggers on the queue size to ensure you meet deadlines in terms of speed -- personally I don't need them.
SQL Azure is slow, so I don't use that. Instead, I only use table storage for storing the scraped items. Note that updating data might be quite complex.
Don't use too much threading; instead, use async IO for all network traffic.
Also you might have to consider that extra threads require extra memory (parse trees can become quite big) - so there's a trade-off there... I do recall using some threads, but it's really just a few.
Note that probably this does require you to re-design and re-implement your complete web scraper if you're now using a threaded approach.. then again, there are some benefits:
Table storage and queue storage are cheap.
I currently use a single Extra Small VM to scrape well over a thousand web sources.
Inbound network traffic is for free.
As such, the result is quite cheap as well; I'm sure it's much less than the alternatives.
As for classes that I use... well, that's a bit of a long list. I'm using HttpWebRequest for the async HTTP requests and the Azure SDK -- but all the rest is hand crafted (and not open source).
P.S.: This doesn't just hold for Azure; most of this also holds for on-premise scrapers.
I have some experience with scraping so I will share my thoughts.
It seems that these things are built for CPU calculation - not for Web/API scraping.
They are built for dynamic scaling which given your task is not something you really need.
How to make sure you don't overload the VM?
Measure the response times and error rates and tune you code to lower them.
I don't think you need too much CPU processing in this case.
Depends on how much data is coming in each second and what you are doing with it. More complex parsing on quickly incoming data (if you decide to do it on the same machine) will eat up CPU pretty quickly.
8 small VMs caused a lot of network errors to occur (there must be some network limit)
The smaller the VMs the less shared resources they get. There are throughput limits and then there is an issue with your neighbors sharing the actual hardware with you. Often, the smaller your instance size the more trouble you run into.
Is it better to have a lot small VMs or few massive VMs?
In my experience, smaller VMs are too crippled. However, your mileage may vary and it all depends on the particular task and its solution implementation. Really, you have to measure yourself in your environment.
What C# classes should we be looking at?
What are the technical best practices when trying to do something like this on Azure?
With high throughput scraping you should be looking at infrastructure. You will have different latency in different Azure datacenters, and different experience with network latency/sustained throughput at different VM sizes, and depending on who in particular is sharing the hardware with you. The best practice is to try and find what works best for you - change datacenters, VM sizes and otherwise experiment.
Azure may not be the best solution to this problem (unless you are on a spending spree). 8 small VMs is $450 a month. It is enough to pay for an unmanaged dedicated server with 256Gb of RAM, 40 hardware threads and 500Mbps - 1Gbps (or even up to several Gbps bursts) of quality network bandwidth without latency issues.
For you budget, you will have a dedicated server that you cannot overload. You will have more than enough RAM to deal with async pinning (if you decide to go async), or enough hardware threads for multi-threaded synchronous IO which gives the best throughput (if you choose to go synchronous with a fixed-size threadpool).
On a sidenote, depending on the API specifics, it might turn out that your main issue will be the API owner simply throttling you down to a crawl when you start to put too much pressure on the API endpoints.
Related
I am fairly new to asynchronous programming so I need help.
What I need to do is, create a windows service that constantly checks the database for menu updates (insert/updates), tables updates (insert/updates), menu category updates (insert/updates) and so on and if any change is detected the service will then need to POST those said changes to separate APIs one by one. Keeping in mind that the service will be used for just this purpose and the database that I need to check for updates is SQL Server.
So, how do I approach this scenario efficiently ? Do I create new Tasks (System.Threading.Tasks) or create new Threads (System.Threading.Thread) for each pieces like UpdateMenu that checks the menu updates and upload to api, UpdateTable, UpdateDishes and so on and how do I go about the Posting to the API part I mean do I create a new Task for each and every API call? I want the application to be as efficient as possible and pick the changes and post them to API as soon as possible.
Thanks in advance.
It seems that you are worried about the overhead of the mechanism that you are going to use, in order to fetch data from the database and post these data to APIs. You are thinking that maybe Threads are fast and Tasks are slower, or vice versa. In fact choosing between these two mechanisms is likely to have no measurable impact to your service's demand for CPU, memory or other system resources.
What is likely to be impactful, is the pattern of communication of your service with the database and the APIs. For example if your threads/tasks are not coordinated with each other, and query the database all at the same time, the database might be slow to respond, and might consume larger amounts of memory while preparing the response. That's not because your threads/tasks are slow. It's because your service is querying the database with a pattern that makes it harder for the database to respond. The same might be true for the pattern of communication with the APIs. If your workers are not coordinated, the network connectivity might become a bottleneck, or the remote machines that host the APIs might suffer.
So my advice is to focus on the usability factor of the mechanisms, and not on their supposed difference in performance. If you are comfortable and familiar with threads, and know nothing about tasks, use threads. If you are familiar with both threads and tasks, use tasks because they are generally easier to use. You'd better invest your time to optimize the communication pattern between your service and its dependencies, than for doing benchmarks trying to find the best between mechanisms that for all intents and purposes are equally efficient.
I have a C# MVC application with a WCF service running on Azure. First of it was of course hosted on the free version, but as I had that one running smoothly I wanted to try and see how it ran on either Basic or Standard, which as far as I know should be dedicated servers.
To my surprise the code ran significantly slower once it was changed from Free to either Standard or Basic. I chose the smallest instance, but still expected them to perform better than the Free option?
From my performance logging I can see that the code that runs especially slow is something that is started as async from Task.Run. Initially it was old school Thread.Start() but considered whether this might spawn it in some lower priority thread and therefore changed it to Task.Run - without this changing anything - so perhaps it has nothing to do with it - but it might, so now you know.
The code that runs really slow basically works on some XML document, through XDocument, XElement etc. It loops through, has some LINQ etc. but nothing too fancy. But still it is 5-10 times slower on Basic and Standard as on the Free version? For the exact same request the Free version uses around 1000ms where as Basic and Standard uses 8000-10000ms?
In each test I have tried 5-10 times but without any decrease in response-times. I thought about whether I need to wait some hours before the Basic/Standard is fully functional or something like that, but each time I switch back, the Free version just outperforms it from the get-go.
Any suggestions? Is the Free version for some strange reason more powerful than Basic or Standard or do I need to configure something differently once I get up and running on Basic or Standard?
The notable difference between the Free and Basic/Standard tiers is that Free uses an undisclosed number of shared cores, whereas Basic/Standard has a defined number of CPU cores (1-4 based on how much you pay). Related to this is the fact that Free is a shared instance while Basic/Standard is a private instance.
My best guess based on this that since the Free servers you would be on house multiple different users and applications, they probably have pretty beef specs. Their CPUs are probably 8-core Xeons and there might even be multiple CPUs. Most likely, Azure isn't enforcing any caps but rather relying on quotas (60 CPU minutes / day for the Free tier) and overall demand on the server to restrict CPU use. In other words, if your site is the only one that happens to be doing anything at the moment (unlikely of course, but for the sake of example), you could be potentially utilizing all 8+ cores on the box, whereas when you move over to Basic/Standard you are hard-limited to 1-4. Processing XML is actually very CPU heavy, so this seems to line up with my assumptions.
More than likely, this is a fluke. Perhaps your residency is currently on a relatively newly provisioned server that hasn't been fill up with tenants yet. Maybe you just happen to be sharing with tenants that aren't doing much. Who knows? But, if the server is ever actually under real load, I'd imagine you'd see a much worse response time on the Free tier than even Basic/Standard.
I have a .NET service which need to feed live financial data to its clients. The output rate for this feed might get intense and I am looking for the best architecture to implement this type of service with low latency and high performance.
I was thinking of using some kind of a stream data provider, one that is used for audio or video, but send feed updates instead.
Would appreciate any thought on this subject, or any real world examples
Update:
I don't have to use WCF, that was only my first approach since it is the current technology. Any other implementation in C# is welcome.
Full Disclosure: I work for Informatica (formerly 29West) and am on the engineering team responsible for their messaging products. I am biased. I do, however, have a pretty good grasp of low-latency messaging in the financial market.
If you message rates are about 60 messages/sec. (as stated in a comment on Will Dean's answer), and they're being delivered to a GUI with a human sitting in front of it and reacting to the market at human-speed, it honestly doesn't matter a whole lot what software you use from a latency perspective. You might even be able to get away with using WCF (though I'd still recommend against it; we considered supporting it once and prototyped an adapter for it and it bloated latencies up by an order of magnitude - we decided not to bother with it at the time).
Now, Informatica's messaging software can pass messages between processes on the same machine in well under a microsecond, and if you want to buy some nice 10 gig-E NICs with kernel bypass or InfiniBand gear, you can pass millions of messages per second between machines with single-digit microseconds of latency. We'll also soon be releasing a new data serialization library that's supported in C/C++, Java, and .NET as part of the messaging product that in some cases is actually faster than Protocol Buffers (although Protocol Buffers are widely used and also a very good choice). Our .NET and Java APIs both have a feature called "ZOD" for "Zero Object Delivery", which is a kinda funny way of saying they generate no new objects during message delivery, meaning no garbage collection pauses & associated latency spikes/outliers. We've got another product called UMDS that's specifically designed to fan out high-speed backbone traffic to slower desktop apps without slowing down the backbone or other clients.
I could go on and on about how great Informatica's messaging software is and I do think it's worth checking out, but this already looks like a straight-up ad, and I'm an engineer, not a sales person. So here's a few pieces of more general advice:
If you have a lot of clients receiving the same data, you'll want some flavor of UDP multicast. You'll often want a reliable multicast transport of some kind - the well-known (and free) reliable multicast protocol is PGM. Windows includes an implementation of PGM that's usable in C#; I'll refer you to Mike Rettig's excellent blog post on how to use it if you want to try it out. (I happen to know Mike - he's a smart guy.) Protocol choice is an area in which you get what you pay for; Informatica's messaging includes a reliable multicast protocol loosely based off of PGM (our architect who designed it co-wrote the PGM RFC a long while back), but with a lot of major improvements. Plain PGM might be fine for what you need, though.
You want to go with a brokerless/serverless architecture. Have the apps communicate peer-to-peer with nothing in the middle. Avoid extra hops in the message path (which usually means avoid most JMS implementations, avoid almost anything with "queue" in the name somewhere, etc.).
Be mindful of how your system behaves when one individual client misbehaves. Can one slow consumer slow down everyone else?
There are a lot of OS tuning and BIOS tuning options that can benefit any sort of low-latency messaging, homegrown or bought - things like interrupt coalescing, tying NIC interrupts to a particular CPU core, receive-side scaling (which has historically been terrible when used with UDP on Windows, but should be getting much better in the future), disabling certain CPU power states, etc.
Resist the temptation to use built-in object serialization in .NET to send whole objects over the wire - it is orders of magnitude slower than using a simple binary format (like Protocol Buffers, or Informatica's serialization library, or your own binary format, etc.).
If you have more specific questions or need more detail on any of my advice, just let me know!
How low is 'low latency' and how busy is 'intense'? You need to have some idea of what you're aiming for to choose the right approach.
I could supply you some hardware which would respond to 100% of all requests within, say, 20us upto the full capacity of your network hardware, but it would not use WCF much at all.
To a very broad approximation, I would say that things like WCF are very high-level and trade-off ease-of-use and abstraction-for-the-benefit-of-the-programmer against performance (latency/throughput). Whether they trade it off too much for your application needs real numbers.
The lowest-latency, lowest-overhead IP-based protocol in widespread use is UDP - that's why it's used for things like DNS and NTP. It's very scalable at the server, because the server doesn't need to keep any state, and it's very simple to implement on almost any platform. But you do need to be thinking in terms of network packets rather than .NET objects. Do you get to supply the client-end software too?
Live financial data? Never rely on WCF on that. Instead, go with what other industries use. i.e. NASDAQ uses Real-Time Innovations - Data Distribution Service to deliver live stock ticks to users. They provide C/C++/C# api for their communications libraries, which is extremely easy to setup and use (compared to WCF).
In general, this sort of real-time data feeds use publish/subscribe paradigm which helps to make sure that the communication happens with minimal overhead. This sort of an approach is the main idea in message-oriented middle ware and it is exactly what financial services use for real-time stuff.
On a side node, you can deliver real-time audio-video packets using RTI-DDS library, as far as I know, unmanned aerial vehicles like MQ-9 uses again this library to deliver live video & geo-location information to the ground control stations.
There are also free data distribution service libraries but I've no experience in them. You just need to google for it.
Edit: I'm currently prototyping some HMI (human machine interface) software which uses aforementioned RTI-DDS libraries along with two other libraries which have such message oriented architectures, which did work a thread up to now for all my real-time communication needs. Here is a demo: http://epics.codeplex.com/ (It will be used in remotely controlling the equipment in our brand new nuclear research facility)
The more assumptions you make and features you cut out the faster you can make your system. The more robust and flexible you attempt to make things, the more your performance will suffer. I would suggest a few basic must haves:
A binary data serialization format.
Don't use XML or any other human
readable method of passing your
data.
A robust enough data
serialization format that it can
support cross-architecture,
cross-language endpoints. BER comes
to mind - C# seems to have support
A transport protocol that has
guaranteed delivery and data
integrity. If any type of
financial algorithm will be using
this data, even missing one tick
could mean the difference between
and order being triggered or missing
out on a price. Even if you are
going to aggregate ticks in your
server you still want control over
how the information is presented to
your clients. TCP works for distributed systems. However there are much faster alternatives if your clients are on the same machine as your server. UDP won't even garauntee order, which can be problematic (though not insurmountable).
With regard to internal processing:
Avoid strings and other classes that
add significant overhead to simple
tasks. Use basic character arrays
instead. I'm not sure what options
you have in C# or if you even have
lightweight alternatives. If so, use
them. This applies to data-structures as well.
Be aware of double/float comparison errors. Use comparisons that only check for the necessary level of precision. If possible convert everything to integers internally and provide enough metadata to convert back on the other end.
Use something similar to pooled allocators in C++. My lack of knowledge of C# prevents me from being more specific. Again C# probably isn't your best choice here. Bottom line is that you are going to be creating and destroying a lot of tick objects and there is no reason to ask the OS for the memory every time.
Only send out deltas, don't send information that your clients already have. This assumes you are using a transport with guaranteed delivery. If not you could end up displaying stale data for a long time.
This might be of interest although its specific to gaming ... Lowest Latency small size data Internet transfer protocol? c#
Here is a tutorial on UDP connection http://www.winsocketdotnetworkprogramming.com/clientserversocketnetworkcommunication8r.html
Another Article on UDP
http://msdn.microsoft.com/en-us/magazine/cc163648.aspx
You ask specifically about a "low latency User Feed". What do you really want with low latency, for 'Feed Only' (and especially if it does not generate revenue), could the Users wait a second; that is not low latency.
If you want to trade FAST then you need to physically move across the street from the Exchange (or nearby with an Optical Link). Next you need to 'Trade on the Card'; the Ethernet Card is 'smart' and is fed 'Trade Formulas' that program the Network Card to make a preprogrammed trade based on Data received (without pestering your Computer).
See: http://intelligenttradingtechnology.com/article/groundbreaking-results-high-performance-trading-fpga-and-x86-technologies
Learning to manipulate that Environment will buy you more than reinventing the wheel.
Ultra low latency is costly, but billions are at stake; your stakes (and pursuit of lower latency) with be throttled by $.
In the past i've used Tibco rv or raw sockets for streaming prices/ rates, where high frequency updates are expected. In this situation, it is often the client (or in in fact the user) who is the limitation (as there is only so many updates a user can process), and this is therefore an example of where you can 'lose' data. In this situation a client side service broker can be used to throttle updates.
If the system is used for automated trading or HFT then products like 29West LatencyBuster has been proven to work well and offers guaranteed messaging.
We're developing a .NET app that must make up to tens of thousands of small webservice calls to a 3rd party webservice. We would prefer a more 'chunky' call, but the 3rd party does not support it. We've designed the client to use a configurable number of worker threads, and through testing have code that is fairly well optimized for one multicore machine. However, we still want to improve the speed, and are looking at spreading the work accross multiple machines. We're well versed in typical client/server/database apps, but new to designing for multiple machines. So, a few questions related to that:
Is there any other client-side optimization, besides multithreading, that we should look at that could improve speed of a http request/response? (I should note this is a non-standard webservice, so is implemented using WebClient, not a WCF or SOAP client)
Our current thinking is to use WCF to publish chunks of work to MSMQ, and run clients on one or more machines to pull work off of the queue. We have experience with WCF + MSMQ, but want to be sure we're not missing better options. Are there other, better ways to do this today?
I've seen some 3rd party tools like DigiPede and Microsoft's HPC offerings, but these seem like overkill. Any experience with those products or reasons we should consider them over roll-our-own?
Sounds like your goal is to execute all these web service calls as quickly as you can, and get the results tabulated. Given that, your greatest efficiency control is going to be through scaling the number of concurrent requests you can make.
Be sure to look at your client-side connection limits. By default, I think the system default is 2 connections. I haven't tried this myself, but by upping the number of connections with this property, you should theoretically see a multiplier effect in terms of generating more requests by generating more connections from a single machine. There's more info on MS forums.
The MSMQ option works well. I'm running that configuration myself. ActiveMQ is also a fine solution, but MSMQ is already on the server.
You have a good starting point. Get that in operation, then move on to performance and throughput.
At CodeMash this year, Wesley Faler did an interesting presentation on this sort of problem. His solution was to store "jobs" in a DB, then use clients to pull down work and mark status when complete.
He then pushed the whole infrastructure up to Amazon's EC2.
Here's his slides from the presentation - they should give you the basic idea:
I've done something similar w/ multiple PC's locally - the basics of managing the workload were similar to Faler's approach.
If you have optimized the code, you could look into optimizing the network side to minimize the number of packets sent:
reuse HTTP sessions (i.e.: multiple transactions into one session by keeping the connection open, reduces TCP overhead)
reduce the number of HTTP headers to the minimum in the request to save bandwidth
if supported by server, use gzip to compress the body of the request (need to balance CPU usage to do the compression, and the bandwidth you save)
You might want to consider Rhino Service Bus instead of MSMQ. The source is available here.
What is the best way to determine available bandwidth in .NET?
We have users that access business applications from various remote access points, wired and wireless and at times the bandwidth can be very low based on where the user is. When the applications appear to be running slow, the issue could be due to low bandwidth and not some other issue.
I would like to be able to run some kind of service that would warn users whenever the available bandwidth dips below a specific threshold.
Any thoughts?
Not beyond the obvious of downloading a file of a known size and timing how long it takes. the disadvantage of that is that you'd need to waste a lot of bandwidth to do it. Also, if you wanted to alert when throughput drops below a threshold, you'll have to run the test more-or-less continuously.
IMHO, I'd live with poor performance in some locations, given that you can't do anything about it if it does occur.
Sorry.
There's no easy way to measure bandwidth without actually using it - which of course will starve the applications. A couple of points to bear in mind though:
1) Is it actually bandwidth that's the problem, or latency? You can measure latency in a less intrusive manner than bandwidth.
2) Are the applications all run from the same server (or at least the same network)? You may find that users will have a good connection to some areas of the net but not others. (It's likely that the last mile will be the limiting factor, but it's not always the case.)
If you're transferring data, simply measure it. You could also download a reference object from somewhere if you want to make it independent of the speed of your server.
Without knowing the exact nature of your connection, or how its used, there are two options that I am aware of.
MultinetGetConnectionPerformance (http://msdn.microsoft.com/en-us/library/aa385342(VS.85).aspx)
System Event Notification Service (http://msdn.microsoft.com/en-us/library/aa377538(VS.85).aspx)
Neither are direct .NET classes, but can be implemented in .NET very easily.
Take a look at both of them and see if they will work for you.
Roy