Trying to build a distributed crawler with ZeroMQ

Trying to build a distributed crawler with ZeroMQ - c#

I just started to learn ZeroMQ and want to build a distributed webcrawler as an example while learing.
My idea is to have a "server", written in PHP, which accepts a url where the crawling should start.
Workers (C# cli) will have to crawl that url, extract links, and push them back into a stack on the server. The server keeps sending urls in the stack to workers.
Perhaps a redis will keep track of all crawled urls, so we dont crawl sites multiple times and have the ability to extract statistics of the current process.
I would like to have the server to distribute tasks evenly, be aware of new/missing workers and redistribute urls when a worker doesnt respond.
Why PHP for the server: i'm just very comfortable with PHP, that is all. I dont want to make the example/testing project more complicated.
Why C# for the minions: because it runs on most windows machines. I can give the executable to various friends which can just execute it and help me test my project.
The crawling process and redis functionality are not part of my question.
My first approach was the PUSH/PULL pattern, which generally works for my scenario, but isnt aware of it's minions. I think i need a DEALER/ROUTER broker in the middle and have to handle the worker-awareness for myself.
I found this question but i'm not really sure if i understand the answer...
I'm asking for some hints how to impement the zmq stuff. Is the dealer approach correct? Is there any way to get an automatic worker-awareness? I think I need some resources/examples, or do you think that i just need to dig deeper in the zmq guide?
However, some hints towards the right direction would be great :)
Cheers

I'm building a job/task distributor that works the same as your crawler, in principal, at least. Here's a few things I've learned:
Define All Events
Communication between server and crawlers will be based on different things happening in your system, such as dispatching work from server to crawler, or a crawler sending a heartbeat message to the server. Define the system's event types; they are the use cases:
DISPATCH_WORK_TO_CRAWLER_EVENT
CRAWLER_NODE_STATUS_EVENT
...
Define a Message Standard
All communication between server and crawlers should be done using ZMsg's, so define a standard that organizes your frames, something like this:
Frame1: "Crawler v1.0" //this is a static header
Frame2: <event type> //ex: "CRAWLER_NODE_STATUS_EVENT"
Frame3: <content xml/json/binary> //content that applies to this event (if any)
Now you can create message validators to validate ZMsgs received between peers since you have a standard convention all messages must follow.
Server
Use a single ROUTER on the server for asynchrounous and bidirectional communication with the crawlers. Also, use a PUB socket for broadcasting heartbeat messages.
Don't block on the ROUTER socket, use a POLLER to loop every 5s or whatever, this allows the server to do other things periodically, like broadcast heartbeat events to the crawlers; something like this:
Socket rtr = .. //ZMQ.ROUTER
Socket pub = .. //ZMQ.PUB
ZMQ.Poller poller = new ZMQ.Poller(2)
poller.register( rtr, ZMQ.Poller.POLLIN)
poller.register( pub, ZMQ.Poller.POLLIN)
while (true) {
ZMsg msg = null
poller.poll(5000)
if( poller.pollin(0)){
//messages from crawlers
msg = ZMsg.recvMsg(rtr)
}
//send heartbeat messages
ZMsg hearbeatMsg = ...
//create message content here,
//publish to all crawlers
heartbeatMsg.send(pub)
}
To address your question about worker awareness, a simple and effective method uses a FIFO stack along with the heartbeat messages; something like this:
server maintains a simple FIFO stack in memory
server sends out heartbeats; crawlers respond with their node name; the ROUTER automatically puts the address of the node in the message as well (read up on message enveloping)
push 1 object onto the stack containing the node name and node address
when the server wants to dispatch work to a crawler, just pop the next object from the stack, create the message and address is properly (using the node address), and off it goes to that worker
dispatch more work to other crawlers the same way; when a crawler responds back to the server, just push another object with node name/address back on the stack; the other workers won't be available until they respond, so we don't bother them.
This is a simple but effective method of distributing work based on worker availability instead of blindly sending out work. Check lbbroker.php example, the concept is the same.
Crawler (Worker)
The worker should use a single DEALER socket along with a SUB. The DEALER is the main socket for async communication, and the SUB subscribes to heartbeat messages from the server. When the worker receives a heartbeat messages, it responds to the server on the DEALER socket.
Socket dlr = .. //ZMQ.DEALER
Socket sub = .. //ZMQ.SUB
ZMQ.Poller poller = new ZMQ.Poller(2)
poller.register( dlr, ZMQ.Poller.POLLIN)
poller.register( sub, ZMQ.Poller.POLLIN)
while (true) {
ZMsg msg = null
poller.poll(5000)
if( poller.pollin(0)){
//message from server
msg = ZMsg.recvMsg(dlr)
}
if( poller.pollin(1)){
//heartbeat message from server
msg = ZMsg.recvMsg(sub)
//reply back with status
ZMsg statusMsg = ...
statusMsg.send(dlr)
}
The rest you can figure out on your own. Work through the PHP examples, build stuff, break it, build more, it's the only way you'll learn!
Have fun, hope it helps!

Related

Order of messages in SingalR - how to ensure it

I'm using SingalR in an application that sends alot messages in a short period of time.
let's say i have client A and client B.
Client A just sends messages and client B just listening to messages.
Client A sends the following messages in the following order: A->B->C->D
What i'm seeing is that Client B sometimes receives the messages in a different order, for example: B->A->C->D
It is important for maintain the same order i sent the messages.
I've looked online and i found people saying i should use async-await on the function on the hub that handles those messages.
public async Task hubMethod(msgObject msg)
{
await Clients.All.message(msg);
}
I'm not sure how that helps since each time i make a call from client A , singalR should create a new instance of the hub.
The only thing it does is wait for the singalR that it finished doing all it can do on the server in order to send the message to the other client and notifies it to client A.
So my question is this - is there a singalR or asp.net mechanism that make sure i receive the messages in the correct order on the other client or do i need to write my own mechanism (server or client) that reorders the messages if they are out of order - and if so, is there a library that already does it?

You need to write your own mechanism. SignalR in client B has no way to know in which order the clients messages were sent by client A because there is many things that could delay a specific message arrival, like network delay, the only thing SignalR can guarantee is the order in which the messages arrived.
If you really need to know the original order of the messages you could put a count inside the message and let client B sort them out. However i suggest you try another approach, because guaranteeing the order of delivery is not a easy task.

How to send updates from server to clients?

I am building a c#/wpf project.
It's architecture is this:
A console application which will be on a virtual machine (or my home computer) that will be the server side.
A wpf application that will be the client app.
Now my problem is this - I want the server to be able to send changes to the clients. If for example I have a change for client ABC, I want the server to know how to call a service on the clients computer.
The problem is, that I don't know how the server will call the clients.
A small example in case I didn't explain it well:
The server is on computer 1, and there are two clients, on computers 2 and 3.
Client 2 has a Toyota car and client 3 has a BMW car.
The server on computer 1 wants to tell client 2 that it has a new car, an Avenger.
How do I keep track and call services on the clients?
I thought of saving their ip address (from calling ipconfig from the cmd) in the DB - but isn't that based on the WI-FI/network they are connected to?
Thanks for any help!

You could try implementing SignalR. It is a great library that uses web sockets to push data to clients.
Edit:
SignalR can help you solve your problem by allowing you to set up Hubs on your console app (server) that WPF application (clients) can connect to. When the clients start up you will register them with a specified Hub. When something changes on the server, you can push from the server Hub to the client. The client will receive the information from the server and allow you to handle it as you see fit.
Rough mockup of some code:
namepsace Server{}
public class YourHub : Hub {
public void SomeHubMethod(string userName) {
//clientMethodToCall is a method in the WPF application that
//will be called. Client needs to be registered to hub first.
Clients.User(userName).clientMethodToCall("This is a test.");
//One issue you may face is mapping client connections.
//There are a couple different ways/methodologies to do this.
//Just figure what will work best for you.
}
}
}
namespace Client{
public class HubService{
public IHubProxy CreateHubProxy(){
var hubConnection = new HubConnection("http://serverAddress:serverPort/");
IHubProxy yourHubProxy = hubConnection.CreateHubProxy("YourHub");
return yourHubProxy;
}
}
}
Then in your WPF window:
var hubService = new HubService();
var yourHubProxy = hubService.CreateHubProxy();
yourHubProxy.Start().Wait();
yourHubProxy.On("clientMethodToCall", () => DoSometingWithServerData());

You need to create some kind of subscription model for the clients to the server to handle a Publish-Subscribe channel (see http://www.enterpriseintegrationpatterns.com/patterns/messaging/PublishSubscribeChannel.html). The basic architecture is this:
Client sends a request to the messaging channel to register itself as a subscriber to a certain kind of message/event/etc.
Server sends messages to the channel to be delivered to subscribers to that message.
There are many ways to handle this. You could use some of the Azure services (like Event hub, or Topic) if you don't want to reinvent the wheel here. You could also have your server application track all of these things (updates to IP addresses, updates to subscription interest, making sure that messages don't get sent more than once; taking care of message durability [making sure messages get delivered even if the client is offline when the message gets created]).

In general, whatever solution you choose is plagued with a common problem - clients hide behind firewalls and have dynamic IP addresses. This makes it difficult (I've heard of technologies claiming to overcome this but haven't seen any in action) for a server to push to a client.
In reality, the client talks and the server listens and response. However, you can use this approach to simulate a push by;
1. polling (the client periodically asks for information)
2. long polling (the client asks for information and the server holds onto the request until information arrives or a timeout occurs)
3. sockets (the client requests server connection that is used for bi-directional communication for a period of time).
Knowing those terms, your next choice is to write your own or use a third-party service (azure, amazon, other) to deliver messages for you. I personally like long polling because it is easy to implement. In my application, I have the following setup.
A web API server on Azure with and endpoint that listens for message requests
A simple loop inside the server code that checks the database for new messages every 100ms.
A client that calls the API, handling the response.
As mentioned, there are many ways to do this. In your particular case, one way would be as follows.
Client A calls server API to listen for message
Server holds onto call, waiting for new message entry in database
Client B calls server API to post new message
Server saves message to database
Server instance from step 2 sees new message
Server returns message to Client A.
Also, the message doesn't have to be stored in a database - it just depends on your needs.

Sounds like you want to track users à la https://www.simple-talk.com/dotnet/asp.net/tracking-online-users-with-signalr/ , but in a desktop app in the sense of http://www.codeproject.com/Articles/804770/Implementing-SignalR-in-Desktop-Applications or damienbod.wordpress.com/2013/11/20/signalr-a-complete-wpf-client-using-mvvm/ .

ZeroMQ Non-Blocking Non-Queueing Push

I am using the C# wrapper for ZeroMQ but this seems more like an underlying issue with ZeroMQ.
Is there any way to push a message without blocking and without queueing? If the server is not up I would like the messages to be permanently disposed without blocking.
Here are the settings I've tried so far:
1)
Send (Blocking send)
High water mark = 0
This (stragely) does not block, but it seems to queue in memory until the socket is connected (memory keeps rising for the process).
2)
Send (Non-Blocking send)
High water mark = 1
This is a race condition. If I send two messages in rapid succession one message is sometimes thrown out for exceeding the high water mark.
3)
Poll the socket to figure out if it's going to block. This doesn't really help because I still have to put one (old) message in the queue before it starts blocking (if I set HWM = 1).
Non-blocking send with any high water mark is undesirable because as soon as the server comes back online it gets a bunch of old messages from clients.
Blocking send doesn't work because I don't want to block.

What you seem to be looking for is simply a PUB socket. This socket type never blocks on send, and discards any message it cannot send to a subscriber. See this page : http://api.zeromq.org/3-2:zmq-socket .
On a side note, you do not need to use this socket for "real" pub/sub, you can use it for nonblocking communication between two nodes by having only one PUB and one SUB socket by endpoint.
Your server will not get "old" messages after a reconnect because the PUB sockets will have dropped the messages it could not send while the server was disconnected. Nevertheless i believe that while you cannot avoid some internal ZMQ "queuing", it should have little bearing on your use case.

TCPListener / TCPClient Server-Client Writing / Reading data

Here I am troubleshooting a theoretical problem about HOW servers and clients are working on machines. I know all NET Processes, but I am missing something referring to code. I was unable to find something related about this.
I code in Visual C# 2008, i use regular TCPClient / TCPListener with 2 different projects:
Project1 (Client)
Project2 (Server)
My issues are maybe so simple:
1-> About how server receives data, event handlers are possible?
In my first server codes i used to make this loop:
while (true)
{
if (NetworkStream.DataAvailable)
{
//stuff
}
Thread.Sleep(200);
}
I encounter this as a crap way to control the incoming data from a server. BUT server is always ready to receive data.
My question: There is anything like...? ->
AcceptTcpClient();
I want a handler that waits until something happen, in this case a specific socket data receiving.
2-> General networking I/O methods.
The problem is (beside I'm a noob) is how to handle multiple data writing.
If I use to send a lot of data in a byte array, the sending can break if I send more data. All data got joined and errors occurs when receiving. I want to handle multiple writes to send and receive.
Is this possible?

About how server receives data, event handlers are possible?
If you want to write call-back oriented server code, you may find MSDN's Asynchronous Server Socket Example exactly what you're looking for.
... the sending can break if I send more data. All data got joined and errors occurs when receiving.
That is the nature of TCP. The standardized Internet protocols fall into a few categories:
block oriented stream oriented
reliable SCTP TCP
unreliable UDP ---
If you really want to send blocks of data, you can use SCTP, but be aware that many firewalls DROP SCTP packets because they aren't "usual". I don't know if you can reliably route SCTP packets across the open Internet.
You can wrap your own content into blocks of data with your own headers or add other "synchronization" mechanisms to your system. Consider an HTTP server: it must wait until it reads an entire request like:
GET /index.html HTTP/1.1␍␊
Host: www.example.com␍␊
␍␊
Until the server sees the CRLFCRLF sequence, it must keep the partially-read data in a buffer. The bytes might come in one at a time in a dozen or more packets. Or, if the client is sending multiple requests in a single stream, a dozen requests might come in a single packet.
You just have to handle this.

Suggestions for developing a TCP/IP based message client

I've got a server side protocol that controls a telephony system, I've already implemented a client library that communicates with it which is in production now, however there are some problems with the system I have at the moment, so I am considering re-writing it.
My client library is currently written in Java but I am thinking of re-writing it in both C# and Java to allow for different clients to have access to the same back end.
The messages start with a keyword have a number of bytes of meta data and then some data. The messages are always terminated by an end of message character.
Communication is duplex between the client and the server usually taking the form of a request from the Client which provokes several responses from the server, but can be notifications.
The messages are marked as being on of:
C: Command
P: Pending (server is still handling the request)
D: Data data as a response to
R: Response
B: Busy (Server is too busy to handle response at the moment)
N: Notification
My current architecture has each message being parsed and a thread spawned to handle it, however I'm finding that some of the Notifications are processed out of order which is causing me some trouble as they have to be handled in the same order they arrive.
The duplex messages tend to take the following message format:
Client -> Server: Command
Server -> Client: Pending (Optional)
Server -> Client: Data (optional)
Server -> Client: Response (2nd entry in message data denotes whether this is an error or not)
I've been using the protocol for over a year and I've never seen the a busy message but that doesn't mean they don't happen.
The server can also send notifications to the client, and there are a few Response messages that are auto triggered by events on the server so they are sent without a corresponding Command being issued.
Some Notification Messages will arrive as part of sequence of messages, which are related for example:
NotificationName M00001
NotificationName M00001
NotificationName M00000
The string M0000X means that either there is more data to come or that this is the end of the messages.
At present the tcp client is fairly dumb it just spawns a thread that notifies an event on a subscriber that the message has been received, the event is specific to the message keyword and the type of message (So data,Responses and Notifications are handled separately) this works fairly effectively for Data and response messages, but falls over with the notification messages as they seem to arrive in rapid sequence and a race condition sometimes seems to cause the Message end to be processed before the ones that have the data are processed, leading to lost message data.
Given this really badly written description of how the system works how would you go about writing the client side transport code?
The meta data does not have a message number, and I have not control over the underlying protocol as it's provided by a vendor.

The requirement that messages must be processed in the order in which they're received almost forces a producer/consumer design, where the listener gets requests from the client, parses them, and then places the parsed request into a queue. A separate thread (the consumer) takes each message from the queue in order, processes it, and sends a response to the client.
Alternately, the consumer could put the result into a queue so that another thread (perhaps the listener thread?) can send the result to the client. In that case you'd have two producer/consumer relationships:
Listener -> event queue -> processing thread -> output queue -> output thread
In .NET, this kind of thing is pretty easy to implement using BlockingCollection to handle the queues. I don't know if there is something similar in Java.
The possibility of a multi-message request complicates things a little bit, as it seems like the listener will have to buffer messages until the last part of the request comes in before placing the entire thing into the queue.
To me, the beauty of the producer/consumer design is that it forces a hard separation between different parts of the program, making each much easier to debug and minimizing the possibility of shared state causing problems. The only slightly complicated part here is that you'll have to include the connection (socket or whatever) as part of the message that gets shared in the queues so that the output thread knows where to send the response.
It's not clear to me if you have to process all messages in the order they're received or if you just need to process messages for any particular client in the proper order. For example, if you have:
Client 1 message A
Client 1 message B
Client 2 message A
Is it okay to process the first message from Client 2 before you process the second message from Client 1? If so, then you can increase throughput by using what is logically multiple queues--one per client. Your "consumer" then becomes multiple threads. You just have to make sure that only one message per client is being processed at any time.

I would have one thread per client which does the parsing and processing. That way the processing would be in the order it is sent/arrives.
As you have stated, the tasks cannot be perform in parallel safely. performing the parsing and processing in different threads is likely to add as much overhead as you might save.
If your processing is relatively simple and doesn't depend on external systems, a single thread should be able to handle 1K to 20K messages per second.
Is there any other issues you would want to fix?

I can recommend only for Java-based solution.
I would use some already mature transport framework. By "some" I mean the only one I have worked with until now -- Apache MINA. However, it works and it's very flexible.
Regarding processing messages out-of-order -- for messages which must be produced in the order they were received you could build queues and put such messages into queues.
To limit number of queues, you could instantiate, say, 4 queues, and route incoming message to particular queue depending on the last 2 bits (indeces 0-3) of the hash of the ordering part of the message (for example, on the client_id contained in the message).
If you have more concrete questions, I can update my answer appropriately.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.