Run my own application on hadoop without coding mapreduce? - c#

Maybe I did not fully understand how complex hadoop really is, if there is something incorrect please help me out. So what I got is this:
Hadoop is a great thing to handle a big amount of data. Mostly for data analysis and mining. I can write my own mapreduce functions or using pig or hive. I can even use existing functions, wordcount and stuff like that - I dont even have to write code.
Ok, but what if I would like to use the great power of hadoop for non-analysis/mining things? As example I have a .NET application written in C# that is able to read files and generating pdfs with some barcodes. This application is running on one server, but because the one server cannot handle the big amount of files I need more power. Why not adding some hadoop nodes/clusters to handle this job?
Question: can I take my .NET application and tell hadoop "do this, on every on your nodes/cluster"? -> Running these jobs without coding, is it possible?
If not, do I have to throw away the .NET application and rewrite everything in pig/hive/java-mapreduce? Or how do people solve these issues in my situation?
PS: The important thing here is not the pdf generator and maybe also not .NET/C# - the question is: there is an application in language whatever - can I give it to hadoop just like that? Or do we have to rewrite everything in mapreduce functions?

#Mongo : I'm not sure if I understood correct but I'd try sharing what I know. First of all hadoop is a framework - not an extension or a plugin.
If you want to process the files or perform a task in hadoop, you need to make sure that your requirements are properly put forward so that hadoop understand what to do with your data. To put it simple, let us consider the same word count example. If you want to perform the word count on a file, you can do it using any language. Lets say we have done it in Java, and we want to scale it to larger files- dumping the same code in to a hadoop cluster would not be helpful. Though the java logic remains the same, you will have to write a Map reduce code in java which would be understood by the hadoop framework.
Here's an example of a C# map reduce program for Hadoop processing
Here's another example of MapReduce Without Hadoop Using the ASP.NET Pipeline
Hope this is helpful. I'm assuming that my post adds some value to your question. I'm sure you would be getting better thoughts/suggestions/answers from the wonderful people here...
P.S: You could mostly do anything and everything thats related to file processing/ data analysis in Hadoop. It all depends up on how you do it :)
Cheers !

Any application that can run in Linux can be done in Hadoop, using Hadoop-streaming. And a C# application can run in Linux using Mono.
So you can run your C# application using both Hadoop-streaming and Mono. But still, you need to adapt your logic to the map-reduce paradigm.
However, it should not be a big deal in your case. For instance, you could:
create a Hadoop-streaming job with mappers only (no reducers)
process exactly 1 file per mapper
each mapper would run "mono yourApp.exe", reading the input file in stdin, and writing the output in stdout
Also, Mono must be available on the Hadoop cluster. If not, some admin privileges will be required to install and deploy Mono yourself.

Related

Is it possible to sandbox and run C++ or C# code that's entered in a textfield in a browser?

I'm diving into web development after ten years of desktop development and I'm experimenting with some testing concepts. I was wondering if it's possible to sandbox and run C++ code that's entered in a textfield in a browser? By that, I mean run the C++ or C# code on the backend webserver and return an analysis of the code. Just to be clear, I don't mean run C++ or C# code that's intended to generate any kind of markup, but simply to blackbox test the C++ or C# block of code that's entered.
How would you invoke the compiler, depending on the web server you're using?
How could you sandbox the code to prevent malicious behavior? If we're considering only one of the C variants, what about blacklisting/whitelisting specific functions and libraries to prevent malicious behavior? Or would that blacklist be too long and too limiting to allow any fair amount of code to run?
These are some fairly high-level questions that I'm asking just because I'm having a hard time finding some direction, but I'm going to continue researching them right now. Thanks so much in advance for your help!
You might find the codepad about page interesting.
# 1 is easy with C#. The Reflection capabilities of .NET allow you to compile and run code "on the fly." And here's a link to another good looking tutorial.
# 2 is a little more difficult but I suppose a basic sand boxing technique might involve executing a dynamic process under a limited, and therefore sand boxed account. Programmatically you could analyze the dynamicly built assembly's dependencies and not allow it to run if it used APIs in certain namespaces such as System.IO. This is non-trivial to say the least though.
C++ doesn't have reflection capabilities and so 3rd party libraries would be your best bet.
The Dinkumware site has something like this.
A simple Perl (or Python, ...) cgi could be used to invoke the compiler, parse it results, run the resulting executable if any and display it's results.
I would take a look at SELinux (maybe AppArmor?) for access controls. Maybe not allowing it write and read to/from the disk and limit it's running time. I don't know if the later can be done with SELinux, too.
If the server runs Linux, you may consider using chroot
We actually did just that with our product called iKnode. We are using this idea to create a Backend in the cloud.
We did this by creating a SandBox that takes an specific piece of code and executes it, captures the result and returns it to the user. This is all done in the cloud.
How would you invoke the compiler, depending on the web server you're
using?
We did this by using the CodeDom utilities from the .Net framework. And we are exploring the coming 'compiler as a service' project coming from Microsoft code-named Roslyn.
This is a good starting point on using CodeDom to programatically compile.
How could you sandbox the code to prevent malicious behavior? If we're
considering only one of the C variants, what about
blacklisting/whitelisting specific functions and libraries to prevent
malicious behavior? Or would that blacklist be too long and too
limiting to allow any fair amount of code to run?
We did this by wrapping the code execution in a separate and limited AppDomain. You can see some examples here.
Additionally, you might want to look into the MonoSandBox, which was created for Moonlight, but it is a more robust SandBox. We are experimenting with it right now, to move away from AppDomains. We believe the MonoSandBox is way better than simple AppDomains.

VT100 Emulation LIbrary in C# with SharpSSH

I'm messing around with Tamir.SharpSsh and wanted to see if it was possible to use it to implement a console SSH client fully in C#. I do not mean something like putty where it's actually running in it's own GUI, but something you could run directly from the windows cmd console.
The library is pretty great, except that it doesn't handle terminal emulation in any way. So when using SshShell, you can do some basic interaction, but the output is often very ugly and full of random characters and you cannot actually interact with things like shell scripts, etc.
As far as I can tell SharpSSH simply redirects the IO to the console IO.
How hard would it be to redirect this elsewhere and handle the terminal emulation? Also, is there an emulation library (C# and open source, preferably) already that I could use?
Edit: Gave up on SharpSSH, see answer below for the final solution I came up with.
I have actually since abandoned trying to use SharpSSH. It is a good library, but was just too lacking in overall functionality. I am now using a library called Granados which is a much more fleshed out SSH implementation. It has a built in event model (unlike SharpSSH which mostly involves wrangling with Streams) that makes usage very easy.
As for the terminal emulation part... Granados is actually the core of another open source project called Poderosa.
Poderosa is a complete terminal emulator application that can connect to ssh, telnet and even your local cygwin install.
I haven't really dove into it's terminal emulation code at all, but it definitely does it quite well, so I'm sure you could easily pull out whatever code you need.
I'm looking for the same thing. There is a library here that costs $700. Found another one on codeproject that looks shoddy but might be a good start. And there is an incomplete implementation right here on stackoverflow. Still searching..

Would like to make a php get_include_files() enhancements

I am interested in making an application that can automatically determine what files are included in php.
What I'm getting at is that I would like to make either a C/C++ or a C# application that runs in the background and as you're developing on your local machine, it can display included files by php as you launch pages running on your local apache.
What I thought about was to modify the function in php source code, but then I thought that would be a bad idea because then each new version of php, I'd have to go back and make the same modifications and I doubt everyone would do that.
So my question is, is it remotely possible to get all the included files that your php application used and then somehow display them to the user without using get_included_files() in your php program?
You could go outside of PHP completely and rely on the underlying operating system to report these details. It would be difficult to match the request to the includes though so it would only work in a development situation.
If the OS is Linux/UNIX, you can run strace on the executable (assuming using Apache with mod_php, other situations more difficult).
If the OS is Windows, I'm not sure what to use but possible one of the SysInternals utilities (most are GUI but likely there is a console equivalent of strace or a version of strace for Windows).
Another option would be to use xdebug. It would show you much more information including profiling details, memory usage, etc. It is used as a PHP extension and it does make it easy to profile a whole request in one snapshot. Once you have a trace file, you can use WinCacheGrind (Windows), kCacheGrin (UNIX, maybe OS X too) and something else for OS X. I'd suggest trying this as it is the simplest approach and is quite powerful if you are looking to get this done rather than do exploratory programming.
http://xdebug.org/
If you are interested in doing exploratory programming, my suggested route would be to look at how xdebug works and see if you can write a hook to the functions you want to trace.

What's a good way to write batch scripts in C#?

I would like to write simple scripts in C#. Stuff I would normally use .bat or 4NT .btm files for. Copying files, parsing text, asking user input, and so on. Fairly simple but doing this stuff right in a batch file is really hard (no exceptions for example).
I'm familiar with command line "scripting" wrappers like AxScript so that gets me part of the way there. What I'm missing is the easy file-manipulation framework. I want to be able to do cd(".."), copy(srcFile, destFile) type functionality.
Tools I have tried:
NANT, which we use in our build process. Not a good scripting tool. Insanely verbose XML syntax and to add a simple function you must write an extension assembly. Can't do it inline.
PowerShell. Looks great, but I just haven't been able to switch over to this as my primary shell. Too many differences from 4NT. Whatever I do needs to run from an ordinary command prompt and not require a special shell to run it through. Can PowerShell be used as a script executor?
Perl/Python/Ruby. Really hate learning an entirely new language and framework just to do batch file operations. Haven't been able to dedicate the time I need to do this. Plus, we're a 99% .NET shop for our toolchain and I really want to leverage our existing experience and codebase.
Are there frameworks out there that are trying to solve this problem of "make a batch file in C#" that you have used?
I want the power of C#/.NET with the immediate-mode type functionality of a typical cmd.exe shell language. Am I alone in wanting something like this?
I would try to get over the PowerShell anxiety because it is the shell of the future. All of the software coming out of Microsoft is using it as their management interface and especially version 2.0 is ridiculously useful.
I'm a C# developer most of the time but PowerShell has solved that whole "WindowsApplication42" problem of temp projects just piling up. PowerShell gives you full access to the .NET framework in a command line shell so even if you don't know how to do something in PowerShell, you most likely know how to do it in .NET.
IronPython and IronRuby do let you leverage all of your .NET "experience and codebase" (they don't answer your objection to learning new languages, however).
If you have any bash nerds, you can always try cygwin.
Also remember that Python was originally intended as a "glue" langauge. If you used the aforementioned IronPython, it's pretty easy to tie together pre-written C# classes.
If you are bind to MS, PowerShell is surely the way to go. But I don't like it much.
I personally use MSBuild script more, and would like to see Mono C# Shell one day comes to Windows.
I think CS-Script might be the ideal solution for you.

In a .NET C# program, is it easy to transition from FTP to SFTP?

In a .NET C# program, is it easy to transition from FTP to SFTP? I'm trying to get a sense of how muh time it would take the contractor to make the transition. My personal experience is mostly with PHP, so I have no idea.
Basically, what I'm talking about, what steps would have to be made? Obviously, different commands, but would anything else in the code itself? Like do the commands require different formats, etc.?
Also, if anyone has a list of all the .NET/C# FTP and SFTP commands, that would be really helpful.
Clarification, as requested: The program is uploading extremely small files (20 KB) to a server. By format, I mean visually, because I was wondering about a find/replace job.
This is a pretty vague question. You haven't told us what the C# program is doing with FTP. Is it a server, is it a client, is it doing directory listings, is it uploading 100 GB files? What library is it using?
According to this forum post , there is no built-in support for SFTP in .NET, so you would have to use third-party libraries such as SharpSSH or Granados SSH.
I don't really know what you mean, "do the commands require different formats". Obviously, the code will use different:
Libraries
Types
Wire protocol.
It will obviously appear somewhat similar, thanks to the abstraction of the libraries. I suggest you provide more information, and a clearer question.
One thing that you'd need to consider is how well your current code is written. If your existing FTP implementation is horribly designed spaghetti code then converting it to SFTP may be next to impossible and take way longer than you'd like. Without knowing the current state of the code, it would be difficult for anyone to make a good estimation. And even if you do get an estimation from people on this site, I wouldn't recommend trusting it (even though the people on this site are great) since without all the information in front of them it will be next to impossible for anyone to come up with a reliable estimate.
Perhaps you should consider hiring a good consultant or business analyst to do a thorough estimate for you.
It really depends on what C# library your developer has used to implement FTP.
If, for example, they used edtFTPnet, a widely available open source library, then the upgrade path is trivial if you replace it with edtFTPnet/PRO. The PRO version has the identical API and just a few extra lines of code would be needed.
I've been down this road.
It depends, but keep in mind that SFTP, FTP-SSL and FTP are different.
If he's writing the SFTP libraries himself, a month or two, since it's a lot of work to make it perfect and compatible. But he should NOT do that.
In short, get him to use an external library to add SFTP functionality. This will make it pretty short. Maybe a week or two of full-time work, but it depends on how involved it is. There's open-source options.. But for $50-150 you can get a license to well-maintained code that's really easy to use. will save him days of work.
There's links above, but I'd look at:
Free:
http://www.enterprisedt.com/products/edtftpnet/overview.html
Commercial:
http://www.weonlydo.com/

Categories

Resources