Best approach to an extendible statistics system

Best approach to an extendible statistics system - c#

Ok so - I need to implement a statistics/data-points/data-sources system.
I basically want to pass data periodically to the 'root' and have it process and update the relevant properties for access throughout the application - as data sources for graphing, labels, status checks etc.
I was wondering if there were some real world examples of this from users that have handled something like this in the past. I googled the hell out of this and I keep getting a mixed bag of results as to what I should do and I hate just programming and 'letting the pieces fall in place'. I need a direction.
Edit for clarity:
The data sources will be:
Local files (xml most likely),
Local sql,
Remotely acquired json data,
Remotely acquired sql.
Types of subsystems (limited list, just for illustrative purposes):
Connection statuses - both bool and text,
Graphing/Gridview data sources,
Processing/Predictive methods (eg probability distribution etc),
General statistical profiles based on client/dept,
General statistical profiles based on date/time/spans,
more...
As I said, a lot of these sources can be used in collaboration to update segments of data, should the need arise (which it likely will). One piece of information can be used across multiple systems, but there will be times when a fetch will be very specific for one point.
I hope that made it a little bit clearer... maybe. I would like to handle all the data processing in one area if possible. It'll be easier to work with as the flow increases over time.
I wrote down some thoughts on it as brainstormed the idea.
The observer pattern
This pattern seems good, however it does have some drawbacks in the sense that all subs will be notified instead of selective ones. Which will force me to either check the data then process, or create multiple observable objects for each type of data and have it cascade to the subs. I definitely like how extendable this is, also allows me to sub to multiple types of data sources should the need arise. On the other hand also seems like a lot of work to get any sort of results. Paying it forward as it were.
Strategy
This pattern also seems relevant but in a different way. Store the processing of the raw data separately and just have a parent class that holds all the statistical information (so to speak). I like this because all information is stored centrally and the 'nodes' process it and return it. Allows for easy access and storage, however the amount of properties (unless I split it up, likely) will be huge.
Custom events.
Now - I guess this COULD be seen as a reinvention of the first one. But I do like the control it offers.
A combination of observer and strategy.
This could be weird but hear me out. So You have your observable object have the data passed to it, which cascades down to appropriate subs that will process that information for different reasons, then using the strategy from each of those subs and process the information accordingly and pass it back to the sub for storage / access.
An example of this would be periodically withdrawing data from some kind of source; this information can be used to update multiple areas of the system (observer), but each area needs it processed a different way (strategy).
Is this logic sound or should I be looking at it a different way. I do need this to be extendable and scalable as the system could potentially be handling 'large' volumes.
Thoughts? Tried to be specific but remain on topic.

I ended out going with a combination of observer and strategy with a few customs events thrown in. Funny how that works. It actually works very well - lightweight, extendable and scalable on testing with 'large' (5-7gigs) of input. Desired results every time. Although assistance didn't happen I thought I would share the fact that the observer/strategy combination actually worked well.

Related

Multiple input-ouput and steps winform application

I am developing a winform application where a set of classes and its methods calculate a geometry from 3d points.
As there is some input from the user needed from step to step in the algorithm we have designed some buttons which represent the steps. The intermediate data is stored in a class (maybe we use a structure in future Versions), so as the user input is. As result of pushing the Buttons the intermediate data will be calculated, saved and shown to the user, so that he can edit it, affecting to the calculation in the next steps.
The application began as an application which calculate everything in 4 steps but now we have more than 10 steps so we have divided it into 3 parts (horizontal geometry, vertical geometry and other...). Now I am doing some divisions because everything is getting too complex to manage all the gui interaction in one Form so I will create user controls for the smaller parts of the form.
Do you have general recommendations for me?
Should I have these data structures (input and intermediate data) in the controls I make or in the general form?

You should avoid mixing UI with the business logic. In that way when the program grows larger it will be a lot easier to maintain. It will also make it much simpler to write automated unittests.
If there is no particular reason you are using winforms. I would recommend using Windows Presentation Foundation (WPF). Read a tutorial about the Model View ViewModel (MVVM) design pattern. This is a very nice way of separating the UI logic from the business logic.
It takes some effort to switch from Winforms to WPF but it is definitely worth it.
EDIT (answer to comment)
Well it depends on the problem your solving. But generally:
In the MVVM pattern the:
Model would contain all your data and algorithms in classes/methods. The ViewModel would connect all the stuff you have in your Model and control the flow of the program, it will expose properties (commands, strings, numbers collections etc...) that view can bind to. The View is simply a "skin" that makes it possible for the user to communicate with the ViewModel. This is a very simple explanation of the MVVM pattern and I would recommend reading a tutorial about it.
The first time I came across this pattern it was called Model View Controller MVC, I like what AngularJS is doing by just calling it Model View Whatever MVW because there are a lot of MV* out there. But WPF is specifically built for MVVM.
The most important thing, if your creating a program that is going to be used and maintained for many years, is to keep the code as simple as possible. Instead of writing all the functionality into the Button_Click event handler (I have seen some programs that do this), try to write a class or method for every single task (use long descriptive names), this is also called Seperation of Concerns. In other words one method/class should not do more than one tasks. The nice thing is that you end up with a "program flow controller" (ViewModel/Whatever) that just passes data from one method to the next. And just by looking at the code you go: Aha I know exactly what is going on here! In the same way when you look at your algorithms (Model) they should do a single job and all variables should be have descriptive names. This makes it very easy for other developer to understand the code.
I also have very good experience with dividing my namespaces according to type. So every time you have more than one object of some sort (DataProviders, FileReaders, etc...) create a namespace/subnamespace for them and put them in there. So when your creating a new object/interface/enumeration... you always know where to put it. And you always know where to find again: Oh its a DataProvider so it must be located in ProjectName.Objects.DataProviders :)
So my recommendation is: have some fun and read about: MVVM and SoC

Probably state machine diagram will help you do devide actions in parts. It's a better practice, to create some of diagrams before codding.

What is the profession / preferred way to handle input validation

I am new to C# and SQL. But over the last few years while learning both in college a question really begins to burn inside me. Here it is:
It seems to me that there are really two very generic ways to handle input validation (i.e. checking for required fields, and data in the correct ranges ect).
The first, and the way shown traditionally is: Once you develop your UI, and have connected it to a database back end in some manner. On the user interface, you check for correct input, such as blank text boxes, number ranges, or to ensure a radio or check box is selected ect.
The second, and the way shown in database development is: To set check constraints on fields such as no nulls allowed, unique values, and even ranges and required fields.
My dilema is this. Given that in modern languages like C# you can do general execption handling, and also given that major league fault tolerance is built into most databases like SQL Server with regard to handling data changes in respect to committing all or none. Details like this, and to this level, would be hard to program in anything but the simplest of programs.
So my question is, why not build all the requirements directly into the table at the database back end. Take advantage of the aformentioned fault tolerance, and just forget about programming if statements to ensure correct data is input, and instead just use a generic catch all execption handler if the data is not committed.
Perhaps that is how it is done, if so I would really like to know for sure. If not, why? My preference is to avoid writing code whenever possible. Less code, less debugging, and less problems when it comes to updating. So I would tend to go with that approach of letting the DB back end do the work. Is this the generally correct thing to do.
I know that general execption handling is considered "expensive" in terms of resources. But surley once you get past 5 or 10 if statements to handle different fields and their constraints, it must be more efficient code wise to just do a general execption handler. It certantly seems easier to understand overall. (At least the way I do it).
Thanks for your help with this.

OK, here is why you need it in both places.
First the integrity of the data should be paramount and data can be changed directly in database tables (deliberately through a script to say update a million prices or by accident or even by disgruntled or criminal employees trying to disrupt the database or steal from the company). Therefore is it reckless to avoid using constraints directly in the database and it leads to bad data.
Now at the user interface level, you want to prevent the user from wasting his time submitting bad data and you want to prevent the servers and networks from wasting their time trying to process it, so you write checks at that level. Plus you don't want the data in an inconsistent state if you need to insert to several tables and aren't using a transction (which you should be using but I would suspect it happens less often than it should.) Plus the users hate it when you try the insert and it fails and tells you that X is wrong and then they fix X and now Y is wrong but it was wrong before, the process just didn't get as far as Y before.

You do both.
Create constraints at the DB - level, and check for those constraints on the client level as well.
The validation on the DB makes sure that no invalid data gets in your DB, no matter how the data is inputted.
The validation at the client side improves the user-experience.

You generally can't build all the logic for checking into the database. Also not validating user input sufficiently is a good way to open yourself up to attack.
One way to write lesss guard code in every method is 'Code Contracts' a product of microsoft research.
All input should be validated both client and server side. Always.
Also with a giant catch it would be hard to tell which field was in error. So you would end up writing a lot of which field exploded code at the other end.

While I generally advocate putting as much in the database as possible (which means that you can have a high degree of confidence about the "raw" data as possible), that isn't always possible, even with the powerful constraints and triggers available in SQL.
In addition, there are high-level "integrity" things which may change over time, and it is not realistic to always have temporally-dynamic conditions in constraints. i.e. all HR records since 2007 must have a non-NULL birthdate, but prior ones are allowed to remain NULL, but any row cannot ever be set back to NULL.
My point is you can almost never put it all in the database.
Put the things in that you can, and put others at higher levels in the system. The database is a very important part of any system, but it isn't the only part. As long as its design helps it protect its perimeter and be able to provide reliable service and guarantee what it says it will guarantee so that other parts of the system can rely on their assumptions, then that's about the most you can ask for.

In addition to all answers made here, like that UI control improves drammaticaly UX for the user, and can completely change "image" of your app, that validation on DB is made for correct insert the data to DB, but on client it have to be done for correct insert of the client data.
Consider an example of standalone enterprise app. A client work at home, he filled 20 invoices late night on his notebook in Mongolia. The day after he came back, and sync it with his office SAP server. If the error will be figure out only during sync of the data, you can imagine what awful is this situation.
Just an example. There could a plenty of others, I'm sure.
Good luck.

Its 2 years later and I have a decent amount of experience now. I am not going to accept my answer as the right one as many here have done a great job and I am very happy with their answers. But I want to add another important consideration that looking back over my experience has not been highlighted here. I also use stack overflow for reference as I progress and I always find myself looking back over my questions and answers which is another reason I wanted to add this. Like a note to my future self.
While working at that company, I was asked to build an app that would do job abc. With this I also had to build part of the database. As I was finishing with the company I learned that they were writing another app which would use my database. Effectively my point is, that as many have pointed out, data is paramount, and you don't know how it is going to be accessed when you're gone.
I have also learned that there are 3 places that data needs to be verified:
on the actual database as explained
on the server side code behind which is not the same as the DB or client side validation
on the client side
There is another worry. With the advent of new tech like tablets and smart phones. This is yet another place where validation has to be implemented. The same rules for a 4th time (unless its a web app).
I later learned that prior to MVC we had CGI forms which had something to do with handling data over the network (I humbly admit ignorance on hardware side) but from what was explained to me it seems there may even be a 5th place to do validation (although I am open to being totally wrong about that).
I think the next guru in computer science will make a name for himself if he can find a way to abstract all that verification and validation to one place so that such rules don't have to be altered in a bunch of places.
worst case:
DB
Server side code
Client side code for web apps
What about if:
There may be a native client app (i.e. windows, linux or mac (at least 6 now))
There may be various phone apps (android, iPhone, and win phone to name 3, at least 9 now))
There may be some CGI or whatever
This totals 10+ places without much exaggeration and there are other operating systems.
Even for a simple age range this is getting to be messy, but what if they bring out some new email format, or other complicated validation, or you have to change a bunch of validation rules. Now you have to modify them across at least 3 or 4 places which in itself is bad.
The major problem with that is that you are modifying a lot of code and infrastructure that has been invested in, tested, and usually proven to work and delivered to the market...
As the number of client sides grow, modifying well tested code, can't be a good thing. I think this is going to be a major headache for the future. I wonder if there will be a design pattern or best practice to resolve it. If anyone knows of one, please tell me.

ASP.Net: Best Way To Handle Multi-Function Form Based on User

In my scenario, let's say there is a ASP.Net 4.0 C# page containing a form with several inputs on it. Based upon which state the user is in, the form needs to act in entirely different ways: some fields might be required, some not visible at all, some might have different requirements (state A might only allow numbers 1-5, state B numbers 5-10), etc.
So, to simplify things, let's just say for any given input on the form, I need to determine whether or not it's required for the user, again based on their state. For those of you who run into this scenario quite a bit, what's the best way of implementing a system to handle this? I can see the following options:
Hardcoded - Difficult to maintain, obviously
Custom Database Rule Framework - This seems like it would work; however, it would be somewhat of a pain to maintain depending on how complicated the logic is
Windows Workflow Foundation - This would be able to handle just about any kind of logic, and be decent to maintain, but I'm not sure how this would do performance wise. (could be stored externally in database)
Dynamic Code - Store the logic in a database and run it directly based upon the user. I've never done this.. is it possible?
That's all I've come up with at this point, but I'm hoping someone out there has found an elegant solution to handle scenarios with complicated forms like this.
Thanks!

I have never worked with WWF, but I have encountered a scenario like this and implemented an entry form for it that works well and is easy to maintain once you understand the system.
I will discourage you from using hardcoded logic because any degree of complexity will quickly become impossible to maintain. I tried a hybrid approach that included some hardcoding initially and it did not turn out well.
I ended up creating, as you call it, a custom database rule framework. It is a little extra work to set up config forms to associate user groups with certain codes and pieces of functionality, but in the end it is well worth it for everything to automatically configure itself. Also in my case I was able to farm out user & code setup work to a supervisor in the department that uses the application, so that is a big plus.

Hardcoding -- not so hard to maintain, just depending on how fluid the rules are. I.e., if your "states" are relatively fixed, you're not adding new ones or changing the way those states interact with the page, then hardcoding might be fine. My only recommendation in this case would be to keep it in a separate class so you can re-use it, modify & re-publish easier, etc.
If you want the flexibility to change the rules a lot, create new states (I'm thinking of these as "roles"), then storing the info in a database would make more sense.
Personally, I use the database approach. It saves me some re-publsihing of the app, and it has allowed me to build additional interfaces for my end-users to have limited capability to manage their own app in terms of role assignments ("states" as you put it), etc. For example, my end-users can grant one of their clients (based on the client's login) access to a certain report. Or in your situation, they could change the min number for some range-validator your .aspx is using.
Since this approach lets me delegate some admin functions to my end-users, it allows them to do on-the-fly changes (to a limited extent), and also saves me a lot of rush-work / do it yesterday work as far as my own to-do list is concerned.

How to allow users to define financial formulas in a C# app

I need to allow my users to be able to define formulas which will calculate values based on data. For example
//Example 1
return GetMonetaryAmountFromDatabase("Amount due") * 1.2;
//Example 2
return GetMonetaryAmountFromDatabase("Amount due") * GetFactorFromDatabase("Discount");
I will need to allow / * + - operations, also to assign local variables and execute IF statements, like so
var amountDue = GetMonetaryAmountFromDatabase("Amount due");
if (amountDue > 100000) return amountDue * 0.75;
if (amountDue > 50000) return amountDue * 0.9;
return amountDue;
The scenario is complicated because I have the following structure..
Customer (a few hundred)
Configuration (about 10 per customer)
Item (about 10,000 per customer configuration)
So I will perform a 3 level loop. At each "Configuration" level I will start a DB transaction and compile the forumlas, each "Item" will use the same transaction + compiled formulas (there are about 20 formulas per configuration, each item will use all of them).
This further complicates things because I can't just use the compiler services as it would result in continued memory usage growth. I can't use a new AppDomain per each "Configuration" loop level because some of the references I need to pass cannot be marshalled.
Any suggestions?
--Update--
This is what I went with, thanks!
http://www.codeproject.com/Articles/53611/Embedding-IronPython-in-a-C-Application

Iron Python Allows you to embed a scripting engine into your application. There are many other solutions. In fact, you can google something like "C# embedded scripting" and find a whole bunch of options. Some are easier than others to integrate, and some are easier than others to code up the scripts.
Of course, there is always VBA. But that's just downright ugly.

You could create a simple class at runtime, just by writing your logic into a string or the like, compile it, run it and make it return the calculations you need. This article shows you how to access the compiler from runtime: http://www.codeproject.com/KB/cs/codecompilation.aspx

I faced a similar problem a few years ago. I had a web app with moderate traffic that needed to allow equations, and it needed similar features to yours, and it had to be fast. I went through several ideas.
The first solution involved adding calculated columns to our database. Our tables for the app store the properties in columns (e.g., there's a column for Amount Due, another Discount, etc.). If the user typed in a formula like PropertyA * 2, the code would alter the underlying table to have a new calculated column. It's messy as far as adding and removing columns. It does have a few advantages though: the database (SQL Server) was really fast at doing the calculations; the database handled a lot of error detection for us; and I could pretend that the calculated values were the same as the non-calculated values, which meant that I didn't have to modify any existing code that worked with the non-calculated values.
That worked for a while until we needed the ability for a formula to reference another formula, and SQL Server doesn't allow that. So I switched to a scripting engine. IronPython wasn't very mature back then, so I chose another engine... I can't remember which one right now. Anyway, it was easy to write, but it was a little slow. Not a lot, maybe a few milliseconds per query, but for a web app the time really added up over all the requests.
That was when I decided to write my own parser for the formulas. That is, I have a PlusToken class to add two values, an ItemToken class that corresponds to GetValue("Discount"), etc. When the user enters a new formula, a validator parses the formula, makes sure it's valid (things like, did they reference a column that doesn't exist?), and stores it in a semi-compiled form that's easy to parse later. When the user requests a calculated value, a parser reads the formula, parses it, figures out what data is needed from the database, and computes the final answer. It took a fair amount of work up front, but it works well and it's really fast. Here's what I learned:
If the user enters a formula that leads to a cycle in the formulas, and you try to compute the value of the formula, you'll run out of stack space. If you're running this on a web app, the entire web server will stop working until you reset it. So it's important to detect cycles at the validation stage.
If you have more than a couple formulas, aggregate all the database calls in one place, then request all the data at once. Much faster.
Users will enter wacky stuff into formulas. A parser that provides useful error messages will save a lot of headaches later on.

If the custom scripts don't get more complex than the ones that you show above, I would agree with Sylvestre: Create your own parser, make a tree and do the logic yourself. You can generate a .Net expression tree or just go through the Syntax tree yourself and make the operations within your own code (Antlr below will help you generate such code).
Then you are in complete control of your references, you are always within C#, so you don't need to worry about memory management (any more than you would normally do) etc. IMO Antlr is the best tool for doing this in C# . You get examples from the site for little languages, like your scenario.
But... if this is really just a beginning and at the end you need almost full power of a proper scripting language, you would need to go into embedding a scripting language to your system. With your numbers, you will have a problem with performance, memory management and probably references as you noted. There are several approaches, but I cannot really give one recommendation for your scenario: I've never done it in such a scale.

You could build two base classes UnaryOperator (if, square, root...) and BinaryOperator (+ - / *) and build a tree from the expression. Then evaluate the tree for each item.

Is it better to have one big workflow or several smaller specific ones?

I need to build an app that gets files from a server and moves to another server. It was suggested that I look into using Windows Workflow Foundation (WF).
I started to build the workflow but it is getting messy and I'm not sure I'm doing it the best way possible.
Here is the basic worklow activities:
Get a list of sources
Determine if source is ftp or disk drive
Get a list of files from the server
If source is ftp then get the file with ftp get else if source is drive then read file from drive
If target is ftp then ftp file to server else if target is drive then write to a drive else if target is web service then post to web service
If source is ftp then delete file with ftp commands else if source is drive then delete file
With one workflow it gets a little busy. I need 2 while loops, one around the integrations and one after I get a file list.
The other thing I thought of was to build multiple workflows. One for FTPtoFTP, FTPtoDrive, FTPtoWebServie, DriveToFTP, DrivetoDrive, DriveToWebService.
Any suggestions?

First, you should consider creating custom Activities for each of the major sections. The custom activities will be Composite Activities that can be composed of many steps. This will help de-clutter things a bit and allow you to continue working with the workflows at a relatively high-level.
The Workflow Designer, while handy, is not really designed to scale very large. As of VS 2008, the best way to work with XAML-based technologies is to use the text editor and read/write the XML directly.
Breaking it down into several workflows might not be the best approach unless you can break it down into a few high-level activities and are working at the XAML level. Keep in mind that if the logic and flow is nearly identical for all of these, you will now have to maintain 6 different workflows. This is a bigger nightmare if your workflows are complex and you need to fix a common logic error across all of them.
You should also consider the use of the Services. This may allow you to have ONE workflow and ONE set of activities, but the implementation of each step can be isolated into a service. In this case, you would need to instantiate one workflow per combination, load the same workflow into each, and inject different activities. Not necessarily the best approach, but something to consider.

First of all, this sounds to me like using WF is adding extra complications to what should be a fairly straightforward process. Although WF can be used to model execution flow, its purpose is to model business flow, and include business rules and logic without putting those into your implementations.
In your example, the business rules seem largely like things which should be dealt with by an app.config file.
However, on the broader question of using one workflow or many. You want each of your workflow tasks to be approximately the same 'broad scope'
For instance
WF for building a table
purchase wood
cut wood
cut wood for legs
bevel edges
round cornices
sand twice with different coarseness
assemble table
The steps in the middle are all much more detailed than the steps around them.
So you would consider splitting it up into two separate workflows: a high level workflow that contains the broads steps, and lower level workflows that contain the particulars.
So the 'GetDatasource' workflow step would not care (externally) what type of datasource it is gathering from, it just returns to the next step in the workflow a set of data.
Same goes for target, it doesn't care what type of datasource it had, it only cares what it has to do with the data. So that should be encapsulated as well.
So your Workflow could be three workflows
Highest WF
GetDataSourceWF
DoThingsWithDataWF
Then your DoThingsWithDataWF and GetDataSourceWF Workflows can each be concerned with only the execution context that they need.
EDIT
As pointed out by the commenter James Schek.
You can use the higher level workflow to actually kick off your lower level workflows and manage their execution into each other.

Well personally I have not used WWF yet. I have done quite a bit of workflows before though. To me breaking them up into smaller workflows would seem to be the best way. When you're working with workflows you should try to limit each workflow to a specific task so that you have a definitive start action and at least one successful route and at least one failure route. Workflows in general can be very tricky things and it's best to keep each as simple as possible.

As a general rule, anytime things get "messy", you should break them down into smaller parts. I'd definitely recommend breaking it down into several workflows.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.