Categorization / classification pattern

Categorization / classification pattern - c#

This question may be a bit too generic and abstract because I do not know what I'm looking for yet. I do not have too much experience with patterns. I need to know what pattern/ technique I can use to categorize patients in a medical app.
Let's say the hospital has a documentation app with 10 data fields. Dates, numbers, selects, multi-selects.
Every patient that visits the hospital will have it's own specific information.
After input and analysis, each patient must be placed in a category.
Each category is determined by a set of rules. Those rules are created based on some or all fields defined above and their individual value.
In reality I'm speaking about hundreds of patients and hundreds of input fields. So I'm trying to find out whether there is some traditional way of doing this (something more generic) or if I'm stuck with writing tens of "IF and Switch" statements.
PS: this is not a machine learning task

This sounds like a task for a type of algorithm called a rules engine. At their simplest the coding style of rules engines does appear to be a collection of IF ... THEN ... (ELSE...) but rules engines also usually have features such as the elimination of redundant branches, cycle and contradiction detection and so on.
Examples of software packages that provide this are Drools and the BizTalk Business Rules Engine

In addition to Tom W's answer, on a lower level you may find the Specification pattern useful.

Related

How can personal and place names be extracted from text using C#?

Is there any C# algorithm by which personal and place names can be extracted from text?
e.g., given the following text:
St. Mark died at Alexandria, in Egypt. He was martyred, I think.
However, that has nothing to do with my legend. About the founding of
the city of Venice--
(taken from "The Innocents Abroad" by Mark Twain)
...is there any way to extract:
St. Mark
Alexandria (or better yet, "Alexandria, Egypt")
Venice
?
I realize that there is no way to get 100% accuracy (where all place names and personal names are captured, and no "false positives" are added), but 80% accuracy could be very valuable.
I understand that each word could be compared with an encyclopedia or some such, but there must be a better way. Also, how could the algorithm know to combine "St." and "Mark" and to see "Alexandria, in Egypt" as "Alexandria, Egypt"?

I noticed that the links provided here are a bit dated. One project that is still active (and free [correction: GPL, so free for non-commercial]) is the Stanford Natural Language Processing (NLP) libraries (https://nlp.stanford.edu/software/). You can demo their Named Entity Recognition (NER) here. It even has a .NET wrapper (http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordNER.html).
Microsoft also offers many similar algorithms through Azure Cognitive Services. You would be most interested in Entity Linking (https://azure.microsoft.com/en-us/services/cognitive-services/entity-linking-intelligence-service/)
I hope helps future viewers.

You are best off using some kind of API that will be able to perform this kind of entity matching, as what you are asking is potentially very complex and requires some degree of semantic textual analysis backed up by a large database. I'd recommend at looking at APIs such as:
OpenCalais - English Semantic Metadata: Entity/Fact/Event Definitions and Descriptions web-service
Calais supports a rich set of semantic metadata, including entities, events and facts.
Alchemy API - Entity Extraction API
AlchemyAPI is capable of identifying people, companies, organizations, cities, geographic features, and other typed entities within your HTML, text, or web-based content. We employ sophisticated statistical algorithms and natural language processing technology to analyze your information, extracting the semantic richness embedded within.

If it is possible to auto-format code before and after a source control commit, checkout, diff, etc. does a company really need a standard code style?

If it is possible to auto-format code before and after a source control commit, checkout, diff, etc. does a company really need a standard code style?
It feels like standard coding style debates that have been raging since programming began like "put the bracket on the following line" or "properly indent your (" are no longer essential.
I realize in languages where white space matters the diff will have to consider it but for languages where the style is a personal preference is there really a need to worry about it anymore?

Auto-format can really only address whitespace.
It wont address developers giving variables bizarre nonsensical names.
It won't address some developers having functions return null on an error vs throwing an exception.
I'm sure others can think of more examples.

This is what we do at my work:
We all use Eclipse. We don't have a policy for using Eclipse but somehow none of us is an IDEA/IntelliJ guy. We also think our code should be written with legacy in mind. This means our code has to be readable in a certain way even years after (#1) no matter who wrote it and if that person even is in the company anymore.
Eclipse has couple handy features, automatic format on save and a specific Formatter tool. As you can see from the linked screenshot, it can be configured with XML. Thus there's a bunch of premade XML:s available for every worker in our company so that when a new guy comes in, we walk him through of the whole process and configure their Eclipse for them (yes, it's slightly evil thing to do) so that it actually uses those formatting XML:s we have provided. We do not enforce automatic format on save, we don't want to be completely intrusive, we just want to push all our developers into the right directions. For even increased compatability, we mostly use rules defined in JCC.
Next comes the important part, the actual builds. We are those who embrace automatic builds and for that we use Hudson Continuous Integration Server. There's two important parts in our configurations beyond this:
We use CVS loginfo to trigger builds whenever something is committed.
We utilize several plugins available for Hudson, including Continuous Integration Game in conjuction with the most important one, Checkstyle.
The Checkstyle Plugin is the magician in our code style enforcement guide line:
After commiting code to CVS, Hudson build is triggered
After build has been completed succesfully (all unit tests pass etc.), Checkstyle inspects the actual source files
Checkstyle ranks the code based rules we have defined for it
Continuous Integration Game sees the result of Checkstyle and awards/takes away points for the person who has the ownership for the relevant part of the code
Leaderboard shows total points for every commiter in system
Basically this means that when anyone commits ugly code into our CVS, our build server automatically reduces that person's points.
This means that eventually any one of us can be ranked on the Leaderboard based on the general code quality in both look and OO principles such as Law of Demeter, Cyclomatic complexity etc. etc. Naturally this isn't a completely serious statistic, but it's a good indication you're doing something wrong when causing a build to be initiated in our CI won't reduce your points - most of our commits are worth between 1 and 5 points.
And is it working? Sort of, I don't think anyone of us at my work writes ugly or unmaintainable code and personally I love to hunt all kinds of scores so it's definitely motivating me to make code that looks nice and follows all the OO paradigms I know of.
And do we as a company really need it? I think we do as you should see from reading this entire answer, it can be considered a good practice for the advancements it brings.
#1: in a related note, I refactored legacy code from 2002 today which used those standards, didn't look "bad" at all even in its original form and certainly not worse in its new form

No, not really.
If you can actually get it to work consistently and not make it flag code has changed due to a different style of laying the code out.
However, this is just a small part of coding standards. It won't cover multiple return statements, the use or not of ternary operators, etc.

It is always nice if the coding style that the shop uses is the same one that is also followed by the development tools.
Otherwise, if there is a large body of code that already follows a shop standard which is NOT the same as that of the tools you have two choices:
Modify all of the code to follow the tool standard, or
Maintain the existing shop standard.
Many shops do the latter. Regardless, there does need to be some kind of standard, and it does need to be followed.
Some development tools allow you to tweak their standard. In some cases you may be able to bring the tools in alignment with the shop standard.

It probably doesn't matter that much anymore if you can ensure that everybody in the team sees the source code "correctly" formatted, whatever they think it is. However I've not seen a system that can do that - you can do parts of it (say, reformat before and after checkin/checkout) but these days you also have to consider web interfaces into the version control, external code review systems that interact directly with the version control system etc.
The main purpose of a standard code style is (IMHO) to ensure that you can read other team members' code easily without having to start reverse engineering it because all the code is written using the same sort of guiding principles. Indenting and parentheses placement seem to be a major hangup on this but they are only a very small and in my opinion, somewhat overblown and not very important part of the need to make code consistent.
Unfortunately I'm not aware of any tools that can automatically apply consistent coding principles to source code...

Yes, coding styles are needed if there is a desire to have a homogeneous code base. Such a code base can be useful in preventing individual ownership of parts of the code base, which can cause problems when people leave the team. If you can't imagine having wildly different styles and problems understanding all of it, just look at all the different ways English text can be organized in various communications, all written but quite different such as tweets, e-mail, text messages, IM, message board posts, etc. and changes in fonts, capitalization, decorations, etc.

Handling states and countries (or provinces): best way to implement

In my career I have seen literally dozens of ways that people choose to handle implementing state and country data (as in NY or USA). I've seen enumerations, multi-dimensional arrays, data driven classes from XML docs or databases, and good old-fashioned strings.
So, I'm wondering, what do people consider the "best way" to handle this very common implementation? I know that most of the answers will be based primarily on opinion and preference; but, I'm curious to hear arguments as a recent discussion for an app I'm working on has resulted in a small debate.

If you consider the number of states in the united states to be unchanging then what you have is a finite, small, fixed set of values which can easily be associated with a number. For that type of data structure an enumeration works great. My choice then would be an enumeration as they have good compile time validation and typing something like the following reads nicely.
var state = GetStateInfo(States.Nebraska);
However if you consider the number of states to be a changing value then an enumeration may not be the best choice. The framework design guidelines (link) reccomend you do not use an enumeration for a set of values which is considered to be open (changing). The reason why is that if the value does change you risk breaking already shipped code. Instead I would go with a static class and string constants for the name.

I would pick a standard for enumerating the geogrpahical areas and then associate keys in a table. In the past I have used FIPS 5-2 place codes. Its a federal standard and the numbers and abbreviations are reasonablly well known. for international country codes we use FIPS 10-4 for encoding places. There are some iso standards such as ISO 3166 that may also be appealing, but this is more preference and taste.

I think enums or XML is fine if all you want is a simple list of states. If you have other dependent data like say state tax rates or postcode data then I think I'd store the states in the database.
Same with countries. You may want to control languages, currencies etc. based on country. I think that's just easier to deal with if it's in a database.

Ndepend and other automatic code analyser revelence?

Since yesterday, I am analyzing one of our project with Ndepend (free for most of its features) and more I am using it, and more I have doubt about the real value of this type of software (code-analysis software).
Let me explain, The system build a report about the health of the system and class by Rank every metric. I thought it would be a good starting point to do modifications but most of the top result are here because they have over 100 lines inside the class (we have big headers and we do use VS comments styles) so it's not a big deal... than the number of Afferent Coupling level (CA) is always too high and this is almost very true for Interface that we used a lot... so at this moment I do not see something wrong but NDepend seem to do not like it (if you have suggestion to improve that tell me because I do not see the need for). It's the samething for the metric called "NOC" for Number of children that most of my Interface are too high...
For the moment, the only very useful metric is the Cyclomatic Complexity...
My question is : Do you find is worth it to analyse code with Automatic Code Analyser like NDepend? If yes, how do you filter all information that I have mentionned that doesn't really show the real health of the system?

Actually metrics are just one feature of NDepend, did you try to use VisualNDepend that lets you analyze your project much more in depth than the report? By reading your comment, I am almost sure you didn't play with NDepend UI (standalone or integrated in Visual Studio) which is the best way to filter data about your code base.
I am one of the developers of NDepend and we use it a lot to analyze our own code. Basically we write our own quality rules with Code Rules over LINQ Queries (CQLinq). These rules automatically make sure that we don't have regression on our design. Here you'll find the list of around 200 default code rules.
Here are some unique features of NDepend and not related to code metrics:
Write CQLinq rules to make sure we don't have architectural flaws, such as dependency cycles between our components, UI using directly the DB or DB entangled with the business objects.
Make sure we don't have problem with code coverage by tests (like we make sure with a CQLinq rule that if a class is supposed to be 100% covered, it will remain 100% covered in the future)
Enforce side-effects-free code (immutable class/pure methods)
Use the ability to compare 2 analysis to code review changes since the last release, before doing a new release. More specifically, I enjoy using NDepend to know which method has been added and refactored since the last release, and is not 100% covered by tests.
Having an optimal encapsulation for all our members and types (like knowing which internal methods can be declared as private). This is also related to dead-code detection that NDepend also supports.
For a complete list of features if NDepend, see here.

I don't necessarily see NDepend results as "good" or "bad" in software engineering, there's always a good reason why an application is designed the way it is. I see it as a report that can probably help me point out issues with my design, but I have the final word when it comes to deciding if a method needs to be refactored or if it's good the way I designed it. In general, don't get too caught up trying to answer if it's worth it or not. It definitely is, instead I would suggest you carefully review the results. This will help you view your design from another perspective and there may be occasions where you decide the way you designed it is the best to achieve your applications goals.

Recommended data format for describing the rules of chess

I'm going to be writing a chess server and one or more clients for chess and I want to describe the rules of chess (e.g. allowable moves based on game state, rules for when a game is complete) in a programming language independant way. This is a bit tricky since some of the chess rules (e.g. King Castling, en passent, draws based on 3 or more repeated moves) are based not only on the board layout but also on the history of moves.
I would prefer the format to be:
textual
human readable
based on a standard (e.g. YAML, XML)
easily parsable in a variety of languages
But I am willing to sacrifice any of these for a suitable solution.
My main question is: How can I build algorithms of such a complexity that operate on such complex state from a data format?
A followup queston is: Can you provide an example of a similar problem solved in a similar manner that can act as a starting point?
Edit: In response to a request for clarity -- consider that I will have a server written in Python, one client written in C# and another client written in Java. I would like to avoid specifying the rules (e.g. for allowable piece movement, circumstances for check, etc.) in each place. I would prefer to specify these rules once in a language independant manner.

Let's think. We're describing objects (locations and pieces) with states and behaviors. We need to note a current state and an ever-changing set of allowed state changes from a current state.
This is programming. You don't want some "meta-language" that you can then parse in a regular programming language. Just use a programming language.
Start with ordinary class definitions in an ordinary language. Get it all to work. Then, those class definitions are the definition of chess.
With only miniscule exceptions, all programming languages are
Textual
Human readable
Reasonably standardized
Easily parsed by their respective compilers or interpreters.
Just pick a language, and you're done. Since it will take a while to work out the nuances, you'll probably be happier with a dynamic language like Python or Ruby than with a static language like Java or C#.
If you want portability. Pick a portable language. If you want the language embedded in a "larger" application, then, pick the language for your "larger" application.
Since the original requirements were incomplete, a secondary minor issue is how to have code that runs in conjunction with multiple clients.
Don't have clients in multiple languages. Pick one. Java, for example, and stick with it.
If you must have clients in multiple languages, then you need a language you can embed in all three language run-time environments. You have two choices.
Embed an interpreter. For example Python, Tcl and JavaScript are lightweight interpreters that you can call from C or C# programs. This approach works for browsers, it can work for you. Java, via JNI can make use of this, also. There are BPEL rules engines that you can try this with.
Spawn an interpreter as a separate subprocess. Open a named pipe or socket or something between your app and your spawned interpreter. Your Java and C# clients can talk with a Python subprocess. Your Python server can simply use this code.

There's already a widely used format specific to chess called Portable Game Notation. There's also Smart Game Format, which is adaptable to many different games.

I would suggest Prolog for describing the rules.

This is answering the followup question :-)
I can point out that one of the most popular chess servers around documents its protocol here (Warning, FTP link, and does not support passive FTP), but only to write interfaces to it, not for any other purpose. You could start writing a client for this server as a learning experience.
One thing that's relevant is that good chess servers offer many more features than just a move relay.
That said, there is a more basic protocol used to interface to chess engines, documented here.
Oh, and by the way: Board Representation at Wikipedia
Anything beyond board representation belongs to the program itself, as many have already pointed out.

Edit: Overly wordy answer deleted.
The short answer is, write the rules in Python. Use Iron Python to interface that to the C# client, and Jython for the Java client.

What I've gathered from the responses so far:
For chess board data representations:
See the Wikipedia article on [chess board representations](http://en.wikipedia.org/wiki/Board_representation_(chess)).
For chess move data representations:
See the Wikipedia articles on Portable Game Notation and Algebraic Chess Notation
For chess rules representations:
This must be done using a programming language. If one wants to reduce the amount of code written in the case where the rules will be implemented in more than one language then there are a few options
Use a language where an embedable interpreter exists for the target languages (e.g. Lua, Python).
Use a Virtual Machine that the common languages can compile to (e.g. IronPython for C#, JPython for Java).
Use a background daemon or sub-process for the rules with which the target languages can communicate.
Reimplement the rules algorithms in each target language.
Although I would have liked a declarative syntax that could have been interpreted by mutliple languages to enforce the rules of chess my research has lead me to no likely candidate. I have a suspicion that Constraint Based Programming may be a possible route given that solvers exist for many languages but I am not sure they would truly fulfill this requirement. Thanks for all the attention and perhaps in the future an answer will appear.

Drools has a modern human readable rules implementation -- https://www.jboss.org/drools/.
They have a way users can enter their rules in Excel. A lot more users can understand what is in Excel than in other tools.

To represent the current state of a board (including castling possibilities etc) you can use
Forsyth-Edwards Notation, which will give you a short ascii representation. e.g.:
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Would be the opening board position.
Then to represent a particular move from a position you could use numeric move notation (as used in correspondence chess), which give you a short (4-5 digits) representation of a move on the board.
As to represent the rules - I'd love to know myself. Currently the rules for my chess engine are just written in Python and probably aren't as declarative as I'd like.

I would agree with the comment left by ΤΖΩΤΖΙΟΥ, viz. just let the server do the validation and let the clients submit a potential move. If that's not the way you want to take the design, then just write the rules in Python as suggested by S. Lott and others.
It really shouldn't be that hard. You can break the rules down into three major categories:
- Rules that rely on the state of the board (castling, en passant, draws, check, checkmate, passing through check, is it even this player's turn, etc.)
- Rules that apply to all pieces (can't occupy the same square as another piece of your own colour, moving to a square w/ opponent's piece == capture, can't move off the board)
- Rules that apply to each individual piece. (pawns can't move backwards, castles can't move diagonally, etc)
Each rule can be implemented as a function, and then for each half-move, validity is determined by seeing if it passes all of the validations.
For each potential move submitted, you would just need to check the rules in the following order:
is the proposed move potentially valid? (the right "shape" for the piece)
does it fit the restraints of the board? (is the piece blocked, would it move off the edge)
does the move violate state requirements? (am I in check after this move? do I move through check? is this en passant capture legal?)
If all of those are ok, then the server should accept the move as legal…

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.