Forgiving Format design pattern - JQuery or C# library? - c#

I have just been looking at the 'Forgiving Format' design pattern (e.g. http://ui-patterns.com/patterns/ForgivingFormat), however I am surprised that I can't find any libraries implementing this (specifically for simple date/times). Does anybody know of any (perferably open source) libraries for this?
Thanks

I don't think this is a design pattern, rather a UI pattern... (edit: i just noticed the name of the website you linked to :) )
As a matter of fact, this functionality exists in some libraries. The first one to spring to mind is dateJS, a javascript parsing library that allows for fuzzy date input. However since i last heard of it, there hasn't been much activity in the project.
Apart from dates, countries, etc... , i think that any project of this kind is very business-specific; first you've got to learn how users express themselves and how to translate this in business terms. Working on a generic translator doesn't look like it's feasible, at least not without a lot of configuration.

The Forgiving Format design pattern is heavily dependent upon your interface. If you are using HTML 4 and have only a text box, how is it supposed to know that only numbers are acceptable? How is it supposed to know that 2.30 is supposed to mean 2:30 as in hour of the day? Et cetera.
There are jQuery plugins which steer user input in the right direction using general rules, however you're the one to determine what is acceptable and what isn't in the end. And if you wanted to have a field which accepted either telephone numbers or e-mail addresses, you'd be hard-pressed to find a library which validates it as such without a little tweaking.
Ultimately it comes down to you to be able to determine what is tolerated input and what isn't. Libraries merely help you do the more common validation.

Related

Determining if text is english

I currently have a lot comments and text in my database that is mainly in English. However if it isn't in English I want to translate it to English.
I know I can call a translation api to determine the language but I don't want to make millions of translation API calls for text that most likely won't need translating.
I am looking for a way to determine if the text is English or not. I don't need to know what language it is, just that it isn't English, then if it isn't English I will send it to a translation service API.
The Chromium project (including its most popular implementation, Google Chrome) solves this problem with https://github.com/google/cld3.
If your only need is to detect whether or not something is English, then in theory you can use something even more compact.
Most good language detectors use trigram frequency (a gram being a single character) or trigram frequency overlaid with word frequency. For your application it seems that you could use a hybrid approach where the first pass is local, but of low accuracy and tuned to be a bit aggressive to not miss any potential English, and the second pass that does hit an API like Google Translate.
The popularity of English and amount of English data is usually helpful for applying NLP solutions to it, but in this case unfortunately you will often find false positives for English, because sources of data that are listed as English contain other languages or un-language like garbage characters or URLs.
Note also that for many queries there is no single correct answer. Good systems will return a weighted list of possibilities, but for a query like [dan], [a], [example#example.com] or [hi! como estas? i'm in class ahorita] the most correct answer will depend on your application and may not exist.
You can use NTextCat to determine input language.
Research (by a certian Zipf) determined that for the most part, there are some words which are used very frequently, and a lot of words which are rarely used.
If I was given this problem, I'd probably put down a list of the top X used words. Then for each comment I would see if there's a match.
It's not perfect (and if the text is very particular, or mispelt, you've got an issue) - but I think it's an acceptable heuristic.
See this post
More specifically, take a look on Trigrams

Template pattern - not useful for small projects

I'm sorry to ask such a localized question but until I get confirmation I don't feel confident moving on with my project.
I have read lots about the template pattern, Wikipedia has a good example.
It shows that you create the basic virtual methods and then inherit the base class and override where you want. The example on the site is for Monopoly and Chess which both inherit the base class.
So, my question is, if you had an application which was only going to be Chess and never anything else, would there be any benefit in using the template pattern (other than as an education exercise)?
No, I think that falls under the category of "You Ain't Gonna Need It."
To be more specific, design patterns exist to solve a particular problem, and if your code doesn't need to solve that problem, all they do is add lines of code without having any benefit.
No. Expressed in a very simplified and superficial way, the template pattern is just worthwhile starting at a certain relationship between total code size and templated code size. In your example, the chess game is going to be the entire program, so there'll be no need to use the template pattern here.
The template pattern is used in specific situations. It is used when you want to sketch out an algorithm but let the specific steps differ.
This could be useful in a Chess application. However, you should not start developing an application with the idea 'I'm going to use this pattern and that one and..'. Instead, you develop the code and you discover that you need certain patterns.
This is where a Test Driven Development approach is really handy. It allows you to refactor your code each step of the way.
A nice book that explains this is Refactoring To Patterns.
I would suggest writing your chess game and then if in the future coming back and changing things to fit monopoly too. But its something totally different if you want to use the pattern to learn the pattern, in that case its good to start simple so the complex is easier to understand.
It really depends on the parts of the program. The whole idea of Template is to have an algorithm that never changes and to be able to add or edit certain steps of that algorithm.
It may well be that you never change, however, this is the issue with design principles, it IS good practice and you may later wish you'd implemented them. I would say though that if you are 100% sure then you can leave it out as it usually saves time and lines of code. Depends if you want to learn Template usage or not.
Also the GOF principles website is quite good:

If it is possible to auto-format code before and after a source control commit, checkout, diff, etc. does a company really need a standard code style?

If it is possible to auto-format code before and after a source control commit, checkout, diff, etc. does a company really need a standard code style?
It feels like standard coding style debates that have been raging since programming began like "put the bracket on the following line" or "properly indent your (" are no longer essential.
I realize in languages where white space matters the diff will have to consider it but for languages where the style is a personal preference is there really a need to worry about it anymore?
Auto-format can really only address whitespace.
It wont address developers giving variables bizarre nonsensical names.
It won't address some developers having functions return null on an error vs throwing an exception.
I'm sure others can think of more examples.
This is what we do at my work:
We all use Eclipse. We don't have a policy for using Eclipse but somehow none of us is an IDEA/IntelliJ guy. We also think our code should be written with legacy in mind. This means our code has to be readable in a certain way even years after (#1) no matter who wrote it and if that person even is in the company anymore.
Eclipse has couple handy features, automatic format on save and a specific Formatter tool. As you can see from the linked screenshot, it can be configured with XML. Thus there's a bunch of premade XML:s available for every worker in our company so that when a new guy comes in, we walk him through of the whole process and configure their Eclipse for them (yes, it's slightly evil thing to do) so that it actually uses those formatting XML:s we have provided. We do not enforce automatic format on save, we don't want to be completely intrusive, we just want to push all our developers into the right directions. For even increased compatability, we mostly use rules defined in JCC.
Next comes the important part, the actual builds. We are those who embrace automatic builds and for that we use Hudson Continuous Integration Server. There's two important parts in our configurations beyond this:
We use CVS loginfo to trigger builds whenever something is committed.
We utilize several plugins available for Hudson, including Continuous Integration Game in conjuction with the most important one, Checkstyle.
The Checkstyle Plugin is the magician in our code style enforcement guide line:
After commiting code to CVS, Hudson build is triggered
After build has been completed succesfully (all unit tests pass etc.), Checkstyle inspects the actual source files
Checkstyle ranks the code based rules we have defined for it
Continuous Integration Game sees the result of Checkstyle and awards/takes away points for the person who has the ownership for the relevant part of the code
Leaderboard shows total points for every commiter in system
Basically this means that when anyone commits ugly code into our CVS, our build server automatically reduces that person's points.
This means that eventually any one of us can be ranked on the Leaderboard based on the general code quality in both look and OO principles such as Law of Demeter, Cyclomatic complexity etc. etc. Naturally this isn't a completely serious statistic, but it's a good indication you're doing something wrong when causing a build to be initiated in our CI won't reduce your points - most of our commits are worth between 1 and 5 points.
And is it working? Sort of, I don't think anyone of us at my work writes ugly or unmaintainable code and personally I love to hunt all kinds of scores so it's definitely motivating me to make code that looks nice and follows all the OO paradigms I know of.
And do we as a company really need it? I think we do as you should see from reading this entire answer, it can be considered a good practice for the advancements it brings.
#1: in a related note, I refactored legacy code from 2002 today which used those standards, didn't look "bad" at all even in its original form and certainly not worse in its new form
No, not really.
If you can actually get it to work consistently and not make it flag code has changed due to a different style of laying the code out.
However, this is just a small part of coding standards. It won't cover multiple return statements, the use or not of ternary operators, etc.
It is always nice if the coding style that the shop uses is the same one that is also followed by the development tools.
Otherwise, if there is a large body of code that already follows a shop standard which is NOT the same as that of the tools you have two choices:
Modify all of the code to follow the tool standard, or
Maintain the existing shop standard.
Many shops do the latter. Regardless, there does need to be some kind of standard, and it does need to be followed.
Some development tools allow you to tweak their standard. In some cases you may be able to bring the tools in alignment with the shop standard.
It probably doesn't matter that much anymore if you can ensure that everybody in the team sees the source code "correctly" formatted, whatever they think it is. However I've not seen a system that can do that - you can do parts of it (say, reformat before and after checkin/checkout) but these days you also have to consider web interfaces into the version control, external code review systems that interact directly with the version control system etc.
The main purpose of a standard code style is (IMHO) to ensure that you can read other team members' code easily without having to start reverse engineering it because all the code is written using the same sort of guiding principles. Indenting and parentheses placement seem to be a major hangup on this but they are only a very small and in my opinion, somewhat overblown and not very important part of the need to make code consistent.
Unfortunately I'm not aware of any tools that can automatically apply consistent coding principles to source code...
Yes, coding styles are needed if there is a desire to have a homogeneous code base. Such a code base can be useful in preventing individual ownership of parts of the code base, which can cause problems when people leave the team. If you can't imagine having wildly different styles and problems understanding all of it, just look at all the different ways English text can be organized in various communications, all written but quite different such as tweets, e-mail, text messages, IM, message board posts, etc. and changes in fonts, capitalization, decorations, etc.

artificial intelligence - Creative Writing

I am trying to find information (and hopefully c# source code) about trying to create a basic AI tool that can understand english words, grammar and context.
The Idea is to train the AI by using as many written documents as possible and then based on these documents, for the AI to create its own creative writitng in proper english that makes sense to a human.
While the idea is simple, I do realise that the hurdles are huge, any starting points or good resoueces will be appriacted.
A basic AI tool that you can use to do something like this is a Markov Chain. It's actually not too tricky to write!
See: http://pscode.com/vb/scripts/ShowCode.asp?txtCodeId=2031&lngWId=10
If that's not enough, you might be able to store WordNet synsets in your Markov chain instead of just words. This gives you some sense of the meaning of the words.
To be able to recompose a document you are going to have to have away to filter through the bad results.
Which means:
You are going to have to write a program that can evaluate if the output is valid (grammatically and syntactically is the best you can do reliablily) (This would would NLP)
You would need lots of training data and test data
You would need to watch out for overtraining (take a look at ROC curves)
Instead of writing a tool you could:
Manually score the output (will take a long time to properly train the algorigthm)
With this using the Amazon Mechanical Turk might be a good idea
The irony of this: The computer would have a difficult time "Creatively" composing something new. All of its worth will be based on its previous experiences [training data]
Some good references and reading at this Natural Language article.
As others said, Markov chain seems to be most suitable for such a task. Nice description of implementing Markov chain can be found in Kernighan & Pike, The Practice of Programming, section 3.1. Nice description of text-generating is also present in Programming Pearls.
One thing, though not quite what you need, would be a Markov chain of words. Here's a link I found by a quick search: http://blog.figmentengine.com/2008/10/markov-chain-code.html, but you can find much more information by searching for it.
Take a look at http://www.nltk.org/ (Natural Language Toolkit), lots of powerful tools there. They use Python (not C#) but Python is easy enough to pick up. Much easier to pick up than the breadth and depth of natural language processing, at least.
I agree, that you will have troubles in creating something creative. You could possibly also use a keyword spinner on certain words. You might also want to implement a stop word filter to remove anything colloquial.

C# libraries for internationalization?

What libraries are there to write C# internationalized applications?
Typical functionalities that should be contained in the library:
Validation of country specific data (e.g. VAT numbers, phone numbers, addresses,...)
Validation of bank and financial coordinates (e.g. Credit Card numbers, IBAN,...)
Language-specific functionalities (e.g. numbers to words to numbers, summarize,...)
Language specific content filtering (e.g. swearword filtering...)
An example of such libraries in Perl would be the Internationalization/Locale section of CPAN.
What C# solutions are available?
Note: I am not looking for an introduction to the System.Globalization namespace :)
Note 2: Should I desume that there are no options available? Is someone interested in joining forces and create one?
Note 3: Edit to make the question appear on front page in hope of more answers. This isn't such a hard question, how is it possible that Stackers don't ever do i18n?
One project that is working towards a database of globalization, internationalization and localization knowledge is the Unicode Common Locale Data Repository, based on the old ICU project at IBM.
As it is a database of XML data it doesn't contain any .NET-specific code, but as a body of knowledge it is very good.
Only a smallish subset is in the .NET framework. Microsoft hasn't gone near any of the supplemental stuff, like postcode formats, number spelling (for check/cheque amounts), etc. Standard time zone names (from the Olson/tz distribution), etc. are also included, with mappings to the Windows-specific names. Some of the hierarchical locale-specific behaviours also have better support.
I wouldn't say that no one does i18n, but I don't know of any generic tools that can be used for every project. Maintaining a database with all of the information you are looking for would be an epic project. It sounds like what you're looking for isn't a specific C# library, but more a collection of information online that you can draw from. If you were able to find a repository of swear words in various languages (for example), it would be trivial for you to use this in C#. I think that finding a solution that wraps up all of your requirements into an easy-to-use assembly is going to be impossible to find.
Have a look at
http://www.microsoft.com/globaldev/getwr/dotneti18n.mspx
and
http://www.dotneti18n.com/
String to number and vice versa can be dones as following:
culture = new CultureInfo(locale);
int number = Convert.ToInt32(myString, culture.NumberFormat);
string str= Convert.ToString(myNumber, culture.NumberFormat);
As to checking VATS and adresses, I'm interested in that too, haven't found anything useful so far.
Not exactly a "library", per se, but I've actually ran into a great service (for pay), by a company called E4X (former client of mine).
What they provide is complete localization of your ecommerce site, including language translations, currency exchanges, local billing and handling of financial transactions including region-specific taxes etc, and more. They even deal with logisitics of physical shipping...
Worth looking into, for an ecommerce business. Let 'em know I sent you... ;-)
That's a huge endeavor. Let's start with one simple problem: phone numbers. Libphonenumber Google library at http://code.google.com/p/libphonenumber/ has a C# port at https://bitbucket.org/pmezard/libphonenumber-csharp with notes at http://blog.thekieners.com/2011/06/06/using-googles-libphonenumber-in-microsoft-net-with-c/. Appears to be a good library for handling both US and int'l numbers.

Categories

Resources