Is it possible to guide a Markov chain toward certain keywords? - c#

I'm writing a chat bot for a software engineering course in C#.
I'm using Markov chains to generate text, using Wikipedia articles as the corpus. I want it to respond to user input in an (at least slightly) intelligent way, based on their input, but I'm not sure how to do it.
My current thinking is that I'd try and extract keywords from the user's input, then use those to guide the sentence generation. But because of the Markov property, the keywords would have to be the first words in the sentence, which might look silly. As well, for an n order chain, I'd have to extract exactly n keywords from the user every time.
The data for the generator is a dictionary, where the keys are lists of words, and the values are lists of words combined with a weight depending on how often the word appears after the words in the key. So like:
{[word1, word2, ..., wordn]: [(word, weight), (word, weight), ...]}
It works in a command-line test program, but I'm just providing an n word seed for each bit of text it generates.
I'm hoping there's some way I can make the chain prefer words which are nearby words that the user used, rather than seeding it with the first/last n words in the input, or n keywords, or whatever. Is there a way to do that?

One way to make your chat smarter is to identify the topic from the user's input. Assume you have your Markov brain conditioned on different topics as well. Then to construct your answer, you refer to the dictionary below:
{([word1, word2, ..., wordn], topic): [(word, weight), (word, weight), ...]}
To find the topics, you can start with WikipediaMiner. For instance, below are the topics and their corresponding weights found by wikify api against the sentence:
Statistics is so hard. Do you have some good tutorial of probability theory for a beginner?
[{'id': 23542, 'title': 'Probability theory', 'weight': 0.9257584778725553},
{'id': 30746, 'title': 'Theory', 'weight': 0.7408577501980528},
{'id': 22934, 'title': 'Probability', 'weight': 0.7089442931022307},
{'id': 26685, 'title': 'Statistics', 'weight': 0.7024251356953044}]
Probably those identified keywords are also good to be treated as seeds.
However, question answering is not so simple. This Markov-based sentence generation does not have the ability to understand the question at all. The best it can do is just providing related contents. Just my 2 cents.

Related

Facebook Sentiment Analysis API

i want to try and create an application which rates the user's facebook posts based on the content (Sentiment Analysis).
I tried creating an algorithm myself initially but i felt it wasn't that reliable.
Created a dictionary list of words and scanned the posts against the dictionary and rate if it was positive or negative.
However, i feel this is minimal. I would like to rate the mood or feelings/personality traits of the person based on the posts. Is this possible to be done?
Would hope to make use of some online APIs, please assist. Thanks ;)
As #Jared pointed out, using a dictionary-based approach can work quite well in some situations, depending on the quality of your training corpus. This is actually how CLIPS pattern and TextBlob's implementations work.
Here's an example using TextBlob:
from text.blob import TextBlob
b = TextBlob("StackOverflow is very useful")
b.sentiment # returns (polarity, subjectivity)
# (0.39, 0.0)
By default, TextBlob uses pattern's dictionary-based algorithm. However, you can easily swap out algorithms. You can, for example, use a Naive Bayes classifier trained on a movie reviews corpus.
from text.blob import TextBlob
from text.sentiments import NaiveBayesAnalyzer
b = TextBlob("Today is a good day", analyzer=NaiveBayesAnalyzer())
b.sentiment # returns (label, prob_pos, prob_neg)
# ('pos', 0.7265237431528468, 0.2734762568471531)
The algorithm you describe should actually work well, but the quality of the result depends greatly on the word list used. For Sentimental, we take comments on Facebook posts and score them based on sentiment. Using the AFINN 111 word list to score the comments word by word, this approach is (perhaps surprisingly) effective. By normalizing and stemming the words first, you should be able to do even better.
There are lots of sentiment analysis APIs that you can easily incorporate into your app, also many have a free usage allowance (usually, 500 requests a day). I started a small project that compares how each API (currently supporting 10 different APIs: AIApplied, Alchemy, Bitext, Chatterbox, Datumbox, Lymbix, Repustate, Semantria, Skyttle, and Viralheat) classifies a given set of texts into positive, negative or neutral: https://github.com/skyttle/sentiment-evaluation
Each specific API can offer lots of other features, like classifying emotions (delight, anger, sadness, etc) or linking sentiment to entities the sentiment is attributed to. You just need to go through available features and pick the one that suits your needs.
TextBlob is another possiblity, though it will only classify texts into pos/neg/neu.
If you are looking for an Open Source implementation of sentiment analysis engine based on Naive Bayes classifier in C#, take a peek at https://github.com/amrishdeep/Dragon. It works best on large corpus of words like blog posts or multi-paragraph product reviews. However, I am not sure if it would work for facebook posts that have a handful of words

To find out the number of occruence of words in a file

I came across this question in an interview:
We have to find out the number of occurences of two given words in a text file with <=n words between them.
Example1:
text:`this is first string this is second string`
Keywords:`this, string`
n= 4
output= 2
"this is first string" is the first occurrence and number of words between this and string is 2(is, first) which is less than 4.
this is second string is the remaining string. number of words between *this and string * is 2 (is, second) which is less than 4.
Therefore the answer is 2.
I have thought that I will use
Dictionary<string, List<int>>.
My idea was that I use the dictionary and get the list of places where the particular word is repeated and then iterate through both the lists, increment the count if a condition is met and then display the count.
Is my thinking process correct? Please provide any suggestions to improve my solution.
Thanks,
Not an answer per-se (as quite honestly, I don't understand the question :P), but to add some general interview advice to the other answers:
In interviews the interviewer is always looking for the thought process and that you are a critical, logical thinker. Not necessarily that you have excellent coding recall and can compile code in your brain.
In addition interviews are a stressful process. By slowing down and talking out loud as you work things out you not only look like a better communicator and logical thinker (even if getting the question wrong), you also give yourself time to think.
Use a pen and paper, speak as you think, start off from the top and work through it. I've got jobs even if I didn't know the answers to tech questions by demonstrating that I can at least try to work things out ;-)
In short, it's not just down to technical prowess
I think it depends if the call is done only one or multiple times per string. If it's something like
int getOccurences(String str, String reference, int min_size) { ... }
then you don't really need the dictionary, not even a ist. You can just iterate through the string to find occurrences of words and then check the number of separators between them.
If on the other hand the problem is for arbitrary search/indexing, IMHO you do need a dictionary. I'd go for a dictionary where the key is the word and the value is a list of indexes where it occurs.
HTH
If you need to do that repeatedly for different pairs of words in the same text, then a word dictionary with a list of indexes is a good solution. However, if you were only looking for one pair, then two lists of indexes for those two words would be sufficient.
The lists allow you to separate the word detection operation from the counting logic.

Findings string segments in a string

I have a list of segments (15000+ segments), I want to find out the occurence of segments in a given string. The segment can be single word or multiword, I can not assume space as a delimeter in string.
e.g.
String "How can I download codec from internet for facebook, Professional programmer support"
[the string above may not make any sense but I am using it for illustration purpose]
segment list
Microsoft word
Microsoft excel
Professional Programmer.
Google
Facebook
Download codec from internet.
Ouptut :
Download codec from internet
facebook
Professional programmer
Bascially i am trying to do a query reduction.
I want to achieve it less than O(list length + string length) time.
As my list is more than 15000 segments, it will be time consuming to search entire list in string.
The segments are prepared manully and placed in a txt file.
Regards
~Paul
You basically want a string search algorithm like Aho-Corasik string matching. It constructs a state machine for processing bodies of text to detect matches, effectively making it so that it searches for all patterns at the same time. It's runtime is on the order of the length of the text and the total length of the patterns.
In order to do efficient searches, you will need an auxiliary data structure in the form of some sort of index. Here, a great place to start would be to look at a KWIC index:
http://en.wikipedia.org/wiki/Key_Word_in_Context
http://www.cs.duke.edu/~ola/ipc/kwic.html
What your basically asking how to do is write a custom lexer/parser.
Some good background on the subject would be the Dragon Book or something on lex and yacc (flex and bison).
Take a look at this question:
Poor man's lexer for C#
Now of course, alot of people are going to say "just use regular expressions". Perhaps. The deal with using regex in this situation is that your execution time will grow linearly as a function of the number of tokens you are matching against. So, if you end up needing to "segment" more phrases, your execution time will get longer and longer.
What you need to do is have a single pass, popping words on to a stack and checking if they are valid tokens after adding each one. If they aren't, then you need to continue (disregard the token like a compiler disregards comments).
Hope this helps.

Approximate string matching

I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max
There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).
Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.
In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.
I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/

How can I generate pseudo-random "readable" strings in Java?

Generating a truly random string of a given length is a fairly straightforward (and already-well-covered) task.
However; I'd like to generate a "pseudo" random string with the additional constraint that it be relatively easily readable (to a native-English reader.)
I think another way to say this is to say that the generated string should consist of "recognizable syllables." For example, "akdjfwv" is a random string, but it's not recognizable at all. "flamyom"; however, is very "recognizable" (even though it's nonsense.)
Obviously, one could make a long list of "recognizable syllables," and then randomly select them.
But, is there a better way to do something like programmatically generate a "recognizable syllable," or generate a "syllable" and then test it to see if it's "recognizable"?
I can think of several ways to go about this implementation, but if someone has already implemented it (preferrably in Java or C#,) I'd rather re-use their work.
Any ideas?
You could try implementing a Markov chain and give it a suitable passage to process. There is a Java implementation that may work for you.
This is a sample from interpolating between Genesis in English and Genesis in Spanish (N = 1):
In bersaran thelely and avin inder tht teathe m lovig weay waw thod mofin he t thte h fupiteg s o t llissed od ma. lllar t land fingujod maid af de wand tetodamoiz fosu Andesp. ersunen thenas lowhejod whipanirede tifinas Gofuavithila d gió Y Diche fua Dios co l, liens ly Y crerdíquen ticuesereregos hielase agúnd veumarbas iarasens laragún co eruerá laciéluelamagúneren Dien a He.
I think this should do what you want:
Java Password Generator
It has the source code and a permissive license so you can adapt the source code to what you are looking for.
You need to generate random syllables. The simplest way to do it is to use syllables that are consonant-vowel, or consonant-vowel-consonant. From a list of consonants and vowels, pick randomly to build syllables, then join the syllables together to make a string.
Keep in mind your list of consonants shouldn't be letters that are consonants, but phonemes, so "th", "st", "sl", etc, could be entries in the consonant list.
You really should check out SCIgen. It generates entire semi-nonsense scientific papers: http://pdos.csail.mit.edu/scigen/
And the source is available: it's released under GPL, and is currently available via anonymous CVS.
I'm not sure exactly what you need this for, but if graphic-layot folks in the print industry have used Lorem Ipsum generators to create text that looks enough like text that your brain processes it as such without actually being readable words. More info here
I don't know if there's a web service to which you could subscribe, but there are several sites which will just generate Lorem Ipsum strings for you, so you may be able to use those.
There is a good section on this in Programming Pearls. It's online but I'd highly recommend buying the book; One of the best programming books around in my opinion.
Lots of Lorem Ipsum generators out there.
All gets back to why you want this. If you just want "pronounceable gibberish", I'd think the easiest thing to do would be to generate alternating consonants and vowels. That would be a tiny subset of all pronounceable gibberish, but what's the goal? To give a little broader range you could create a table of consonant phonemes and vowel phonemes, with the consonant list including not just individual letters like "b" and "d" but also "th", "br", and so on, and the vowel list could include "oo" and "ea", etc. One more step would be to generate syllables instead of letters, with a syllable containing either vowel, consonant-vowel, or consonant-vowel-consonant. That is, loop through creating syllables, then within syllables pick one of the three patterns. You probably want to forbid two vowel-only syllables in a row. (I'm trying to think of an example of that in English. It probably happens, but the only examples I can think of are borrowed from other languages, like "stoa".)
I created a Java package Pronounceable String Generator for generating pronounceable random strings quickly.
Just create an object PronounceableStringGenerator and invoke the method generate:
PronounceableStringGenerator mg = new PronounceableStringGenerator();
System.out.println(mg.generate(8));//8 is the length of the generated string
System.out.println(mg.generate(10));
System.out.println(mg.generate(6));

Categories

Resources