Findings string segments in a string - c#

I have a list of segments (15000+ segments), I want to find out the occurence of segments in a given string. The segment can be single word or multiword, I can not assume space as a delimeter in string.
e.g.
String "How can I download codec from internet for facebook, Professional programmer support"
[the string above may not make any sense but I am using it for illustration purpose]
segment list
Microsoft word
Microsoft excel
Professional Programmer.
Google
Facebook
Download codec from internet.
Ouptut :
Download codec from internet
facebook
Professional programmer
Bascially i am trying to do a query reduction.
I want to achieve it less than O(list length + string length) time.
As my list is more than 15000 segments, it will be time consuming to search entire list in string.
The segments are prepared manully and placed in a txt file.
Regards
~Paul

You basically want a string search algorithm like Aho-Corasik string matching. It constructs a state machine for processing bodies of text to detect matches, effectively making it so that it searches for all patterns at the same time. It's runtime is on the order of the length of the text and the total length of the patterns.

In order to do efficient searches, you will need an auxiliary data structure in the form of some sort of index. Here, a great place to start would be to look at a KWIC index:
http://en.wikipedia.org/wiki/Key_Word_in_Context
http://www.cs.duke.edu/~ola/ipc/kwic.html

What your basically asking how to do is write a custom lexer/parser.
Some good background on the subject would be the Dragon Book or something on lex and yacc (flex and bison).
Take a look at this question:
Poor man's lexer for C#
Now of course, alot of people are going to say "just use regular expressions". Perhaps. The deal with using regex in this situation is that your execution time will grow linearly as a function of the number of tokens you are matching against. So, if you end up needing to "segment" more phrases, your execution time will get longer and longer.
What you need to do is have a single pass, popping words on to a stack and checking if they are valid tokens after adding each one. If they aren't, then you need to continue (disregard the token like a compiler disregards comments).
Hope this helps.

Related

Facebook Sentiment Analysis API

i want to try and create an application which rates the user's facebook posts based on the content (Sentiment Analysis).
I tried creating an algorithm myself initially but i felt it wasn't that reliable.
Created a dictionary list of words and scanned the posts against the dictionary and rate if it was positive or negative.
However, i feel this is minimal. I would like to rate the mood or feelings/personality traits of the person based on the posts. Is this possible to be done?
Would hope to make use of some online APIs, please assist. Thanks ;)
As #Jared pointed out, using a dictionary-based approach can work quite well in some situations, depending on the quality of your training corpus. This is actually how CLIPS pattern and TextBlob's implementations work.
Here's an example using TextBlob:
from text.blob import TextBlob
b = TextBlob("StackOverflow is very useful")
b.sentiment # returns (polarity, subjectivity)
# (0.39, 0.0)
By default, TextBlob uses pattern's dictionary-based algorithm. However, you can easily swap out algorithms. You can, for example, use a Naive Bayes classifier trained on a movie reviews corpus.
from text.blob import TextBlob
from text.sentiments import NaiveBayesAnalyzer
b = TextBlob("Today is a good day", analyzer=NaiveBayesAnalyzer())
b.sentiment # returns (label, prob_pos, prob_neg)
# ('pos', 0.7265237431528468, 0.2734762568471531)
The algorithm you describe should actually work well, but the quality of the result depends greatly on the word list used. For Sentimental, we take comments on Facebook posts and score them based on sentiment. Using the AFINN 111 word list to score the comments word by word, this approach is (perhaps surprisingly) effective. By normalizing and stemming the words first, you should be able to do even better.
There are lots of sentiment analysis APIs that you can easily incorporate into your app, also many have a free usage allowance (usually, 500 requests a day). I started a small project that compares how each API (currently supporting 10 different APIs: AIApplied, Alchemy, Bitext, Chatterbox, Datumbox, Lymbix, Repustate, Semantria, Skyttle, and Viralheat) classifies a given set of texts into positive, negative or neutral: https://github.com/skyttle/sentiment-evaluation
Each specific API can offer lots of other features, like classifying emotions (delight, anger, sadness, etc) or linking sentiment to entities the sentiment is attributed to. You just need to go through available features and pick the one that suits your needs.
TextBlob is another possiblity, though it will only classify texts into pos/neg/neu.
If you are looking for an Open Source implementation of sentiment analysis engine based on Naive Bayes classifier in C#, take a peek at https://github.com/amrishdeep/Dragon. It works best on large corpus of words like blog posts or multi-paragraph product reviews. However, I am not sure if it would work for facebook posts that have a handful of words

Sort a text file where each line is a name followed by a score

I'm going to make a scoreboard for an XNA game but I'm having some trouble figuring out how. The player, after finishing a level, will enter his name when prompted. His name and time will be recorder in a text file something like this:
sam 90
james 64
matthew 100
I'm trying to figure out a way to sort this data by the time taken only and not taking into account the name.
I haven't started coding this yet but if anybody can give me any ideas it would be greatly appreciated.
First, read the text file using File.ReadAllLines(...) so you get a string array. Then iterate over the array and split each string on blank space (assuming users can't enter spaces in their names) and order on the second element, which should be the score. You have to cast it into a string with int.Parse(...) to be able to order it properly.
string[] scores = File.ReadAllLines("scorefile.txt");
var orderedScores = scores.OrderByDescending(x => int.Parse(x.Split(' ')[1]));
foreach (var score in orderedScores)
{
Console.WriteLine(score);
}
//outputs:
//matthew 100
//sam 90
//james 64
I would recommend using something like a semi-colon to separate the name and the score instead of a space, as that makes it much easier to handle in the case that users are allowed to enter spaces in their names.
Why not make this into a database file, .sdf which you can easily create on the fly if needed. Would be best for keeping track of data allowing sorting and future reuse.
SQLite is designed for this exact purpose, makes basic CRUD operations a doddle. You can also encrypt database files if your game grows and you want to start using it as a way to download/upload high scores and share scores with friends/world.
Don't get me wrong, this is definitely a little more work than simply parsing a text file, but it is future proof and you get a lot of functionality right out of the bag without having to write parsers and complex search routines etc.
XML is a definite other choice, or JSON. All 3 are all good alternatives. A plain text file probably isn't the way to go though, as in the end will probably cause you more work.
Create the score table as
name::score
and read every line and on it
string line = "Sam::255";
string name = line.split("::")[0];
Also do similar to the score.

Is it possible to guide a Markov chain toward certain keywords?

I'm writing a chat bot for a software engineering course in C#.
I'm using Markov chains to generate text, using Wikipedia articles as the corpus. I want it to respond to user input in an (at least slightly) intelligent way, based on their input, but I'm not sure how to do it.
My current thinking is that I'd try and extract keywords from the user's input, then use those to guide the sentence generation. But because of the Markov property, the keywords would have to be the first words in the sentence, which might look silly. As well, for an n order chain, I'd have to extract exactly n keywords from the user every time.
The data for the generator is a dictionary, where the keys are lists of words, and the values are lists of words combined with a weight depending on how often the word appears after the words in the key. So like:
{[word1, word2, ..., wordn]: [(word, weight), (word, weight), ...]}
It works in a command-line test program, but I'm just providing an n word seed for each bit of text it generates.
I'm hoping there's some way I can make the chain prefer words which are nearby words that the user used, rather than seeding it with the first/last n words in the input, or n keywords, or whatever. Is there a way to do that?
One way to make your chat smarter is to identify the topic from the user's input. Assume you have your Markov brain conditioned on different topics as well. Then to construct your answer, you refer to the dictionary below:
{([word1, word2, ..., wordn], topic): [(word, weight), (word, weight), ...]}
To find the topics, you can start with WikipediaMiner. For instance, below are the topics and their corresponding weights found by wikify api against the sentence:
Statistics is so hard. Do you have some good tutorial of probability theory for a beginner?
[{'id': 23542, 'title': 'Probability theory', 'weight': 0.9257584778725553},
{'id': 30746, 'title': 'Theory', 'weight': 0.7408577501980528},
{'id': 22934, 'title': 'Probability', 'weight': 0.7089442931022307},
{'id': 26685, 'title': 'Statistics', 'weight': 0.7024251356953044}]
Probably those identified keywords are also good to be treated as seeds.
However, question answering is not so simple. This Markov-based sentence generation does not have the ability to understand the question at all. The best it can do is just providing related contents. Just my 2 cents.

Advanced C# pattern search in long string (100-25000 char)

Let me start with this: I can't zip it or anything similar.
What I'm trying to do is search through fairly large strings. I use data blocks that look like 0g12h. (The 0 is the color from my palette. The g is a space to divide the numbers. The 12 means 12 pixels in a row use that color. The h is to divide the numbers again.)
The problem I'm having is that the blocks aren't all the same length. They range from 0g1h to 2546g115h. Basically I want to create a palette of common patterns to hopefully save space. Say I have: 12g345h19g12h190g11h occurring at least three times, then I could save space if I had something like: a=12g345h19g12h190g11h in the palette array and just put 'a' in the string. Or even not look at the data blocks, as you see in the attached file you get g640h a ton of times.
I could be wrong, but I'm pretty sure this could work. If you have a better idea how I could save space and not lose data, I'm more than open to the ideas.
Here is a great example since you can visually see the pattern: http://pastebin.com/5dbhxZQK. I chose this file because I knew it would have massive redundancy; most aren't this simple.
You could use a dictionary (probably Dictionary<string, int> and just could how many times each pattern occurs, then go back and rewrite the string with the appropriate replacements.
However, I would recommend that you read up a little about compression algorithms, what you are implementing appears to be a Run Length Encoding (RLE) scheme. You are then trying to compress again on top of that, consider looking at how Sliding Window compression works (which is what GZIP does) as an alternative to your RLE. Or look at Huffman encoding as a mechanism to reduce the amount of space needed for the codewords that you are creating (in simple terms Huffman encoding uses shorter symbols for more frequent patterns and longer symbols for less frequent patterns in a 'optimal' way)
This is a fun problem space to play in! Good Luck!

Writing an Inverted Index in C# for an information retrieval application

I am writing an in-house application that holds several pieces of text information as well as a number of pieces of data about these pieces of text. These pieces of data will be held within a database (SQL Server, although this could change) in order of entry.
I'd like to be able to search for the most relevant of these pieces of information, with the most relevant of these to be at the top. I originally looked into using SQL Server Full-Text Search but it's not as flexible for my other needs as I had hoped so it seems that I'll need to develop my own solution to this.
From what I understand what is needed is an inverted index, then for the contents of said inverted index to be restored and modified based on the results of the additional information held (although for now this can be left for a later date as I just want the inverted index to index the main text from the database table/strings provided).
I've had a crack at writing this code in Java using a Hashtable with the key as the words and the value as a list of the occurrences of the word but in all honesty I'm still rather new at C# and have only really used things like DataSets and DataTables when handling information. If requested I'll upload the Java code soon once I've cleared this laptop of viruses.
If given a set of entries from a table or from a List of Strings, how could one create an inverted index in C# that will preferably save into a DataSet/DataTable?
EDIT: I forgot to mention that I have already tried Lucene and Nutch, but require my own solution as modifying Lucene to meet my needs would take far longer than writing an inverted index. I'll be handling a lot of meta-data that'll also need handling once the basic inverted index is completed, so all I require for now is a basic full-text search on one area using the inverted index. Finally, working on an inverted index isn't something I get to do every day so it'd be great to have a crack at it.
Here's a rough overview of an approach I've used successfully in C# in the past:
struct WordInfo
{
public int position;
public int fieldID;
}
Dictionary<string,List<WordInfo>> invertedIndex=new Dictionary<string,List<WordInfo>>();
public void BuildIndex()
{
foreach (int fieldID in GetDatabaseFieldIDS())
{
string textField=GetDatabaseTextFieldForID(fieldID);
string word;
int position=0;
while(GetNextWord(textField,out word,ref position)==true)
{
WordInfo wi=new WordInfo();
if (invertedIndex.TryGetValue(word,out wi)==false)
{
invertedIndex.Add(word,new List<WordInfo>());
}
wi.Position=position;
wi.fieldID=fieldID;
invertedIndex[word].Add(wi);
}
}
}
Notes:
GetNextWord() iterates through the field and returns the next word and position. To implement it look at using string.IndexOf() and char character type checking methods (IsAlpha etc).
GetDatabaseTextFieldForID() and GetDatabaseFieldIDS() are self explanatory, implement as required.
Lucene.net might be your best bet. Its a mature full text search engine using inverted indexes.
http://codeclimber.net.nz/archive/2009/09/02/lucene.net-your-first-application.aspx
UPDATE:
I wrote a little library for indexing against in-memory collections using Lucene.net - it might be useful for this. https://github.com/mcintyre321/Linqdex
If you're looking to spin your own, the Dictionary<T> class is most likely going to be your base, like your Java hashtables. As far as what is stored as the values in the dictionary, its hard to tell based on the information you provide, but typically search algorithms use some type of Set structure so you can run unions and intersections. LINQ gives you a much of that functionality on any IEnumerable, although a specialized Set class may boost performance.
One such implementation of a Set is in the Wintellect PowerCollections. I'm not sure if that would give you any performance benefit or not over LINQ.
As far as saving to a DataSet, I'm not sure what you're envisioning. I'm not aware of anything that "automagically" writes to a DataSet. I suspect you will have to write this yourself, especially since you mentioned several times about other third-party options not being flexible enough.

Categories

Resources