How can I generate pseudo-random "readable" strings in Java? - c#

Generating a truly random string of a given length is a fairly straightforward (and already-well-covered) task.
However; I'd like to generate a "pseudo" random string with the additional constraint that it be relatively easily readable (to a native-English reader.)
I think another way to say this is to say that the generated string should consist of "recognizable syllables." For example, "akdjfwv" is a random string, but it's not recognizable at all. "flamyom"; however, is very "recognizable" (even though it's nonsense.)
Obviously, one could make a long list of "recognizable syllables," and then randomly select them.
But, is there a better way to do something like programmatically generate a "recognizable syllable," or generate a "syllable" and then test it to see if it's "recognizable"?
I can think of several ways to go about this implementation, but if someone has already implemented it (preferrably in Java or C#,) I'd rather re-use their work.
Any ideas?

You could try implementing a Markov chain and give it a suitable passage to process. There is a Java implementation that may work for you.
This is a sample from interpolating between Genesis in English and Genesis in Spanish (N = 1):
In bersaran thelely and avin inder tht teathe m lovig weay waw thod mofin he t thte h fupiteg s o t llissed od ma. lllar t land fingujod maid af de wand tetodamoiz fosu Andesp. ersunen thenas lowhejod whipanirede tifinas Gofuavithila d gió Y Diche fua Dios co l, liens ly Y crerdíquen ticuesereregos hielase agúnd veumarbas iarasens laragún co eruerá laciéluelamagúneren Dien a He.

I think this should do what you want:
Java Password Generator
It has the source code and a permissive license so you can adapt the source code to what you are looking for.

You need to generate random syllables. The simplest way to do it is to use syllables that are consonant-vowel, or consonant-vowel-consonant. From a list of consonants and vowels, pick randomly to build syllables, then join the syllables together to make a string.
Keep in mind your list of consonants shouldn't be letters that are consonants, but phonemes, so "th", "st", "sl", etc, could be entries in the consonant list.

You really should check out SCIgen. It generates entire semi-nonsense scientific papers: http://pdos.csail.mit.edu/scigen/
And the source is available: it's released under GPL, and is currently available via anonymous CVS.

I'm not sure exactly what you need this for, but if graphic-layot folks in the print industry have used Lorem Ipsum generators to create text that looks enough like text that your brain processes it as such without actually being readable words. More info here
I don't know if there's a web service to which you could subscribe, but there are several sites which will just generate Lorem Ipsum strings for you, so you may be able to use those.

There is a good section on this in Programming Pearls. It's online but I'd highly recommend buying the book; One of the best programming books around in my opinion.

Lots of Lorem Ipsum generators out there.

All gets back to why you want this. If you just want "pronounceable gibberish", I'd think the easiest thing to do would be to generate alternating consonants and vowels. That would be a tiny subset of all pronounceable gibberish, but what's the goal? To give a little broader range you could create a table of consonant phonemes and vowel phonemes, with the consonant list including not just individual letters like "b" and "d" but also "th", "br", and so on, and the vowel list could include "oo" and "ea", etc. One more step would be to generate syllables instead of letters, with a syllable containing either vowel, consonant-vowel, or consonant-vowel-consonant. That is, loop through creating syllables, then within syllables pick one of the three patterns. You probably want to forbid two vowel-only syllables in a row. (I'm trying to think of an example of that in English. It probably happens, but the only examples I can think of are borrowed from other languages, like "stoa".)

I created a Java package Pronounceable String Generator for generating pronounceable random strings quickly.
Just create an object PronounceableStringGenerator and invoke the method generate:
PronounceableStringGenerator mg = new PronounceableStringGenerator();
System.out.println(mg.generate(8));//8 is the length of the generated string
System.out.println(mg.generate(10));
System.out.println(mg.generate(6));

Related

How do I parse street address's so they all uniformly look the same?

Okay so I am writing a C# script which is close to finish, I just need to check to make sure the street names are all the same in regards to abbreviating the end of an address
Example:
1234 Apple Street
4902 Kennington Road
4234 house drew Boulevard
ETC....
The highlighted names are the values I want to abbreviate to:
1234 Apple ST
4902 Kennington RD
4234 house drew BLVD
Is there someway method or function in C# that can do this automatically? Or some parsing function that can do this? Please let me know!This also would be helpful if SQL had something like this if it exists
Well, you know, that is like a function for making a 3d game, please. And not too complex but with nice assets. If you get out of your little street and travel around al ittle you will find that normalizing addresses EVEN IN ONE COUNTRY is extremely complex.
What you can easily do is not do it - hand it over. Submit your address to an API (Google Geocoding, BIngs etc.) and then take the address parts they return.
Otherwise - last time I did that it was half a year wit h3 people to get all the freaking special cases out.
As TomTom was saying, this is a much more complex problem than it appears to be, and there are a ridiculous number of edge cases.
I also recommend submitting your addresses to an API for standardization, but just so you know most free ones (like Google) only allow incidental, non-commercial lookups. This means if you have a whole bunch of addresses you need standardized, you would violate their ToS.
SmartyStreets provides a street address validation and standardization service that you can try out for free. If you are happy with it and need more lookups, you can always buy more. Disclosure: I am a software developer at SmartyStreets.
var newString = "1234 Apple Street".Replace("Street", "ST");
or also you can use regex pattern to modify your strings
var newString= Regex.Replace("1234 Apple Street", "(?i)street", "ST");
"A function that abbreviates street names"
That sounds a little too specific to be packed with the .NET framework doesn't it?
You can create it, it's simple
Grab the string
Check for any of the options you want (Boulevard,Road, Street) String.contains or you can use the LINQ function Enumerable.Any
and treat accordingly
But like #TomTom pointed out you will face Culture problems

String likeness algorithms

I have two strings (they're going to be descriptions in a simple database eventually), let's say they're
String A: "Apple orange coconut lime jimmy buffet"
String B: "Car
bicycle skateboard"
What I'm looking for is this. I want a function that will have the input "cocnut", and have the output be "String A"
We could have differences in capitalization, and the spelling won't always be spot on. The goal is a 'quick and dirty' search if you will.
Are there any .net (or third party), or recommend 'likeness algorithms' for strings, so I could check that the input has a 'pretty close fragment' and return it? My database is going to have liek 50 entries, tops.
What you’re searching for is known as the edit distance between two strings. There exist plenty of implementations – here’s one from Stack Overflow itself.
Since you’re searching for only part of a string what you want is a locally optimal match rather than a global match as computed by this method.
This is known as the local alignment problem and once again it’s easily solvable by an almost identical algorithm – the only thing that changes is the initialisation (we don’t penalise whatever comes before the search string) and the selection of the optimum value (we don’t penalise whatever comes after the search string).

Is it possible to guide a Markov chain toward certain keywords?

I'm writing a chat bot for a software engineering course in C#.
I'm using Markov chains to generate text, using Wikipedia articles as the corpus. I want it to respond to user input in an (at least slightly) intelligent way, based on their input, but I'm not sure how to do it.
My current thinking is that I'd try and extract keywords from the user's input, then use those to guide the sentence generation. But because of the Markov property, the keywords would have to be the first words in the sentence, which might look silly. As well, for an n order chain, I'd have to extract exactly n keywords from the user every time.
The data for the generator is a dictionary, where the keys are lists of words, and the values are lists of words combined with a weight depending on how often the word appears after the words in the key. So like:
{[word1, word2, ..., wordn]: [(word, weight), (word, weight), ...]}
It works in a command-line test program, but I'm just providing an n word seed for each bit of text it generates.
I'm hoping there's some way I can make the chain prefer words which are nearby words that the user used, rather than seeding it with the first/last n words in the input, or n keywords, or whatever. Is there a way to do that?
One way to make your chat smarter is to identify the topic from the user's input. Assume you have your Markov brain conditioned on different topics as well. Then to construct your answer, you refer to the dictionary below:
{([word1, word2, ..., wordn], topic): [(word, weight), (word, weight), ...]}
To find the topics, you can start with WikipediaMiner. For instance, below are the topics and their corresponding weights found by wikify api against the sentence:
Statistics is so hard. Do you have some good tutorial of probability theory for a beginner?
[{'id': 23542, 'title': 'Probability theory', 'weight': 0.9257584778725553},
{'id': 30746, 'title': 'Theory', 'weight': 0.7408577501980528},
{'id': 22934, 'title': 'Probability', 'weight': 0.7089442931022307},
{'id': 26685, 'title': 'Statistics', 'weight': 0.7024251356953044}]
Probably those identified keywords are also good to be treated as seeds.
However, question answering is not so simple. This Markov-based sentence generation does not have the ability to understand the question at all. The best it can do is just providing related contents. Just my 2 cents.

To find out the number of occruence of words in a file

I came across this question in an interview:
We have to find out the number of occurences of two given words in a text file with <=n words between them.
Example1:
text:`this is first string this is second string`
Keywords:`this, string`
n= 4
output= 2
"this is first string" is the first occurrence and number of words between this and string is 2(is, first) which is less than 4.
this is second string is the remaining string. number of words between *this and string * is 2 (is, second) which is less than 4.
Therefore the answer is 2.
I have thought that I will use
Dictionary<string, List<int>>.
My idea was that I use the dictionary and get the list of places where the particular word is repeated and then iterate through both the lists, increment the count if a condition is met and then display the count.
Is my thinking process correct? Please provide any suggestions to improve my solution.
Thanks,
Not an answer per-se (as quite honestly, I don't understand the question :P), but to add some general interview advice to the other answers:
In interviews the interviewer is always looking for the thought process and that you are a critical, logical thinker. Not necessarily that you have excellent coding recall and can compile code in your brain.
In addition interviews are a stressful process. By slowing down and talking out loud as you work things out you not only look like a better communicator and logical thinker (even if getting the question wrong), you also give yourself time to think.
Use a pen and paper, speak as you think, start off from the top and work through it. I've got jobs even if I didn't know the answers to tech questions by demonstrating that I can at least try to work things out ;-)
In short, it's not just down to technical prowess
I think it depends if the call is done only one or multiple times per string. If it's something like
int getOccurences(String str, String reference, int min_size) { ... }
then you don't really need the dictionary, not even a ist. You can just iterate through the string to find occurrences of words and then check the number of separators between them.
If on the other hand the problem is for arbitrary search/indexing, IMHO you do need a dictionary. I'd go for a dictionary where the key is the word and the value is a list of indexes where it occurs.
HTH
If you need to do that repeatedly for different pairs of words in the same text, then a word dictionary with a list of indexes is a good solution. However, if you were only looking for one pair, then two lists of indexes for those two words would be sufficient.
The lists allow you to separate the word detection operation from the counting logic.

Telling the difference between two large pieces of text

What would be the best way to compare big paragraphs of text in order to tell the differences apart. For example string A and string B are the same except for a few missing words, how would I highlight these?
Originally I thought of breaking it down into word arrays, and comparing the elements. However this breaks down when a word is deleted or inserted.
Use a diff algorithm.
I saw this a few months back when I was working on a small project, but it might set you on the right track.
http://www.codeproject.com/KB/recipes/DiffAlgorithmCS.aspx
You want to look into Longest Common Subsequence algorithms. Most languages have a library which will do the dirty work for you, and here is one for C#. Searching for "C# diff" or "VB.Net diff" will help you find additional libraries that suit your needs.
Usually text difference is measured in terms of edit distance, which is essentially the number of character additions, deletions or changes necessary to transform one text into the other.
A common implementation of this algorithm uses dynamic programming.
Here is an implementaion of a Merge Engine that compares 2 html files and shows the highlighted differences: http://www.codeproject.com/KB/string/htmltextcompare.aspx
If it's a one-shot deal, save them both in MS Word and use the document compare function.

Categories

Resources