Find which phrases have been used multiple times in a string - c#

It's easy to count occurrences of words in a file by using a Dictionary to identify which words are used the most frequently, but given a text file, how can I find commonly used phrases where a "phrase" is a set of two or more consecutive words?
For example, here is some sample text:
Except oral wills, every will shall be in writing, but may be
handwritten or typewritten. The will shall contain the testator's signature
or by some other person in the testator's conscious presence
and at the testator's express direction . The will shall be attested
and subscribed in the conscious presence of the testator, by two or
more competent witnesses, who saw the testator subscribe, or heard the
testator acknowledge the testator's signature.
For purposes of this section, conscious presence means within the
range of any of the testator's senses, excluding the sense of sight or
sound that is sensed by telephonic, electronic, or other distant
communication.
How can I identify that the phrases "conscious presence" (3 times) and "testator's signature" (2 times) as having appeared more than once (apart from brute force searching for every set of two or three words)?
I'll be writing this in c#, so c# code would be great, but I can't even identify a good algorithm so I'll settle for any code at all or even pseudo code for how to solve this.

Thought I'd have a quick go at this - not sure if this isn't the brute force method you were trying to avoid - but :
static void Main(string[] args)
{
string txt = #"Except oral wills, every will shall be in writing,
but may be handwritten or typewritten. The will shall contain the testator's
signature or by some other person in the testator's conscious presence and at the
testator's express direction . The will shall be attested and subscribed in the
conscious presence of the testator, by two or more competent witnesses, who saw the
testator subscribe, or heard the testator acknowledge the testator's signature.
For purposes of this section, conscious presence means within the range of any of the
testator's senses, excluding the sense of sight or sound that is sensed by telephonic,
electronic, or other distant communication.";
//split string using common seperators - could add more or use regex.
string[] words = txt.Split(',', '.', ';', ' ', '\n', '\r');
//trim each tring and get rid of any empty ones
words = words.Select(t=>t.Trim()).Where(t=>t.Trim()!=string.Empty).ToArray();
const int MaxPhraseLength = 20;
Dictionary<string, int> Counts = new Dictionary<string,int>();
for (int phraseLen = MaxPhraseLength; phraseLen >= 2; phraseLen--)
{
for (int i = 0; i < words.Length - 1; i++)
{
//get the phrase to match based on phraselen
string[] phrase = GetPhrase(words, i, phraseLen);
string sphrase = string.Join(" ", phrase);
Console.WriteLine("Phrase : {0}", sphrase);
int index = FindPhraseIndex(words, i+phrase.Length, phrase);
if (index > -1)
{
Console.WriteLine("Phrase : {0} found at {1}", sphrase, index);
if(!Counts.ContainsKey(sphrase))
Counts.Add(sphrase, 1);
Counts[sphrase]++;
}
}
}
foreach (var foo in Counts)
{
Console.WriteLine("[{0}] - {1}", foo.Key, foo.Value);
}
Console.ReadKey();
}
static string[] GetPhrase(string[] words, int startpos, int len)
{
return words.Skip(startpos).Take(len).ToArray();
}
static int FindPhraseIndex(string[] words, int startIndex, string[] matchWords)
{
for (int i = startIndex; i < words.Length; i++)
{
int j;
for(j=0; j<matchWords.Length && (i+j)<words.Length; j++)
if(matchWords[j]!=words[i+j])
break;
if (j == matchWords.Length)
return startIndex;
}
return -1;
}

Try this out. It's in no way fool-proof, but should get the job done for now.
Yes, this only matches 2-word combos, does not strip punctuation, and is brute-force. No, the ToList is not necessary.
string text = "that big long text block";
var splitBySpace = text.Split(' ');
var doubleWords = splitBySpace
.Select((x, i) => new { Value = x, Index = i })
.Where(x => x.Index != splitBySpace.Length - 1)
.Select(x => x.Value + " " + splitBySpace.ElementAt(x.Index + 1)).ToList();
var duplicates = doubleWords
.GroupBy(x => x)
.Where(x => x.Count() > 1)
.Select(x => new { x.Key, Count = x.Count() }).ToList();
I got the following results:
Here is my attempt at getting more than 2 word combos. Again, same warning as previous.
List<string> multiWords = new List<string>();
//i is the number of words to combine
//in this case, 2-6 words
for (int i = 2; i <= 6; i++)
{
multiWords.AddRange(splitBySpace
.Select((x, index) => new { Value = x, Index = index })
.Where(x => x.Index != splitBySpace.Length - i + 1)
.Select(x => CombineItems(splitBySpace, x.Index, x.Index + i - 1)));
}
var duplicates = multiWords
.GroupBy(x => x)
.Where(x => x.Count() > 1)
.Select(x => new { x.Key, Count = x.Count() }).ToList();
private string CombineItems(IEnumerable<string> source, int startIndex, int endIndex)
{
return string.Join(" ", source.Where((x, i) => i >= startIndex && i <= endIndex).ToArray());
}
The results this time:
Now I just want to say there is a high chance of a off-by-one error with my code. I did not fully test it, so make sure you test it before you use it.

If I were doing it, I would probably be starting with the brute force approach, but it sounds like you want to avoid that. A two-phase approach could do a count of each word, take the top few results (only start with the top few words that appear the most times), then search for and count only phrases that include these popular words. Then you won't spend your time searching over all phrases.
I have this feeling that CS folks will correct me saying that this would actually take more time than a straight up brute force. And maybe some linguists will pitch in with some methods for detecting phrases or something.
Good luck!

Related

Find first difference in strings case insensitive given culture

There are many ways to compare two strings to find the first index where they differ, but if I require case-insensitivity in any given culture, then the options go away.
This is the only way I can think to do such a comparison:
static int FirstDiff(string str1, string str2)
{
for (int i = 1; i <= str1.Length && i <= str2.Length; i++)
if (!string.Equals(str1.Substring(0, i), str2.Substring(0, i), StringComparison.CurrentCultureIgnoreCase))
return i - 1;
return -1; // strings are identical
}
Can anyone think of a better way that doesn't involve so much string allocation?
For testing purposes:
// Turkish word 'open' contains the letter 'ı' which is the lowercase of 'I' in Turkish, but not English
string lowerCase = "açık";
string upperCase = "AÇIK";
Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");
FirstDiff(lowerCase, upperCase); // Should return 2
Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-TR");
FirstDiff(lowerCase, upperCase); // Should return -1
Edit: Checking both ToUpper and ToLower for each character appears to work for any example that I can come up with. But will it work for all cultures? Perhaps this is a question better directed at linguists.
One way to reduce the number of string allocations is to reduce the number of times you do a comparison. We can borrow from the binary search algorithm for searching arrays in this case, and start out by comparing a substring that is half the length of the string. Then we continue to add or remove half of the remaining indexes (depending on whether or not the strings were equal), until we get to the first instance of inequality.
In general this should speed up the search time:
public static int FirstDiffBinarySearch(string str1, string str2)
{
// "Fail fast" checks
if (string.Equals(str1, str2, StringComparison.CurrentCultureIgnoreCase))
return -1;
if (str1 == null || str2 == null) return 0;
int min = 0;
int max = Math.Min(str1.Length, str2.Length);
int mid = (min + max) / 2;
while (min <= max)
{
if (string.Equals(str1.Substring(0, mid), str2.Substring(0, mid),
StringComparison.CurrentCultureIgnoreCase))
{
min = mid + 1;
}
else
{
max = mid - 1;
}
mid = (min + max) / 2;
}
return mid;
}
You could compare characters instead of strings. This is far from optimized, and rather quick and dirty, but something like this appears to work
for (int i = 0; i < str1.Length && i < str2.Length; i++)
if (char.ToLower(str1[i]) != char.ToLower(str2[i]))
return i;
This should work with culture as well, according to the docs: https://learn.microsoft.com/en-us/dotnet/api/system.char.tolower?view=netframework-4.8
Casing rules are obtained from the current culture.
To convert a character to lowercase by using the casing conventions of the current culture, call the ToLower(Char, CultureInfo) method overload with a value of CurrentCulture for its culture parameter.
You need to check both ToLower and ToUpper.
private static int FirstDiff(string str1, string str2)
{
int length = Math.Min(str1.Length, str2.Length);
TextInfo textInfo = CultureInfo.CurrentCulture.TextInfo;
for (int i = 0; i < length; ++i)
{
if (textInfo.ToUpper(str1[i]) != textInfo.ToUpper(str2[i]) ||
textInfo.ToLower(str1[i]) != textInfo.ToLower(str2[i]))
{
return i;
}
}
return str1.Length == str2.Length ? -1 : length;
}
I am reminded of one additional oddity of characters (or rather unicode code points): there are some that act as surrogate pairs and they have no relevance to any culture unless the pair appears next to one another. For more information about Unicode interpretation standards see the document that #orhtej2 linked in his comment.
While trying out different solutions I stumbled upon this particular class, and I think it will best suit my needs: System.Globalization.StringInfo (The MS Doc Example shows it in action with surrogate pairs)
The class breaks the string down into sections by pieces that need each other to make sense (rather than by strictly character). I can then compare each piece by culture using string.Equals and return the index of the first piece that differs:
static int FirstDiff(string str1, string str2)
{
var si1 = StringInfo.GetTextElementEnumerator(str1);
var si2 = StringInfo.GetTextElementEnumerator(str2);
bool more1, more2;
while ((more1 = si1.MoveNext()) & (more2 = si2.MoveNext())) // single & to avoid short circuiting the right counterpart
if (!string.Equals(si1.Current as string, si2.Current as string, StringComparison.CurrentCultureIgnoreCase))
return si1.ElementIndex;
if (more1 || more2)
return si1.ElementIndex;
else
return -1; // strings are equivalent
}
Here's a little different approach. Strings are technically arrays of char, so I'm using that along with LINQ.
var list1 = "Hellow".ToLower().ToList();
var list2 = "HeLio".ToLower().ToList();
var diffIndex = list1.Zip(list2, (item1, item2) => item1 == item2)
.Select((match, index) => new { Match = match, Index = index })
.Where(a => !a.Match)
.Select(a => a.Index).FirstOrDefault();
If they match, diffIndex will be zero. Otherwise it will be the index of the first mismatching character.
Edit:
A little improved version with casting to lower case on the go. And initial ToList() was really redundant.
var diffIndex = list1.Zip(list2, (item1, item2) => char.ToLower(item1) == char.ToLower(item2))
.Select((match, index) => new { Match = match, Index = index })
.Where(a => !a.Match)
.Select(a => a.Index).FirstOrDefault();
Edit2:
Here's a working version where it can be further shortened. This is the best answer since in the previous two you'll get 0 if strings match. Here' if strings match you get null and the index otherwise.
var list1 = "Hellow";
var list2 = "HeLio";
var diffIndex = list1.Zip(list2, (item1, item2) => char.ToLower(item1) == char.ToLower(item2))
.Select((match, index) => new { Match = match, Index = index })
.FirstOrDefault(x => !x.Match)?.Index;

How to treat integers from a string as multi-digit numbers and not individual digits?

My input is a string of integers, which I have to check whether they are even and display them on the console, if they are. The problem is that what I wrote checks only the individual digits and not the numbers.
string even = "";
while (true)
{
string inputData = Console.ReadLine();
if (inputData.Equals("x", StringComparison.OrdinalIgnoreCase))
{
break;
}
for (int i = 0; i < inputData.Length; i++)
{
if (inputData[i] % 2 == 0)
{
even +=inputData[i];
}
}
}
foreach (var e in even)
Console.WriteLine(e);
bool something = string.IsNullOrEmpty(even);
if( something == true)
{
Console.WriteLine("N/A");
}
For example, if the input is:
12
34
56
my output is going to be
2
4
6 (every number needs to be displayed on a new line).
What am I doing wrong? Any help is appreciated.
Use string.Split to get the independent sections and then int.TryParse to check if it is a number (check Parse v. TryParse). Then take only even numbers:
var evenNumbers = new List<int>();
foreach(var s in inputData.Split(" "))
{
if(int.TryParse(s, out var num) && num % 2 == 0)
evenNumbers.Add(num); // If can't use collections: Console.WriteLine(num);
}
(notice the use of out vars introduced in C# 7.0)
If you can use linq then similar to this answer:
var evenNumbers = inputData.Split(" ")
.Select(s => (int.TryParse(s, out var value), value))
.Where(pair => pair.Item1)
.Select(pair => pair.value);
I think you do too many things here at once. Instead of already checking if the number is even, it is better to solve one problem at a time.
First we can make substrings by splitting the string into "words". Net we convert every substring to an int, and finally we filter on even numbers, like:
var words = inputData.Split(' '); # split the words by a space
var intwords = words.Select(int.Parse); # convert these to ints
var evenwords = intwords.Where(x => x % 2 == 0); # check if these are even
foreach(var even in evenwords) { # print the even numbers
Console.WriteLine(even);
}
Here it can still happen that some "words" are not integers, for example "12 foo 34". So you will need to implement some extra filtering between splitting and converting.

Searching for words in richTextBox (windows forms, c#)

I need help with counting same words in richTextBox and also want to assign numbers/words to them. There are 500,000 + words in a box and all of them are written only 3 or 6 times.
One solution is to write down words i'm looking for in advance, search in a box and assign a number/word to it. But there are too many to do so, if there is a quicker way it will help me a lot.
Thanks in advance!
Here is a code of what i'm doing right now, it's not representing what i'm asking.
for (int i = 0; i < Url.Length; i++)
{
doc = web.Load(Url[i]);
int x=0;
int y=0;
//here was a long list of whiles for *fors* down there,
//removed to be cleaner for you to see
//one example:
//while (i == 0)
//{
// x = 18;
// y = 19;
// break;
//}
for (int w = 0; w < x; w++)
{
string metascore = doc.DocumentNode.SelectNodes("//*[#class=\"trow8\"]")[w].InnerText;
richTextBox1.AppendText("-----------");
richTextBox1.AppendText(metascore);
richTextBox1.AppendText("\r\n");
}
for (int z = 0; z < y; z++)
{
string metascore1 = doc.DocumentNode.SelectNodes("//*[#class=\"trow2\"]")[z].InnerText;
richTextBox1.AppendText("-----------");
richTextBox1.AppendText(metascore1);
richTextBox1.AppendText("\r\n");
}
}
Let's assume you have your text, read from a file, downloaded from some webpage, or RTB... Using LINQ will give you all you need.
string textToCount = "...";
// second step, split text by delimiters of your own
List<string> words = textToCount.Split(' ', ',', '!', '?', '.', ';', '\r', '\n') // split by whatever characters
.Where(s => !string.IsNullOrWhiteSpace(s)) // eliminate all whitespaces
.Select(w => w.ToLower()) // transform to lower for easier comparison/count
.ToList();
// use LINQ helper method to group words and project them into dictionary
// with word as a key and value as total number of word appearances
// this is the actual magic. With this in place, it's easy to get statistics.
var groupedWords = words.GroupBy(w => w)
.ToDictionary(k => k.Key, v => v.Count());`
// to get various statistics
// total different words - count your dictionary entries
groupedWords.Count();
// the most often word - looking for the word having the max number in the list of values
groupedWords.First(kvp => kvp.Value == groupedWords.Values.Max()).Key
// mentioned this many times
groupedWords.Values.Max()
// similar for min, just replace call to Max() with call to Min()
// print out your dictionary to see for every word how often it's metnioned
foreach (var wordStats in groupedWords)
{
Console.WriteLine($"{wordStats.Key} - {wordStats.Value}");
}
This solution is similar to previous post. The main difference is that this solution is using grouping by word, which is simple and putting it to dictionary. Once there, a lot can be easily found.
Assuming you have a RichTextBox running around somewhere, we will call him Bob:
//Grab all the text from the RTF
TextRange BobRange = new TextRange(
// TextPointer to the start of content in the RichTextBox.
Bob.Document.ContentStart,
// TextPointer to the end of content in the RichTextBox.
Bob.Document.ContentEnd
);
//Assume words are sperated by space, commas or periods, split on that and thorw the words in a List
List<string> BobsWords = BobRange.Text.Split(new char[] { ' ', ',', '.' }, StringSplitOptions.RemoveEmptyEntries).ToList<string>();
//Now use Linq for .net to grab what you want
int CountOfTheWordThe = BobsWords.Where(w => w == "The").Count();
//Now that we have the list of words we can create a MetaDictionary
Dictionary<string, int> BobsWordsAndCounts = new Dictionary<string, int>();
foreach( string word in BobsWords)
{
if (!BobsWordsAndCounts.Keys.Contains(word))
BobsWordsAndCounts.Add( word, BobsWords.Where(w => w == word).Count(); )
}
Is that what you meant by assign a number to a word? You want to know the count of each word in the RichTextBox?
Perhaps add a bit of context?

Shuffling a string so that no two adjacent letters are the same

I've been trying to solve this interview problem which asks to shuffle a string so that no two adjacent letters are identical
For example,
ABCC -> ACBC
The approach I'm thinking of is to
1) Iterate over the input string and store the (letter, frequency)
pairs in some collection
2) Now build a result string by pulling the highest frequency (that is > 0) letter that we didn't just pull
3) Update (decrement) the frequency whenever we pull a letter
4) return the result string if all letters have zero frequency
5) return error if we're left with only one letter with frequency greater than 1
With this approach we can save the more precious (less frequent) letters for last. But for this to work, we need a collection that lets us efficiently query a key and at the same time efficiently sort it by values. Something like this would work except we need to keep the collection sorted after every letter retrieval.
I'm assuming Unicode characters.
Any ideas on what collection to use? Or an alternative approach?
You can sort the letters by frequency, split the sorted list in half, and construct the output by taking letters from the two halves in turn. This takes a single sort.
Example:
Initial string: ACABBACAB
Sort: AAAABBBCC
Split: AAAA+BBBCC
Combine: ABABABCAC
If the number of letters of highest frequency exceeds half the length of the string, the problem has no solution.
Why not use two Data Structures: One for sorting (Like a Heap) and one for key retrieval, like a Dictionary?
The accepted answer may produce a correct result, but is likely not the 'correct' answer to this interview brain teaser, nor the most efficient algorithm.
The simple answer is to take the premise of a basic sorting algorithm and alter the looping predicate to check for adjacency rather than magnitude. This ensures that the 'sorting' operation is the only step required, and (like all good sorting algorithms) does the least amount of work possible.
Below is a c# example akin to insertion sort for simplicity (though many sorting algorithm could be similarly adjusted):
string NonAdjacencySort(string stringInput)
{
var input = stringInput.ToCharArray();
for(var i = 0; i < input.Length; i++)
{
var j = i;
while(j > 0 && j < input.Length - 1 &&
(input[j+1] == input[j] || input[j-1] == input[j]))
{
var tmp = input[j];
input[j] = input[j-1];
input[j-1] = tmp;
j--;
}
if(input[1] == input[0])
{
var tmp = input[0];
input[0] = input[input.Length-1];
input[input.Length-1] = tmp;
}
}
return new string(input);
}
The major change to standard insertion sort is that the function has to both look ahead and behind, and therefore needs to wrap around to the last index.
A final point is that this type of algorithm fails gracefully, providing a result with the fewest consecutive characters (grouped at the front).
Since I somehow got convinced to expand an off-hand comment into a full algorithm, I'll write it out as an answer, which must be more readable than a series of uneditable comments.
The algorithm is pretty simple, actually. It's based on the observation that if we sort the string and then divide it into two equal-length halves, plus the middle character if the string has odd length, then corresponding positions in the two halves must differ from each other, unless there is no solution. That's easy to see: if the two characters are the same, then so are all the characters between them, which totals ⌈n/2⌉&plus;1 characters. But a solution is only possible if there are no more than ⌈n/2⌉ instances of any single character.
So we can proceed as follows:
Sort the string.
If the string's length is odd, output the middle character.
Divide the string (minus its middle character if the length is odd) into two equal-length halves, and interleave the two halves.
At each point in the interleaving, since the pair of characters differ from each other (see above), at least one of them must differ from the last character output. So we first output that character and then the corresponding one from the other half.
The sample code below is in C++, since I don't have a C# environment handy to test with. It's also simplified in two ways, both of which would be easy enough to fix at the cost of obscuring the algorithm:
If at some point in the interleaving, the algorithm encounters a pair of identical characters, it should stop and report failure. But in the sample implementation below, which has an overly simple interface, there's no way to report failure. If there is no solution, the function below returns an incorrect solution.
The OP suggests that the algorithm should work with Unicode characters, but the complexity of correctly handling multibyte encodings didn't seem to add anything useful to explain the algorithm. So I just used single-byte characters. (In C# and certain implementations of C++, there is no character type wide enough to hold a Unicode code point, so astral plane characters must be represented with a surrogate pair.)
#include <algorithm>
#include <iostream>
#include <string>
// If possible, rearranges 'in' so that there are no two consecutive
// instances of the same character.
std::string rearrange(std::string in) {
// Sort the input. The function is call-by-value,
// so the argument itself isn't changed.
std::string out;
size_t len = in.size();
if (in.size()) {
out.reserve(len);
std::sort(in.begin(), in.end());
size_t mid = len / 2;
size_t tail = len - mid;
char prev = in[mid];
// For odd-length strings, start with the middle character.
if (len & 1) out.push_back(prev);
for (size_t head = 0; head < mid; ++head, ++tail)
// See explanatory text
if (in[tail] != prev) {
out.push_back(in[tail]);
out.push_back(prev = in[head]);
}
else {
out.push_back(in[head]);
out.push_back(prev = in[tail]);
}
}
}
return out;
}
you can do that by using a priority queue.
Please find the below explanation.
https://iq.opengenus.org/rearrange-string-no-same-adjacent-characters/
Here is a probabilistic approach. The algorithm is:
10) Select a random char from the input string.
20) Try to insert the selected char in a random position in the output string.
30) If it can't be inserted because of proximity with the same char, go to 10.
40) Remove the selected char from the input string and go to 10.
50) Continue until there are no more chars in the input string, or the failed attempts are too many.
public static string ShuffleNoSameAdjacent(string input, Random random = null)
{
if (input == null) return null;
if (random == null) random = new Random();
string output = "";
int maxAttempts = input.Length * input.Length * 2;
int attempts = 0;
while (input.Length > 0)
{
while (attempts < maxAttempts)
{
int inputPos = random.Next(0, input.Length);
var outputPos = random.Next(0, output.Length + 1);
var c = input[inputPos];
if (outputPos > 0 && output[outputPos - 1] == c)
{
attempts++; continue;
}
if (outputPos < output.Length && output[outputPos] == c)
{
attempts++; continue;
}
input = input.Remove(inputPos, 1);
output = output.Insert(outputPos, c.ToString());
break;
}
if (attempts >= maxAttempts) throw new InvalidOperationException(
$"Shuffle failed to complete after {attempts} attempts.");
}
return output;
}
Not suitable for strings longer than 1,000 chars!
Update: And here is a more complicated deterministic approach. The algorithm is:
Group the elements and sort the groups by length.
Create three empty piles of elements.
Insert each group to a separate pile, inserting always the largest group to the smallest pile, so that the piles differ in length as little as possible.
Check that there is no pile with more than half the total elements, in which case satisfying the condition of not having same adjacent elements is impossible.
Shuffle the piles.
Start yielding elements from the piles, selecting a different pile each time.
When the piles that are eligible for selection are more than one, select randomly, weighting by the size of each pile. Piles containing near half of the remaining elements should be much preferred. For example if the remaining elements are 100 and the two eligible piles have 49 and 40 elements respectively, then the first pile should be 10 times more preferable than the second (because 50 - 49 = 1 and 50 - 40 = 10).
public static IEnumerable<T> ShuffleNoSameAdjacent<T>(IEnumerable<T> source,
Random random = null, IEqualityComparer<T> comparer = null)
{
if (source == null) yield break;
if (random == null) random = new Random();
if (comparer == null) comparer = EqualityComparer<T>.Default;
var grouped = source
.GroupBy(i => i, comparer)
.OrderByDescending(g => g.Count());
var piles = Enumerable.Range(0, 3).Select(i => new Pile<T>()).ToArray();
foreach (var group in grouped)
{
GetSmallestPile().AddRange(group);
}
int totalCount = piles.Select(e => e.Count).Sum();
if (piles.Any(pile => pile.Count > (totalCount + 1) / 2))
{
throw new InvalidOperationException("Shuffle is impossible.");
}
piles.ForEach(pile => Shuffle(pile));
Pile<T> previouslySelectedPile = null;
while (totalCount > 0)
{
var selectedPile = GetRandomPile_WeightedByLength();
yield return selectedPile[selectedPile.Count - 1];
selectedPile.RemoveAt(selectedPile.Count - 1);
totalCount--;
previouslySelectedPile = selectedPile;
}
List<T> GetSmallestPile()
{
List<T> smallestPile = null;
int smallestCount = Int32.MaxValue;
foreach (var pile in piles)
{
if (pile.Count < smallestCount)
{
smallestPile = pile;
smallestCount = pile.Count;
}
}
return smallestPile;
}
void Shuffle(List<T> pile)
{
for (int i = 0; i < pile.Count; i++)
{
int j = random.Next(i, pile.Count);
if (i == j) continue;
var temp = pile[i];
pile[i] = pile[j];
pile[j] = temp;
}
}
Pile<T> GetRandomPile_WeightedByLength()
{
var eligiblePiles = piles
.Where(pile => pile.Count > 0 && pile != previouslySelectedPile)
.ToArray();
Debug.Assert(eligiblePiles.Length > 0, "No eligible pile.");
eligiblePiles.ForEach(pile =>
{
pile.Proximity = ((totalCount + 1) / 2) - pile.Count;
pile.Score = 1;
});
Debug.Assert(eligiblePiles.All(pile => pile.Proximity >= 0),
"A pile has negative proximity.");
foreach (var pile in eligiblePiles)
{
foreach (var otherPile in eligiblePiles)
{
if (otherPile == pile) continue;
pile.Score *= otherPile.Proximity;
}
}
var sumScore = eligiblePiles.Select(p => p.Score).Sum();
while (sumScore > Int32.MaxValue)
{
eligiblePiles.ForEach(pile => pile.Score /= 100);
sumScore = eligiblePiles.Select(p => p.Score).Sum();
}
if (sumScore == 0)
{
return eligiblePiles[random.Next(0, eligiblePiles.Length)];
}
var randomScore = random.Next(0, (int)sumScore);
int accumulatedScore = 0;
foreach (var pile in eligiblePiles)
{
accumulatedScore += (int)pile.Score;
if (randomScore < accumulatedScore) return pile;
}
Debug.Fail("Could not select a pile randomly by weight.");
return null;
}
}
private class Pile<T> : List<T>
{
public int Proximity { get; set; }
public long Score { get; set; }
}
This implementation can suffle millions of elements. I am not completely convinced that the quality of the suffling is as perfect as the previous probabilistic implementation, but should be close.
func shuffle(str:String)-> String{
var shuffleArray = [Character](str)
//Sorting
shuffleArray.sort()
var shuffle1 = [Character]()
var shuffle2 = [Character]()
var adjacentStr = ""
//Split
for i in 0..<shuffleArray.count{
if i > shuffleArray.count/2 {
shuffle2.append(shuffleArray[i])
}else{
shuffle1.append(shuffleArray[i])
}
}
let count = shuffle1.count > shuffle2.count ? shuffle1.count:shuffle2.count
//Merge with adjacent element
for i in 0..<count {
if i < shuffle1.count{
adjacentStr.append(shuffle1[i])
}
if i < shuffle2.count{
adjacentStr.append(shuffle2[i])
}
}
return adjacentStr
}
let s = shuffle(str: "AABC")
print(s)

What's the best way to split a list of strings to match first and last letters?

I have a long list of words in C#, and I want to find all the words within that list that have the same first and last letters and that have a length of between, say, 5 and 7 characters. For example, the list might have:
"wasted was washed washing was washes watched watches wilts with wastes wits washings"
It would return
Length: 5-7, First letter: w, Last letter: d, "wasted, washed, watched"
Length: 5-7, First letter: w, Last letter: s, "washes, watches, wilts, wastes"
Then I might change the specification for a length of 3-4 characters which would return
Length: 3-4, First letter: w, Last letter: s, "was, wits"
I found this method of splitting which is really fast, made each item unique, used the length and gave an excellent start:
Spliting string into words length-based lists c#
Is there a way to modify/use that to take account of first and last letters?
EDIT
I originally asked about the 'fastest' way because I usually solve problems like this with lots of string arrays (which are slow and involve a lot of code). LINQ and lookups are new to me, but I can see that the ILookup used in the solution I linked to is amazing in its simplicity and is very fast. I don't actually need the minimum processor time. Any approach that avoids me creating separate arrays for this information would be fantastic.
this one liner will give you groups with same first/last letter in your range
int min = 5;
int max = 7;
var results = str.Split()
.Where(s => s.Length >= min && s.Length <= max)
.GroupBy(s => new { First = s.First(), Last = s.Last()});
var minLength = 5;
var maxLength = 7;
var firstPart = "w";
var lastPart = "d";
var words = new List<string> { "washed", "wash" }; // so on
var matches = words.Where(w => w.Length >= minLength && w.Length <= maxLength &&
w.StartsWith(firstPart) && w.EndsWith(lastPart))
.ToList();
for the most part, this should be fast enough, unless you're dealing with tens of thousands of words and worrying about ms. then we can look further.
Just in LINQPad I created this:
void Main()
{
var words = new []{"wasted", "was", "washed", "washing", "was", "washes", "watched", "watches", "wilts", "with", "wastes", "wits", "washings"};
var firstLetter = "w";
var lastLetter = "d";
var minimumLength = 5;
var maximumLength = 7;
var sortedWords = words.Where(w => w.StartsWith(firstLetter) && w.EndsWith(lastLetter) && w.Length >= minimumLength && w.Length <= maximumLength);
sortedWords.Dump();
}
If that isn't fast enough, I would create a lookup table:
Dictionary<char, Dictionary<char, List<string>> lookupTable;
and do:
lookupTable[firstLetter][lastLetter].Where(<check length>)
Here's a method that does exactly what you want. You are only given a list of strings and the min/max length, correct? You aren't given the first and last letters to filter on. This method processes all the first/last letters in the strings.
private static void ProcessInput(string[] words, int minLength, int maxLength)
{
var groups = from word in words
where word.Length > 0 && word.Length >= minLength && word.Length <= maxLength
let key = new Tuple<char, char>(word.First(), word.Last())
group word by key into #group
orderby Char.ToLowerInvariant(#group.Key.Item1), #group.Key.Item1, Char.ToLowerInvariant(#group.Key.Item2), #group.Key.Item2
select #group;
Console.WriteLine("Length: {0}-{1}", minLength, maxLength);
foreach (var group in groups)
{
Console.WriteLine("First letter: {0}, Last letter: {1}", group.Key.Item1, group.Key.Item2);
foreach (var word in group)
Console.WriteLine("\t{0}", word);
}
}
Just as a quick thought, I have no clue if this would be faster or more efficient than the linq solutions posted, but this could also be done fairly easily with regular expressions.
For example, if you wanted to get 5-7 letter length words that begin with "w" and end with "s", you could use a pattern along the lines of:
\bw[A-Za-z]{3,5}s\b
(and this could fairly easily be made to be more variable driven - For example, have a variable for first letter, min length, max length, last letter and plug them in to the pattern to replace w, 3, 5 & s)
Them, using the RegEx library, you could then just take your captured groups to be your list.
Again, I don't know how this compares efficiency-wise to linq, but I thought it might deserve mention.
Hope this helps!!

Categories

Resources