Find the best interval match result

Find the best interval match result - c#

I have two sets of data in this form:
x | y | z x1 | y1 | z1
ab1 | 1 | 2 ab1 | 1 | 2
ab1 | 2 | 3 ab1 | 1.8 | 2
ab2 | 2 | 3 ab1 | 1.8 | 2
The number of columns can change between 1 to 30. The number of rows of the two sets is likely to be different.
The average amount of rows per set can change between few hundreds to few millions.
For each column a different matching rule will be applied, for example:
x: perfect match
y: +/- 0.1
z: +/- 0.5
Two rows are equivalent when all the criterias are satisfied.
My final goal is to find the rows in the first set with no match in second set.
The naive algorithm could be:
foreach a in SetA
{
foreach b in SetB
{
if (a == b)
{
remove b from SetB
process the next element in SetA
}
}
log a is not in SetB
}
At this stage I am not very interested in the efficiency of the algorithm. I am sure I could do better and I could reduce the complexity.
I am more concern about the correctness of the result. Let's try with a very simple example.
Two sets of number:
A B
1.6 1.55
1.5 1.45
4 3.2
And two elements are equal if:
b + 0.1 >= a >= b - 0.1
Now, if I run the naive algorithm I will find 2 matches.
However the result of the algorithm depends on the order of the two sets. For example:
A B
1.5 1.55
1.6 1.45
4 3.2
The algorithm will find only one match.
I would like to find the maximum number of matching rows.
I reckon in the real world data one of the columns will store an id, so the number of possible multiple matches will be a much smaller subset of the original set.
I know I can try to face this problem with a post processing after the first scan.
However, I don't want reinventing the wheel and I am wondering if my problem is equivalent to some famous, well known and already solved problem.
PS: I have tagged the question also as C++, C# and Java because I am going to use one of these languages to implement it.

It can be cast as a graph theory problem. Let X be a set that contains one node for each row in your first set. Let Y be another set which contains one node for each row in your second set.
The edges in the graph are defined by: for a given x in X and a given y in Y, there is an edge (x,y) if the row corresponding to x matches the row corresponding to y.
Once you have built this graph you can run the "maximum-bipartite-matching" algorithm on it and you will be done.

As I understand, you want the rows in the first set which don't match any row in the second set (within the error range). This cleaerly can be done with an O(n^2) complexity algorithm by parsing the elements in the first set and comparing them with the elements in the second set.
An optimization could be this:
sort both the sets - O(n*ln(n))
eliminate from the start the elements too small or too big (within the error) from the first set - O(n)
look in the second set for elements from the first set using a binary search (within the error) - O(n*lg(2)) and eliminate those not suitable
comlexity O(n*ln(n))

range tree? http://en.wikipedia.org/wiki/Range_tree
i dont really know, just throwing ideas out there

From the statement "My final goal is to find the rows in the first set with no match in second set." I understand that there can be multiple rows in first set that match the same row in the second set. In this case the solution is to remove the line "remove b from SetB" from your naive algorithm.
If however, you really need one to one matches between elements of the two sets then the answer with "maximum-bipartite-matching" provide by Corey Kosak applies.

Given your constraints, I dont see a way to do it in less than O(n^2). I'd probably modify your naive algorithm to include either a bool or a count field for each row in table A and then mark it if it matches a row in table B.
Then post process it with std::partition based on the indicator to group all the unique and non unique rows together. If you go with a count, you could get the rows that were "least unique". The bool would be somewhat more efficient since you could break out of the loop over B at the first match.

Two rows are equivalent when all the criteria are satisfied. My final goal is to find the rows in the first set with no match in second set.
foreach a in SetA
{
foreach b in SetB
{
if (a == b) //why would you alter SetB at all
go to next A
}
remove a from SetA //log a is not in SetB
}
However, you are right, that this is equivalent to some famous, well known and already solved problem. It's called "Set Difference". It's... kind of a major part of set theory. And since all those languages have sets, they also have that algorithm. C++ even has a dedicated function for it. Approximate Complexity of all of these is O(2(A+B)-1).
C++ standard algorithm function: http://www.cplusplus.com/reference/algorithm/set_difference/
vector<row> Result(A.rows());
end = std::set_difference(A.begin(), A.end(),
B.begin(), B.end(),
Result.begin());
Result.resize(end-Result.begin());
or std::unordered_set can be made to do this: http://msdn.microsoft.com/en-us/library/bb982739.aspx
std::unordered_set<row> Result(A.begin(), A.end());
for(auto i=B.begin(); i!=B.end(); ++i) {
auto f = Result.find(*i);
if (f != A.end())
A.erase(f);
}
Java does as well: http://download.oracle.com/javase/tutorial/collections/interfaces/set.html
Set<row> Result = new Set<row>(A);
A.removeAll(B);
And C#: http://msdn.microsoft.com/en-us/library/bb299875.aspx
HashSet<row> Result = new HashSet<row>(A);
A.ExceptWith(B);

Related

Algorithm to find all possible binary combinations with a condition

Here is one is for you math brains out there. I have a matrix, actually its half a matrix, cut diagonally. Each element of the matrix can be a 1 or a 0. I need to find all the possible combinations of 1s and 0s for any matrix of width N.
This is easy enough, you can get the number of elements on this matrix given width N with for this example where N=7 this would give us 28 or the number of elements. Then you can get the combinations with .
So the formula would be to get all the possible combinations.
Now here is where it gets tricky. There is one condition that must hold true for each result. The sum of the each set of elements on the matrix (shown below with each row represented) must be less than 4 for the first set (the one on the first row), less than 3 for all the other sets (these are constants regardless of the N value).
Here are what the sets for this example (N=7) look like. If you notice each row is represented. So for the first set if the combination is 0 1 0 1 0 1 0 this would be valid as its sum is < 4 (since its the first row). For the second set if the combination is 1 0 0 0 0 1 0 it is valid as it needs to be < 3.
I need to do this for huge matrices so brute forcing all possible permutations to find the ones that fall under this condition would be unfeasable. I need to find some sort of algorithm I can use to generate the valid matrices bottom up rather than top down. Maybe doing separate operations that can be composed later to yield a total set of results.
Any and all ideas are welcome.

A simple algorithm generating each solution recursively :
global File //A file where you will store your data
global N //Your matrix size
//matrix contains the matrix we build (int[][])
//set contains the number of 1 we can use on a set (int[])
//c is the column number (int)
//r is the row number (int)
function f ( matrix, set, c, r ) :
if ( c == N ):
r = r + 1
c = r
if ( r == N ):
write ( matrix in File )
// Implement your own way of storing the matrix
if ( set[r] > 0 AND (c+2 < N AND set[c+2] > 0) ):
matrix[c][r] = 1
set[c]--
set[r]--
f ( matrix, set, c+1, r )
matrix[c][r] = 0
f ( matrix, set, c+1, r)
end
//Calling our function with N = 5
N = 5
f([[0,0,0,0,0],[0,0,0,0,0],...], [3,2,2,2,2], 0, 0)
You can store each matrix in something else than a file but keep an eye on your memory consumption.

Here's a basic idea to get started; it's too large for a comment, though, but not a complete answer.
The idea is to start with a maximally 'filled' matrix rather than an empty one and then filling it.
Basic striping away procedure
Start with a matrix filled with all rows filled to their maximum number of 1s, that is row 0 has 4 1s and the other rows each have 3 1s. Then, start checking the conditions. Condition 0 (row 0) is automatically satisfied. For the rest of the conditions, either they are satisfied, or there are too many 1s in its condition set: take away 1s until the condition is satisfied. Do this for all conditions.
Generating all 'simpler' ones
Doing this, you end up with a matrix that satisfies all conditions. Now, you can change any 1 to a 0 and the matrix will still satisfy all the conditions. So, once you have a 'maximal' solution, you can generate all sub-solutions of it trivially.

Ideas about Generating Untraceable Invoice IDs

I want to print invoices for customers in my app. Each invoice has an Invoice ID. I want IDs to be:
Sequential (ids entered lately come late)
32 bit integers
Not easily traceable like 1 2 3 so that people can't tell how many items we sell.
An idea of my own:
Number of seconds since a specific date & time (e.g. 1/1/2010 00 AM).
Any other ideas how to generate these numbers ?

I don't like the idea of using time. You can run into all sorts of issues - time differences, several events happening in a single second and so on.
If you want something sequential and not easily traceable, how about generating a random number between 1 and whatever you wish (for example 100) for each new Id. Each new Id will be the previous Id + the random number.
You can also add a constant to your IDs to make them look more impressive. For example you can add 44323 to all your IDs and turn IDs 15, 23 and 27 into 44338, 44346 and 44350.

There are two problems in your question. One is solvable, one isn't (with the constraints you give).
Solvable: Unguessable numbers
The first one is quite simple: It should be hard for a customer to guess a valid invoice number (or the next valid invoice number), when the customer has access to a set of valid invoice numbers.
You can solve this with your constraint:
Split your invoice number in two parts:
A 20 bit prefix, taken from a sequence of increasing numbers (e.g. the natural numbers 0,1,2,...)
A 10 bit suffix that is randomly generated
With these scheme, there are a bout 1 million valid invoice numbers. You can precalculate them and store them in the database. When presented with a invoice number, check if it is in your database. When it isn't, it's not valid.
Use a SQL sequence for handing out numbers. When issuing a new (i.e. unused) invoice number, increment the seuqnce and issue the n-th number from the precalculated list (order by value).
Not solvable: Guessing the number of customers
When you want to prevent a customer having a number of valid invoice numbers from guessing how much invoice numbers you have issued yet (and there for how much customers you have): This is not possible.
You have hare a variant form the so called "German tank problem". I nthe second world war, the allies used serial numbers printed on the gear box of german tanks to guestimate, how much tanks Germany had produced. This worked, because the serial number was increasing without gaps.
But even when you increase the numbers with gaps, the solution for the German tank problem still works. It is quite easy:
You use the method described here to guess the highest issued invoice number
You guess the mean difference between two successive invoice numbers and divide the number through this value
You can use linear regression to get a stable delta value (if it exists).
Now you have a good guess about the order of magnitude of the number of invoices (200, 15000, half an million, etc.).
This works as long there (theoretically) exists a mean value for two successive invoice numbers. This is usually the case, even when using a random number generator, because most random number generators are designed to have such a mean value.
There is a counter measure: You have to make sure that there exists no mean value for the gap of two successive numbers. A random number generator with this property can be constructed very easy.
Example:
Start with the last invoice number plus one as current number
Multiply the current number with a random number >=2. This is your new current number.
Get a random bit: If the bit is 0, the result is your current number. Otherwise go back to step 2.
While this will work in theory, you will very soon run out of 32 bit integer numbers.
I don't think there is a practical solution for this problem. Either the gap between two successive number has a mean value (with little variance) and you can guess the amount of issued numbers easily. Or you will run out of 32 bit numbers very quickly.
Snakeoil (non working solutions)
Don't use any time based solution. The timestamp is usually easy guessable (probably an approximately correct timestamp will be printed somewhere on invoice). Using timestamps usually makes it easier for the attacker, not harder.
Don't use insecure random numbers. Most random number generators are not cryptographically safe. They usually have mathematical properties that are good for statistics but bad for your security (e.g. a predicable distribution, a stable mean value, etc.)

One solution may involve Exclusive OR (XOR) binary bitmaps. The result function is reversible, may generate non-sequential numbers (if the first bit of the least significant byte is set to 1), and is extremely easy to implement. And, as long as you use a reliable sequence generator (your database, for example,) there is no need for thread safety concerns.
According to MSDN, 'the result [of a exclusive-OR operation] is true if and only if exactly one of its operands is true.' reverse logic says that equal operands will always result false.
As an example, I just generated a 32-bit sequence on Random.org. This is it:
11010101111000100101101100111101
This binary number translates to 3588381501 in decimal, 0xD5E25B3D in hex. Let's call it your base key.
Now, lets generate some values using the ([base key] XOR [ID]) formula. In C#, that's what your encryption function would look like:
public static long FlipMask(long baseKey, long ID)
{
return baseKey ^ ID;
}
The following list contains some generated content. Its columns are as follows:
ID
Binary representation of ID
Binary value after XOR operation
Final, 'encrypted' decimal value
0 | 000 | 11010101111000100101101100111101 | 3588381501
1 | 001 | 11010101111000100101101100111100 | 3588381500
2 | 010 | 11010101111000100101101100111111 | 3588381503
3 | 011 | 11010101111000100101101100111110 | 3588381502
4 | 100 | 11010101111000100101101100111001 | 3588381497
In order to reverse the generated key and determine the original value, you only need to do the same XOR operation using the same base key. Let's say we want to obtain the original value of the second row:
11010101111000100101101100111101 XOR
11010101111000100101101100111100 =
00000000000000000000000000000001
Which was indeed your original value.
Now, Stefan made very good points, and the first topic is crucial.
In order to cover his concerns, you may reserve the last, say, 8 bytes to be purely random garbage (which I believe is called a nonce), which you generate when encrypting the original ID and ignore when reversing it. That would heavily increase your security at the expense of a generous slice of all the possible positive integer numbers with 32 bits (16,777,216 instead of 4,294,967,296, or 1/256 of it.)
A class to do that would look like this:
public static class int32crypto
{
// C# follows ECMA 334v4, so Integer Literals have only two possible forms -
// decimal and hexadecimal.
// Original key: 0b11010101111000100101101100111101
public static long baseKey = 0xD5E25B3D;
public static long encrypt(long value)
{
// First we will extract from our baseKey the bits we'll actually use.
// We do this with an AND mask, indicating the bits to extract.
// Remember, we'll ignore the first 8. So the mask must look like this:
// Significance mask: 0b00000000111111111111111111111111
long _sigMask = 0x00FFFFFF;
// sigKey is our baseKey with only the indicated bits still true.
long _sigKey = _sigMask & baseKey;
// nonce generation. First security issue, since Random()
// is time-based on its first iteration. But that's OK for the sake
// of explanation, and safe for most circunstances.
// The bits it will occupy are the first eight, like this:
// OriginalNonce: 0b000000000000000000000000NNNNNNNN
long _tempNonce = new Random().Next(255);
// We now shift them to the last byte, like this:
// finalNonce: 0bNNNNNNNN000000000000000000000000
_tempNonce = _tempNonce << 0x18;
// And now we mix both Nonce and sigKey, 'poisoning' the original
// key, like this:
long _finalKey = _tempNonce | _sigKey;
// Phew! Now we apply the final key to the value, and return
// the encrypted value.
return _finalKey ^ value;
}
public static long decrypt(long value)
{
// This is easier than encrypting. We will just ignore the bits
// we know are used by our nonce.
long _sigMask = 0x00FFFFFF;
long _sigKey = _sigMask & baseKey;
// We will do the same to the informed value:
long _trueValue = _sigMask & value;
// Now we decode and return the value:
return _sigKey ^ _trueValue;
}
}

perhaps idea may come from the millitary? group invoices in blocks like these:
28th Infantry Division
--1st Brigade
---1st BN
----A Co
----B Co
---2nd BN
----A Co
----B Co
--2nd Brigade
---1st BN
----A Co
----B Co
---2nd BN
----A Co
----B Co
--3rd Brigade
---1st BN
----A Co
----B Co
---2nd BN
----A Co
----B Co
http://boards.straightdope.com/sdmb/showthread.php?t=432978
groups don't have to be sequential but numbers in groups do
UPDATE
Think about above as groups differentiated by place, time, person, etc. For example: create group using seller temporary ID, changing it every 10 days or by office/shop.
There is another idea, you may say a bit weird but... when I think of it I like it more and more. Why not to count down these invoices? Choose a big number and count down. It's easy to trace number of items when counting up, but counting down? How anyone would guess where is a starting point? It's easy to implement,
too.

If the orders sit in an inbox until a single person processes them each morning, seeing that it took that person till 16:00 before he got round to creating my invoice will give me the impression that he's been busy. Getting the 9:01 invoice makes me feel like I'm the only customer today.
But if you generate the ID at the time when I place my order, the timestamp tells me nothing.
I think I therefore actually like the timestamps, assuming that collisions where two customers simultaneously need an ID created are rare.

You can see from the code below that I use newsequentialid() to generate a sequential number then convert that to a [bigint]. As that generates a consistent increment of 4294967296 I simply divide that number by the [id] on the table (it could be rand() seeded with nanoseconds or something similar). The result is a number that is always less than 4294967296 so I can safely add it and be sure I'm not overlapping the range of the next number.
Peace
Katherine
declare #generator as table (
[id] [bigint],
[guid] [uniqueidentifier] default( newsequentialid()) not null,
[converted] as (convert([bigint], convert ([varbinary](8), [guid], 1))) + 10000000000000000000,
[converted_with_randomizer] as (convert([bigint], convert ([varbinary](8), [guid], 1))) + 10000000000000000000 + cast((4294967296 / [id]) as [bigint])
);
insert into #generator ([id])
values (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
select [id],
[guid],
[converted],
[converted] - lag([converted],
1.0)
over (
order by [id]) as [orderly_increment],
[converted_with_randomizer],
[converted_with_randomizer] - lag([converted_with_randomizer],
1.0)
over (
order by [id]) as [disorderly_increment]
from #generator
order by [converted];

I do not know the reasons for the rules you set on the Invoice ID, but you could consider to have an internal Invoice Id which could be the sequential 32-bits integer and an external Invoice ID that you can share with your customers.
This way your internal Id can start at 1 and you can add one to it everytime and the customer invoice id could be what ever you want.

I think Na Na has the correct idea with choosing a big number and counting down. Start off with a large value seed and either count up or down, but don't start with the last placeholder. If you use one of the other placeholders it will give an illusion of a higher invoice count....if they are actually looking at that anyway.
The only caveat here would be to modify the last X digits of the number periodically to maintain the appearance of a change.

Why not taking an easy readable Number constructed like
first 12 digits is the datetime in a yyyymmddhhmm format (that ensures the order of your invoice IDs)
last x-digits is the order number (in this example 8 digits)
The number you get then is something like 20130814140300000008
Then do some simple calculations with it like the first 12 digits
(201308141403) * 3 = 603924424209
The second part (original: 00000008) can be obfuscated like this:
(10001234 - 00000008 * 256) * (minutes + 2) = 49995930
It is easy to translate it back into an easy readable number but unless you don't know how the customer has no clue at all.
Alltogether this number would look like 603924424209-49995930
for an invoice at the 14th August 2013 at 14:03 with the internal invoice number 00000008.

You can write your own function that when applied to the previous number generates the next sequential random number which is greater than the previous one but random. Though the numbers that can be generated will be from a finite set (for example, integers between 1 and 2 power 31) and may eventually repeat itself though highly unlikely. To Add more complexity to the generated numbers you can add some AlphaNumeric Characters at the end. You can read about this here Sequential Random Numbers.
An example generator can be
private static string GetNextnumber(int currentNumber)
{
Int32 nextnumber = currentNumber + (currentNumber % 3) + 5;
Random _random = new Random();
//you can skip the below 2 lines if you don't want alpha numeric
int num = _random.Next(0, 26); // Zero to 25
char let = (char)('a' + num);
return nextnumber + let.ToString();
}
and you can call like
string nextnumber = GetNextnumber(yourpreviouslyGeneratedNumber);

Dealing With Combinations

In C# I created a list array containing a list of varied indexes. I'd like to display 1 combination of 2 combinations of different indexes. The 2 combinations inside the one must not be repeated.
I am trying to code a tennis tournament with 14 players that pair. Each player must never be paired with another player twice.

Your problem falls under the domain of the binomial coefficient. The binomial coefficient handles problems of choosing unique combinations in groups of K with a total of N items.
I have written a class in C# to handle common functions for working with the binomial coefficient. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the set.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it is also faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
There are 2 different ways to interpret your problem. In tennis, tournaments are usually arranged to use single elmination where the winning player from each match advances. However, some local clubs also use round robins where each player plays each other player just once, which appears to be the problem that you are looking at.
So, the question is - how to calculate the total number of unique matches that can be played with 14 players (N = 14), where each player plays just one other player (and thus K = 2). The binomial coefficient calculation is as follows:
Total number of unique combinations = N! / (K! * (N - K)! ). The ! character is called a factorical, and means N * (N-1) * (N-2) ... * 1. When K is 2, the binomial coefficient is reduced to: N * (N - 1) / 2. So, plugging in 14 for N and 2 for K, we find that the total number of combinations is 91.
The following code will iterate through each uniue combinations:
int N = 14; // Total number of elements in the set.
int K = 2; // Total number of elements in each group.
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// The Kindexes array specifies the 2 players, starting with index 0.
int[] KIndexes = new int[K];
// Loop thru all the combinations for this N choose K case.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination.
BC.GetKIndexes(Loop, KIndexes);
// KIndex[0] is the first player & Kindex[2] is the 2nd player.
// Print out the indexes for both players.
String S = "Player1 = Kindexes[0].ToString() + ", " +
"Player2 = Kindexes[1].ToString();
Console.WriteLine(S};
}
You should be able to port this class over fairly easily to the language of your choice. You probably will not have to port over the generic part of the class to accomplish your goals. Depending on the number of combinations you are working with, you might need to use a bigger word size than 4 byte ints.
I should also mention, that since this is a class project, your teacher might not accept the above answer since he might be looking for more original work. In that case, you might want to consider using loops. You should check with him before submitting a solution.

Which C# data structure allows searching a pair of strings most efficiently for substrings?

I have a data structure which consists of pairs of values, the first of which is an integer and the second of which is an alphanumeric string (which may begin with digits):
+--------+-----------------+
| Number | Name |
+--------+-----------------+
| 15 | APPLES |
| 16 | APPLE COMPUTER |
| 17 | ORANGE |
| 21 | TWENTY-1 |
| 291 | 156TH ELEMENT |
+--------+-----------------+
A table of these would comprise up to 100,000 rows.
I'd like to provide a lookup function in which the user can look up either the number (as if it were a string), or pieces of the string. Ideally the lookup will be "live" as the user types; after each keystroke (or maybe after a brief delay ~250-500 ms) a new search will be done to find the most likely candidates. So, for example searching on
1 will return 15 APPLES, 16 APPLE COMPUTER, 17 ORANGE, and
291 156TH ELEMENT
15 will narrow the search to 15 APPLES, 291 156TH ELEMENT
AP will return 15 APPLES and 16 APPLE COMPUTER
(ideally, but not required) ELEM will return 291 156TH ELEMENT.
I was thinking about using two Dictionary<string, string>s since ultimately the ints are being compared as strings -- one will index by the integer part and the other by the string part.
But really searching by substring shouldn't use a hash function, and it seems wasteful to use twice the memory that I feel like I should need.
Ultimately the question is, is there any well-performing way to text search two large lists simultaneously for substrings?
Failing that, how about a SortedDictionary? Might increase performance but still wouldn't solve the hash problem.
Thought about creating a regex on the fly, but I would think that would perform terribly.
I'm new to C# (having come from the Java world) so I haven't looked into LINQ yet; is that the answer?
EDIT 18:21 EST: None of the strings in the "Name" field will be longer than 12-15 characters, if that affects your potential solution.

If possible, I would avoid loading all 100,000 entries into memory. I would use either a database or Lucene.Net to index the values. Then use the appropriate query syntax to efficiently search for the results.

I'd consider using Trie data structure.
How to achieve that? Leaves would represent your "row", but you would have "two paths" to each memory instance of a "row" (one for number and the other one for name).
You can then sacrifice your condition:
(ideally, but not required) ELEM will return 291 156TH ELEMENT.
Or provide even more paths to your row instances.

Since you are searching for the beginning of words, key based collections will not work, unless you store all possible pieces of the words, like "a", "ap", "app", "appl", "apple".
My suggestion is to use a System.Collections.Generic.List<T> in conjunction with a binary search. You would have to provide your own IComparer<T>, which also finds the beginning of words. You would use two data structures.
One List<KeyValuePair<string,int>> holding single words or the number as key and the number as value.
One Dictionary<int,string> holding the whole name.
You would proceed like this:
Split your sentence (the whole name) into single words.
Add them to the list with the word as key and the number as value of the KeyValuePair.
Add the number to the list as key and as value of the KeyValuePair.
When the list is full, sort the list in order to allow a binary search.
Search for a beginning of a word:
Search in the list by using BinarySearch in conjunction with your IComparer<T>.
The index you get from the search might not be the first that applies, so go back in the list until you find the first entry that matches.
Using the number stored as value in the list, look up the whole name in the dictionary using this number as key.

Struggling to make algorithm to generate board for a puzzle game

I'm looking to make a number puzzle game. For the sake of the question, let's say the board is a grid consisting of 4 x 4 squares. (In the actual puzzle game, this number will be 1..15)
A number may only occur once in each column and once in each row, a little like Sudoku, but without "squares".
Valid:
[1, 2, 3, 4
2, 3, 4, 1
3, 4, 1, 2
4, 1, 2, 3]
I can't seem to come up with an algorithm that will consistently generate valid, random n x n boards.
I'm writing this in C#.

Start by reading my series on graph colouring algorithms:
http://blogs.msdn.com/b/ericlippert/archive/tags/graph+colouring/
It is going to seem like this has nothing to do with your problem, but by the time you're done, you'll see that it has everything to do with your problem.
OK, now that you've read that, you know that you can use a graph colouring algorithm to describe a Sudoku-like puzzle and then solve a specific instance of the puzzle. But clearly you can use the same algorithm to generate puzzles.
Start by defining your graph regions that are fully connected.
Then modify the algorithm so that it tries to find two solutions.
Now create a blank graph and set one of the regions at random to a random colour. Try to solve the graph. Were there two solutions? Then add another random colour. Try it again. Were there no solutions? Then back up a step and add a different random colour.
Keep doing that -- adding random colours, backtracking when you get no solutions, and continuing until you get a puzzle that has a unique solution. And you're done; you've got a random puzzle generator.

It seems you could use this valid example as input to an algorithm that randomly swapped two rows a random number of times, then swapped two random columns a random number of times.

There aren't too many combinations you need to try. You can always rearrange a valid board so the top row is 1,2,3,4 (by remapping the symbols), and the left column is 1,2,3,4 (by rearranging rows 2 thru 4). On each row there are only 6 permutations of the remaining 3 symbols, so you can loop over those to find which of the 216 possible boards are valid. You may as well store the valid ones.
Then pick a valid board randomly, randomly rearrange the rows, and randomly reassign the symbols.

I don't speak C#, but the following algorithm ought to be easily translated.
Associate a set consisting of the numbers 1..N with each row and column:
for i = 1 to N
row_set[i] = column_set[i] = Set(1 .. N)
Then make a single pass through the matrix, choosing an entry for each position randomly from the set elements valid at that row and column. Remove the number chosen from the respective row and column sets.
for r = 1 to N
for c = 1 to N
k = RandomChoice( Intersection( column_set[c], row_set[r] ))
puzzle_board[r, c] = k
column_set[c] = column_set[c] - k
row_set[r] = row_set[r] - k
next c
next r

Looks like you want to generate uniformly distributed Latin Squares.
This pdf has a description of a method by Jacobson and Matthews (which was published elsewhere, a reference of which can be found here: http://designtheory.org/library/encyc/latinsq/z/)
Or you could potentially pre-generate a "lot" of them (before you ship :-)), store that in a file and randomly pick one.
Hope that helps.

The easiest way I can think of would be to create a partial game and solve it. If it's not solvable, or if it's wrong, make another. ;-)

Sudoku without squares sounds a bit like Sudoku. :)
http://www.codeproject.com/KB/game/sudoku.aspx
There is an explanation of the board generator code they use there.

Check out http://www.chiark.greenend.org.uk/~sgtatham/puzzles/ - he's got several puzzles that have precisely this constraint (among others).

A further solution would be this. Suppose you have a number of solutions. For each of them, you can generate a new solution by simply permuting the identifiers (1..15). These new solutions are of course logically the same, but to a player they will appear different.
The permutation might be done by treating each identifier in the initial solution as an index into an array, and then shuffling that array.

Use your first valid example:
1 2 3 4
2 3 4 1
3 4 1 2
4 1 2 3
Then, create randomly 2 permutations of {1, 2, 3, 4}.
Use the first to permute rows and then the second to permute columns.
You can find several ways to create permutations in Knuth's The Art of Computer Programming (TAOCP), Volume 4 Fascicle 2, Generating All Tuples and Permutations (2005), v+128pp. ISBN 0-201-85393-0.
If you can't find a copy in a library, a preprint (of the part that discusses permutations) is available at his site: fasc2b.ps.gz
EDIT - CORRECTION
The above solution is similar to 500-Intenral Server Error's one. But I think both won't find all valid arrangements.
For example they'll find:
1 3 2 4
3 1 4 2
2 4 1 3
4 2 3 1
but not this one:
1 2 3 4
2 1 4 3
3 4 1 2
4 3 2 1
One more step is needed: After rearranging rows and columns (either using my or 500's way), create one more permutation (lets call it s3) and use it to permute all the numbers in the array.
s3 = randomPermutation(1 ... n)
for i=1 to n
for j=1 to n
array[i,j] = s3( array[i,j] )

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.