Random string collision after using Fisher-Yates algorithm (C#) - c#

I am doing an exercise from exercism.io, in which I have to generate random names for robots. I am able to get through a bulk of the tests until I hit this test:
[Fact]
public void Robot_names_are_unique()
{
var names = new HashSet<string>();
for (int i = 0; i < 10_000; i++) {
var robot = new Robot();
Assert.True(names.Add(robot.Name));
}
}
After some googling around, I stumbled upon a couple of solutions and found out about the Fisher-Yates algorithm. I tried to implement it into my own solution but unfortunately, I haven't been able to pass the final test, and I'm stumped. If anyone could point me in the right direction with this, I'd greatly appreciate it. My code is below:
EDIT: I forgot to mention that the format of the string has to follow this: #"^[A-Z]{2}\d{3}$"
public class Robot
{
string _name;
Random r = new Random();
string alpha = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
string nums = "0123456789";
public Robot()
{
_name = letter() + num();
}
public string Name
{
get { return _name; }
}
private string letter() => GetString(2 ,alpha.ToCharArray(), r);
private string num() => GetString(3, nums.ToCharArray(), r);
public void Reset() => _name = letter() + num();
public string GetString(int length,char[] chars, Random rnd)
{
Shuffle(chars, rnd);
return new string(chars, 0, length);
}
public void Shuffle(char[] _alpha, Random r)
{
for(int i = _alpha.Length - 1; i > 1; i--)
{
int j = r.Next(i);
char temp = _alpha[i];
_alpha[i] = _alpha[j];
_alpha[j] = temp;
}
}
}

The first rule of any ID is:
It does not mater how big it is, how many possible value it has - if you just create enough of them, you will get a colission eventually.
To Quote Trillian from the Hithchikers Guide: "[A colission] is not impossible. Just realy, really unlikely."
However in this case, I think it is you creating Random Instances in a Loop. This is a classical beginners mistake when workign with Random. You should not create a new random isntance for each Robot Instance, you should have one for the application that you re-use. Like all Pseudorandom Number Generators, Random is deterministic. Same inputs - same outputs.
As you did not specify a seed value, it will use the time in milliseconds. Wich is going to the same between the first 20+ loop itterations at last. So it is going to have the same seed and the same inputs, so the same outputs.

The easiest solution for unique names is to use GUIDs. In theory, it is possible to generate non-unique GUIDs but it is pretty close to zero.
Here is the sample code:
var newUniqueName = Guid.NewGuid().ToString();
Sure GUIDs do not look pretty but they are really easy to use.
EDIT: Since the I missed the additional requirement for the format I see that GUID format is not acceptable.
Here is an easy way to do that too. Since format is two letters (26^2 possibile values) and 3 digits (10^3 possible values) the final number of possible values is 26^2 * 10^3 = 676 * 1000 = 676000. This number is quite small so Random can be used to generate the random integer in the range 0-675999 and then that number can be converted to the name. Here is the sample code:
var random = new System.Random();
var value = random.Next(676000);
var name = ((char)('A' + (value % 26))).ToString();
value /= 26;
name += (char)('A' + (value % 26));
value /= 26;
name += (char)('0' + (value % 10));
value /= 10;
name += (char)('0' + (value % 10));
value /= 10;
name += (char)('0' + (value % 10));
The usual disclaimer about possible identical names applies here too since we have 676000 possible variants and 10000 required names.
EDIT2: Tried the code above and generating 10000 names using random numbers produced between 9915 and 9950 unique names. That is no good. I would use a simple static in class member as a counter instead of random number generator.

First, let's review the test you're code is failing against:
10.000 instances created
Must all have distinct names
So somehow, when creating 10000 "random" names, your code produces at least two names that are the same.
Now, let's have a look at the naming scheme you're using:
AB123
The maximum number of unique names we could possibly create is 468000 (26 * 25 * 10 * 9 * 8).
This seems like it should not be a problem, because 10000 < 468000 - but this is where the birthday paradox comes in!
From wikipedia:
In probability theory, the birthday problem or birthday paradox concerns the probability that, in a set of n randomly chosen people, some pair of them will have the same birthday.
Rewritten for the purposes of your problem, we end up asking:
What's the probability that, in a set of 10000 randomly chosen people, some pair of them will have the same name.
The wikipedia article also lists a function for approximating the number of people required to reach a 50% propbability that two people will have the same name:
where m is the total number of possible distinct values. Applying this with m=468000 gives us ~806 - meaning that after creating only 806 randomly named Robots, there's already a 50% chance of two of them having the same name.
By the time you reach Robot #10000, the chances of not having generated two names that are the same is basically 0.
As others have noted, you can solve this by using a Guid as the robot name instead.
If you want to retain the naming convention you might also get around this by implementing an LCG with an appropriate period and use that as a less collision-prone "naming generator".

Here's one way you can do it:
Generate the list of all possible names
For each robot, select a name from the list at random
Remove the selected name from the list so it can't be selected again
With this, you don't even need to shuffle. Something like this (note, I stole Optional Option's method of generating names because it's quite clever and I couldn't be bother thinking of my own):
public class Robot
{
private static List<string> names;
private static Random rnd = new Random();
public string Name { get; private set; }
static Robot()
{
Console.WriteLine("Initializing");
// Generate possible candidates
names = Enumerable.Range(0, 675999).Select(i =>
{
var sb = new StringBuilder(5);
sb.Append((char)('A' + i % 26));
i /= 26;
sb.Append((char)('A' + i % 26));
i /= 26;
sb.Append(i % 10);
i /= 10;
sb.Append(i % 10);
i /= 10;
sb.Append(i % 10);
return sb.ToString();
}).ToList();
}
public Robot()
{
// Note: if this needs to be multithreaded, then you'd need to do some work here
// to avoid two threads trying to take a name at the same time
// Also note: you should probably check that names.Count > 0
// and throw an error if not
var i = rnd.Next(0, names.Count - 1);
Name = names[i];
names.RemoveAt(i);
}
}
Here's a fiddle that generates 20 random names. They can only be unique because they are removed once they are used.
The point about multitheading is very important however. If you needed to be able to generate robots in parallel, then you'd need to add some code (e.g. locking the critical section of code) to ensure that only one name is being picked and removed from the list of candidates at a time or else things will get really bad, really quickly. This is why, when people need a random id with a reasonable expectation that it'll be unique, without worrying that some other thread(s) are trying the same thing at the same time, they use GUIDs. The sheer number of possible GUIDs makes collisions very unlikely. But you don't have that luxury with only 676,000 possible values

Related

Reducing a BigInteger value in C#

I'm somewhat new to working with BigIntegers and have tried some stuff to get this system working, but feel a little stuck at the moment and would really appreciate a nudge in the right direction or a solution.
I'm currently working on a system which reduces BigInteger values down to a more readable form, and this is working fine with my current implementation, but I would like to further expand on it to get decimals implemented.
To better give a picture of what I'm attempting, I'll break it down.
In this context, we have a method which is taking a BigInteger, and returning it as a string:
public static string ShortenBigInt (BigInteger moneyValue)
With this in mind, when a number such as 10,000 is passed to this method, 10k will be returned. Same for 1,000,000 which will return 1M.
This is done by doing:
for(int i = 0; i < prefixes.Length; i++)
{
if(!(moneyValue >= BigInteger.Pow(10, 3*i)))
{
moneyValue = moneyValue / BigInteger.Pow(10, 3*(i-1));
return moneyValue + prefixes[i-1];
}
}
This system is working by grabbing a string from an array of prefixes and reducing numbers down to their simplest forms and combining the two and returning it when inside that prefix range.
So with that context, the question I have is:
How might I go about returning this in the same way, where passing 100,000 would return 100k, but also doing something like 1,111,111 would return 1.11M?
Currently, passing 1,111,111M returns 1M, but I would like that additional .11 tagged on. No more than 2 decimals.
My original thought was to convert the big integer into a string, then chunk out the first few characters into a new string and parse a decimal in there, but since prefixes don't change until values reach their 1000th mark, it's harder to tell when to place the decimal place.
My next thought was using BigInteger.Log to reduce the value down into a decimal friendly number and do a simple division to get the value in its decimal form, but doing this didn't seem to work with my implementation.
This system should work for the following prefixes, dynamically:
k, M, B, T, qd, Qn, sx, Sp,
O, N, de, Ud, DD, tdD, qdD, QnD,
sxD, SpD, OcD, NvD, Vgn, UVg, DVg,
TVg, qtV, QnV, SeV, SPG, OVG, NVG,
TGN, UTG, DTG, tsTG, qtTG, QnTG, ssTG,
SpTG, OcTG, NoTG, QdDR, uQDR, dQDR, tQDR,
qdQDR, QnQDR, sxQDR, SpQDR, OQDDr, NQDDr,
qQGNT, uQGNT, dQGNT, tQGNT, qdQGNT, QnQGNT,
sxQGNT, SpQGNT, OQQGNT, NQQGNT, SXGNTL
Would anyone happen to know how to do something like this? Any language is fine, C# is preferable, but I'm all good with translating. Thank you in advance!
formatting it manually could work a bit like this:
(prefixes as a string which is an char[])
public static string ShortenBigInt(BigInteger moneyValue)
{
string prefixes = " kMGTP";
double m2 = (double)moneyValue;
for (int i = 1; i < prefixes.Length; i++)
{
var step = Math.Pow(10, 3 * i);
if (m2 / step < 1000)
{
return String.Format("{0:F2}", (m2/step)) + prefixes[i];
}
}
return "err";
}
Although Falco's answer does work, it doesn't work for what was requested. This was the solution I was looking for and received some help from a friend on it. This solution will go until there are no more prefixes left in your string array of prefixes. If you do run out of bounds, the exception will be thrown and handled by returning "Infinity".
This solution is better due to the fact there is no crunch down to doubles/decimals within this process. This solution does not have a number cap, only limit is the amount of prefixes you make/provide.
public static string ShortenBigInt(BigInteger moneyValue)
{
if (moneyValue < 1000)
return "" + moneyValue;
try
{
string moneyAsString = moneyValue.ToString();
string prefix = prefixes[(moneyAsString.Length - 1) / 3];
BigInteger chopAmmount = (moneyAsString.Length - 1) % 3 + 1;
int insertPoint = (int)chopAmmount;
chopAmmount += 2;
moneyAsString = moneyAsString.Remove(Math.Min(moneyAsString.Length - 1, (int)chopAmmount));
moneyAsString = moneyAsString.Insert(insertPoint, ".");
return moneyAsString + " " + prefix;
}
catch (Exception exceptionToBeThrown)
{
return "Infinity";
}
}

Random number distribution is uneven / non uniform

I have noticed a strange issue with the random number generation in c#, it looks like sets (patterns) are repeated a lot more often than you would expect.
I'm writing a mechanism that generates activation codes, a series of 7 numbers (range 0-29).
Doing the math, there should be 30^7 (22billion) possible combinations of activation codes. Based on this it should be extremely unlikely to get a duplicate activation code before the 1 billionth code is generated. However running my test, I start getting duplicate codes after about 60,000 iteration, which is very surprising. I have also tried using RNGCryptoServiceProvider with similar results, I get duplicates at about 100,000 iterations.
I would really like to know if this is a bug/limitation of the random number generation in .Net or if I'm doing something wrong.
The following code is a test to validate the uniqueness of the generated codes:
static void Main(string[] args)
{
Random rand = new Random();
RandomActivationCode(rand, true);
Console.Out.WriteLine("Press enter");
Console.ReadLine();
}
static void RandomActivationCode(Random randomGenerator)
{
var maxItems = 11000000;
var list = new List<string>(maxItems);
var activationCodes = new HashSet<string>(list);
activationCodes.Clear();
DateTime start = DateTime.Now;
for (int i = 0; i < maxItems; ++i)
{
string activationCode = "";
for (int j = 0; j < 7; ++j)
{
activationCode += randomGenerator.Next(0,30) + "-";
}
if (activationCodes.Contains(activationCode))
{
Console.Out.WriteLine("Code: " + activationCode);
Console.Out.WriteLine("Duplicate at iteration: " + i.ToString("##,#"));
Console.Out.WriteLine("Press enter");
Console.ReadLine();
Console.Out.WriteLine();
Console.Out.WriteLine();
}
else
{
activationCodes.Add(activationCode);
}
if (i % 100000 == 0)
{
Console.Out.WriteLine("Iteration: " + i.ToString("##,#"));
Console.Out.WriteLine("Time elapsed: " + (DateTime.Now - start));
}
}
}
My workaround is to use 10 number activation codes, which means that the test runs without any duplicate values being generated. The test runs up to 11 million iterations (after which point it runs out of memory).
This is not at all surprising; this is exactly what you should expect. Your belief that it should take a long time to generate duplicates when the space of possibilities is large is simply false, so stop believing that. Start believing the truth: that if there are n possible codes then you should start getting duplicates at about the square root of n codes generated, which is about 150 thousand if n is 22 billion.
Think about it this way: by the time you have generated root-n codes, most of them have had roughly a root-n-in-n chance to have a collision. Multiply root-n by roughly root-n-in-n, and you get... roughly 100% chance of collision.
That is of course not a rigorous argument, but it should give you the right intution, to replace your faulty belief. If that argument is unconvincing then you might want to read my article on the subject:
http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx
If you want to generate a unique code then generate a GUID; that's what they're for. Note that a GUID is not guaranteed to be random, it is only guaranteed to be unique.
Another choice for generating random seeming codes that are not actually random at all, but are unique, is to generate the numbers 1, 2, 3, 4, ... as many as you want, and then use the multiplicative inverse technique to make a random-looking unique encoding of those numbers. See http://ericlippert.com/2013/11/14/a-practical-use-of-multiplicative-inverses/ for details.

make limitation for random class in c#

I want to make limitation for random class in c# like generate random variables from 2 ranges without repeat it?
example :
Xpoints[i] = random.Next(0, 25);
Ypoints[i] = random.Next(0, 12);
where 25 we 12 is image dimension so I need all pixels in this image but random ? any suggestion if I use this code i didn't get some pixels and some pixels repeated
Update Simplified by not requiring any specific hashing [1]
Update Generalzed into generic SimpleShuffle extension method
public static IEnumerable<T> SimpleShuffle<T>(this IEnumerable<T> sequence)
{
var rand = new Random();
return sequence.Select(i => new {i, k=rand.Next()})
.OrderBy(p => p.k)
.Select(p => p.i);
}
I though in addition to downvoting (shouting? sorry :)) Anx's answer I thought it'd be nicer to also show what my code would look like:
using System;
using System.Linq;
using System.Collections.Generic;
namespace NS
{
static class Program
{
public static IEnumerable<T> SimpleShuffle<T>(this IEnumerable<T> sequence)
{
var rand = new Random();
return sequence.Select(i => new {i, k=rand.Next()}).OrderBy(p => p.k).Select(p => p.i);
}
public static void Main(string[] args)
{
var pts = from x in Enumerable.Range(0, 24)
from y in Enumerable.Range(0, 11)
select new { x, y };
foreach (var pt in pts.SimpleShuffle())
Console.WriteLine("{0},{1}", pt.x, pt.y);
}
}
}
I totally fixed my earlier problem of how to generate a good hash by realizing that we don't need a hash unless:
a. the source contains (logical) duplicates
b. and we need those to have equivalent sort order
c. and we want to have the same 'random' sort order (deterministic hashing) each time round
a. and b. are false in this case and c. was even going to be a problem (depending on what the OP was requiring). So now, without any strings attached, no more worries about performance (even the irrational worries),
Good luck!
[1] Incidentally this makes the whole thing more flexible because I no longer require the coords to be expressed a byte[]; you can now shuffle any structure you want.
Have a look at the Fisher-Yates Algorithm:
http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
It's easy to implement, and works really well.
It shuffles an array of digits, then you can access them sequentially if you like to ensure no repeats.
You might want to use a shuffle algorithm on a list of the indexes (e.g. 25 elements with the values 0..24 for the X axis) instead of random.
By design random doesn't guarantee that no value is repeated; repetitions are very likely.
See also: Optimal LINQ query to get a random sub collection - Shuffle (look for the Fisher-Yates-Durstenfeld solution)
I also believe, Random should not be predictable, and we shouldn't even predict that the value will not be repeating.
But I think sometimes it could be required to randomly get non repeating int, for that we need to maintain state, like for particular instance of Random class, what all values were returned.
here is a small quick and dirty implementation of an algorithm which I thought just now, I am not sure if it is the same as Fisher-Yates solution. I just wrote this class so that you can use it instead of System.Random class.
So It may help you for your requirement, use below NonRepeatingRandom class as per your need...
class NonRepeatingRandom : Random
{
private HashSet<int> _usedValues = new HashSet<int>();
public NonRepeatingRandom()
{
}
public NonRepeatingRandom(int seed):base(seed)
{
}
public override int Next(int minValue, int maxValue)
{
int rndVal = base.Next(minValue, maxValue);
if (_usedValues.Contains(rndVal))
{
int oldRndVal = rndVal;
do
{
rndVal++;
} while (_usedValues.Contains(rndVal) && rndVal <= maxValue);
if (rndVal == maxValue + 1)
{
rndVal = oldRndVal;
do
{
rndVal--;
} while (_usedValues.Contains(rndVal) && rndVal >= minValue);
if (rndVal == minValue - 1)
{
throw new ApplicationException("Cannot get non repeating random for provided range.");
}
}
}
_usedValues.Add(rndVal);
return rndVal;
}
}
Please not that only "Next" method is overridden, and not other, if you want you can override other methods of "Random" class too.
Ps. Just before clicking "Post Your Answer" I saw sehe's answer, I liked his overall idea, but to hash 2 bytes, creating a 16 byte hash? or am I missing something? In my code I am using HashSet which uses int's implementation of GetHashCode method, which is nothing but that value of int itself so no overhead of hashing. But I could be missing some point as it is 3:59 AM here in India :)
Hope it helps salamonti...
The whole point of random numbers is that you do get repeats.
However, if you want to make sure you don't then remove the last chosen value from your array before picking the next number. So if you have a list of numbers:
index = random.Next(0, originallist.Length);
radomisedList.Add(originalList[index]);
originalList.RemoveAt(index);
The list will be randomised yet contain no repeats.
Instead of creating image through two one-dimensional arrays you should create an image through one two-dimensional matrix. Each time you get new random coordinate check if that pixel is already set. If it is then repeat the procedure for that pixel.

how to generate a voucher code in c#?

I need to generate a voucher code[ 5 to 10 digit] for one time use only. what is the best way to generate and check if been used?
edited: I would prefer alpha-numeric characters - amazon like gift voucher codes that must be unique.
When generating voucher codes - you should consider whether having a sequence which is predictable is really what you want.
For example, Voucher Codes: ABC101, ABC102, ABC103 etc are fairly predictable. A user could quite easily guess voucher codes.
To protect against this - you need some way of preventing random guesses from working.
Two approaches:
Embed a checksum in your voucher codes.
The last number on a credit card is a checksum (Check digit) - when you add up the other numbers in a certain way, lets you ensure someone has entered a number correctly. See: http://www.beachnet.com/~hstiles/cardtype.html (first link out of google) for how this is done for credit cards.
Have a large key-space, that is only sparsely populated.
For example, if you want to generate 1,000 vouchers - then a key-space of 1,000,000 means you should be able to use random-generation (with duplicate and sequential checking) to ensure it's difficult to guess another voucher code.
Here's a sample app using the large key-space approach:
static Random random = new Random();
static void Main(string[] args)
{
int vouchersToGenerate = 10;
int lengthOfVoucher = 10;
List<string> generatedVouchers = new List<string>();
char[] keys = "ABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890".ToCharArray();
Console.WriteLine("Vouchers: ");
while(generatedVouchers.Count < vouchersToGenerate)
{
var voucher = GenerateVoucher(keys, lengthOfVoucher);
if (!generatedVouchers.Contains(voucher))
{
generatedVouchers.Add(voucher);
Console.WriteLine("\t[#{0}] {1}", generatedVouchers.Count, voucher);
}
}
Console.WriteLine("done");
Console.ReadLine();
}
private static string GenerateVoucher(char[] keys, int lengthOfVoucher)
{
return Enumerable
.Range(1, lengthOfVoucher) // for(i.. )
.Select(k => keys[random.Next(0, keys.Length - 1)]) // generate a new random char
.Aggregate("", (e, c) => e + c); // join into a string
}
Building on the answers from Will Hughes & Shekhar_Pro (and just because I found this question interesting) here's another implementation but I've been a bit liberal with your requirement for the length of the voucher code.
Using a base 32 encoder I found you can use the Tick value to generate alpha-numeric strings. The encoding of a tick count to base 32 produces a 13 character string which can be formatted to make it more readable.
public void GenerateCodes()
{
Random random = new Random();
DateTime timeValue = DateTime.MinValue;
// Create 10 codes just to see the random generation.
for(int i=0; i<10; ++i)
{
int rand = random.Next(3600)+1; // add one to avoid 0 result.
timeValue = timeValue.AddMinutes(rand);
byte[] b = System.BitConverter.GetBytes(timeValue.Ticks);
string voucherCode = Transcoder.Base32Encode(b);
Console.WriteLine(string.Format("{0}-{1}-{2}",
voucherCode.Substring(0,4),
voucherCode.Substring(4,4),
voucherCode.Substring(8,5)));
}
}
Here's the output
AARI-3RCP-AAAAA
ACOM-AAZF-AIAAA
ABIH-LV7W-AIAAA
ADPL-26FL-AMAAA
ABBL-W6LV-AQAAA
ADTP-HFIR-AYAAA
ACDG-JH5K-A4AAA
ADDE-GTST-BEAAA
AAWL-3ZNN-BIAAA
AAGK-4G3Y-BQAAA
If you use a known seed for the Random object and remember how many codes you have already created you can continue to generate codes; e.g. if you need more codes and want to be certain you won't generate duplicates.
Here's one way: Generate a bunch of unique numbers between 10000 and 9999999999 put it in a database. Every time you give one to a user, mark it as used (or delete it if you're trying to save space).
EDIT: Generate the unique alpha-numeric values in the beginning. You'll probably have to keep them around for validation (as others have pointed out).
If your app is limited to using only Numerical digits then i think Timestamps (DateTime.Now.Ticks) can be a good way to get unique code every time. You can use random nums but that will have overhead of checking every number that its been issued already or not. If you can use alphabets also then surely go with GUID.
For checking if its been used or not you need to maintain a database and query it to check for validity.
If you prefer alphanumerical, you could use Guid.NewGuid() method:
Guid g = Guid.NewGuid();
Random rn = new Random();
string gs = g.ToString();
int randomInt = rn.Next(5,10+1);
Console.WriteLine(gs.Substring(gs.Length - randomInt - 1, randomInt));
To check if it was not used store somwhere previously generated codes and compare.
private void AutoPurchaseVouNo1()
{
try
{
int Num = 0;
con.Close();
con.Open();
string incre = "SELECT MAX(VoucherNoint+1) FROM tbl_PurchaseAllCompany";
SqlCommand command = new SqlCommand(incre, con);
if (Convert.IsDBNull(command.ExecuteScalar()))
{
Num = 100;
txtVoucherNoInt1.Text = Convert.ToString(Num);
txtVoucherNo1.Text = Convert.ToString("ABC" + Num);
}
else
{
Num = (int)(command.ExecuteScalar());
txtVoucherNoInt1.Text = Convert.ToString(Num);
txtVoucherNo1.Text = Convert.ToString("ABC" + Num);
}
con.Close();
}
catch (Exception ex)
{
MessageBox.Show("Error: " + ex, "Error !!", MessageBoxButtons.OK, MessageBoxIcon.Error);
}
}
Try this method for creating Voucher Number like ABC100, ABC101, ABC102, etc.
Try this code
var chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
var stringChars = new char[15];
for (int i = 0; i < stringChars.Length; i++)
{
stringChars[i] = chars[random.Next(chars.Length)];
}

Compare each element in an array to each other

I need to compare a 1-dimensional array, in that I need to compare each element of the array with each other element. The array contains a list of strings sorted from longest to the shortest. No 2 items in the array are equal however there will be items with the same length. Currently I am making N*(N+1)/2 comparisons (127.8 Billion) and I'm trying to reduce the number of over all comparisons.
I have implemented a feature that basically says: If the strings are different in length by more than x percent then don't bother they not equal, AND the other guys below him aren't equal either so just break the loop and move on to the next element.
I am currently trying to further reduce this by saying that: If element A matches element C and D then it stands to reason that elements C and D would also match so don't bother checking them (i.e. skip that operation). This is as far as I've factored since I don't currently know of a data structure that will allow me to do that.
The question here is: Does anyone know of such a data structure? or Does anyone know how I can further reduce my comparisons?
My current implementation is estimated to take 3.5 days to complete in a time window of 10 hours (i.e. it's too long) and my only options left are either to reduce the execution time, which may or may not be possible, or distrubute the workload accross dozens of systems, which may not be practical.
Update: My bad. Replace the word equal with closely matches with. I'm calculating the Levenstein distance
The idea is to find out if there are other strings in the array which closely matches with each element in the array. The output is a database mapping of the strings that were closely related.
Here is the partial code from the method. Prior to executing this code block there is code that loads items into the datbase.
public static void RelatedAddressCompute() {
TableWipe("RelatedAddress");
decimal _requiredDistance = Properties.Settings.Default.LevenshteinDistance;
SqlConnection _connection = new SqlConnection(Properties.Settings.Default.AML_STORE);
_connection.Open();
string _cacheFilter = "LevenshteinCache NOT IN ('','SAMEASABOVE','SAME')";
SqlCommand _dataCommand = new SqlCommand(#"
SELECT
COUNT(DISTINCT LevenshteinCache)
FROM
Address
WHERE
" + _cacheFilter + #"
AND
LEN(LevenshteinCache) > 12", _connection);
_dataCommand.CommandTimeout = 0;
int _addressCount = (int)_dataCommand.ExecuteScalar();
_dataCommand = new SqlCommand(#"
SELECT
Data.LevenshteinCache,
Data.CacheCount
FROM
(SELECT
DISTINCT LevenshteinCache,
COUNT(LevenshteinCache) AS CacheCount
FROM
Address
WHERE
" + _cacheFilter + #"
GROUP BY
LevenshteinCache) Data
WHERE
LEN(LevenshteinCache) > 12
ORDER BY
LEN(LevenshteinCache) DESC", _connection);
_dataCommand.CommandTimeout = 0;
SqlDataReader _addressReader = _dataCommand.ExecuteReader();
string[] _addresses = new string[_addressCount + 1];
int[] _addressInstance = new int[_addressCount + 1];
int _itemIndex = 1;
while (_addressReader.Read()) {
string _address = (string)_addressReader[0];
int _count = (int)_addressReader[1];
_addresses[_itemIndex] = _address;
_addressInstance[_itemIndex] = _count;
_itemIndex++;
}
_addressReader.Close();
decimal _comparasionsMade = 0;
decimal _comparisionsAttempted = 0;
decimal _comparisionsExpected = (decimal)_addressCount * ((decimal)_addressCount + 1) / 2;
decimal _percentCompleted = 0;
DateTime _startTime = DateTime.Now;
Parallel.For(1, _addressCount, delegate(int i) {
for (int _index = i + 1; _index <= _addressCount; _index++) {
_comparisionsAttempted++;
decimal _percent = _addresses[i].Length < _addresses[_index].Length ? (decimal)_addresses[i].Length / (decimal)_addresses[_index].Length : (decimal)_addresses[_index].Length / (decimal)_addresses[i].Length;
if (_percent < _requiredDistance) {
decimal _difference = new Levenshtein().threasholdiLD(_addresses[i], _addresses[_index], 50);
_comparasionsMade++;
if (_difference <= _requiredDistance) {
InsertRelatedAddress(ref _connection, _addresses[i], _addresses[_index], _difference);
}
}
else {
_comparisionsAttempted += _addressCount - _index;
break;
}
}
if (_addressInstance[i] > 1 && _addressInstance[i] < 31) {
InsertRelatedAddress(ref _connection, _addresses[i], _addresses[i], 0);
}
_percentCompleted = (_comparisionsAttempted / _comparisionsExpected) * 100M;
TimeSpan _estimatedDuration = new TimeSpan((long)((((decimal)(DateTime.Now - _startTime).Ticks) / _percentCompleted) * 100));
TimeSpan _timeRemaining = _estimatedDuration - (DateTime.Now - _startTime);
string _timeRemains = _timeRemaining.ToString();
});
}
InsertRelatedAddress is a function that updates the database, and there are 500,000 items in the array.
OK. With the updated question, I think it makes more sense. You want to find pairs of strings with a Levenshtein Distance less than a preset distance. I think the key is that you don't compare every set of strings and rely on the properties of Levenshtein distance to search for strings within your preset limit. The answer involves computing the tree of possible changes. That is, compute possible changes to a given string with distance < n and see if any of those strings are in your set. I supposed this is only faster if n is small.
It looks like the question posted here: Finding closest neighbour using optimized Levenshtein Algorithm.
More info required. What is your desired outcome? Are you trying to get a count of all unique strings? You state that you want to see if 2 strings are equal and that if 'they are different in length by x percent then don't bother they not equal'. Why are you checking with a constraint on length by x percent? If you're checking for them to be equal they must be the same length.
I suspect you are trying to something slightly different to determining an exact match in which case I need more info.
Thanks
Neil

Categories

Resources