Cut and take new line String with maxchar - c#

I tried to create a function for print all the articles of a billing with some max length vars, but when this max length is exceeded Index out of range errors appears or in another cases total and quantity Doesn't appear in line:
Max Length variables:
private Int32 MaxCharPage = 36;
private Int32 MaxCharArticleName = 15;
private Int32 MaxCharArticleQuantity = 4;
private Int32 MaxCharArticleSellPrice = 6;
private Int32 MaxCharArticleTotal = 8;
private Int32 MaxCharArticleLineSpace = 1;
This is my current function:
private IEnumerable<String> ArticleLine(Double Quantity, String Name, Double SellPrice, Double Total)
{
String QuantityChunk = (String)(Quantity).ToString("#,##0.00");
String PriceChunk = (String)(SellPrice).ToString("#,##0.00");
String TotalChunk = (String)(Total).ToString("#,##0.00");
// full chunks with "size" length
for (int i = 0; i < Name.Length / this.MaxCharArticleName; ++i)
{
String Chunk = String.Empty;
if (i == 0)
{
Chunk = QuantityChunk + new String((Char)32, (MaxCharArticleQuantity + MaxCharArticleLineSpace - QuantityChunk.Length)) +
Name.Substring(i * this.MaxCharArticleName, this.MaxCharArticleName) +
new String((Char)32, (MaxCharArticleLineSpace)) +
new String((Char)32, (MaxCharArticleSellPrice - PriceChunk.Length)) +
PriceChunk +
new String((Char)32, (MaxCharArticleLineSpace)) +
new String((Char)32, (MaxCharArticleTotal - TotalChunk.Length)) +
TotalChunk;
}
else
{
Chunk = new String((Char)32, (MaxCharArticleQuantity + MaxCharArticleLineSpace)) +
Name.Substring(i * this.MaxCharArticleName, this.MaxCharArticleName);
}
yield return Chunk;
}
if (Name.Length % this.MaxCharArticleName > 0)
{
String chunk = Name.Substring(Name.Length / this.MaxCharArticleName * this.MaxCharArticleName);
yield return new String((Char)32, (MaxCharArticleQuantity + MaxCharArticleLineSpace)) + chunk;
}
}
private void AddArticle(Double Quantity, String Name, Double SellPrice, Double Total)
{
Lines.AddRange(ArticleLine(Quantity, Name, SellPrice, Total).ToList());
}
For example:
private List<String> Lines = new List<String>();
private void Form1_Load(object sender, EventArgs e)
{
AddArticle(2.50, "EGGS", 0.50, 1.25); // Problem: Dont show the numbers like (Quantity, Sellprice, Total)
//AddArticle(100.52, "HAND SOAP /w EP", 5.00, 502.60); //OutOfRangeException
AddArticle(105.6, "LONG NAME ARTICLE DESCRIPTION", 500.03, 100.00);
//AddArticle(100, "LONG NAME ARTICLE DESCRIPTION2", 1500.03, 150003.00); // OutOfRangeException
foreach (String line in Lines)
{
Console.WriteLine(line);
}
}
Console Output:
LINE1: EGGS
LINE2:105.6LONG NAME ARTIC 500.03 100.00
LINE3: LE DESCRIPTION
Desired output:
LINE:2.50 EGGS 0.50 1.25
LINE:100. HAND SOAP /w EP 5.00 502.60
LINE: 52
LINE:105. LONG NAME ARTIC 500.03 100.00
LINE: 60 LE DESCRIPTION
LINE:100. LONG NAME ARTIC 1,500. 150,003.
LINE: 00 LE DESCRIPTION2 03 00

You are attempting to take values (string values, ultimately) and extract specified lengths from them (chunks), where the each individual chunk of a given group goes on one line, and the next chunk(s) go on the next line(s). There are several challenges in your posted code. One of the biggest is that you may not understand what the / operator does. It is, of course, used for division, but it is integer division, meaning you get a whole number as the result. E.g. 3 / 15 gives you 0, whereas 17 / 15 would give you 1. This means your loop never runs unless the length of the value is greater than the specified chunk limit.
Another possible issue is that you only perform this check against Name, not the other items (though you may have omitted the code for them for brevity reasons).
Your current code is also creating a lot of unnecessary strings, which will lead to performance degradation. Remember that in C# strings are immutable - meaning they cannot be changed. When you "change" the value of a string, you are actually creating another copy of the string.
The crux of your requirement is how to "chunk" the values in such a way that the correct output is achieved. One way to do this is to create an extension method that will take a string value and "chunkify" it for you. One such example is found here: Split String Into Array of Chunks. I've modified it slightly to use a List<T>, but an array would work just as well.
public static List<string> SplitIntoChunks(this string toSplit, int chunkSize)
{
int stringLength = toSplit.Length;
int chunksRequired = (int)Math.Ceiling(decimal)stringLength / (decimal)chunkSize);
List<string> chunks = new List<string>();
int lengthRemaining = stringLength;
for (int i = 0; i < chunksRequired; i++)
{
int lengthToUse = Math.Min(lengthRemaining, chunkSize);
int startIndex = chunkSize * i;
chunks.Add(toSplit.Substring(startIndex, lengthToUse));
lengthRemaining = lenghtRemaining - lengthToUse;
}
return chunks;
}
Say you have a string named myString. You would use the above method like this: string[] chunks = myString.SplitIntoChunks(15);, and you would receive an array of 15 character strings (depending on the size of the string).
A quick walk-through of the code (as there is not much explanation on the page). The size of the chunk is passed into the extension method. The length of the string is recorded, and the number of chunks for that string is determined using the Math.Ceiling function.
Then a for loop is constructed with the number of chunks required as the limit. Inside the loop, the length of the chunk is determined (using the lower of either the chunk size or the remaining length of the string), the starting index is calculated based on the chunk size and the loop index, and then the chunk is extracted via Substring. Finally remaining length is calculated, and once the loop ends the chunks are returned.
One way to use this extension method in your code would look like this. The extension method will need to be in a separate, static class (I suggest building a library that has a class dedicated solely to extension methods, as they come in quite handy). Note that I haven't had time to test this, and it's a bit kludgy for my tastes, but it should at least get you going in the right direction.
private IEnumerable<string> ArticleLine(double quantity, string name, double sellPrice, double total)
{
List<string> quantityChunks = quantity.ToString("#,##0.00").SplitIntoChunks(maxCharArticleQuantity);
List<string> nameChunks = name.SplitIntoChunks(maxCharArticleName);
List<string> sellPriceChunks = sellPrice.ToString("#,##0.00").SplitIntoChunks(maxCharArticleSellPrice);
List<string> totalChunks = total.ToString("#,##0.00").SplitIntoChunks(maxCharArticleTotal);
int maxLines = (new List<int>() { quantityChunks.Count,
nameChunks.Count,
sellPriceChunks.Count,
totalChunks.Count }).Max();
for (int i = 0; i < maxLines; i++)
{
lines.Add(String.Format("{0}{1}{2}{3}",
quantityChunks.Count > i ?
quantityChunks[i].PadLeft(maxCharArticleQuantity) :
String.Empty.PadLeft(maxCharArticleQuantity),
nameChunks.Count > i ?
nameChunks[i].PadLeft(maxCharArticleName) :
String.Empty.PadLeft(maxCharArticleName, ' '),
sellPriceChunks.Count > i ?
sellPriceChunks[i].PadLeft(maxCharArticleSellPrice) :
String.Empty.PadeLeft(maxCharArticleSellPrice),
totalChunks.Count > i ?
totalChunks[i].PadLeft(maxCharArticleTotal) :
String.Empty.PadLeft(maxCharArticleTotal));
}
return lines;
}
The above code does a couple of things. First, it calls SplitIntoChunks on each of the four variables.
Next, it uses the Max() extension method (you'll need to add a reference to System.Linq if you don't already have one, plus a using directive). to determine the maximum number of lines needed (based on the highest count of the four lists).
Finally, it uses a 4for loop to build the lines. Note that I use the ternary operator (?:) to build each line. The reason I went this route is so that the lines would be properly formatted, even if one or more of the 4 values didn't have something for that line. In other words:
quantityChunks.Count > i
If the count of items in quantityChunks is greater than i, then there is an entry for that item for that line.
quanityChunks[i].PadLeft(maxCharArticleQuantity, ' ')
If the condition evaluates to true, we use the corresponding entry (and pad it with spaces to keep alignment).
String.Empty.PadLeft(maxCharArticleQuantity, ' ')
If the condition is false, we simply put in spaces for the maximum number of characters for that position in the line.
Then you can print the output like this:
foreach(string line in lines)
{
Console.WriteLine(line);
}
The portions of the code where I check for the maximum number of lines and the use of a ternary operator in the String.Format feel very kludgy to me, but I don't have time to finesse it into something more respectable. I hope this at least points you in the right direction.
EDIT
Fixed the PadLeft syntax and removed the character specified, as space is used by default.

Related

Dice Sorensen Distance error calculating Bigrams without using Intersect method

I have been programming an object to calculate the DiceSorensen Distance between two strings. The logic of the operation is not so difficult. You calculate how many two letter pairs exist in a string, compare it with a second string and then perform this equation
2(x intersect y)/ (|x| . |y|)
where |x| and |y| is the number of bigram elements in x & y. Reference can be found here for further clarity https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
So I have tried looking up how to do the code online in various spots but every method I have come across uses the 'Intersect' method between two lists and as far as I am aware this won't work because if you have a string where the bigram already exists it won't add another one. For example if I had a string
'aaaa'
I would like there to be 3 'aa' bigrams but the Intersect method will only produce one, if i am incorrect on this assumption please tell me cause i wondered why so many people used the intersect method. My assumption is based on the MSDN website https://msdn.microsoft.com/en-us/library/bb460136(v=vs.90).aspx
So here is the code I have made
public static double SorensenDiceDistance(this string source, string target)
{
// formula 2|X intersection Y|
// --------------------
// |X| + |Y|
//create variables needed
List<string> bigrams_source = new List<string>();
List<string> bigrams_target = new List<string>();
int source_length;
int target_length;
double intersect_count = 0;
double result = 0;
Console.WriteLine("DEBUG: string length source is " + source.Length);
//base case
if (source.Length == 0 || target.Length == 0)
{
return 0;
}
//extract bigrams from string 1
bigrams_source = source.ListBiGrams();
//extract bigrams from string 2
bigrams_target = target.ListBiGrams();
source_length = bigrams_source.Count();
target_length = bigrams_target.Count();
Console.WriteLine("DEBUG: bigram counts are source: " + source_length + " . target length : " + target_length);
//now we have two sets of bigrams compare them in a non distinct loop
for (int i = 0; i < bigrams_source.Count(); i++)
{
for (int y = 0; y < bigrams_target.Count(); y++)
{
if (bigrams_source.ElementAt(i) == bigrams_target.ElementAt(y))
{
intersect_count++;
//Console.WriteLine("intersect count is :" + intersect_count);
}
}
}
Console.WriteLine("intersect line value : " + intersect_count);
result = (2 * intersect_count) / (source_length + target_length);
if (result < 0)
{
result = Math.Abs(result);
}
return result;
}
In the code somewhere you can see I call a method called listBiGrams and this is how it looks
public static List<string> ListBiGrams(this string source)
{
return ListNGrams(source, 2);
}
public static List<string> ListTriGrams(this string source)
{
return ListNGrams(source, 3);
}
public static List<string> ListNGrams(this string source, int n)
{
List<string> nGrams = new List<string>();
if (n > source.Length)
{
return null;
}
else if (n == source.Length)
{
nGrams.Add(source);
return nGrams;
}
else
{
for (int i = 0; i < source.Length - n; i++)
{
nGrams.Add(source.Substring(i, n));
}
return nGrams;
}
}
So my understanding of the code step by step is
1) pass in strings
2) 0 length check
3) create list and pass up bigrams into them
4) get the lengths of each bigram list
5) nested loop to check in source position[i] against every bigram in target string and then increment i until no more source list to check against
6) perform equation mentioned above taken from wikipedia
7) if result is negative Math.Abs it to return a positive result (however i know the result should be between 0 and 1 already this is what keyed me into knowing i was doing something wrong)
the source string i used is source = "this is not a correct string" and the target string was, target = "this is a correct string"
the result I got was -0.090909090908
I'm SURE (99%) that what I'm missing is something small like a mis-calculated length somewhere or a count mis-count. If anyone could point out what i'm doing wrong I'd be really grateful. Thank you for your time!
This looks like homework, yet this similarity metric on strings is new to me so I took a look.
Algorith implementation in various languages
As you may notice the C# version uses HashSet and takes advantage of the IntersectWith method.
A set is a collection that contains no duplicate elements, and whose
elements are in no particular order.
This solves your string 'aaaa' puzzle. Only one bigram there.
My naive implementation on Rextester
If you prefer Linq then I'd suggest Enumerable.Distinct, Enumerable.Union and Enumerable.Intersect. These should mimic very well the duplicate removal capabilities of the HashSet.
Also found this nice StringMetric framework written in Scala.

What is the Fastest way to split ':' seperated string into given number of chunks where result/record length is variable

I have a large string accepted from TCP listner which is in following format
"1,7620257787,0123456789,99,0922337203,9223372036,32.5455,87,12.7857,1/1/2012,9223372036:1,7620257787,0123456789,99,0922337203,9223372036,32.5455,87,12.7857:2/1/2012,234234234:3,7620257787,01234343456789,99,0922337203,9223372036,32.5455,87,12.7857,1/1/2012,9223372036:34,76202343457787,012434343456789,93339,34340922337203,9223372036,32.5455,87,12.7857,1/1/2012,9223372036"
You can see that this is a : seperated string which contains Records which are comma seperated fields.
I am looking for the best (fastest) way that split the string in given number of chunks and take care that one chunk should contain full record (string upto ':')
or other way of saying , there should not be any chunck which is not ending with :
e.g. 20 MB string to 4 chunks of 5 MB each with proper records (thus size of each chunk may not be exactly 5 MB but very near to it and total of all 4 chunks will be 20 MB)
I hope you can understand my question (sorry for the bad english)
I like the following link , but it does not take care of full record while spliting also don't know if that is the best and fastest way.
Split String into smaller Strings by length variable
I don't know how large a 'large string' is, but initially I would just try it with the String.Split method.
The idea is to divide the lenght of your data for the num of blocks required, then look backwards to search the last sep in the current block.
private string[] splitToBlocks(string data, int numBlocks, char sep)
{
// We return an array of the request length
if (numBlocks <= 1 || data.Length == 0)
{
return new string [] { data };
}
string[] result = new string[numBlocks];
// The optimal size of each block
int blockLen = (data.Length / numBlocks);
int idx = 0; int pos = 0; int lastSepPos = blockLen;
while (idx < numBlocks)
{
// Search backwards for the first sep starting from the lastSepPos
char c = data[lastSepPos];
while (c != sep) { lastSepPos--; c = data[lastSepPos]; }
// Get the block data in the result array
result[idx] = data.Substring(pos, (lastSepPos + 1) - pos);
// Reposition for then next block
idx++;
pos = lastSepPos + 1;
if(idx == numBlocks-1)
lastSepPos = data.Length - 1;
else
lastSepPos = blockLen * (idx + 1);
}
return result;
}
Please test it. I have not fully tested for fringe cases.
OK, I suggest you way with two steps:
Split string into chunks (see below)
Check chunks for completeness
Splitting string into chunks with help of linq (linq extension method taked from Split a collection into `n` parts with LINQ? ):
string tcpstring = "chunk1 : chunck2 : chunk3: chunk4 : chunck5 : chunk6";
int numOfChunks = 4;
var chunks = (from string z in (tcpstring.Split(':').AsEnumerable()) select z).Split(numOfChunks);
List<string> result = new List<string>();
foreach (IEnumerable<string> chunk in chunks)
{
result.Add(string.Join(":",chunk));
}
.......
static class LinqExtensions
{
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> list, int parts)
{
int i = 0;
var splits = from item in list
group item by i++ % parts into part
select part.AsEnumerable();
return splits;
}
}
Am I understand your aims clearly?
[EDIT]
In my opinion, In case of performance consideration, better way to use String.Split method for chunking
It seems you want to split on ":" (you can use the Split method).
Then you have to add ":" after splitting to each chunk that has been split.
(you can then split on "," for all the strings that have been split by ":".
int index = yourstring.IndexOf(":");
string[] whatever = string.Substring(0,index);
yourstring = yourstring.Substring(index); //make a new string without the part you just cut out.
this is a general view example, all you need to do is establish an iteration that will run while the ":" character is encountered; cheers...

Reading an int from a file returns ASCII

I'm currently making a game but I seem to have problems reading values from a text file. For some reason, when I read the value, it gives me the ASCII code of the value rather than the actual value itself when I wrote it to the file. I've tried about every ASCII conversion function and string conversion function, but I just can't seem to figure it out.
I use a 2D array of integers. I use a nested for loop to write each element into the file. I've looked at the file and the values are correct, but I don't understand why it's returning the ASCII code. Here's the code I'm using to write and read to file:
Writing to file:
for (int i = 0; i < level.MaxRows(); i++)
{
for (int j = 0; j < level.MaxCols(); j++)
{
fileWrite.Write(level.GetValueAtIndex(i, j) + " ");
//Console.WriteLine(level.GetValueAtIndex(i, j));
}
//add new line
fileWrite.WriteLine();
}
And here's the code where I read the values from the file:
string str = "";
int iter = 0; //used to iterate in each column of array
for (int i = 0; i < level.MaxRows(); i++)
{
iter = 0;
//TODO: For some reason, the file is returning ASCII code, convert to int
//keep reading characters until a space is reached.
str = fileRead.ReadLine();
//take the above string and extract the values from it.
//Place each value in the level.
foreach (char id in str)
{
if (id != ' ')
{
//convert id to an int
num = (int)id;
level.ChangeTile(i, iter, num);
iter++;
}
}
This is the latest version of the loop that I use to read the values. Reading other values is fine; it's just when I get to the array, things go wrong. I guess my question is, why did the conversion to ASCII happen? If I can figure that out, then I might be able to solve the issue. I'm using XNA 4 to make my game.
This is where the convertion to ascii is happening:
fileWrite.Write(level.GetValueAtIndex(i, j) + " ");
The + operator implicitly converts the integer returned by GetValueAtIndex into a string, because you are adding it to a string (really, what did you expect to happen?)
Furthermore, the ReadLine method returns a String, so I am not sure why you'd expect a numeric value to magically come back here. If you want to write binary data, look into BinaryWriter
This is where you are converting the characters to character codes:
num = (int)id;
The id variable is a char, and casting that to int gives you the character code, not the numeric value.
Also, this converts a single character, not a whole number. If you for example have "12 34 56 " in your text file, it will get the codes for 1, 2, 3, 4, 5 and 6, not 12, 34 and 56.
You would want to split the line on spaces, and parse each substring:
foreach (string id in str.Split(' ')) {
if (id.Length > 0) {
num = Int32.Parse(id);
level.ChangeTile(i, iter, num);
iter++;
}
}
Update: I've kept the old code (below) with the assumption that one record was on each line, but I've also added a different way of doing it that should work with multiple integers on a line, separated by a space.
Multiple records on one line
str = fileRead.ReadLine();
string[] values = str.Split(new Char[] {' '});
foreach (string value in values)
{
int testNum;
if (Int32.TryParse(str, out testnum))
{
// again, not sure how you're using iter here
level.ChangeTile(i, iter, num);
}
}
One record per line
str = fileRead.ReadLine();
int testNum;
if (Int32.TryParse(str, out testnum))
{
// however, I'm not sure how you're using iter here; if it's related to
// parsing the string, you'll probably need to do something else
level.ChangeTile(i, iter, num);
}
Please note that the above should work if you write out each integer line-by-line (i.e. how you were doing it via the WriteLine which you remarked out in your code above). If you switch back to using a WriteLine, this should work.
You have:
foreach (char id in str)
{
//convert id to an int
num = (int)id;
A char is an ASCII code (or can be considered as such; technically it is a unicode code-point, but that is broadly comparable assuming you are writing ANSI or low-value UTF-8).
What you want is:
num = (int)(id - '0');
This:
fileWrite.Write(level.GetValueAtIndex(i, j) + " ");
converts the int returned from level.GetValueAtIndex(i, j) into a string. Assuming the function returns the value 5 for a particular i and j then you write "5 " into the file.
When you then read it is being read as a string which consists of chars and you get the ASCII code of 5 when you cast it simply to an int. What you need is:
foreach (char id in str)
{
if (id != ' ')
{
//convert id to an int
num = (int)(id - '0'); // subtract the ASCII value for 0 from your current id
level.ChangeTile(i, iter, num);
iter++;
}
}
However this only works if you only ever are going to have single digit integers (only 0 - 9). This might be better:
foreach (var cell in fileRead.ReadLine().Split(' '))
{
num = Int.Parse(cell);
level.ChangeTile(i, iter, num);
iter++;
}

Compare each element in an array to each other

I need to compare a 1-dimensional array, in that I need to compare each element of the array with each other element. The array contains a list of strings sorted from longest to the shortest. No 2 items in the array are equal however there will be items with the same length. Currently I am making N*(N+1)/2 comparisons (127.8 Billion) and I'm trying to reduce the number of over all comparisons.
I have implemented a feature that basically says: If the strings are different in length by more than x percent then don't bother they not equal, AND the other guys below him aren't equal either so just break the loop and move on to the next element.
I am currently trying to further reduce this by saying that: If element A matches element C and D then it stands to reason that elements C and D would also match so don't bother checking them (i.e. skip that operation). This is as far as I've factored since I don't currently know of a data structure that will allow me to do that.
The question here is: Does anyone know of such a data structure? or Does anyone know how I can further reduce my comparisons?
My current implementation is estimated to take 3.5 days to complete in a time window of 10 hours (i.e. it's too long) and my only options left are either to reduce the execution time, which may or may not be possible, or distrubute the workload accross dozens of systems, which may not be practical.
Update: My bad. Replace the word equal with closely matches with. I'm calculating the Levenstein distance
The idea is to find out if there are other strings in the array which closely matches with each element in the array. The output is a database mapping of the strings that were closely related.
Here is the partial code from the method. Prior to executing this code block there is code that loads items into the datbase.
public static void RelatedAddressCompute() {
TableWipe("RelatedAddress");
decimal _requiredDistance = Properties.Settings.Default.LevenshteinDistance;
SqlConnection _connection = new SqlConnection(Properties.Settings.Default.AML_STORE);
_connection.Open();
string _cacheFilter = "LevenshteinCache NOT IN ('','SAMEASABOVE','SAME')";
SqlCommand _dataCommand = new SqlCommand(#"
SELECT
COUNT(DISTINCT LevenshteinCache)
FROM
Address
WHERE
" + _cacheFilter + #"
AND
LEN(LevenshteinCache) > 12", _connection);
_dataCommand.CommandTimeout = 0;
int _addressCount = (int)_dataCommand.ExecuteScalar();
_dataCommand = new SqlCommand(#"
SELECT
Data.LevenshteinCache,
Data.CacheCount
FROM
(SELECT
DISTINCT LevenshteinCache,
COUNT(LevenshteinCache) AS CacheCount
FROM
Address
WHERE
" + _cacheFilter + #"
GROUP BY
LevenshteinCache) Data
WHERE
LEN(LevenshteinCache) > 12
ORDER BY
LEN(LevenshteinCache) DESC", _connection);
_dataCommand.CommandTimeout = 0;
SqlDataReader _addressReader = _dataCommand.ExecuteReader();
string[] _addresses = new string[_addressCount + 1];
int[] _addressInstance = new int[_addressCount + 1];
int _itemIndex = 1;
while (_addressReader.Read()) {
string _address = (string)_addressReader[0];
int _count = (int)_addressReader[1];
_addresses[_itemIndex] = _address;
_addressInstance[_itemIndex] = _count;
_itemIndex++;
}
_addressReader.Close();
decimal _comparasionsMade = 0;
decimal _comparisionsAttempted = 0;
decimal _comparisionsExpected = (decimal)_addressCount * ((decimal)_addressCount + 1) / 2;
decimal _percentCompleted = 0;
DateTime _startTime = DateTime.Now;
Parallel.For(1, _addressCount, delegate(int i) {
for (int _index = i + 1; _index <= _addressCount; _index++) {
_comparisionsAttempted++;
decimal _percent = _addresses[i].Length < _addresses[_index].Length ? (decimal)_addresses[i].Length / (decimal)_addresses[_index].Length : (decimal)_addresses[_index].Length / (decimal)_addresses[i].Length;
if (_percent < _requiredDistance) {
decimal _difference = new Levenshtein().threasholdiLD(_addresses[i], _addresses[_index], 50);
_comparasionsMade++;
if (_difference <= _requiredDistance) {
InsertRelatedAddress(ref _connection, _addresses[i], _addresses[_index], _difference);
}
}
else {
_comparisionsAttempted += _addressCount - _index;
break;
}
}
if (_addressInstance[i] > 1 && _addressInstance[i] < 31) {
InsertRelatedAddress(ref _connection, _addresses[i], _addresses[i], 0);
}
_percentCompleted = (_comparisionsAttempted / _comparisionsExpected) * 100M;
TimeSpan _estimatedDuration = new TimeSpan((long)((((decimal)(DateTime.Now - _startTime).Ticks) / _percentCompleted) * 100));
TimeSpan _timeRemaining = _estimatedDuration - (DateTime.Now - _startTime);
string _timeRemains = _timeRemaining.ToString();
});
}
InsertRelatedAddress is a function that updates the database, and there are 500,000 items in the array.
OK. With the updated question, I think it makes more sense. You want to find pairs of strings with a Levenshtein Distance less than a preset distance. I think the key is that you don't compare every set of strings and rely on the properties of Levenshtein distance to search for strings within your preset limit. The answer involves computing the tree of possible changes. That is, compute possible changes to a given string with distance < n and see if any of those strings are in your set. I supposed this is only faster if n is small.
It looks like the question posted here: Finding closest neighbour using optimized Levenshtein Algorithm.
More info required. What is your desired outcome? Are you trying to get a count of all unique strings? You state that you want to see if 2 strings are equal and that if 'they are different in length by x percent then don't bother they not equal'. Why are you checking with a constraint on length by x percent? If you're checking for them to be equal they must be the same length.
I suspect you are trying to something slightly different to determining an exact match in which case I need more info.
Thanks
Neil

How to parse a numbered sequence from a List of filenames?

I would like to automatically parse a range of numbered sequences from an already sorted List<FileData> of filenames by checking which part of the filename changes.
Here is an example (file extension has already been removed):
First filename: IMG_0000
Last filename: IMG_1000
Numbered Range I need: 0000 and 1000
Except I need to deal with every possible type of file naming convention such as:
0000 ... 9999
20080312_0000 ... 20080312_9999
IMG_0000 - Copy ... IMG_9999 - Copy
8er_green3_00001 .. 8er_green3_09999
etc.
I would like the entire 0-padded range e.g. 0001 not just 1
The sequence number is 0-padded e.g. 0001
The sequence number can be located anywhere e.g. IMG_0000 - Copy
The range can start and end with anything i.e. doesn't have to start with 1 and end with 9999
Numbers may appear multiple times in the filename of the sequence e.g. 20080312_0000
Whenever I get something working for 8 random test cases, the 9th test breaks everything and I end up re-starting from scratch.
I've currently been comparing only the first and last filenames (as opposed to iterating through all filenames):
void FindRange(List<FileData> files, out string startRange, out string endRange)
{
string firstFile = files.First().ShortName;
string lastFile = files.Last().ShortName;
...
}
Does anyone have any clever ideas? Perhaps something with Regex?
If you're guaranteed to know the files end with the number (eg. _\d+), and are sorted, just grab the first and last elements and that's your range. If the filenames are all the same, you can sort the list to get them in order numerically. Unless I'm missing something obvious here -- where's the problem?
Use a regex to parse out the numbers from the filenames:
^.+\w(\d+)[^\d]*$
From these parsed strings, find the maximum length, and left-pad any that are less than the maximum length with zeros.
Sort these padded strings alphabetically. Take the first and last from this sorted list to give you your min and max numbers.
Firstly, I will assume that the numbers are always zero-padded so that they are the same length. If not then bigger headaches lie ahead.
Secondly, assume that the file names are exactly the same apart from the increment number component.
If these assumptions are true then the algorithm should be to look at each character in the first and last filenames to determine which same-positioned characters do not match.
var start = String.Empty;
var end = String.Empty;
for (var index = 0; index < firstFile.Length; index++)
{
char c = firstFile[index];
if (filenames.Any(filename => filename[index] != c))
{
start += firstFile[index];
end += lastFile[index];
}
}
// convert to int if required
edit: Changed to check every filename until a difference is found. Not as efficient as it could be but very simple and straightforward.
Here is my solution. It works with all of the examples that you have provided and it assumes the input array to be sorted.
Note that it doesn't look exclusively for numbers; it looks for a consistent sequence of characters that might differ across all of the strings. So if you provide it with {"0000", "0001", "0002"} it will hand back "0" and "2" as the start and end strings, since that's the only part of the strings that differ. If you give it {"0000", "0010", "0100"}, it will give you back "00" and "10".
But if you give it {"0000", "0101"}, it will whine since the differing parts of the string are not contiguous. If you would like this behavior modified so it will return everything from the first differing character to the last, that's fine; I can make that change. But if you are feeding it a ton of filenames that will have sequential changes to the number region, this should not be a problem.
public static class RangeFinder
{
public static void FindRange(IEnumerable<string> strings,
out string startRange, out string endRange)
{
using (var e = strings.GetEnumerator()) {
if (!e.MoveNext())
throw new ArgumentException("strings", "No elements.");
if (e.Current == null)
throw new ArgumentException("strings",
"Null element encountered at index 0.");
var template = e.Current;
// If an element in here is true, it means that index differs.
var matchMatrix = new bool[template.Length];
int index = 1;
string last = null;
while (e.MoveNext()) {
if (e.Current == null)
throw new ArgumentException("strings",
"Null element encountered at index " + index + ".");
last = e.Current;
if (last.Length != template.Length)
throw new ArgumentException("strings",
"Element at index " + index + " has incorrect length.");
for (int i = 0; i < template.Length; i++)
if (last[i] != template[i])
matchMatrix[i] = true;
}
// Verify the matrix:
// * There must be at least one true value.
// * All true values must be consecutive.
int start = -1;
int end = -1;
for (int i = 0; i < matchMatrix.Length; i++) {
if (matchMatrix[i]) {
if (end != -1)
throw new ArgumentException("strings",
"Inconsistent match matrix; no usable pattern discovered.");
if (start == -1)
start = i;
} else {
if (start != -1 && end == -1)
end = i;
}
}
if (start == -1)
throw new ArgumentException("strings",
"Strings did not vary; no usable pattern discovered.");
if (end == -1)
end = matchMatrix.Length;
startRange = template.Substring(start, end - start);
endRange = last.Substring(start, end - start);
}
}
}

Categories

Resources