I have a huge dataset that I want to write into the Excel and need to perform conditional formatting of rows based on a business logic. So, for the data insertion part, I am using a data array to populate the Excel and it works pretty fast. However, I see a severe performance degradation when it comes to formatting the rows. It almost takes more than double the time just to do the formatting.
As of now, I am applying formatting to individual rows and loop through a series of rows. However, I am wondering if I can select multiple rows at a time and apply bulk formatting options to those rows:
Here is what I have right now:
foreach (int row in rowsToBeFormatted)
{
Excel.Range range = (Excel.Range)xlsWorksheet.Range[xlsWorksheet.Cells[row + introFormat, 1], xlsWorksheet.Cells[row + introFormat, 27]];
range.Font.Size = 11;
range.Interior.ColorIndex = 15;
range.Font.Bold = true;
}
And here is a demo of how I am trying to select multiple rows to the range and apply the formatting:
string excelrange = "A3:AA3,A83:AA83,A88:AA88,A94:AA94,A102:AA102,A106:AA106,A110:AA110,...." (string with more than 3000 characters)
xlsWorksheet.get_Range(excelrange).Interior.Color = Color.SteelBlue;
However, I get the following error when I execute the code:
Exception from HRESULT: 0x800A03EC
and there is nothing in inner exception. Any ideas how can I achieve the desired result?
As per comments under the question, there's hard-coded limit of 255 characters for a range string, however I wasn't able to find any documentation about it. Another commenter suggested to use semicolon as separator, but the documentation clearly states that comma should be used as union operator in range string:
The name of the range in A1-style notation in the language of the application. It can include the range operator (a colon), the intersection operator (a space), or the union operator (a comma). It can also include dollar signs, but they are ignored. You can use a local defined name in any part of the range. If you use a name, the name is assumed to be in the language of the application.
So where do we go from here? Formatting each range individually is indeed inefficient. Application interface provides method Union, but calling it in a loop is as inefficient as individual formatting. So the natural choice is to use the range string limit to the maximum and thus minimizing number of calls to COM interface.
You can split the full range to format into chunks; each not exceeding 255 characters limit. I would implement it using enumerators:
static IEnumerable<string> GetChunks(IEnumerable<string> ranges)
{
const int MaxChunkLength = 255;
var sb = new StringBuilder(MaxChunkLength);
foreach (var range in ranges)
{
if (sb.Length > 0)
{
if (sb.Length + range.Length + 1 > MaxChunkLength)
{
yield return sb.ToString();
sb.Clear();
}
else
{
sb.Append(",");
}
}
sb.Append(range);
}
if (sb.Length > 0)
{
yield return sb.ToString();
}
}
var rowsToFormat = new[] { 3, 83, 88, 94, 102, 106, 110/*, ...*/ }
var rowRanges = rowsToFormat.Select(row => "A" + row + ":" + "AA" + row);
foreach (var chunk in GetChunks(rowRanges))
{
var range = xlsWorksheet.Range[chunk];
// do formatting stuff here
}
The above is 10-15 times faster than individual formatting:
foreach (var rangeStr in rowRanges)
{
var range = xlsWorksheet.Range[rangeStr];
// do formatting stuff here
}
I can also see further space for optimization like grouping contiguous rows, but in case you are formatting discrete rows with subtotals, it won't help.
Related
I am developing a VSTO add-in for Excel in C# that needs to compare potentially large datasets (100 columns x ~10000 or more rows). It is being done in Excel so an end user can view some pictorial representation of the provided data on a row-by-row basis. This application must be done in Excel despite the potential pitfalls of using these large datasets.
Regardless, my question pertains to an efficient way to compare contiguous and sequential rows. My goal is to compare one row to the row directly after it; if there is a change of any of the elements between row1 and row2, this counts as an "event" and row2 output into a separate sheet. I'm sure you can see that for row-wise comparison of rows when the count is around 10000, this takes a long time (in practice, this is about 150ms-200ms per row for the current code).
Currently, I have used the SequenceEqual() method to compare two lists of strings as follows:
private void FilterRawDataForEventReader(Excel.Application xlApp)
{
List<string> row1 = new List<string>();
List<string> row2 = new List<string>();
xlWsRaw = xlApp.Worksheets["Full Raw Data"];
xlWsEventRaw = xlApp.Worksheets["Event Data"];
Excel.Range xlRawRange = xlWsRaw.Range["A3"].Resize[xlWsRaw.UsedRange.Rows.Count-2, xlWsRaw.UsedRange.Columns.Count];
var array = xlRawRange.Value;
Excel.Range xlRange = (Excel.Range)xlWsEventRaw.Cells[xlWsEventRaw.UsedRange.Rows.Count, 1];
int lastRow = xlRange.get_End(Excel.XlDirection.xlUp).Row;
int newRow = lastRow + 2;
for (int i = 1; i < xlWsRaw.UsedRange.Rows.Count - 2; i++)
{
row1.Clear();
row2.Clear();
for (int j = 1; j <= xlWsRaw.UsedRange.Columns.Count-1; j++)
{
row1.Add(array[i, j].ToString());
row2.Add(array[i + 1, j].ToString());
}
if (!row1.SequenceEqual(row2))
{
row2.Add(array[i + 1, xlWsRaw.UsedRange.Columns.Count].ToString()); // Add timestamp to row2.
for (int j = 0; j < row2.Count; j++)
{
xlWsEventRaw.Cells[newRow, j + 1] = row2[j];
}
newRow++;
}
}
}
During testing, I placed timers are various parts of this method to see how long certain operations take. For 100 columns, the first loop which builds the string arrays for row1 and row2 takes around 100ms per iteration and the whole operation takes between 150ms-200ms when an "event" has been found.
My intuition is that building the two List<string> is the problem but I do not know how else to approach this kind of problem in my experience. I should emphasize, the actual values of the data in the two List<string> don't matter; what matters is if the data are different at all. In that way, I feel that I am approaching this problem incorrectly but don't know how to "re-approach" so to say.
I am wondering if, instead of building arrays of strings through iteration and comparing them with the SequenceEqual() method, anyone can suggest a faster way to compare contiguous and sequential rows?
In case this solution may be useful for someone else trying to use Excel in C# and do some comparisons:
This problem was largely an optimization exercise. By eliminating the multiple loops and using Excel instead to generate the comparison lists:
for (int i = 3; i < xlWsRaw.UsedRange.Rows.Count - 2; i++)
{
rng1 = (Excel.Range)xlWsRaw.Range[xlWsRaw.Cells[i, 1], xlWsRaw.Cells[i, xlWsRaw.UsedRange.Columns.Count - 1]];
rng2 = (Excel.Range)xlWsRaw.Range[xlWsRaw.Cells[i+1, 1], xlWsRaw.Cells[i+1, xlWsRaw.UsedRange.Columns.Count - 1]];
rng3 = (Excel.Range)xlWsEventRaw.Range[xlWsEventRaw.Cells[newRow, 1], xlWsEventRaw.Cells[newRow, xlWsRaw.UsedRange.Columns.Count - 1]];
object[,] cellValues1 = (object[,])rng1.Value2;
object[,] cellValues2 = (object[,])rng2.Value2;
List<string> test1 = cellValues1.Cast<object>().ToList().ConvertAll(x => Convert.ToString(x));
List<string> test2 = cellValues2.Cast<object>().ToList().ConvertAll(x => Convert.ToString(x));
if (!test1.SequenceEqual(test2))
{
rng2.Copy(rng3);
xlWsEventRaw.Cells[newRow, xlWsRaw.UsedRange.Columns.Count].Value = xlWsRaw.Cells[i + 1, xlWsRaw.UsedRange.Columns.Count].Value; // Outputs the timestamp of the event to the events worksheet.
newRow++;
}
}
I believe this can be optimized further but in my case, the ranges contain multiple types including strings so I convert everything to List<string> for the purpose of comparison. The SequenceEqual() method, however it works behind the scenes, is nearly instantaneous and reduces the time to compare 120 columns to around 3ms.
I have two lists of strings - this is not mandatory, i can convert them to any collection (list, dictionary, etc).
First is "text":
Birds sings
Dogs barks
Frogs jumps
Second is "words":
sing
dog
cat
I need to iterate through "text" and if line contains any of "words" - do one thing and if not another thing.
Important: yes, in my case i need to find partial match ignoring case, like text "Dogs" is a match for word "dog". This is why i use .Contains and .ToLower().
My naive try looks like this:
List<string> text = new List<string>();
List<string> words = new List<string>();
foreach (string line in text)
{
bool found = false;
foreach (string word in words)
{
if (line.ToLower().Contains(word.ToLower()))
{
;// one thing
found = true;
break;
}
}
if (!found)
;// another
}
Problem in size - 8000 in first list and ~50000 in second. This takes too many time.
How to make it faster?
I'm assuming that you only want to match on the specific words in your text list: that is, if text contains "dogs", and words contains "dog", then that shouldn't be a match.
Note that this is different to what your code currently does.
Given this, we can construct a HashSet<string> of all of the words in your text list. We can then query this very cheaply.
We'll also use StringComparer.OrdinalIgnoreCase to do our comparisons. This is a better way of doing a case-insensitive match than ToLower(), and ordinal comparisons are relatively cheap. If you're dealing with languages other than English, you'll need to consider whether you actually need StringComparer.CurrentCultureIgnoreCase or StringComparer.InvariantCultureIgnoreCase.
var textWords = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
foreach (var line in text)
{
var lineWords = line.Split(' ');
textWords.UnionWith(lineWords);
}
if (textWords.Overlaps(words))
{
// One thing
}
else
{
// Another
}
If this is not the case, and you do want to do a .Contains on each, then you can speed it up a bit by avoiding the calls to .ToLower(). Each call to .ToLower() creates a new string in memory, so you're creating two new, useless objects per comparison.
Instead, use:
if (line.IndexOf(word, StringComparison.OrdinalIgnoreCase) >= 0)
As above, you might have to use StringComparison.CurrentCultureIgnoreCase or StringComparison.InvariantCultureIgnoreCase depending on the language of your strings. However, you should see a significant speedup if your strings are entirely ASCII and you use OrdinalIgnoreCase as this makes the string search a lot quicker.
If you're using .NET Framework, another thing to try is moving to .NET Core. .NET Core introduced a lot of optimizations in this area, and you might find that it's quicker.
Another thing you can do is see if you have duplicates in either text or words. If you have a lot, you might be able to save a lot of time. Consider using a HashSet<string> for this, or linq's .Distinct() (you'll need to see which is quicker).
You can try using LINQ for the second looping construct.
List<string> text = new List<string>();
List<string> words = new List<string>();
foreach (string line in text)
{
bool found = words.FirstOrDefault(w=>line.ToLower().Contains(w.ToLower()))!=null;
if (found)
{
//Do something
}
else
{
//Another
}
}
Might not be as fast as you want but it will be faster than before.
You can improve the search algorithm.
public static int Search(string word, List<string> stringList)
{
string wordCopy = word.ToLower();
List<string> stringListCopy = new List<string>();
stringList.ForEach(s => stringListCopy.Add(s.ToLower()));
stringListCopy.Sort();
int position = -1;
int count = stringListCopy.Count;
if (count > 0)
{
int min = 0;
int max = count - 1;
int middle = (max - min) / 2;
int comparisonStatus = 0;
do
{
comparisonStatus = string.Compare(wordCopy, stringListCopy[middle]);
if (comparisonStatus == 0)
{
position = middle;
break;
}
else if (comparisonStatus < 0)
{
max = middle - 1;
}
else
{
min = middle + 1;
}
middle = min + (max - min) / 2;
} while (min < max);
}
return position;
}
Inside this method we create copy of string list. All elements are lower case.
After that we sort copied list by ascending. This is crucial because the entire algorithm is based upon ascending sort.
If word exists in the list then the Search method will return its position inside list, otherwise it will return -1.
How the algorithm works?
Instead of checking every element in the list, we split the list in half in every iteration.
In every iteration we take the element in the middle and compare two strings (the element and our word). If out string is the same as the one in the middle then our search is finished. If our string is lexical before the string in the middle, then our string must be in the first half of the list, because the list is sorted by ascending. If our string is lexical after the string in the middle, then our string must be in the second half of the list, again because the list is sorted by ascending. Then we take the appropriate half and repeat the process.
In first iteration we take the entire list.
I've tested the Search method using these data:
List<string> stringList = new List<string>();
stringList.Add("Serbia");
stringList.Add("Greece");
stringList.Add("Egypt");
stringList.Add("Peru");
stringList.Add("Palau");
stringList.Add("Slovakia");
stringList.Add("Kyrgyzstan");
stringList.Add("Mongolia");
stringList.Add("Chad");
Search("Serbia", stringList);
This way you will search the entire list of ~50,000 elements in 16 iterations at most.
I have a string like this:
-82.9494547,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0
I need to have output like this:
-82.9494547 36.2913021, -83.0784938 36.2347521, -82.9537782,36.079235
I have tried this following to code to achieve the desired output:
string[] coordinatesVal = coordinateTxt.Trim().Split(new string[] { ",0" }, StringSplitOptions.None);
for (int i = 0; i < coordinatesVal.Length - 1; i++)
{
coordinatesVal[i] = coordinatesVal[i].Trim();
coordinatesVal[i] = coordinatesVal[i].Replace(',', ' ');
numbers.Append(coordinatesVal[i]);
if (i != coordinatesVal.Length - 1)
{
coordinatesVal.Append(", ");
}
}
But this process does not seem to me the professional solution. Can anyone please suggest more efficient way of doing this?
Your code is okay. You could dismiss temporary results and chain method calls
var numbers = new StringBuilder();
string[] coordinatesVal = coordinateTxt
.Trim()
.Split(new string[] { ",0" }, StringSplitOptions.None);
for (int i = 0; i < coordinatesVal.Length - 1; i++) {
numbers
.Append(coordinatesVal[i].Trim().Replace(',', ' '))
.Append(", ");
}
numbers.Length -= 2;
Note that the last statement assumes that there is at least one coordinate pair available. If the coordinates can be empty, you would have to enclose the loop and this last statement in if (coordinatesVal.Length > 0 ) { ... }. This is still more efficient than having an if inside the loop.
You ask about efficiency, but you don't specify whether you mean code efficiency (execution speed) or programmer efficiency (how much time you have to spend on it).
One key part of professional programming is to judge which one of these is more important in any given situation.
The other answers do a good job of covering programmer efficiency, so I'm taking a stab at code efficiency. I'm doing this at home for fun, but for professional work I would need a good reason before putting in the effort to even spend time comparing the speeds of the methods given in the other answers, let alone try to improve on them.
Having said that, waiting around for the program to finish doing the conversion of millions of coordinate pairs would give me such a reason.
One of the speed pitfalls of C# string handling is the way String.Replace() and String.Trim() return a whole new copy of the string. This involves allocating memory, copying the characters, and eventually cleaning up the garbage generated. Do that a few million times, and it starts to add up. With that in mind, I attempted to avoid as many allocations and copies as possible.
enum CurrentField
{
FirstNum,
SecondNum,
UnwantedZero
};
static string ConvertStateMachine(string input)
{
// Pre-allocate enough space in the string builder.
var numbers = new StringBuilder(input.Length);
var state = CurrentField.FirstNum;
int i = 0;
while (i < input.Length)
{
char c = input[i++];
switch (state)
{
// Copying the first number to the output, next will be another number
case CurrentField.FirstNum:
if (c == ',')
{
// Separate the two numbers by space instead of comma, then move on
numbers.Append(' ');
state = CurrentField.SecondNum;
}
else if (!(c == ' ' || c == '\n'))
{
// Ignore whitespace, output anything else
numbers.Append(c);
}
break;
// Copying the second number to the output, next will be the ,0\n that we don't need
case CurrentField.SecondNum:
if (c == ',')
{
numbers.Append(", ");
state = CurrentField.UnwantedZero;
}
else if (!(c == ' ' || c == '\n'))
{
// Ignore whitespace, output anything else
numbers.Append(c);
}
break;
case CurrentField.UnwantedZero:
// Output nothing, just track when the line is finished and we start all over again.
if (c == '\n')
{
state = CurrentField.FirstNum;
}
break;
}
}
return numbers.ToString();
}
This uses a state machine to treat incoming characters differently depending on whether they are part of the first number, second number, or the rest of the line, and output characters accordingly. Each character is only copied once into the output, then I believe once more when the output is converted to a string at the end. This second conversion could probably be avoided by using a char[] for the output.
The bottleneck in this code seems to be the number of calls to StringBuilder.Append(). If more speed were required, I would first attempt to keep track of how many characters were to be copied directly into the output, then use .Append(string value, int startIndex, int count) to send an entire number across in one call.
I put a few example solutions into a test harness, and ran them on a string containing 300,000 coordinate-pair lines, averaged over 50 runs. The results on my PC were:
String Split, Replace each line (see Olivier's answer, though I pre-allocated the space in the StringBuilder):
6542 ms / 13493147 ticks, 130.84ms / 269862.9 ticks per conversion
Replace & Trim entire string (see Heriberto's second version):
3352 ms / 6914604 ticks, 67.04 ms / 138292.1 ticks per conversion
- Note: Original test was done with 900000 coord pairs, but this entire-string version suffered an out of memory exception so I had to rein it in a bit.
Split and Join (see Ćukasz's answer):
8780 ms / 18110672 ticks, 175.6 ms / 362213.4 ticks per conversion
Character state machine (see above):
1685 ms / 3475506 ticks, 33.7 ms / 69510.12 ticks per conversion
So, the question of which version is most efficient comes down to: what are your requirements?
Your solution is fine. Maybe you could write it a bit more elegant like this:
string[] coordinatesVal = coordinateTxt.Trim().Split(new string[] { ",0" },
StringSplitOptions.RemoveEmptyEntries);
string result = string.Empty;
foreach (string line in coordinatesVal)
{
string[] numbers = line.Trim().Split(',');
result += numbers[0] + " " + numbers[1] + ", ";
}
result = result.Remove(result.Count()-2, 2);
Note the StringSplitOptions.RemoveEmptyEntries parameter of Split method so you don't have to deal with empty lines into foreach block.
Or you can do extremely short one-liner. Harder to debug, but in simple cases does the work.
string result =
string.Join(", ",
coordinateTxt.Trim().Split(new string[] { ",0" }, StringSplitOptions.RemoveEmptyEntries).
Select(i => i.Replace(",", " ")));
heres another way without defining your own loops and replace methods, or using LINQ.
string coordinateTxt = #" -82.9494547,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0";
string[] coordinatesVal = coordinateTxt.Replace(",", "*").Trim().Split(new string[] { "*0", Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(",", coordinatesVal).Replace("*", " ");
Console.WriteLine(result);
or even
string coordinateTxt = #" -82.9494540,36.2913021,0
-83.0784938,36.2347521,0
-82.9537782,36.079235,0";
string result = coordinateTxt.Replace(Environment.NewLine, "").Replace($",", " ").Replace(" 0", ", ").Trim(new char[]{ ',',' ' });
Console.WriteLine(result);
In short - I want to convert the first answer to the question here from Python into C#. My current solution to splitting conjoined words is exponential, and I would like a linear solution. I am assuming no spacing and consistent casing in my input text.
Background
I wish to convert conjoined strings such as "wickedweather" into separate words, for example "wicked weather" using C#. I have created a working solution, a recursive function using exponential time, which is simply not efficient enough for my purposes (processing at least over 100 joined words). Here the questions I have read so far, which I believe may be helpful, but I cannot translate their responses from Python to C#.
How can I split multiple joined words?
Need help understanding this Python Viterbi algorithm
How to extract literal words from a consecutive string efficiently?
My Current Recursive Solution
This is for people who only want to split a few words (< 50) in C# and don't really care about efficiency.
My current solution works out all possible combinations of words, finds the most probable output and displays. I am currently defining the most probable output as the one which uses the longest individual words - I would prefer to use a different method. Here is my current solution, using a recursive algorithm.
static public string find_words(string instring)
{
if (words.Contains(instring)) //where words is my dictionary of words
{
return instring;
}
if (solutions.ContainsKey(instring.ToString()))
{
return solutions[instring];
}
string bestSolution = "";
string solution = "";
for (int i = 1; i < instring.Length; i++)
{
string partOne = find_words(instring.Substring(0, i));
string partTwo = find_words(instring.Substring(i, instring.Length - i));
if (partOne == "" || partTwo == "")
{
continue;
}
solution = partOne + " " + partTwo;
//if my current solution is smaller than my best solution so far (smaller solution means I have used the space to separate words fewer times, meaning the words are larger)
if (bestSolution == "" || solution.Length < bestSolution.Length)
{
bestSolution = solution;
}
}
solutions[instring] = bestSolution;
return bestSolution;
}
This algorithm relies on having no spacing or other symbols in the entry text (not really a problem here, I'm not fussed about splitting up punctuation). Random additional letters added within the string can cause an error, unless I store each letter of the alphabet as a "word" within my dictionary. This means that "wickedweatherdykjs" would return "wicked weather d y k j s" using the above algorithm, when I would prefer an output of "wicked weather dykjs".
My updated exponential solution:
static List<string> words = File.ReadLines("E:\\words.txt").ToList();
static Dictionary<char, HashSet<string>> compiledWords = buildDictionary(words);
private void btnAutoSpacing_Click(object sender, EventArgs e)
{
string text = txtText.Text;
text = RemoveSpacingandNewLines(text); //get rid of anything that breaks the algorithm
if (text.Length > 150)
{
//possibly split the text up into more manageable chunks?
//considering using textSplit() for this.
}
else
{
txtText.Text = find_words(text);
}
}
static IEnumerable<string> textSplit(string str, int chunkSize)
{
return Enumerable.Range(0, str.Length / chunkSize)
.Select(i => str.Substring(i * chunkSize, chunkSize));
}
private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
var dictionary = new Dictionary<char, HashSet<string>>();
foreach (var word in words)
{
var key = word[0];
if (!dictionary.ContainsKey(key))
{
dictionary[key] = new HashSet<string>();
}
dictionary[key].Add(word);
}
return dictionary;
}
static public string find_words(string instring)
{
string bestSolution = "";
string solution = "";
if (compiledWords[instring[0]].Contains(instring))
{
return instring;
}
if (solutions.ContainsKey(instring.ToString()))
{
return solutions[instring];
}
for (int i = 1; i < instring.Length; i++)
{
string partOne = find_words(instring.Substring(0, i));
string partTwo = find_words(instring.Substring(i, instring.Length - i));
if (partOne == "" || partTwo == "")
{
continue;
}
solution = partOne + " " + partTwo;
if (bestSolution == "" || solution.Length < bestSolution.Length)
{
bestSolution = solution;
}
}
solutions[instring] = bestSolution;
return bestSolution;
}
How I would like to use the Viterbi Algorithm
I would like to create an algorithm which works out the most probable solution to a conjoined string, where the probability is calculated according to the position of the word in a text file that I provide the algorithm with. Let's say the file starts with the most common word in the English language first, and on the next line the second most common, and so on until the least common word in my dictionary. It looks roughly like this
the
be
and
...
attorney
Here is a link to a small example of such a text file I would like to use.
Here is a much larger text file which I would like to use
The logic behind this file positioning is as follows...
It is reasonable to assume that they follow Zipf's law, that is the
word with rank n in the list of words has probability roughly 1/(n log
N) where N is the number of words in the dictionary.
Generic Human, in his excellent Python solution, explains this much better than I can. I would like to convert his solution to the problem from Python into C#, but after many hours spent attempting this I haven't been able to produce a working solution.
I also remain open to the idea that perhaps relative frequencies with the Viterbi algorithm isn't the best way to split words, any other suggestions for creating a solution using C#?
Written text is highly contextual and you may wish to use a Markov chain to model sentence structure in order to estimate joint probability. Unfortunately, sentence structure breaks the Viterbi assumption -- but there is still hope, the Viterbi algorithm is a case of branch-and-bound optimization aka "pruned dynamic programming" (something I showed in my thesis) and therefore even when the cost-splicing assumption isn't met, you can still develop cost bounds and prune your population of candidate solutions. But let's set Markov chains aside for now... assuming that the probabilities are independent and each follows Zipf's law, what you need to know is that the Viterbi algorithm works on accumulating additive costs.
For independent events, joint probability is the product of the individual probabilities, making negative log-probability a good choice for the cost.
So your single-step cost would be -log(P) or log(1/P) which is log(index * log(N)) which is log(index) + log(log(N)) and the latter term is a constant.
Can't help you with the Viterbi Algorithm but I'll give my two cents concerning your current approach. From your code its not exactly clear what words is. This can be a real bottleneck if you don't choose a good data structure. As a gut feeling I'd initially go with a Dictionary<char, HashSet<string>> where the key is the first letter of each word:
private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
var dictionary = new Dictionary<char, HashSet<string>>();
foreach (var word in words)
{
var key = word[0];
if (!dictionary.ContainsKey(key))
{
dictionary[key] = new HashSet<string>();
}
dictionary[key].Add(word);
}
return dictionary;
}
And I'd also consider serializing it to disk to avoid building it up every time.
Not sure how much improvement you can make like this (dont have full information of you current implementation) but benchmark it and see if you get any improvement.
NOTE: I'm assuming all words are cased consistently.
I am comparing two lists of data that were generated from a binary file. I have a good idea on why it's running slow, when there's a significant amount of records, it does un-needed redundant work.
For example, if a1 = a1, condition is true. Since 2a != 1a so why even bother checking it? I need to eliminate 1a from being checked again. If I don't, it will check the first record when it goes to check the 400,000th record. I thought about making the second for loop a foreach, but I can't remove 1a while iterating through the nested loop
The amount of items that can be in either 'for loop' can vary. I don't think a single for loop using 'i' will work since the match can be anywhere. I'm reading from a binary file
This is my current code. Program has been running for over an hour, and it's still going. I removed a lot of my iterating code for readability reasons.
for (int i = 0; i < origItemList.Count; i++)
{
int modFoundIndex = 0;
Boolean foundIt = false;
for (int g = 0; g < modItemList.Count; g++)
{
if ((origItemList[i].X == modItemList[g].X)
&& (origItemList[i].Y == modItemList[g].Y)
&& (origItemList[i].Z == modItemList[g].Z)
&& (origItemList[i].M == modItemList[g].M))
{
foundIt = true;
modFoundIndex = g;
break;
}
else
{
foundIt = false;
}
}
if (foundIt)
{
/*
* This is run assumming it finds an x,y,z,m
coordinate. It thenchecks the database file.
*
*/
//grab the rows where the coordinates match
DataRow origRow = origDbfFile.dataset.Tables[0].Rows[i];
DataRow modRow = modDbfFile.dataset.Tables[0].Rows[modFoundIndex];
//number matched indicates how many columns were matched
int numberMatched = 0;
//get the number of columns to match in order to detect all changes
int numOfColumnsToMatch = origDbfFile.datatable.Columns.Count;
List<String> mismatchedColumns = new List<String>();
//check each column name for a change
foreach (String columnName in columnNames)
{
//this grabs whatever value is in that field
String origRowValue = "" + origRow.Field<Object>(columnName);
String modRowValue = "" + modRow.Field<Object>(columnName);
//check if they are the same
if (origRowValue.Equals(modRowValue))
{
//if they aren the same, increase the number matched by one
numberMatched++;
//add the column to the list of columns that don't match
}
else
{
mismatchedColumns.Add(columnName);
}
}
/* In the event it matches 15/16 columns, show the change */
if (numberMatched != numOfColumnsToMatch)
{
//Grab the shapeFile in question
Item differentAttrShpFile = origItemList[i];
//start blue highlighting
result += "<div class='turnBlue'>";
//show where the change was made at
result += "Change Detected at<br/> point X: " +
differentAttrShpFile.X + ",<br/> point Y: " +
differentAttrShpFile.Y + ",<br/>";
result += "</div>"; //end turnblue div
foreach (String mismatchedColumn in mismatchedColumns)
{
//iterate changes here
}
}
}
}
You're coming at this in a totally wrong way. The loop you have is O(n^2), breaking when you find the match will on average cut the time in half for a hit, that's not enough. If you have a quarter million items in the list then this loop executes 62 billion times and even if the compiler optimizes out the extra array lookups you're still looking at at least a trillion instructions. You don't do O(n^2) for large n if you can possibly help it!
What you need to do is get rid of the O(n^2) aspect of this. My suggestion:
1) Define a hashing function that looks at the x, y, z & m and comes up with an integer value, my inclination would be to use one that's the wordsize of your target platform.
2) Iterate over both lists, compute hashes for everything.
3) Build an index to one of the tables, hash and the object. I suspect a dictionary is the best data structure here but a simple sorted array would also do.
4) Iterate over the list you didn't build the index over, compare the hashes to the entries in the index. If it's a hash that's an O(n) task, if it's a sorted array it's O(n log n).
5) When the hashes match do a full comparison to confirm the hit is real as you will get the occasional collision with a good 64-bit hash and you'll get a decent number of them if your hashes are 32-bit.
This is something similar to Loren said but below is in language of .NET :)
1. Override GetHashCode method to return sum of x,y,z and m. Override Equals method to check for this sum.
2. Iterate and create HashSet from modItemList (List) before loop.
3. In inner loop, first check if origItemList[i] exists in HashSet using YourModHashSet.Contains(MyObject) method.
4. If .Contains return you false, carry one, no match.
5. If .Contains return you true, iterate thru entire modItemList and apply your current logic of checking for x,y,z and m for entire list. Note that here you should use List as hash table might eat up many objects for which hash code is same.
Also, I would use Foreach instead of For because I've seen Foreach giving little better results (5 to 30% faster) in such case.
Update:
I created MyObject class like below:
public class MyObject
{
public int X, Y, Z, M;
public override int GetHashCode()
{
return X*10000 + Y*100 + Z*10 + M;
}
public override bool Equals(object obj)
{
return (obj.GetHashCode() == this.GetHashCode());
}
}
GetHashCode method is important here. We don't want many false positives. False positive occurs when Hash matches for some other combination of X, Y, Z and M. Best way to prevent false positive is to multiply each member such that each will impact one decimal place in HashCode. Note that you should consider not exceeding Int.Max value. If the expected value of X,Y,Z and M are small you should be good.
set2.Clear();
s1 = DateTime.Now;
MyObject matchingElement;
totalmatch = 0;
foreach (MyObject elem in list2)
set2.Add(elem);
foreach (MyObject t1 in list1)
{
if (set2.Contains(t1))
{
matchingElement = null;
foreach (MyObject t2 in list2)
{
if (t1.X == t2.X && t1.Y == t2.Y && t1.Z == t2.Z && t1.M == t2.M)
{
totalmatch++;
matchingElement = t2;
break;
}
}
//Do Something on matchingElement if not null
}
}
Console.WriteLine("set foreach with contains: " + (DateTime.Now - s1).TotalSeconds + "\t Total Match: " + totalmatch);
Above is sample code that I was trying to describe in my answer. This code should work super fast if matches are expected to be less.