Takes too long to loop through array looking for duplicates - c#

hope you can help me out.
I've got a 135.000 line long txt file containing lines like this: 111706469;1972WE;26;Wel.
What the program is supposed to do, is compare every line to every line that came before it, to find if it's more than 80% similar and then state the line number of the original line.
Those things i've managed to do on my own like this.
if (rows.Length > 1) {
for (int rowIndex = 1; rowIndex < rows.Length; rowIndex++)
{
string cols = rows[rowIndex];
bool Dubbel = false;
for (int DupIndex = 0; DupIndex < rowIndex; DupIndex++)
{
string SearchDup = rows[DupIndex];
decimal ComparisonResult = Compare(cols, SearchDup);
if (ComparisonResult > 80)
{
cols += ";" + DupIndex;
Dubbel = true;
break;
}
}
Console.WriteLine(rowIndex + ";" + cols);
}
}
This means the program has to go through the array again and again for every array item. My question is, is there a faster/better way to doing this?
Any help you can give me would be much appreciated.

The problem is with your fuzzy matching, which returns a floating point number - there's no way to optimize this better than O(N*N) without any details on the fuzzy function itself (if I'm wrong - please somebody correct me)
If you have exact matches you can remove them first, this way your N^2 complexity will be reduced to (N-K)^2 - this operation will be worth it if you have at least some exact matches.
Use HashSet<>, which doesn't need a second object like Dictionary
List<string> rows = new List<string>(new[] {"AAA","BBB","AAA","CCC"});
HashSet<string> foundLines = new HashSet<string>();
foreach (string row in rows){
if (!foundLines.Contains(row))
foundLines.Add(row);
}
rows = foundLines.ToList();
Then proceed with your algoritm

You're not going to be able to get much optimization without a significant overhaul. It'd be trivial for exact matches, or for searching for anything which closely matched a target, but for a difference between objects, you must compare each item to each previous item.
Basically, if you're given a set of N strings, you have to compare N to N-1, N-2, N-3, etc. Then you need to compare them all again with N+1, in addition to N, because there's no relationship between N+1 and N.

After some further efforts I've come to the anwser of my own question and thought I should post it incase someone else were to have the same problem.
I converted the txt file to a mysql database, then SELECTED'ed all the records once into a DataTable. The code then loops through the records and SELECT's from the original DataTable only those records with the same Postal code and house number into a second DataTable. Against which the original is compared.
This reduced a process that took 9 hours to 2 to 3 minutes. After the fact it was quite obvious, but such is hindsight...
Hope it helps someone out.

Related

Improve performance of TryGetValue

I am creating an Excel file using Open XML SDK. In this process, I have a scenario like below.
I need to add data into a Dictionary<uint, string> if key is not exists. For that I am using below code.
var dataLines = sheetData.Elements<Row>().ToList();
for (int i = 0; i < dataLines.Count; i++)
{
var x = dataLines[i];
if (!dataDictionary.TryGetValue(x.RowIndex.Value, out var res)) // 700 seconds, 1,279,999,998 Hit counts
{
dataDictionary.Add(x.RowIndex.Value, x.OuterXml);
}
}
When I am trying to create an Excel sheet which has rows around 90,000 - 92,000, the line with the IF condition in above code consume 700 seconds to complete. (checked with a performance profiler, also this line has 1,279,999,998 Hit counts).
How could I reduce the time the line with the IF condition in above code consumes?
Is there any better way to achive this with less time?
If the if statement is slow, one option you have is to eliminate it entirely and use the indexer of the dictionary to set the value. This means that the "last match will win". If you want the "first match to win", all you have to do is reverse the order you are iterating the list.
var dataLines = sheetData.Elements<Row>().ToList();
for (int i = dataLines.Count - 1; i >= 0; i--)
{
var x = dataLines[i];
dataDictionary[x.RowIndex.Value] = x.OuterXml;
}
If x.RowIndex.Value is unique, it doesn't matter which direction you iterate.
If it is important that the key is sorted in ascending order, you can use a SortedDictionary<TKey, TValue>.
But as others have pointed out, it seems odd that you have so many hit counts. There is probably recursion going on in your application that you need to track down.

how can i calculated values after Worksheet.Calculate()?

i tried Trial version of Gembox.SpreadSheet.
when i Get Cells[,].value by for() or Foreach().
so i think after Calculate() & get Cell[].value, but that way just take same time,too.
it take re-Calculate when i Get Cell[].value.
workSheet.Calcuate(); <- after this, values are Calculated, am i right?
for( int i =0; i <worksheet.GetUsedCellRange(true).LastRowIndex+1;++i)
{
~~~~for Iteration~~~
var value = workSheet.Cells[i,j].Value; <- re-Calcuate value(?)
}
so here is a Question.
Can i Get calculated values? or you guys know pre-Calculate function or Get more Speed?
Unfortunate, I'm not sure what exactly you're asking, can you please try reformulating your question a bit so that it's easier to understand it?
Nevertheless, here is some information which I hope you'll find useful.
To iterate through all cells, you should use one of the following:
1.
foreach (ExcelRow row in workSheet.Rows)
{
foreach (ExcelCell cell in row.AllocatedCells)
{
var value = cell.Value;
// ...
}
}
2.
for (CellRangeEnumerator enumerator = workSheet.Cells.GetReadEnumerator(); enumerator.MoveNext(); )
{
ExcelCell cell = enumerator.Current;
var value = cell.Value;
// ...
}
3.
for (int r = 0, rCount = workSheet.Rows.Count; r < rCount; ++r)
{
for (int c = 0, cCount = workSheet.CalculateMaxUsedColumns(); c < cCount; ++c)
{
var value = workSheet.Cells[r, c].Value;
// ...
}
}
I believe all of them will have pretty much the same performances.
However, depending on the spreadsheet's content this last one could end up a bit slower. This is because it does not exclusively iterate only through allocated cells.
So for instance, let say you have a spreadsheet which has 2 rows. The first row is empty, it has no data, and the second row has 3 cells. Now if you use 1. or 2. approach then you will iterate only through those 3 cells in the second row, but if you use 3. approach then you will iterate through 3 cells in the first row (which previously were not allocated and now they are because we accessed them) and then through 3 cells in the second row.
Now regarding the calculation, note that when you save the file with some Excel application it will save the last calculated formula values in it. In this case you don't have to call Calculate method because you already have the required values in cells.
You should call Calculate method when you need to update, re-calculate the formulas in your spreadsheet, for instance after you have added or modified some cell values.
Last, regarding your question again it is hard to understand it, but nevertheless:
Can i Get calculated values?
Yes, that line of code var value = workSheet.Cells[i,j].Value; should give you the calculated value because you used Calculate method before it. However, if you have formulas that are currently not supported by GemBox.Spreadsheet's calculation engine then it will not be able to calculate the value. You can find a list of currently supported Excel formula functions here.
or you guys know pre-Calculate function or Get more Speed?
I don't know what "pre-Calculate function" means and for speed please refer to first part of this answer.

How to improve the performance of my custom function for getting fast results?

I;m using Lucene/.NET to implement a numerical search engine.
I want to filter numbers from within a large range, depends on which number exists in string array.
I used the following code:
int startValue = 1;
endValue = 100000;
//Assume that the following string array contains 12000 strings
String[] ArrayOfTerms = new String[] { "1", "10",................. , "99995"};
public String[] GetFilteredStrings(String[] ArrayOfTerms)
{
List<String> filteredStrings = new List<String>();
for (int i = startValue; i <= endValue; i++)
{
int index = Array.IndexOf(ArrayOfTerms,i.ToString());
if( index != -1)
{
filteredStrings.Add((String)ArrayOfTerms.GetValue(index));
}
}
return filteredStrings.ToArray();
}
Now, my problem is it searches every value from 1 to 100000 and takes too much time. some times my application is hanging.
Can anyone of you help me how to improve this performance issue? I don't know about caching concept, but I know that Lucene supports cache filters. Should I use a cache filter? Thanks in advance.
In fact you're trying to determine if an Array contains the item or not.
I think you should use something like a HashSet or Dictionary to be able to determine presence of the value for O(1) time instead of O(n) time you have.
This code works pretty much faster.
var results = ArrayOfTerms.Where(s => int.Parse(s) <= endValue);
If I got what you want to do

Is there a quicker way to process a sequence of element than using a loop?

Here I am storing the elements of a datagrid in a string builder using a for loop, but it takes too much time when there is a large number of rows. Is there another way to copy the data in to a string builder in less time?
for (int a = 0; a < grdMass.RowCount; a++)
{
if (a == 0)
{
_MSISDN.AppendLine("'" + grdMass.Rows[a].Cells[0].Value.ToString() + "'");
}
else
{
_MSISDN.AppendLine(",'" + grdMass.Rows[a].Cells[0].Value.ToString() + "'");
}
}
There is no way to improve this code given the information you have provided. This is a simply for loop that appends strings to a StringBuilder - there isn't a whole lot going on here that can be optimized.
This may be one of those cases where something takes a long time simply because you are processing a lot of data. Perhaps there is a way to cache this data so you don't have to generate it as often. Is there anything else you can tell us that would help us find a better way to do this?
Side note: It is very important that you validate your suspicions as to the particular section of code that is causing the slowness. Do this by profiling your code so that you don't spend time trying to fix a problem that exists elsewhere.
As others have said, the StringBuilder is about as fast as you're going to get, so assuming this is the only bit of code that could be causing your slow down, there's probably not much you can do... but you could slightly optimise it by removing the small amount of string concatenation you are doing. I.e:
for (int a = 0; a < grdMass.RowCount; a++)
{
if (a == 0)
{
_MSISDN.Append("'");
}
else
{
_MSISDN.Append(",'");
}
_MSISDN.Append(grdMass.Rows[a].Cells[0].Value);
_MSISDN.AppendLine("'");
}
Edit: You could also clean up the if statement (although I highly doubt it's having a noticable effect) like so:
//First row
if (grdMass.RowCount > 0)
{
_MSISDN.Append("'");
_MSISDN.Append(grdMass.Rows[0].Cells[0].Value);
_MSISDN.AppendLine("'");
}
//Second row onwards
for (int a = 1; a < grdMass.RowCount; a++)
{
_MSISDN.Append(",'");
_MSISDN.Append(grdMass.Rows[a].Cells[0].Value);
_MSISDN.AppendLine("'");
}
I'm suspecting that it's not the string building that takes a long time, perhaps it's accessing the grid elements that is slow.
You could rewrite your code like this:
var cellValues = grdMass.Rows
.Select(r => "'" + r.Cells[0].Value.ToString() + "'")
.ToArray();
return String.Join(",", cellValues);
Now you can verify which part takes the most time. Is it building the cellValues array, or is it the String.Join call?
StringBuilder is pretty much as fast as it gets for building up strings -- and that is pretty goshdarned fast. If StringBuilder is too slow, you are probably trying to process too much data in one go. Are you sure it is really the string building which is slow and not some other part of the processing?
One tip that will speed up StringBuilder for very large strings: set the capacity up front. That is, call the StringBuilder(int) constructor instead of the default constructor, passing an estimate of the number of characters you plan to write. It will still expand if you underestimate -- this just saves the initial "well, 1K wasn't enough, time to allocate another 2K... 4K... etc." But this will make only a small difference, and only if your strings are very long.
This would be better....
if (grdMass.RowCount > 0)
{
_MSISDN.AppendLine("'" + grdMass.Rows[0].Cells[0].Value.ToString() + "'");
for (int a = 1; a < grdMass.RowCount; a++)
{
_MSISDN.AppendLine(",'" + grdMass.Rows[a].Cells[0].Value.ToString() + "'");
}
}

Counting occurrences of a string in an array and then removing duplicates

I am fairly new to C# programming and I am stuck on my little ASP.NET project.
My website currently examines Twitter statuses for URLs and then adds those URLs to an array, all via a regular expression pattern matching procedure. Clearly more than one person will update a with a specific URL so I do not want to list duplicates, and I want to count the number of times a particular URL is mentioned in, say, 100 tweets.
Now I have a List<String> which I can sort so that all duplicate URLs are next to each other. I was under the impression that I could compare list[i] with list[i+1] and if they match, for a counter to be added to (count++), and if they don't match, then for the URL and the count value to be added to a new array, assuming that this is the end of the duplicates.
This would remove duplicates and give me a count of the number of occurrences for each URL. At the moment, what I have is not working, and I do not know why (like I say, I am not very experienced with it all).
With the code below, assume that a JSON feed has been searched for using a keyword into srchResponse.results. The results with URLs in them get added to sList, a string List type, which contains only the URLs, not the message as a whole.
I want to put one of each URL (no duplicates), a count integer (to string) for the number of occurrences of a URL, and the username, message, and user image URL all into my jagged array called 'urls[100][]'. I have made the array 100 rows long to make sure everything can fit but generally, this is too big. Each 'row' will have 5 elements in them.
The debugger gets stuck on the line: if (sList[i] == sList[i + 1]) which is the crux of my idea, so clearly the logic is not working. Any suggestions or anything will be seriously appreciated!
Here is sample code:
var sList = new ArrayList();
string[][] urls = new string[100][];
int ctr = 0;
int j = 1;
foreach (Result res in srchResponse.results)
{
string content = res.text;
string pattern = #"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)";
MatchCollection matches = Regex.Matches(content, pattern);
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
sList.Add(groups[0].Value.ToString());
}
}
sList.Sort();
foreach (Result res in srchResponse.results)
{
for (int i = 0; i < 100; i++)
{
if (sList[i] == sList[i + 1])
{
j++;
}
else
{
urls[ctr][0] = sList[i].ToString();
urls[ctr][1] = j.ToString();
urls[ctr][2] = res.text;
urls[ctr][3] = res.from_user;
urls[ctr][4] = res.profile_image_url;
ctr++;
j = 1;
}
}
}
The code then goes on to add each result into a StringBuilder method with the HTML.
Is now edite
The description of your algorithm seems fine. I don't know what's wrong with the implementation; I haven't read it that carefully. (The fact that you are using an ArrayList is an immediate red flag; why aren't you using a more strongly typed generic collection?)
However, I have a suggestion. This is exactly the sort of problem that LINQ was intended to solve. Instead of writing all that error-prone code yourself, just describe the transformation you're interested in, and let the compiler work it out for you.
Suppose you have a list of strings and you wish to determine the number of occurrences of each:
var notes = new []{ "Do", "Fa", "La", "So", "Mi", "Do", "Re" };
var counts = from note in notes
group note by note into g
select new { Note = g.Key, Count = g.Count() }
foreach(var count in counts)
Console.WriteLine("Note {0} occurs {1} times.", count.Note, count.Count);
Which I hope you agree is much easier to read than all that array logic you wrote. And of course, now you have your sequence of unique items; you have a sequence of counts, and each count contains a unique Note.
I'd recommend using a more sophisticated data structure than an array. A Set will guarantee that you have no duplicates.
Looks like C# collections doesn't include a Set, but there are 3rd party implementations available, like this one.
Your loop fails because when i == 99, (i + 1) == 100 which is outside the bounds of your array.
But as other have pointed out, .Net 3.5 has ways of doing what you want more elegantly.
If you don't need to know how many duplicates a specific entry has you could do the following:
LINQ Extension Methods
.Count()
.Distinct()
.Count()

Categories

Resources