How to optimize two foreach loops with large data sets

How to optimize two foreach loops with large data sets - c#

At present each complete loop is taking 3.5 seconds this would take 63 hours to run at it's current pace. Is there any better way to optimize this loop? The data sets are rather large and one has about 75,677 the other has around 700,000.
foreach(string i in dataText.Skip(1).OrderBy(x => x.Split(',')[0]))
{
   string[] textparts = i.Split(',');
foreach(string j in newExcelList.Skip(3))
{
string[] excelparts = j.Split(',');
if(textparts[0]  == excelparts[0])
{
sw.WriteLine($"{excelparts[0]}\tN25\tPRIM\t{excelparts[1]}\t{excelparts[2]}\t{excelparts[3]}\t{excelparts[4]}\t");
}
}
}

None of the contents of this data appear to change from iteration to iteration. You can therefore create a 2D array of both datatext and excelparts at the start, before either loop. Otherwise you're redundantly splitting all these strings an insane number of times, and almost always not even using all the values. So try this instead:
string[][] dataTextRows =
dataText.Skip(1).Select(dt => dt.Split(','))
.OrderBy(dtrow => dtrow[0])
.ToArray();
string[][] excelListRows =
newExcelList.Skip(3).Select(el => el.Split(','))
.ToArray();
foreach (string[] textparts in dataTextRows)
{
foreach (string[] excelparts in excelListRows)
{
if (textparts[0] == excelparts[0])
{
sw.WriteLine($"{excelparts[0]}\tN25\tPRIM\t{excelparts[1]}\t{excelparts[2]}\t{excelparts[3]}\t{excelparts[4]}\t");
}
}
}
But if you want to make a really pro-level move, you could avoid the second loop all together by using a Dictionary keyed to the value of the first column:
var excelListDict = excelListRows.ToDictionary(row => row[0], row => row);
foreach (string[] textparts in dataTextRows)
{
if (!excelListDict.TryGetValue(textparts[0], out var excelparts))
continue;
sw.WriteLine($"{excelparts[0]}\tN25\tPRIM\t{excelparts[1]}\t{excelparts[2]}\t{excelparts[3]}\t{excelparts[4]}\t");
}

Related

Remove names that contain another in a list

I have a file with "Name|Number" in each line and I wish to remove the lines with names that contain another name in the list.
For example, if there is "PEDRO|3" , "PEDROFILHO|5" , "PEDROPHELIS|1" in the file, i wish to remove the lines "PEDROFILHO|5" , "PEDROPHELIS|1".
The list has 1.8 million lines, I made it like this but its too slow :
List<string> names = File.ReadAllLines("firstNames.txt").ToList();
List<string> result = File.ReadAllLines("firstNames.txt").ToList();
foreach (string name in names)
{
string tempName = name.Split('|')[0];
List<string> temp = names.Where(t => t.Contains(tempName)).ToList();
foreach (string str in temp)
{
if (str.Equals(name))
{
continue;
}
result.Remove(str);
}
}
File.WriteAllLines("result.txt",result);
Does anyone know a faster way? Or how to improve the speed?

Since you are looking for matches everywhere in the word, you will end up with O(n2) algorithm. You can improve implementation a bit to avoid string deletion inside a list, which is an O(n) operation in itself:
var toDelete = new HashSet<string>();
var names = File.ReadAllLines("firstNames.txt");
foreach (string name in names) {
var tempName = name.Split('|')[0];
toDelete.UnionWith(
// Length constraint removes self-matches
names.Where(t => t.Length > name.Length && t.Contains(tempName))
);
}
File.WriteAllLines("result.txt", names.Where(name => !toDelete.Contains(name)));

This works but I don't know if it's quicker. I haven't tested on millions of lines. Remove the tolower if the names are in the same case.
List<string> names = File.ReadAllLines(#"C:\Users\Rob\Desktop\File.txt").ToList();
var result = names.Where(w => !names.Any(a=> w.Split('|')[0].Length> a.Split('|')[0].Length && w.Split('|')[0].ToLower().Contains(a.Split('|')[0].ToLower())));
File.WriteAllLines(#"C:\Users\Rob\Desktop\result.txt", result);
test file had
Rob|1
Robbie|2
Bert|3
Robert|4
Jan|5
John|6
Janice|7
Carol|8
Carolyne|9
Geoff|10
Geoffrey|11
Result had
Rob|1
Bert|3
Jan|5
John|6
Carol|8
Geoff|10

c# linq list with varying where conditions

private void getOrders()
{
try
{
//headerFileReader is assigned with a CSV file (not shown here).
while (!headerFileReader.EndOfStream)
{
headerRow = headerFileReader.ReadLine();
getOrderItems(headerRow.Substring(0,8))
}
}
}
private void getOrderItems(string ordNum)
{
// lines is an array assigned with a CSV file...not shown here.
var sorted = lines.Skip(1).Select(line =>
new
{
SortKey = (line.Split(delimiter)[1]),
Line = line
})
.OrderBy(x => x.SortKey)
.Where(x => x.SortKey == ordNum);
//Note ordNum is different every time when it is passed.
foreach (var orderItems in sorted) {
//Process each line here.
}
}
Above is my code. What I am doing is for every order number from headerFile, I process the detailLines. I would like to only search for those lines specific to the order nr. The above logic works fine but it reads with where clause for every order number which simply is not required as well as delays the process.
I basically want to have getOrderItems something like below but I can't get as the sorted can't be passed but I think it should be possible??
private void getOrderItems(string ordNum)
{
// I would like to have sorted uploaded with data elsewhere and I pass it this function and reference it by other means but I am not able to get it.
var newSorted = sorted.Where(x => x.SortKey == docNum);
foreach (var orderItems in newSorted) {
//Process each line here.
}
}
Please suggest.
UPDATE : Thanks for the responses & improvements but my main question is I don't want to create the list every time (like I have shown in my code). What I want is to create the list first time and then only search within the list for a particular value (here docNum as shown). Please suggest.

It might be a good idea to preprocess your input lines and build a dictionary, where each distinct sort key maps to a list of lines. Building the dictionary is O(n), and after that you get constant time O(1) lookups:
// these are your unprocessed file lines
private string[] lines;
// this dictionary will map each `string` key to a `List<string>`
private Dictionary<string, List<string>> groupedLines;
// this is the method where you are loading your files (you didn't include it)
void PreprocessInputData()
{
// you already have this part somewhere
lines = LoadLinesFromCsv();
// after loading, group the lines by `line.Split(delimiter)[1]`
groupedLines = lines
.Skip(1)
.GroupBy(line => line.Split(delimiter)[1])
.ToDictionary(x => x.Key, x => x.ToList());
}
private void ProcessOrders()
{
while (!headerFileReader.EndOfStream)
{
var headerRow = headerFileReader.ReadLine();
List<string> itemsToProcess = null;
if (groupedLines.TryGetValue(headerRow, out itemsToProcess))
{
// if you are here, then
// itemsToProcess contains all lines where
// (line.Split(delimiter)[1]) == headerRow
}
else
{
// no such key in the dictionary
}
}
}

The following will get your way and also be more efficient.
var sorted = lines.Skip(1)
.Where(line => (line.Split(delimiter)[1] == ordNum))
.Select(
line =>
new
{
SortKey = (line.Split(delimiter)[1]),
Line = line
}
)
.OrderBy(x => x.SortKey);

Appropriate datastructure for key.contains(x) Map/Dictionary

I am somewhat struggling with the terminology and complexity of my explanations here, feel free to edit it.
I have 1.000 - 20.000 objects. Each one can contain several name words (first, second, middle, last, title...) and normalized numbers(home, business...), email adresses or even physical adresses and spouse names.
I want to implement a search that enables users to freely combine word parts and number parts.When I search for "LL 676" I want to find all objects that contain any String with "LL" AND "676".
Currently I am iterating over every object and every objects property, split the searchString on " " and do a stringInstance.Contains(searchword).
This is too slow, so I am looking for a better solution.
What is the appropriate language agnostic data structure for this?
In my case I need it for C#.
Is the following data structure a good solution?
It's based on a HashMap/Dictionary.
At first I create a String that contains all name parts and phone numbers I want to look through, one example would be: "William Bill Henry Gates III 3. +436760000 billgatesstreet 12":
Then I split on " " and for every word x I create all possible substrings y that fullfill x.contains(y). I put every of those substrings inside the hashmap/dictionary.
On lookup/search I just need to call the search for every searchword and the join the results. Naturally, the lookup speed is blazingly fast (native Hashmap/Dictionary speed).
EDIT: Inserts are very fast as well (insignificant time) now that I use a smarter algorithm to get the substrings.

It's possible I've misunderstood your algorithm or requirement, but this seems like it could be a potential performance improvement:
foreach (string arg in searchWords)
{
if (String.IsNullOrEmpty(arg))
continue;
tempList = new List<T>();
if (dictionary.ContainsKey(arg))
foreach (T obj in dictionary[arg])
if (list.Contains(obj))
tempList.Add(obj);
list = new List<T>(tempList);
}
The idea is that you do the first search word separately before this, and only put all the subsequent words into the searchWords list.
That should allow you to remove your final foreach loop entirely. Results only stay in your list as long as they keep matching every searchWord, rather than initially having to pile everything that matches a single word in then filter them back out at the end.

In case anyone cares for my solution:
Disclaimer:
This is only a rough draft.
I have only done some synthetic testing and I have written a lot of it without testing it again.I have revised my code: Inserts are now ((n^2)/2)+(n/2) instead of 2^n-1 which is infinitely faster. Word length is now irrelevant.
namespace MegaHash
{
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Threading.Tasks;
public class GenericConcurrentMegaHash<T>
{
// After doing a bulk add, call AwaitAll() to ensure all data was added!
private ConcurrentBag<Task> bag = new ConcurrentBag<Task>();
private ConcurrentDictionary<string, List<T>> dictionary = new ConcurrentDictionary<string, List<T>>();
// consider changing this to include for example '-'
public char[] splitChars;
public GenericConcurrentMegaHash()
: this(new char[] { ' ' })
{
}
public GenericConcurrentMegaHash(char[] splitChars)
{
this.splitChars = splitChars;
}
public void Add(string keyWords, T o)
{
keyWords = keyWords.ToUpper();
foreach (string keyWord in keyWords.Split(splitChars))
{
if (keyWord == null || keyWord.Length < 1)
return;
this.bag.Add(Task.Factory.StartNew(() => { AddInternal(keyWord, o); }));
}
}
public void AwaitAll()
{
lock (this.bag)
{
foreach (Task t in bag)
t.Wait();
this.bag = new ConcurrentBag<Task>();
}
}
private void AddInternal(string key, T y)
{
for (int i = 0; i < key.Length; i++)
{
for (int i2 = 0; i2 < i + 1; i2++)
{
string desire = key.Substring(i2, key.Length - i);
if (dictionary.ContainsKey(desire))
{
List<T> l = dictionary[desire];
lock (l)
{
try
{
if (!l.Contains(y))
l.Add(y);
}
catch (Exception ex)
{
ex.ToString();
}
}
}
else
{
List<T> l = new List<T>();
l.Add(y);
dictionary[desire] = l;
}
}
}
}
public IList<T> FulltextSearch(string searchString)
{
searchString = searchString.ToUpper();
List<T> list = new List<T>();
string[] searchWords = searchString.Split(splitChars);
foreach (string arg in searchWords)
{
if (arg == null || arg.Length < 1)
continue;
if (dictionary.ContainsKey(arg))
foreach (T obj in dictionary[arg])
if (!list.Contains(obj))
list.Add(obj);
}
List<T> returnList = new List<T>();
foreach (T o in list)
{
foreach (string arg in searchWords)
if (dictionary[arg] == null || !dictionary[arg].Contains(o))
goto BREAK;
returnList.Add(o);
BREAK:
continue;
}
return returnList;
}
}
}

C# loop optimization

i have this loop and it loops for large count like 30 000 times at least
i am looking for some way to improve it's performance
DbRecordDictionary is derived from DictionaryBase class
here is the loop:
ArrayList noEnter = new ArrayList();
DbRecordDictionary oldArray = new DbRecordDictionary();
DbRecordDictionary userArray = new DbRecordDictionary();
DbRecordDictionary result = null;
foreach (string key in keys)
{
if (noEnter.Contains(key))
{ //may need cast!
if (count < 1)
result.Add(key, userArray[key]);
else if (oldArray.Count == 0)
break;
else if (oldArray.Contains(key))
result.Add(key, userArray[key]);
}
}

You may want to use a Dictionary/Hashset for oldArray, but else there is not much you can do. Also noEnter if that is an array.

From what I can see, the variable count or the oldArray never changes during the loop, so you can place those condition outside the loop, and make two different loops.
if (count < 1) {
foreach (string key in keys) {
if (noEnter.Contains(key)) {
result.Add(key, userArray[key]);
}
}
} else if (oldArray.Count == 0) {
// no data
} else {
foreach (string key in keys) {
if (noEnter.Contains(key)) {
if (oldArray.Contains(key)) {
result.Add(key, userArray[key]);
}
}
}
}
The collections noEnter and oldArray should be dictionaries, otherwise you will be spending a lot of execution time in the Contains calls.

If noEnter has more then about 10 items in it, then use a Dictionary rathern then a List/Array for it. As a Dictionary can look up a item without having to look at all the items, when an List/Array has to loop over all items.
Otherwise consider shorting "keys" and "oldArray" and then proforming a "merge" on them. Look at the code for a "merge sort" to see how to do the merge. (This would need carefull profiling)

try this
(from k in keys
where noEnter.Contains(k) &&
oldArray.Count > 0 &&
count < 1 &&
oldArray.Contains(k)
select k)
.ToList()
.ForEach(k => result.Add(k, userArray[k]));

Small small optimization could be using generics
ArrayList noEnter = new ArrayList(); would be List noEnter = new List();
and DbRecordDictionary would inherit Dictonary instead of DictionaryBase.
im not 100% sure that you would gain performance but you will use a more modern c# style.

Compare adjacent list items

I'm writing a duplicate file detector. To determine if two files are duplicates I calculate a CRC32 checksum. Since this can be an expensive operation, I only want to calculate checksums for files that have another file with matching size. I have sorted my list of files by size, and am looping through to compare each element to the ones above and below it. Unfortunately, there is an issue at the beginning and end since there will be no previous or next file, respectively. I can fix this using if statements, but it feels clunky. Here is my code:
public void GetCRCs(List<DupInfo> dupInfos)
{
var crc = new Crc32();
for (int i = 0; i < dupInfos.Count(); i++)
{
if (dupInfos[i].Size == dupInfos[i - 1].Size || dupInfos[i].Size == dupInfos[i + 1].Size)
{
dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
}
}
}
My question is:
How can I compare each entry to its neighbors without the out of bounds error?
Should I be using a loop for this, or is there a better LINQ or other function?
Note: I did not include the rest of my code to avoid clutter. If you want to see it, I can include it.

Compute the Crcs first:
// It is assumed that DupInfo.CheckSum is nullable
public void GetCRCs(List<DupInfo> dupInfos)
{
dupInfos[0].CheckSum = null ;
for (int i = 1; i < dupInfos.Count(); i++)
{
dupInfos[i].CheckSum = null ;
if (dupInfos[i].Size == dupInfos[i - 1].Size)
{
if (dupInfos[i-1].Checksum==null) dupInfos[i-1].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i-1].FullName));
dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
}
}
}
After having sorted your files by size and crc, identify duplicates:
public void GetDuplicates(List<DupInfo> dupInfos)
{
for (int i = dupInfos.Count();i>0 i++)
{ // loop is inverted to allow list items deletion
if (dupInfos[i].Size == dupInfos[i - 1].Size &&
dupInfos[i].CheckSum != null &&
dupInfos[i].CheckSum == dupInfos[i - 1].Checksum)
{ // i is duplicated with i-1
... // your code here
... // eventually, dupInfos.RemoveAt(i) ;
}
}
}

I have sorted my list of files by size, and am looping through to
compare each element to the ones above and below it.
The next logical step is to actually group your files by size. Comparing consecutive files will not always be sufficient if you have more than two files of the same size. Instead, you will need to compare every file to every other same-sized file.
I suggest taking this approach
Use LINQ's .GroupBy to create a collection of files sizes. Then .Where to only keep the groups with more than one file.
Within those groups, calculate the CRC32 checksum and add it to a collection of known checksums. Compare with previously calculated checksums. If you need to know which files specifically are duplicates you could use a dictionary keyed by this checksum (you can achieve this with another GroupBy. Otherwise a simple list will suffice to detect any duplicates.
The code might look something like this:
var filesSetsWithPossibleDupes = files.GroupBy(f => f.Length)
.Where(group => group.Count() > 1);
foreach (var grp in filesSetsWithPossibleDupes)
{
var checksums = new List<CRC32CheckSum>(); //or whatever type
foreach (var file in grp)
{
var currentCheckSum = crc.ComputeChecksum(file);
if (checksums.Contains(currentCheckSum))
{
//Found a duplicate
}
else
{
checksums.Add(currentCheckSum);
}
}
}
Or if you need the specific objects that could be duplicates, the inner foreach loop might look like
var filesSetsWithPossibleDupes = files.GroupBy(f => f.FileSize)
.Where(grp => grp.Count() > 1);
var masterDuplicateDict = new Dictionary<DupStats, IEnumerable<DupInfo>>();
//A dictionary keyed by the basic duplicate stats
//, and whose value is a collection of the possible duplicates
foreach (var grp in filesSetsWithPossibleDupes)
{
var likelyDuplicates = grp.GroupBy(dup => dup.Checksum)
.Where(g => g.Count() > 1);
//Same GroupBy logic, but applied to the checksum (instead of file size)
foreach(var dupGrp in likelyDuplicates)
{
//Create the key for the dictionary (your code is likely different)
var sample = dupGrp.First();
var key = new DupStats() {FileSize = sample.FileSize, Checksum = sample.Checksum};
masterDuplicateDict.Add(key, dupGrp);
}
}
A demo of this idea.

I think the for loop should be : for (int i = 1; i < dupInfos.Count()-1; i++)
var grps= dupInfos.GroupBy(d=>d.Size);
grps.Where(g=>g.Count>1).ToList().ForEach(g=>
{
...
});

Can you do a union between your two lists? If you have a list of filenames and do a union it should result in only a list of the overlapping files. I can write out an example if you want but this link should give you the general idea.
https://stackoverflow.com/a/13505715/1856992
Edit: Sorry for some reason I thought you were comparing file name not size.
So here is an actual answer for you.
using System;
using System.Collections.Generic;
using System.Linq;
public class ObjectWithSize
{
public int Size {get; set;}
public ObjectWithSize(int size)
{
Size = size;
}
}
public class Program
{
public static void Main()
{
Console.WriteLine("start");
var list = new List<ObjectWithSize>();
list.Add(new ObjectWithSize(12));
list.Add(new ObjectWithSize(13));
list.Add(new ObjectWithSize(14));
list.Add(new ObjectWithSize(14));
list.Add(new ObjectWithSize(18));
list.Add(new ObjectWithSize(15));
list.Add(new ObjectWithSize(15));
var duplicates = list.GroupBy(x=>x.Size)
.Where(g=>g.Count()>1);
foreach (var dup in duplicates)
foreach (var objWithSize in dup)
Console.WriteLine(objWithSize.Size);
}
}
This will print out
14
14
15
15
Here is a netFiddle for that.
https://dotnetfiddle.net/0ub6Bs
Final note. I actually think your answer looks better and will run faster. This was just an implementation in Linq.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.