comparing the contents of two huge text files quickly

comparing the contents of two huge text files quickly - c#

what i'm basically trying to do is compare two HUGE text files and if they match write out a string, i have this written but it's extremely slow. I was hoping you guys might have a better idea. In the below example i'm comparing collect[3] splitfound[0]
string[] collectionlist = File.ReadAllLines(#"C:\found.txt");
string[] foundlist = File.ReadAllLines(#"C:\collection_export.txt");
foreach (string found in foundlist)
{
string[] spltifound = found.Split('|');
string matchfound = spltifound[0].Replace(".txt", ""); ;
foreach (string collect in collectionlist)
{
string[] splitcollect = collect.Split('\\');
string matchcollect = splitcollect[3].Replace(".txt", "");
if (matchcollect == matchfound)
{
end++;
long finaldest = (start - end);
Console.WriteLine(finaldest);
File.AppendAllText(#"C:\copy.txt", "copy \"" + collect + "\" \"C:\\OUT\\" + spltifound[1] + "\\" + spltifound[0] + ".txt\"\n");
break;
}
}
}
Sorry for the vagueness guys,
What I'm trying to do is simply say if content from one file exists in another write out a string(the string isn't important, merely the time to find the two comparatives is). collectionlist is like this:
Apple|Farm
foundlist is like this
C:\cow\horse\turtle.txt
C:\cow\pig\apple.txt
what i'm doing is taking apple from collectionlist, and finding the line that contains apple in foundlist. Then writing out a basic windows copy batch file. Sorry for the confusion.
Answer(All credit to Slaks)
string[] foundlist = File.ReadAllLines(#"C:\found.txt");
var collection = File.ReadLines(#"C:\collection_export.txt")
.ToDictionary(s => s.Split('|')[0].Replace(".txt",""));
using (var writer = new StreamWriter(#"C:\Copy.txt"))
{
foreach (string found in foundlist)
{
string[] splitFound = found.Split('\\');
string matchFound = Path.GetFileNameWithoutExtension(found);
string collectedLine;
if (collection.TryGetValue(matchFound,out collectedLine))
{
string[] collectlinesplit = collectedLine.Split('|');
end++;
long finaldest = (start - end);
Console.WriteLine(finaldest);
writer.WriteLine("copy \"" + found + "\" \"C:\\O\\" + collectlinesplit[1] + "\\" + collectlinesplit[0] + ".txt\"");
}
}
}

Call File.ReadLines() (.NET 4) instead of ReadAllLines() (.NET 2.0).
ReadAllLines needs to build an array to hold the return value, which can be extremely slow for large files.
If you're not using .Net 4.0, replace it with a StreamReader.
Build a Dictionary<string, string> with the matchCollects (once), then loop through the foundList and check whether the HashSet contains matchFound.
This allows you to replace the O(n) inner loop with an O(1) hash check
Use a StreamWriter instead of calling AppendText
EDIT: Call Path.GetFileNameWithoutExtension and the other Path methods instead of manually manipulating strings.
For example:
var collection = File.ReadLines(#"C:\found.txt")
.ToDictionary(s => s.Split('\\')[3].Replace(".txt", ""));
using (var writer = new StreamWriter(#"C:\Copy.txt")) {
foreach (string found in foundlist) {
string splitFound = found.Split('|');
string matchFound = Path.GetFileNameWithoutExtension(found)
string collectedLine;
if (collection.TryGetValue(matchFound, collectedLine)) {
end++;
long finaldest = (start - end);
Console.WriteLine(finaldest);
writer.WriteLine("copy \"" + collectedLine + "\" \"C:\\OUT\\"
+ splitFound[1] + "\\" + spltifound[0] + ".txt\"");
}
}
}

First I'd suggest normalizing both files and putting one of them in a set. This allows you to quickly test whether a specific line is present and reduces the complexity from O(n*n) to O(n).
Also you shouldn't open and close the file every time you write a line:
File.AppendAllText(...); // This causes the file to be opened and closed.
Open the output file once at the start of the operation, write lines to it, then close it when all lines have been written.

You have a cartesian product, so it makes sense to index one side instead of doing an enhaustive linear search.
Extract the keys from one file and use either a Set or SortedList data structure to hold them. This will make the lookups much much faster. (Your overall algorithm will be O(N lg N) instead of O(N**2) )

Related

How can i fill a list or array when checking in condition?

begginer here :). So i want to fill an array or list with results from a foreach loop.
Noob example
foreach (var drive in mounted_drives)
{
//asigning new path where to look for the UNCpath
string getUNC = path + "\\" + drive;
reg2 = Registry.CurrentUser.OpenSubKey(getUNC);
string UNCPath = reg2.GetValue("RemotePath").ToString(); //getting UNC PATH
Console.WriteLine(UNCPath);
}
so here i want each UNCPath to be saved to outside array or list that i can use later to write it in a file.
Dont wanna spill here my ideas since im not that deep into C# and .NET yet..
It may be simple but im stuck -.-
Thanks in advance

You can try this:
List<string> mylist = new List<string>();
foreach (var drive in mounted_drives)
{
string getUNC = path + "\\" + drive;
reg2 = Registry.CurrentUser.OpenSubKey(getUNC);
string UNCPath = reg2.GetValue("RemotePath").ToString(); //getting UNC PATH
mylist.Add(UNCPath);
}
In case you should need to have an array instead of a list, you can use the method ToArray();
string[] myarray = mylist.ToArray();

So i want to fill an array or list with results from a foreach loop. I want each UNCPath to be saved to outside array or list that I can use later.
There are other answer that address this already, but I have a different approach to this and a few suggestions to improve your current code.
The first suggestion is don't concatenate strings like you are:
string getUNC = path + "\\" + drive;
Look into the Path.Combine Method to do this for you.
Secondly you should always release resources when you can. You are opening up registry keys which means we should also always close and dispose of them.
Below is a static class with an extension routine. The routine returns an IEnumerable<string>, this way it can defer execution until you actually need it. This is helpful considering you mentioned you want to use it later as a List<string> and or Array.
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.Win32;
public static class RegistryHelper
{
public static IEnumerable<string> GetRemotePaths(this IEnumerable<string> drives, string path)
{
if (drives == null || drives.Count() == 0 || string.IsNullOrEmpty(path))
yield break;
foreach (string drive in drives)
{
using (RegistryKey key = Registry.CurrentUser.OpenSubKey(Path.Combine(path, drive)))
{
if (key != null && key.GetValue("RemotePath") != null)
{
yield return key.GetValue("RemotePath").ToString();
key.Close();
}
}
}
}
}
Here's an example of usage for this extension routine:
var ienum = mounted_drives.GetRemotePaths(YOURPATHHERE); // Make sure to put your path in - delayed execution until you actually need it
var lstPaths = mounted_drives.GetRemotePaths(YOURPATHHERE).ToList(); // Make sure to put your path in - converts the return to a `List<string>`
var arrPaths = mounted_drives.GetRemotePaths(YOURPATHHERE).ToArray(); // Make sure to put your path in - converts the return to an array of strings

You can just create a List outside the foreach and then appending what you want inside it.
var list = new List<string>();
list.Add("my string");
You can also use the Select method:
var list = mounted_drives.Select(e => {
string getUNC = path + "\\" + drive;
reg2 = Registry.CurrentUser.OpenSubKey(getUNC);
string UNCPath = reg2.GetValue("RemotePath").ToString(); //getting UNC PATH
return UNCPath;
}).ToList();

Checking if two strings are equal

I want to check if a string from the first line of a file, is equal with an another string.
The awkward part is that, the strings are the same, but my program doesn't return a true value.
The string is teach and the first line of the file is teach too.
string date = System.IO.File.ReadAllText(folder + "/NPC/" + score_npc + "/" + score_npc + ".txt" );
if (condition)
{
string[] parametrii = date.Split('\n');
if (parametrii[0].Equals("teach"))
//instructions
I tried all the compare methods, i made my own function too. And my function said me that the (parametrii[0])[0] == b
Here is how the file looks like:
teach
poza1
poza2
end

That's propably because new line character is not \n in the file. It may be \r\n instead.
Try File.ReadAllLines instead:
string[] lines = System.IO.File.ReadAllLines(folder + "/NPC/" + score_npc + "/" + score_npc + ".txt" );
if (condition)
{
if (lines[0].Equals("teach"))
// instructions
}
Edit
As Grant Winney suggests, if you only need to manipulate first line (or not all of the) file, you may use File.ReadLines:
string firstLine = File.ReadLines(path).First();
instead.

Have u tried to change
string[] parametrii = date.Split('\n');
into
string[] parametrii = date.Split(Environment.NewLine);?
I suspect it's because your strings contain '\r' character

Searching for line of one text file in another text file, faster

Is there a faster way to search each line of one text file for occurrence in another text file, than by going line by line in both files?
I have two text files - one has ~2500 lines (let's call it TxtA), the other has ~86000 lines(TxtB). I want to search TxtB for each line in TxtA, and return the line in TxtB for each match found.
I currently have this setup as: for each line in TxtA, search TxtB line by line for a match. However this is taking a really long time to process. It seems like it would take 1-3 hours to find all the matches.
Here is my code...
private static void getGUIDAndType()
{
try
{
Console.WriteLine("Begin.");
System.Threading.Thread.Sleep(4000);
String dbFilePath = #"C:\WindowsApps\CRM\crm_interface\data\";
StreamReader dbsr = new StreamReader(dbFilePath + "newdbcontents.txt");
List<string> dblines = new List<string>();
String newDataPath = #"C:\WindowsApps\CRM\crm_interface\data\";
StreamReader nsr = new StreamReader(newDataPath + "HolidayList1.txt");
List<string> new1 = new List<string>();
string dbline;
string newline;
List<string> results = new List<string>();
while ((newline = nsr.ReadLine()) != null)
{
//Reset
dbsr.BaseStream.Position = 0;
dbsr.DiscardBufferedData();
while ((dbline = dbsr.ReadLine()) != null)
{
newline = newline.Trim();
if (dbline.IndexOf(newline) != -1)
{//if found... get all info for now
Console.WriteLine("FOUND: " + newline);
System.Threading.Thread.Sleep(1000);
new1.Add(newline);
break;
}
else
{//the first line of db does not contain this line...
//go to next dbline.
Console.WriteLine("Lines do not match - continuing");
continue;
}
}
Console.WriteLine("Going to next new Line");
System.Threading.Thread.Sleep(1000);
//continue;
}
nsr.Close();
Console.WriteLine("Writing to dbc3.txt");
System.IO.File.WriteAllLines(#"C:\WindowsApps\CRM\crm_interface\data\dbc3.txt", results.ToArray());
Console.WriteLine("Finished. Press ENTER to continue.");
Console.WriteLine("End.");
Console.ReadLine();
}
catch (Exception ex)
{
Console.WriteLine("Error: " + ex);
Console.ReadLine();
}
}
Please let me know if there is a faster way. Preferably something that would take 5-10 minutes... I've heard of indexing but didn't find much on this for txt files. I've tested regex and it's no faster than indexof. Contains won't work because the lines will never be exactly the same.
Thanks.

There might be a faster way, but this LINQ apporoach should be faster than 3 hours and is a sight better to read and maintain:
var f1Lines = File.ReadAllLines(f1Path);
var f2LineInf1 = File.ReadLines(f2Path)
.Where( line => f1Lines.Contains(line))
.Select(line => line).ToList();
Edit: tested and required less than 1 second for 400000 lines in file2 and 17000 lines in file1. I can use File.ReadLines for the big file which does not load all into memory at once. For the smaller file i need to use File.ReadAllLines since Contains needs the complete list of lines of file 1.
If you want to log the result in a third file:
File.WriteAllLines(logPath, f2LineInf1);

EDIT: Note that I'm assuming it's reasonable to read at least one file into memory. You may want to swap the queries below around to avoid loading the "big" file into memory, but even 86,000 lines at (say) 1K per line is going to be less than 2G of memory - which is relatively little to do something significant.
You're reading the "inner" file each time. There's no need for that. Load both files into memory and go from there. Heck, for exact matches you can do the whole thing in LINQ easily:
var query = from line1 in File.ReadLines("newDataPath + "HolidayList1.txt")
join line2 in File.ReadLines(dbFilePath + "newdbcontents.txt")
on line1 equals line2
select line1;
var commonLines = query.ToList();
But for non-joins it's still simple; just read one file completely first (explicitly) and then stream the other:
// Eagerly read the "inner" file
var lines2 = File.ReadAllLines(dbFilePath + "newdbcontents.txt");
var query = from line1 in File.ReadLines("newDataPath + "HolidayList1.txt")
from line2 in lines2
where line2.Contains(line1)
select line1;
var commonLines = query.ToList();
There's nothing clever here - it's just a really simple way of writing code to read all the lines in one file, then iterate over the lines in the other file and for each line check against all the lines in the first file. But even without anything clever, I strongly suspect it would perform well enough for you. Concentrate on simplicity, eliminate unnecessary IO, and see whether that's good enough before trying to do anything fancier.
Note that in your original code, you should be using using statements for your StreamReader variables, to ensure they get disposed properly. Using the above code makes it simple to not even need that though...

Quick and dirty because I've got to go... If you can do it in memory, try working with this snippet:
//string[] searchIn = File.ReadAllLines("File1.txt");
//string[] searchFor = File.ReadAllLines("File2.txt");
string[] searchIn = new string[] {"A","AB","ABC","ABCD", null, "", " "};
string[] searchFor = new string[] {"A","BC","BCD", null, "", " "};
matchDictionary;
foreach(string item in file2Content)
{
string[] matchingItems = Array.FindAll(searchIn, x => (x == item) || (!string.IsNullOrEmpty(x) && !string.IsNullOrEmpty(item) ? (x.Contains(item) || item.Contains(x)) : false));
}

How does the C# compiler work with a split?

I have an List<string> that I am iterating through and splitting on each item then adding it to a StringBuilder.
foreach(string part in List)
{
StringBuilder.Append(part.Split(':')[1] + " ");
}
So my question is how many strings are created by doing this split? All of the splits are going to produce two items. So... I was thinking that it will create a string[2] and then an empty string. But, does it then create the concatenation of the string[1] + " " and then add it to the StringBuilder or is this optimized?

The code is actually equivalent to this:
foreach(string part in myList)
{
sb.Append(string.Concat(part.Split(':')[1], " "));
}
So yes, an additional string, representing the concatenation of the second part of the split and the empty string will be created.
Including the original string, you also have the two created by the call to Split(), and a reference to the literal string " ", which will be loaded from the assembly metadata.
You can save yourself the call to Concat() by just Appending the split result and the empty string sequentially:
sb.Append(part.Split(':')[1]).Append(" ");
Note that if you are only using string literals, then the compiler will make one optimzation for you:
sb.Append("This is " + "one string");
is actually compiled to
sb.Append("This is one string");

3 extra strings for every item
part[0];
part[1];
part[1] + " "
the least allocations possible would be to avoid all the temporary allocations completely, but the usual micro-optimization caveats apply.
var start = part.IndexOf(':') + 1;
stringbuilder.Append(part, start, part.Length-start).Append(' ');

You have the original string 'split' - 1 string
You have the 'split' split into two - 2 string
You have the two parts of split joined - 1 string
The string builder does not create a new string.
The current code uses 4 strings, including the original.
If you want to save one string do:
StringBuilder.Append(part.Split(':')[1]);
StringBuilder.Append(" ");

This code:
foreach(string part in List)
{
StringBuilder.Append(part.Split(':')[1] + " ");
}
Is equivalent to:
foreach(string part in List)
{
string tmp = string.Concat(part.Split(':')[1], " ");
StringBuilder.Append(tmp);
}
So yes, it's creating a string needlessly. This would be better, at least in terms of the number of strings generated:
foreach(string part in List)
{
StringBuilder.Append(part.Split(':')[1])
.Append(" ");
}

So for each value in the list (n, known as part in your code) you are allocating:
x (I assume 2) strings for the split.
n strings for the concatenation.
Roughly n + 1 string for the StringBuilder; probably much less though.
So you have nx + n + n + 1 at the end, and assuming the split always results in two values 4n + 1.
One way to improve this would be:
foreach(string part in List)
{
var val = part.Split(':')[1];
StringBuilder.EnsureCapacity(StringBuilder.Length + val.Length + 1);
StringBuilder.Append(val);
StringBuilder.Append(' ');
}
This makes it 3n + 1. It is a rough estimate as StringBuilder allocates strings as it runs out of space - but if you EnsureCapacity you will prevent it from getting it wrong.

Probably the only way to be sure about how this is compiled is to build it and decompile it again with Refactor to see how it's internally handled. Anyway have in mind that probably it does not have impact on the whole app performance.

C# string.Contains using variable

string[] pullspec = File.ReadAllLines(#"C:\fixedlist.txt");
foreach (string ps in pullspec)
{
string pslower = ps.ToLower();
string[] pslowersplit = pslower.Split('|');
var keywords = File.ReadAllLines(#"C:\crawl\keywords.txt");
if (pslower.Contains("|"))
{
if (pslower.Contains(keywords))
{
File.AppendAllText(#"C:\" + keyword + ".txt", pslowersplit[1] + "|" + pslowersplit[0] + "\n");
}
}
}
This doesn't compile because of pslower.Contains(keywords) but I'm not trying to do 100 foreach loops.
Does anybody have any suggestions?

Using LINQ:
if (keywords.Any(k => pslower.Contains(k)))

You have a collection of keywords, and you want to see if any of them (or all of them?) are contained in a given string. I don't see how you would solve this without using a loop somewhere, either explicit or hidden in some function or linq expression.

Another solution - create a String[]of the keywords and then string[] parts = pslower.Split(yourStringArray, StringSplitOptions.None); - if any of your strings appear then parts.Length > 1. You won't easily get your hands on the keywords this way, tho'.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

comparing the contents of two huge text files quickly - c#

Related

How can i fill a list or array when checking in condition?

Checking if two strings are equal

Searching for line of one text file in another text file, faster

How does the C# compiler work with a split?

C# string.Contains using variable

Categories

Resources