Remove duplicate files in different directories

Remove duplicate files in different directories - c#

I'm using Directory.EnumerateFiles to list files in two separate directories. Some of the files exist in both folders. How can I remove any duplicate files from the combined list?
try
{
corporateFiles = Directory.EnumerateFiles(#"\\" + corporateServer, "*.pdf", SearchOption.AllDirectories).ToList();
}
catch
{
corporateFiles = new List<string>();
}
try {
functionalFiles = Directory.EnumerateFiles(#"\\" + functionalServer, "*.pdf", SearchOption.AllDirectories).ToList();
}
catch
{
functionalFiles = new List<String>();
}
var combinedFiles = corporateFiles.Concat(functionalFiles);

It seems I cannot satisfy my lust for LINQ.
Here's a one-liner:
var combinedFiles = corporateFiles.Concat(functionalFiles.Where(x => !(corporateFiles.Select(y => y.Split('\\').Last()).ToList().Intersect(functionalFiles.Select(y => y.Split('\\').Last()))).Contains(x.Split('\\').Last())));
This keeps the filepaths from corporateFiles. You can swap them if you prefer otherwise.
I'll attempt to format this to be more readable.
EDIT: Here's the code abstracted out a bit, hopefully more readable:
// Get common file names:
var duplicateFileNames = corporateFiles.Select(y => y.Split('\\').Last()).ToList().Intersect(functionalFiles.Select(y => y.Split('\\').Last()));
// Remove entries in 'functionalFiles' that are duplicates:
var functionalFilesWithoutDuplicates = functionalFiles.Where(x => !duplicateFileNames.Contains(x.Split('\\').Last()));
// Combine the un-touched 'corporateFiles' with the filtered 'functionalFiles':
var combinedFiles = corporateFiles.Concat(functionalFilesWithoutDuplicates);

Use Union instead of Concat:
var combinedFiles = corporateFiles.Union(functionalFiles);
You can use the overload passing an IEqualityComparer<string> to compare using only the name part:
var combined = corporateFiles.Union(functionalFiles, new FileNameComparer())
class FileNameComparer : EqualityComparer<string>
{
public override bool Equals(string x, string y)
{
var name1 = Path.GetFileName(x);
var name2 = Path.GetFileName(y);
return name1 == name2;
}
public override int GetHashCode(string obj)
{
var name = Path.GetFileName(obj);
return name.GetHashCode();
}
}

Related

How to pass a list of strings through webapi and get the results without those strings?

My code already gets the table without containing a string. How can I get a list without containing a list of strings? I want to get the result of SELECT * FROM table WHERE column NOT IN ('x' ,'y');
public IEnumerable<keyart1> Get(string keyword)
{
List<keyart1> keylist;
using (dbEntities5 entities = new dbEntities5())
{
keylist = entities.keyart1.Where(e => e.keyword != keyword).ToList();
var result = keylist.Distinct(new ItemEqualityComparer());
return result;
}
}

I think i found the answer if anybody interested
public IEnumerable<keyart1> Get([FromUri] string[] keyword1)
{
List<keyart1> keylist;
List<IEnumerable<keyart1>> ll;
using (dbEntities5 entities = new dbEntities5())
{
ll = new List<IEnumerable<keyart1>>();
foreach (var item in keyword1)
{
keylist = entities.keyart1.Where(e => e.keyword != item).ToList();
var result = keylist.Distinct(new ItemEqualityComparer());
ll.Add(result);
}
var intersection = ll.Aggregate((p, n) => p.Intersect(n).ToList());
return intersection;
}
}

C# Sort Script/Picking the last Int from a foreach

i need some help with my sort script. I wanna sort some files.
This is how the Name is constructed: Name#Page#Version
I can pick the Name/category and the page but i dont know how to pick the last version :/
Here you can see an example.
foreach(string files in Directory.GetFiles(path).OrderBy(fi => fi.Length))
{
try
{
filename = Path.GetFileNameWithoutExtension(files);
index = filename.LastIndexOf("#");
index2 = filename.LastIndexOf("#",index-1);
strversion = filename.Substring(index+1);
strpage = filename.Substring(index2+1);
strpage = strpage.Substring(0, strpage.LastIndexOf("#"));
page = Int32.Parse(strpage);
version = Int32.Parse(strversion);
Console.WriteLine("Page: "+page);
Console.WriteLine("Version: "+version);
if (filename.Contains("SMA"))
{
if (page == 1)
{
Console.WriteLine(filename);
}
}
}
catch (ArgumentOutOfRangeException e)
{
Console.WriteLine(e.Message);
}
}

You're over complicating things, you can split the string by # and get what you want from the array given:
var fileName = "SMA#1#2";
var parts = fileName.Split('#');
var name = parts[0];
var page = parts[1];
var version = parts[2];
EDIT
As for getting the last version for each page, you're probably better off creating some sort of class for your file and then grouping by page, and then sorting by version, and then selecting the first one:
public class Program
{
public static void Main()
{
var fileNames = new[] { "SMA#1#1", "SMA#1#2", "SMA#1#3", "SMA#2#1", "SMA#2#3" };
var files = (from fileName in fileNames select fileName.Split('#') into parts let name = parts[0] let page = Int32.Parse(parts[1]) let version = Int32.Parse(parts[2]) select new MyFile(name, page, version)).ToList();
var grouped = files.GroupBy(x => x.Page).ToList();
foreach (var group in grouped)
{
var ordered = group.OrderByDescending(x => x.Version);
Console.WriteLine($"Page {group.Key} highest version: {ordered.First().Version}");
}
}
}
public class MyFile
{
public string Name { get; set; }
public int Page { get; set; }
public int Version { get; set; }
public MyFile(string name, int page, int version)
{
Name = name;
Page = page;
Version = version;
}
}

If I correctly understand your requirement, you want to
filter out every file not containing "SMA"
then order by page
then by version
You can achieve this quite declaratively using LINQ:
var orderedFileNames =
fileNames
.Where(fn=>fn.Contains("SMA")
// parse name
.Select(fn => fn.Split('#'))
// pull parts into anonymous type
.Select(fn => new {
Name = fn[0], Page = int.Parse(fn[1]), Version = int.Parse(fn[2])
})
.OrderBy(fn=>fn.Name)
.ThenBy(fn=>fn.Page)
.ThenBy(fn=>fn.Version);

int lastIndex = filename.LastIndexOf("#");
string version = fileName.SubString(lastIndex, fileName.Length - lastIndex);
Is that what you are looking for?

Trying to query many text files in the same folder with linq

I need to search a folder containing csv files. The records i'm interested in have 3 fields: Rec, Country and Year. My job is to search the files and see if any of the files has records for more then a single year. Below the code i have so far:
// Get each individual file from the folder.
string startFolder = #"C:\MyFileFolder\";
System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(startFolder);
IEnumerable<System.IO.FileInfo> fileList = dir.GetFiles("*.*",
System.IO.SearchOption.AllDirectories);
var queryMatchingFiles =
from file in fileList
where (file.Extension == ".dat" || file.Extension == ".csv")
select file;
Then i'm came up with this code to read year field from each file and find those where year count is more than 1(The count part was not successfully implemented)
public void GetFileData(string filesname, char sep)
{
using (StreamReader reader = new StreamReader(filesname))
{
var recs = (from line in reader.Lines(sep.ToString())
let parts = line.Split(sep)
select parts[2]);
}
below a sample file:
REC,IE,2014
REC,DE,2014
REC,FR,2015
Now i'am struggling to combine these 2 ideas to solve my problem in a single query. The query should list those files that have record for more than a year.
Thanks in advance

Something along these lines:
string startFolder = #"C:\MyFileFolder\";
System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(startFolder);
IEnumerable<System.IO.FileInfo> fileList = dir.GetFiles("*.*",
System.IO.SearchOption.AllDirectories);
var fileData =
from file in fileList
where (file.Extension == ".dat" || file.Extension == ".csv")
select GetFileData(file, ',')
;
public string GetFileData(string filesname, char sep)
{
using (StreamReader reader = new StreamReader(filesname))
{
var recs = (from line in reader.Lines(sep.ToString())
let parts = line.Split(sep)
select parts[2]);
var multipleyears = recs.Distinct().Count();
if(multipleyears > 1)
return filename;
}
}

Not on my develop machine, so this might not compile "as is", but here's a direction
var lines = // file.readalllines();
var years = from line in lines
let parts = line.Split(new [] {','})
select parts[2]);
var distinct_years = years.Distinct();
if (distinct_years >1 )
// this file has several years

"My job is to search the files and see if any of the files has records
for more then a single year."
This specifies that you want a Boolean result, one that says if any of the files has those records.
For fun I'll extend it a little bit more:
My job is to get the collection of files where any of the records is about more than a single year.
You were almost there. Let's first declare a class with the records in your file:
public class MyRecord
{
public string Rec { get; set; }
public string CountryCode { get; set; }
public int Year { get; set; }
}
I'll make an extension method of the class FileInfo that will read the file and returns the sequence of MyRecords that is in it.
For extension methods see MSDN Extension Methods (C# Programming Guide)
public static class FileInfoExtension
{
public static IEnumerable<MyRecord> ReadMyRecords(this FileInfo file, char separator)
{
var records = new List<MyRecord>();
using (var reader = new StreamReader(file.FullName))
{
var lineToProcess = reader.ReadLine();
while (lineToProcess != null)
{
var splitLines = lineToProcess.Split(new char[] { separator }, 3);
if (splitLines.Length < 3) throw new InvalidDataException();
var record = new MyRecord()
{
Rec = splitLines[0],
CountryCode = splitLines[1],
Year = Int32.Parse(splitLines[2]),
};
records.Add(record);
lineToProcess = reader.ReadLine();
}
}
return records;
}
}
I could have used string instead of FileInfo, but IMHO a string is something completely different than a filename.
After the above you can write the following:
string startFolder = #"C:\MyFileFolder\";
var directoryInfo = new DirectoryInfo(startFolder);
var allFiles = directoryInfo.EnumerateFiles("*.*", SearchOption.AllDirectories);
var sequenceOfFileRecordCollections = allFiles.ReadMyRecords(',');
So now you have per file a sequence of the MyRecords in the file. You want to know which files have more than one year, Let's add another extension method to class FileInfoExtension:
public static bool IsMultiYear(this FileInfo file, char separator)
{
// read the file, only return true if there are any records,
// and if any record has a different year than the first record
var myRecords = file.ReadMyRecords(separator);
if (myRecords.Any())
{
int firstYear = myRecords.First().Year;
return myRecords.Any(record => record.Year != firstYear);
}
else
{
return false;
}
}
The sequence of file that have more than one year in it is:
allFiles.Where(file => file.IsMultiYear(',');
Put everything in one line:
var allFilesWithMultiYear = new DirectoryInfo(#"C:\MyFileFolder\")
.EnumerateFiles("*.*", SearchOption.AllDirectories)
.Where(file => file.IsMultiYear(',');
By creating two fairly simple extension methods your problem became one highly readable statement.

multiple foreach loops inside while loop

is it possible to include multiple "foreach" statements inside any of the looping constructs like while or for ... i want to open the .wav files from two different directories simultaneously so that i can compare files from both.
here is what i am trying to so but it is certainly wrong.. any help in this regard is appreciated.
string[] fileEntries1 = Directory.GetFiles(folder1, "*.wav");
string[] fileEntries2 = Directory.GetFiles(folder11, "*.wav");
while ( foreach(string fileName1 in fileEntries1) && foreach(string fileName2 in fileEntries2))

Gramatically speaking no. This is because a foreach construct is a statement whereas the tests in a while statement must be expressions.
Your best bet is to nest the foreach blocks:
foreach(string fileName1 in fileEntries1)
{
foreach(string fileName2 in fileEntries2)

I like this kind of statements in one line. So even though most of the answers here are correct, I give you this.
string[] fileEntries1 = Directory.GetFiles(folder1, "*.wav");
string[] fileEntries2 = Directory.GetFiles(folder11, "*.wav");
foreach( var fileExistsInBoth in fileEntries1.Where(fe1 => fileEntries2.Contains(fe1) )
{
/// here you will have the records which exists in both of the lists
}

Something like this since you only need to validate same file names:
IEnumerable<string> fileEntries1 = Directory.GetFiles(folder1, "*.wav").Select(x => Path.GetFileName(x));
IEnumerable<string> fileEntries2 = Directory.GetFiles(folder2, "*.wav").Select(x => Path.GetFileName(x));
IEnumerable<string> filesToIterate = (fileEntries1.Count() > fileEntries2.Count()) ? fileEntries1 : fileEntries2;
IEnumerable<string> filesToValidate = (fileEntries1.Count() < fileEntries2.Count()) ? fileEntries1 : fileEntries2;
// Iterate the bigger collection
foreach (string fileName in filesToIterate)
{
// Find the files in smaller collection
if (filesToValidate.Contains(fileName))
{
// Get actual file and compare
}
else
{
// File does not exist in another list. Handle appropriately
}
}
.Net 2.0 based solution:
List<string> fileEntries1 = new List<string>(Directory.GetFiles(folder1, "*.wav"));
List<string> fileEntries2 = new List<string>(Directory.GetFiles(folder2, "*.wav"));
List<string> filesToIterate = (fileEntries1.Count > fileEntries2.Count) ? fileEntries1 : fileEntries2;
filesToValidate = (fileEntries1.Count < fileEntries2.Count) ? fileEntries1 : fileEntries2;
string iteratorFileName;
string validatorFilePath;
// Iterate the bigger collection
foreach (string fileName in filesToIterate)
{
iteratorFileName = Path.GetFileName(fileName);
// Find the files in smaller collection
if ((validatorFilePath = FindFile(iteratorFileName)) != null)
{
// Compare fileName and validatorFilePath files here
}
else
{
// File does not exist in another list. Handle appropriately
}
}
FindFile method:
static List<string> filesToValidate;
private static string FindFile(string fileToFind)
{
string returnValue = null;
foreach (string filePath in filesToValidate)
{
if (string.Compare(Path.GetFileName(filePath), fileToFind, true) == 0)
{
// Found the file
returnValue = filePath;
break;
}
}
if (returnValue != null)
{
// File was found in smaller list. Remove this file from the list since we do not need to look for it again
filesToValidate.Remove(returnValue);
}
return returnValue;
}
You may or may not choose to make fields and methods static based on your needs.

If you want to iterate all pairs of files in both paths respectively, you can do it as follows.
string[] fileEntries1 = Directory.GetFiles(folder1, "*.wav");
string[] fileEntries2 = Directory.GetFiles(folder11, "*.wav");
foreach(string fileName1 in fileEntries1)
{
foreach(string fileName2 in fileEntries2)
{
// to the actual comparison
}
}

This is what I would suggest, using linq
using System.Linq;
var fileEntries1 = Directory.GetFiles(folder1, "*.wav");
var fileEntries2 = Directory.GetFiles(folder11, "*.wav");
foreach (var entry1 in fileEntries1)
{
var entries = fileEntries2.Where(x => Equals(entry1, x));
if (entries.Any())
{
//We have matches
//entries is a list of matches in fileentries2 for entry1
}
}

If you want to enable both collections "in parallel", then use their iterators like this:
var fileEntriesIterator1 = Directory.EnumerateFiles(folder1, "*.wav").GetEnumerator();
var fileEntriesIterator2 = Directory.EnumerateFiles(folder11, "*.wav").GetEnumerator();
while(fileEntriesIterator1.MoveNext() && fileEntriesIterator2.MoveNext())
{
var file1 = fileEntriesIterator1.Current;
var file2 = fileEntriesIterator2.Current;
}
If one collection is shorter than the other, this loop will end when the shorter collection has no more elements.

How to remove duplicates from List<string> without LINQ? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Remove duplicates from a List<T> in C#
i have a List like below (so big email list):
source list :
item 0 : jumper#yahoo.com|32432
item 1 : goodzila#yahoo.com|32432|test23
item 2 : alibaba#yahoo.com|32432|test65
item 3 : blabla#yahoo.com|32432|test32
the important part of each item is email address and the other parts(separated with pipes are not important) but i want to keep them in final list.
as i said my list is to big and i think it's not recommended to use another list.
how can i remove duplicate emails (entire item) form that list without using LINQ ?
my codes are like below :
private void WorkOnFile(UploadedFile file, string filePath)
{
File.SetAttributes(filePath, FileAttributes.Archive);
FileSecurity fSecurity = File.GetAccessControl(filePath);
fSecurity.AddAccessRule(new FileSystemAccessRule(#"Everyone",
FileSystemRights.FullControl,
AccessControlType.Allow));
File.SetAccessControl(filePath, fSecurity);
string[] lines = File.ReadAllLines(filePath);
List<string> list_lines = new List<string>(lines);
var new_lines = list_lines.Select(line => string.Join("|", line.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries)));
List<string> new_list_lines = new List<string>(new_lines);
int Duplicate_Count = 0;
RemoveDuplicates(ref new_list_lines, ref Duplicate_Count);
File.WriteAllLines(filePath, new_list_lines.ToArray());
}
private void RemoveDuplicates(ref List<string> list_lines, ref int Duplicate_Count)
{
char[] splitter = { '|' };
list_lines.ForEach(delegate(string line)
{
// ??
});
}
EDIT :
some duplicate email addrresses in that list have different parts ->
what can i do about them :
mean
goodzila#yahoo.com|32432|test23
and
goodzila#yahoo.com|asdsa|324234
Thanks in advance.

say you have a list of possible duplicates:
List<string> emailList ....
Then the unique list is the set of that list:
HashSet<string> unique = new HashSet<string>( emailList )

private void RemoveDuplicates(ref List<string> list_lines, ref int Duplicate_Count)
{
Duplicate_Count = 0;
List<string> list_lines2 = new List<string>();
HashSet<string> hash = new HashSet<string>();
foreach (string line in list_lines)
{
string[] split = line.Split('|');
string firstPart = split.Length > 0 ? split[0] : string.Empty;
if (hash.Add(firstPart))
{
list_lines2.Add(line);
}
else
{
Duplicate_Count++;
}
}
list_lines = list_lines2;
}

The easiest thing to do is to iterate through the lines in the file and add them to a HashSet. HashSets won't insert the duplicate entries and it won't generate an exception either. At the end you'll have a unique list of items and no exceptions will be generated for any duplicates.

1 - Get rid of your pipe separated string (create an dto class corresponding to the data it's representing)
2 - which rule do you want to apply to select two object with the same id ?

Or maybe this code can be useful for you :)
It's using the same method as the one in #xanatos answer
string[] lines= File.ReadAllLines(filePath);
Dictionary<string, string> items;
foreach (var line in lines )
{
var key = line.Split('|').ElementAt(0);
if (!items.ContainsKey(key))
items.Add(key, line);
}
List<string> list_lines = items.Values.ToList();

First, I suggest to you load the file via stream.
Then, create a type that represent your rows and load them into a HashSet(for
performance considerations).
Look (Ive removed some of your code to make it simple):
public struct LineType
{
public string Email { get; set; }
public string Others { get; set; }
public override bool Equals(object obj)
{
return this.Email.Equals(((LineType)obj).Email);
}
}
private static void WorkOnFile(string filePath)
{
StreamReader stream = File.OpenText(filePath);
HashSet<LineType> hashSet = new HashSet<LineType>();
while (true)
{
string line = stream.ReadLine();
if (line == null)
break;
string new_line = string.Join("|", line.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries));
LineType lineType = new LineType()
{
Email = new_line.Split('|')[3],
Others = new_line
};
if (!hashSet.Contains(lineType))
hashSet.Add(lineType);
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove duplicate files in different directories - c#

Related

How to pass a list of strings through webapi and get the results without those strings?

C# Sort Script/Picking the last Int from a foreach

Trying to query many text files in the same folder with linq

multiple foreach loops inside while loop

How to remove duplicates from List<string> without LINQ? [duplicate]

Categories

Resources