get distinct count of substring of file name

get distinct count of substring of file name - c#

I have a directory with a list of file names.
VAH007157100-pic1.jpg
VAH007157100-pic2.jpg
VAH007157100-pic3.jpg
WAZ009999200-pic1.jpg
WAZ009999200-pic2.jpg
WAZ009999200-pic3.jpg
I want to know the distinct count of substringing (0, 12).
This isn't working for some reason:
string[] originalFiles = Directory.GetFiles(SelectedDirectory);
private int GetDistinctPolicyCountInDirectory()
{
var prefixes = originalFiles
.GroupBy(x => x.Substring(0, 12))
.Select(y => new { Policy = y.Key, Count = y.Count() });
return prefixes.Count();
}
I keep getting 0. Am I missing anything here?
Please note that I do not want to do a split to get the numbers separated. I want to do it by substringing.
UPDATE -
private int GetDistinctPolicyCountInDirectory(string[] originalFiles)
{
var count = originalFiles.Distinct(x => Path.GetFileName(x).Substring(0, 12)).Count();
return Convert.ToInt32(count);
}
I'm running into an error here where it says: Error 1 Cannot convert lambda expression to type 'System.Collections.Generic.IEqualityComparer' because it is not a delegate type

I'd just consider using .Distinct().
Also you need to strip it down to just the filename instead of the full file path.
originalFiles.Select(x => Path.GetFileName(x).Substring(0, 12))
.Distinct().Count();

GetFiles returns an array of file names with full paths, including the directory. You want to compare only the file name, so you should consider using Path.GetFileName.
GroupBy(x => Path.GetFileName(x).Substring(0, 12));

Related

C# How do I Access this generic list, when i didn't create the type?

I've stepped through the first part and that works correctly. My list ends up As fileName[288], and in my locals window I have a "value". This is a list. I didn't create the type, so I don't know how to access it. I know it is a generic list of strings, so I imported System.Collections.Generic.List, but I cant seem to figure it out.
var fileName = new DirectoryInfo(text)
.GetFiles(".", SearchOption.AllDirectories)
.Select(x => x.Name)
.ToList();
for (var i = 0; i < fileName.Count; i++)
{
Console.WriteLine("Filename: {0}", fileName[i].?)
}

Since the last Select returns string (x.Name is of type string)
...
.Select(x => x.Name) // select string(s)
.ToList(); // materialize them into a list
then fileName is of type List<string> and you have no need in any additional method:
Console.WriteLine("Filename: {0}", fileName[i]);
I suggest getting rid of for loop and let (with a help of foreach) .Net work for you:
// we have (potentially) many files' names - let use "fileNames" - plural
var fileNames = new DirectoryInfo(text)
.GetFiles(".", SearchOption.AllDirectories)
.Select(x => x.Name);
foreach (var name in fileNames)
Console.WriteLine($"Filename: {name}"); // string interpolation for readability
Edit: Please, notice that we don't need .ToList() in case of foreach - all we want is to enumerate the names without saving them into an any collection (say List<string>).

Sort List<string> based on character count

Example:
List<string> folders = new List<string>();
folders.Add("folder1/folder2/folder3/");
folders.Add("folder1/");
folders.Add("folder1/folder2/");
I want to sort this list based on character i.e '/'
so my output will be
folder1/
folder1/folder2/
folder1/folder2/folder3

LINQ:
folders = folders.OrderBy(f => f.Length).ToList(); // consider null strings
or List.Sort
folders.Sort((s1, s2) => s1.Length.CompareTo(s2.Length));
a safe approach if the list could contain null's:
folders = folders.OrderBy(f => f?.Length ?? int.MinValue).ToList();
If you actually want to sort by the folder-depth not string length:
folders = folders.OrderBy(f => f.Split(Path.DirectorySeparatorChar, Path.AltDirectorySeparatorChar).Length).ToList();

It's likely you actually want to sort by name:
folders = folders.OrderBy(f => f).ToList();
Or simply:
folders.Sort();
This will work correctly for cases like this:
folder1/
folder1/subfolder1
folder1/subfolder1/subsubfolder
folder2
folder2/subfolder2
Sorting by length alone will consider "folder1" and "folder2" equal.

Find the index position of duplicate entries in a comma separated string

My problem just got more complicated than I thought and I've just wiped out my original question... So I'll probably post multiple questions depending on how I get on with this.
Anyway, back to the problem. I need to find the index position of duplicate entries in string that contains csv data. For example,
FirstName,LastName,Address,Address,Address,City,PostCode,PostCode, Country
As you can see the Address is duplicated and I need to find out the index of each duplicates assuming first index position starts at 0.
If you have a better suggestion on how to do this, let me know, but assuming it can be done, could we maybe have with a dicitionary>?
So if I had to code this, you would have:
duplicateIndexList.Add(2);
duplicateIndexList.Add(3);
duplicateIndexList.Add(4);
myDuplicateList.Add("Address", duplicateIndexList);
duplicateIndexList.Add(6);
duplicateIndexList.Add(7);
myDuplicateList.Add("PostCode", duplicateIndexList);
Obviously I don't want to do this but is it possible to achieve the above using Linq to do this? I could probably write a function that does this, but I love seeing how things can be done with Linq.
In case you're curious as to why I want to do this? Well, in short, I have an xml definition which is used to map csv fields to a database field and I want to first find out if there are any duplicate columns, I then want to append the relevant values from the actual csv row i.e. Address = Address(2) + Address(3) + Address(4), PostCode = PostCode(6) + PostCode(7)
The next part will be how to remove all the relevant values from the csv string defined above based on the indexes found once I have appended their actual values, but that will be the next part.
Thanks.
T.
UPDATE:
Here is the function that does what I want but as I said, linq would be nice. Note that in this function I'm using a list instead of the comma separated string as I haven't converted that list yet to a csv string.
Dictionary<string, List<int>> duplicateEntries = new Dictionary<string, List<int>>();
int indexPosition = 0;
foreach (string fieldName in Mapping.Select(m=>m.FieldName))
{
string key = fieldName.ToUpper();
if (duplicateEntries.ContainsKey(key))
{
List<int> indexes = duplicateEntries[fieldName];
indexes.Add(indexPosition);
duplicateEntries[key] = indexes;
indexes = null;
}
else
{
duplicateEntries.Add(key, new List<int>() { indexPosition });
}
indexPosition += 1;
}
Maybe this will help clarify what I'm trying to achieve.

You need to do the following:
Use .Select on the resulting array to project a new IEnumerable of objects that contains the index of the item in the array along with the value.
Use either ToLookup or GroupBy and ToDictionary to group the results by column value.
Seems like an ILookup<string, int> would be appropriate here:
var lookup = columnArray
.Select((c, i) => new { Value = c, Index = i })
.ToLookup(o => o.Value, o => o.Index);
List<int> addressIndexes = lookup["Address"].ToList(); // 2, 3, 4
Or if you wanted to create a Dictionary<string, List<int>>:
Dictionary<string, List<int>> dictionary = columnArray
.Select((c, i) => new { Value = c, Index = i })
.GroupBy(o => o.Value, o => o.Index)
.ToDictionary(grp => grp.Key, grp => grp.ToList());
List<int> addressIndexes = dictionary["Address"]; // 2, 3, 4
Edit
(in response to updated question)
This should work:
Dictionary<string, List<int>> duplicateEntries = Mapping
.Select((m, i) => new { Value = m.FieldName, Index = i })
.GroupBy(o => o.Value, o => o.Index)
.ToDictionary(grp => grp.Key, grp => grp.ToList());

You could do something like :
int count = 0;
var numbered_collection =
from line in File.ReadAllLines("your_csv_name.csv").Skip(1)
let parts = line.Split(',')
select new CarClass()
{
Id = count++,
First_Field = parts[0],
Second_Field = parts[1], // rinse and repeat
};
This gives you Id's per item. (and also skip the first line which has the header). You could put it in a method if you want to automatically map the names from the first csv line to the fields).
From there, you can do:
var duplicates = (from items in numbered_collection
group items by items.First_Field into g
select g)
.Where(g => g.Count() > 1);
Now you have all the groups where you actually have duplicates, and you can just get the 'Id' from the object to know which one is the duplicated.

Parsing delimited data for specific instance of repeated line

I have an array of strings in the following format, where each string begins with a series of three characters indicating what type of data it contains. For example:
ABC|.....
DEF|...
RHG|1........
RHG|2........
RHG|3........
XDF|......
I want to find any repeating lines (RHG in this example) and mark the last line with a special character:
>RHG|3.........
What's the best way to do this? My current solution has a method to count the line headers and create a dictionary with the header counts.
protected Dictionary<string, int> CountHeaders(string[] lines)
{
Dictionary<string, int> headerCounts = new Dictionary<string, int>();
for (int i = 0; i < lines.Length; i++)
{
string s = lines[i].Substring(0, 3);
int value;
if (headerCounts.TryGetValue(s, out value))
headerCounts[s]++;
else
headerCounts.Add(s, 1);
}
return headerCounts;
}
In the main parsing method, I select the lines that are repeated.
var repeats = CountHeaders(lines).Where(x => x.Value > 1).Select(x => x.Key);
foreach (string s in repeats)
{
// Get last instance of line in lines and mark it
}
This is as far as I've gotten. I think I can do what I want with another LINQ query but I'm not too sure. Also, I can't help but feel that there's a more optimal solution.

You can use LINQ to achieve that.
Input string:
var input = #"ABC|.....
DEF|...
RHG|1........
RHG|2........
RHG|3........
XDF|......";
LINQ query:
var results = input.Split(new[] { Environment.NewLine })
.GroupBy(x => x.Substring(0, 3))
.Select(g => g.ToList())
.SelectMany(g => g.Count > 1 ? g.Take(g.Count - 1).Concat(new[] { string.Format(">{0}", g[g.Count - 1]) }) : g)
.ToArray();
I used Select(g => g.ToList()) projection to make g.Count O(1) operation in further query steps.
You can Join result array into one string using String.Join method:
var output = String.Join(Environment.NewLine, results);

Alternatively, you could find repeating lines with a backreferencing regex. I wrote this hacky regex using your sample data and it matches the lines starting with a preceding 'tag', pipe seperated values.
^(?<Tag>.+)[|].+[\n\r](\k<Tag>[|].+[\n\r])+
The match range starts at the beginning of the first RHG line and selects up to the last RHG line.

Here's an example that includes the parsing and the counting in one Linq statement - feel free to break it up if you want to:
string[] data = new string[]
{
"ABC|.....",
"DEF|...",
"RHG|1........",
"RHG|2........",
"RHG|3........",
"XDF|......"
};
data.Select(d=> d.Split('|')) // split the strings
.Select(d=> new { Key = d[0], Value = d[1] }) // select the key and value
.GroupBy (d => d.Key) // group by the key
.Where(g=>g.Count() > 1 ) // find duplicates
.Select(d => d.Skip(1)) // select the repeating elements
.SelectMany(g=>g) // flatten into a single list
;
This will give you a list of key/value pairs that are duplicates. so with the sample data it will return
Key Value
RHG 2........
RHG 3........
I'm not sure what you mean by "marking" the line, however...

Creating a list of docs that contains same name

I'm creating a tool that is supposed to concatenate docs that contain the same name.
example: C_BA_20000_1.pdf and C_BA_20000_2.pdf
These files should be grouped in one list.
That tool runs on a directory lets say
//directory of pdf files
DirectoryInfo dirInfo = new DirectoryInfo(#"C:\Users\derp\Desktop");
FileInfo[] fileInfos = dirInfo.GetFiles("*.pdf");
foreach (FileInfo info in fileInfos)
I want to create an ArrayList that contains filenames of the same name
ArrayList list = new ArrayList();
list.Add(info.FullName);
and then have a list that contains all the ArrayLists of similar docs.
List<ArrayList> bigList = new List<ArrayList>();
So my question, how can I group files that contains same name and put them in the same list.
EDIT:
Files have the same pattern in their names AB_CDEFG_i
where i is a number and can be from 1-n. Files with the same name should have only different number at the end.
AB_CDEFG_1
AB_CDEFG_2
HI_JKLM_1
Output should be:
List 1: AB_CDEFG_1 and AB_CDEFG_2
List 2: HI_JKLM_1

Create method which extracts 'same' part of file name. E.g.
public string GetRawName(string fileName)
{
int index = fileName.LastIndexOf("_");
return fileName.Substring(0, index);
}
And use this method for grouping:
var bigList = Directory.EnumerateFiles(#"C:\Users\derp\Desktop", "*.pdf")
.GroupBy(file => GetRawName(file))
.Select(g => g.ToList())
.ToList();
This will return List<List<string>> (without ArrayList).
UPDATE Here is regular expression, which will work with all kind of files, whether they have number at the end, or not
public string GetRawName(string file)
{
string name = Path.GetFileNameWithoutExtension(file);
return Regex.Replace(name, #"(_\d+)?$", "")
}
Grouping:
var bigList = Directory.EnumerateFiles(#"C:\Users\derp\Desktop", "*.pdf")
.GroupBy(GetRawName)
.Select(g => g.ToList())
.ToList();

It sounds like the difficulty is in deciding which files are the same.
static string KeyFromFileName(string file)
{
// Convert from "C_BA_20000_2" to "C_BA_20000"
return file.Substring(0, file.LastIndexOf("_"));
// Note: This assumes there is an _ in the filename.
}
Then you can use this LINQ to build a list of fileSets.
using System.Linq; // Near top of file
var files = Directory.GetFiles(#"C:\Users\derp\Desktop", "*.pdf")
var fileSets = files
.Select(file => file.FullName)
.GroupBy(KeyFromFileName)
.Select(g => new {g.Key, Files = g.ToList()}
.ToList();

Aside from the fact that your question doesnt identify what "same name" means. This is a typical solution.
fileInfos.GroupBy ( f => f.FullName )
.Select( grp => grp.ToList() ).ToList();

This will get you a list of lists... also won't throw an exception if a file doesn't contain the underscore, etc.
private string GetKey(FileInfo fi)
{
var index = fi.Name.LastIndexOf('_');
return index == -1 ? Path.GetFileNameWithoutExtension(fi.Name)
: fi.Name.Substring(0, index);
}
var bigList = fileInfos.GroupBy(GetKey)
.Select(x => x.ToList())
.ToList();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

get distinct count of substring of file name - c#

I'd just consider using .Distinct(). Also you need to strip it down to just the filename instead of the full file path. originalFiles.Select(x => Path.GetFileName(x).Substring(0, 12)) .Distinct().Count();

GetFiles returns an array of file names with full paths, including the directory. You want to compare only the file name, so you should consider using Path.GetFileName. GroupBy(x => Path.GetFileName(x).Substring(0, 12));

Related

C# How do I Access this generic list, when i didn't create the type?

Sort List<string> based on character count

Find the index position of duplicate entries in a comma separated string

Parsing delimited data for specific instance of repeated line

Creating a list of docs that contains same name

Categories

Resources