Find the index position of duplicate entries in a comma separated string - c#

My problem just got more complicated than I thought and I've just wiped out my original question... So I'll probably post multiple questions depending on how I get on with this.
Anyway, back to the problem. I need to find the index position of duplicate entries in string that contains csv data. For example,
FirstName,LastName,Address,Address,Address,City,PostCode,PostCode, Country
As you can see the Address is duplicated and I need to find out the index of each duplicates assuming first index position starts at 0.
If you have a better suggestion on how to do this, let me know, but assuming it can be done, could we maybe have with a dicitionary>?
So if I had to code this, you would have:
duplicateIndexList.Add(2);
duplicateIndexList.Add(3);
duplicateIndexList.Add(4);
myDuplicateList.Add("Address", duplicateIndexList);
duplicateIndexList.Add(6);
duplicateIndexList.Add(7);
myDuplicateList.Add("PostCode", duplicateIndexList);
Obviously I don't want to do this but is it possible to achieve the above using Linq to do this? I could probably write a function that does this, but I love seeing how things can be done with Linq.
In case you're curious as to why I want to do this? Well, in short, I have an xml definition which is used to map csv fields to a database field and I want to first find out if there are any duplicate columns, I then want to append the relevant values from the actual csv row i.e. Address = Address(2) + Address(3) + Address(4), PostCode = PostCode(6) + PostCode(7)
The next part will be how to remove all the relevant values from the csv string defined above based on the indexes found once I have appended their actual values, but that will be the next part.
Thanks.
T.
UPDATE:
Here is the function that does what I want but as I said, linq would be nice. Note that in this function I'm using a list instead of the comma separated string as I haven't converted that list yet to a csv string.
Dictionary<string, List<int>> duplicateEntries = new Dictionary<string, List<int>>();
int indexPosition = 0;
foreach (string fieldName in Mapping.Select(m=>m.FieldName))
{
string key = fieldName.ToUpper();
if (duplicateEntries.ContainsKey(key))
{
List<int> indexes = duplicateEntries[fieldName];
indexes.Add(indexPosition);
duplicateEntries[key] = indexes;
indexes = null;
}
else
{
duplicateEntries.Add(key, new List<int>() { indexPosition });
}
indexPosition += 1;
}
Maybe this will help clarify what I'm trying to achieve.

You need to do the following:
Use .Select on the resulting array to project a new IEnumerable of objects that contains the index of the item in the array along with the value.
Use either ToLookup or GroupBy and ToDictionary to group the results by column value.
Seems like an ILookup<string, int> would be appropriate here:
var lookup = columnArray
.Select((c, i) => new { Value = c, Index = i })
.ToLookup(o => o.Value, o => o.Index);
List<int> addressIndexes = lookup["Address"].ToList(); // 2, 3, 4
Or if you wanted to create a Dictionary<string, List<int>>:
Dictionary<string, List<int>> dictionary = columnArray
.Select((c, i) => new { Value = c, Index = i })
.GroupBy(o => o.Value, o => o.Index)
.ToDictionary(grp => grp.Key, grp => grp.ToList());
List<int> addressIndexes = dictionary["Address"]; // 2, 3, 4
Edit
(in response to updated question)
This should work:
Dictionary<string, List<int>> duplicateEntries = Mapping
.Select((m, i) => new { Value = m.FieldName, Index = i })
.GroupBy(o => o.Value, o => o.Index)
.ToDictionary(grp => grp.Key, grp => grp.ToList());

You could do something like :
int count = 0;
var numbered_collection =
from line in File.ReadAllLines("your_csv_name.csv").Skip(1)
let parts = line.Split(',')
select new CarClass()
{
Id = count++,
First_Field = parts[0],
Second_Field = parts[1], // rinse and repeat
};
This gives you Id's per item. (and also skip the first line which has the header). You could put it in a method if you want to automatically map the names from the first csv line to the fields).
From there, you can do:
var duplicates = (from items in numbered_collection
group items by items.First_Field into g
select g)
.Where(g => g.Count() > 1);
Now you have all the groups where you actually have duplicates, and you can just get the 'Id' from the object to know which one is the duplicated.

Related

Convert ordered comma separated list into tuples with ordered element number (a la SQL SPLIT_STRING) using C# 6.0/.Net Framework 4.8

I can't seem to find a ready answer to this, or even if the question has ever been asked before, but I want functionality similar to the SQL STRING_SPLIT functions floating around, where each item in a comma separated list is identified by its ordinal in the string.
Given the string "abc,xyz,def,tuv", I want to get a list of tuples like:
<1, "abc">
<2, "xyz">
<3, "def">
<4, "tuv">
Order is important, and I need to preserve the order, and be able to take the list and further join it with another list using linq, and be able to preserve the order. For example, if a second list is <"tuv", "abc">, I want the final output of the join to be:
<1, "abc">
<4, "tuv">
Basically, I want the comma separated string to determine the ORDER of the end result, where the comma separated string contains ALL possible strings, and it is joined with an unordered list of a subset of strings, and the output is a list of ordered tuples that consists only of the elements in the second list, but in the order determined by the comma separated string at the beginning.
I could likely figure out all of this on my own if I could just get a C# equivalent to all the various SQL STRING_SPLIT functions out there, which do the split but also include the ordinal element number in the output. But I've searched, and I find nothing for C# but splitting a string into individual elements, or splitting them into tuples where both elements of the tuple are in the string itself, not generated integers to preserve order.
The order is the important thing to me here. So if an element number isn't readily possible, a way to inner join two lists and guarantee preserving the order of the first list while returning only those elements in the second list would be welcome. The tricky part for me is this last part: the result of a join needs a specific (not easy to sort by) order. The ordinal number would give me something to sort by, but if I can inner join with some guarantee the output is in the same order as the first input, that'd work too.
That should work on .NET framework.
using System.Linq;
string str = "abc,xyz,def,tuv";
string str2 = "abc,tuv";
IEnumerable< PretendFileObject> secondList = str2.Split(',').Select(x=> new PretendFileObject() { FileName = x}); //
var tups = str.Split(',')
.Select((x, i) => { return (i + 1, x); })
.Join(secondList, //Join Second list ON
item => item.Item2 //This is the filename in the tuples
,item2 => item2.FileName, // This is the filename property for a given object in the second list to join on
(item,item2) => new {Index = item.Item1,FileName = item.Item2, Obj = item2})
.OrderBy(JoinedObject=> JoinedObject.Index)
.ToList();
foreach (var tup in tups)
{
Console.WriteLine(tup.Obj.FileName);
}
public class PretendFileObject
{
public string FileName { get; set; }
public string Foo { get; set; }
}
Original Response Below
If you wanted to stick to something SQL like here is how to do it with linq operators. The Select method has a built in index param you can make use of. And you can use IntersectBy to perform an easy inner join.
using System.Linq;
string str = "abc,xyz,def,tuv";
string str2 = "abc,tuv";
var secondList = str2.Split(',');
var tups = str.Split(',')
.Select((x, i) => { return (i + 1, x); })
.IntersectBy(secondList, s=>s.Item2) //Filter down to only the strings found in both.
.ToList();
foreach(var tup in tups)
{
Console.WriteLine(tup);
}
This will get you list of tuples
var input = "abc,xyz,def,tuv";
string[] items = input.Split(',');
var tuples = new List<(int, string)>();
for (int i = 0; i < items.Length)
{
tuples.Add(((i + 1), items[i]));
}
if then you want to add list of "tuv" and "abc" and keep 1, you probably want to "Left Join". But I am not sure, how you can do using LINQ because you first need to iterate the original list of tuples and assign same int. Then join. Or, you can join first and then assign int but technically, order is not guaranteed. However, if you assign int first, you can sort by it in the end.
I am slightly confused by "and be able to take the list and further join it with another list using linq". Join usually means aggregate result. But in your case it seem you demanding segment, not joined data.
--
"I want to remove any items from the second list that are not in the first list, and then I need to iterate over the second list IN THE ORDER of the first list"
var input2 = "xxx,xyz,yyy,tuv,";
string[] items2 = input2.Split(',');
IEnumerable<(int, string)> finalTupleOutput =
tuples.Join(items2, t => t.Item2, i2 => i2, (t, i2) => (t.Item1, i2)).OrderBy(tpl => tpl.Item1);
This will give you what you want - matching items from L2 in the order from L1
with LINQ
string inputString = "abc,xyz,def,tuv";
var output = inputString.Split(',')
.Select((item, index) => { return (index + 1, item); });
now you can use the output list as you want to use.
Not 100% sure what you're after, but here's an attempt:
string[] vals = new[] { "abc", "xyz", "dev", "tuv"};
string[] results = new string[vals.Length];
int index = 0;
for (int i = 0; i < vals.Length; i++)
{
results[i] = $"<{++index},\"{vals[i]}\">";
}
foreach (var item in results)
{
Console.WriteLine(item);
}
This produces:
<1,"abc">
<2,"xyz">
<3,"dev">
<4,"tuv">
Given the example
For example, if a second list is <"tuv", "abc">, I want the final
output of the join to be:
<1, "abc"> <4, "tuv">
I think this might be close?
List<string> temp = new List<string>() { "abc", "def", "xyz", "tuv" };
List<string> temp2 = new List<string>() { "dbc", "ace", "zyw", "tke", "abc", "xyz" };
var intersect = temp.Intersect(temp2).Select((list, idx) => (idx+1, list));
This produces an intersect result that has the elements from list 1 that are also in list 2, which in this case would be:
<1, "abc">
<2, "xyz">
If you want all the elements from both lists, switch the Intersect to Union.

How can I find the first items in a list with a different value for a property using Linq?

I have an ordered list of objects, and I would like to find the index of each item where a property changes, and get a dictionary/list of pairs matching index to property. For example, finding the index of each new first letter in a list of words ordered alphabetically.
I can do this with a foreach loop:
Initials = new Dictionary<char, int>();
int i = 0;
foreach (var word in alphabeticallyOrderedList))
{
if (!Initials.ContainsKey(word.First()))
{
Initials[word.First()] = i;
}
i++;
}
But I feel like there should be an elegant way of doing this with Linq.
You could have the same functionality with LINQ by using the overload of Select that exposes the index and by using GroupBy + ToDictionary:
Initials = alphabeticallyOrderedList
.Select((word, index) => new { Word = word, WordIndex = index })
.GroupBy(x => x.Word[0])
.ToDictionary(charGroup => charGroup.Key, charGroup => charGroup.First().WordIndex);
But to quote myself:
LINQ is not always more readable, especially when indexes are important. You also lose some debugging, exception handling and logging capabilities if you use a large LINQ query

Simplifying linq when filtering data

I wanted to ask for suggestions how I can simplify the foreach block below. I tried to make it all in one linq statement, but I couldn't figure out how to manipulate "count" values inside the query.
More details about what I'm trying to achieve:
- I have a huge list with potential duplicates, where Id's are repeated, but property "Count" is different numbers
- I want to get rid of duplicates, but still not to loose those "Count" values
- so for the items with the same Id I summ up the "Count" properties
Still, the current code doesn't look pretty:
var grouped = bigList.GroupBy(c => c.Id).ToList();
foreach (var items in grouped)
{
var count = 0;
items.Each(c=> count += c.Count);
items.First().Count = count;
}
var filtered = grouped.Select(y => y.First());
I don't expect the whole solution, pieces of ideas will be also highly appreciated :)
Given that you're mutating the collection, I would personally just make a new "item" with the count:
var results = bigList.GroupBy(c => c.Id)
.Select(g => new Item(g.Key, g.Sum(i => i.Count)))
.ToList();
This performs a simple mapping from the original to a new collection of Item instances, with the proper Id and Count values.
var filtered = bigList.GroupBy(c=>c.Id)
.Select(g=> {
var f = g.First();
f.Count = g.Sum(c=>c.Count);
return f;
});

Parsing delimited data for specific instance of repeated line

I have an array of strings in the following format, where each string begins with a series of three characters indicating what type of data it contains. For example:
ABC|.....
DEF|...
RHG|1........
RHG|2........
RHG|3........
XDF|......
I want to find any repeating lines (RHG in this example) and mark the last line with a special character:
>RHG|3.........
What's the best way to do this? My current solution has a method to count the line headers and create a dictionary with the header counts.
protected Dictionary<string, int> CountHeaders(string[] lines)
{
Dictionary<string, int> headerCounts = new Dictionary<string, int>();
for (int i = 0; i < lines.Length; i++)
{
string s = lines[i].Substring(0, 3);
int value;
if (headerCounts.TryGetValue(s, out value))
headerCounts[s]++;
else
headerCounts.Add(s, 1);
}
return headerCounts;
}
In the main parsing method, I select the lines that are repeated.
var repeats = CountHeaders(lines).Where(x => x.Value > 1).Select(x => x.Key);
foreach (string s in repeats)
{
// Get last instance of line in lines and mark it
}
This is as far as I've gotten. I think I can do what I want with another LINQ query but I'm not too sure. Also, I can't help but feel that there's a more optimal solution.
You can use LINQ to achieve that.
Input string:
var input = #"ABC|.....
DEF|...
RHG|1........
RHG|2........
RHG|3........
XDF|......";
LINQ query:
var results = input.Split(new[] { Environment.NewLine })
.GroupBy(x => x.Substring(0, 3))
.Select(g => g.ToList())
.SelectMany(g => g.Count > 1 ? g.Take(g.Count - 1).Concat(new[] { string.Format(">{0}", g[g.Count - 1]) }) : g)
.ToArray();
I used Select(g => g.ToList()) projection to make g.Count O(1) operation in further query steps.
You can Join result array into one string using String.Join method:
var output = String.Join(Environment.NewLine, results);
Alternatively, you could find repeating lines with a backreferencing regex. I wrote this hacky regex using your sample data and it matches the lines starting with a preceding 'tag', pipe seperated values.
^(?<Tag>.+)[|].+[\n\r](\k<Tag>[|].+[\n\r])+
The match range starts at the beginning of the first RHG line and selects up to the last RHG line.
Here's an example that includes the parsing and the counting in one Linq statement - feel free to break it up if you want to:
string[] data = new string[]
{
"ABC|.....",
"DEF|...",
"RHG|1........",
"RHG|2........",
"RHG|3........",
"XDF|......"
};
data.Select(d=> d.Split('|')) // split the strings
.Select(d=> new { Key = d[0], Value = d[1] }) // select the key and value
.GroupBy (d => d.Key) // group by the key
.Where(g=>g.Count() > 1 ) // find duplicates
.Select(d => d.Skip(1)) // select the repeating elements
.SelectMany(g=>g) // flatten into a single list
;
This will give you a list of key/value pairs that are duplicates. so with the sample data it will return
Key Value
RHG 2........
RHG 3........
I'm not sure what you mean by "marking" the line, however...

C# Sorting list by another list

I have now 2 lists:
list<string> names;
list<int> numbers;
and I need to sort my names based on the values in numbers.
I've been searching, and most use something like x.ID, but i don't really know what that value is. So that didn't work.
Does anyone know, what to do, or can help me out in the ID part?
So i assume that the elements in both lists are related through the index.
names.Select((n, index) => new { Name = n, Index = index })
.OrderBy(x => numbers.ElementAtOrDefault(x.Index))
.Select(x => x.Name)
.ToList();
But i would use another collection type like Dictionary<int,string> instead if both lists are related insomuch.
Maybe this is a task for the Zip method. Something like
names.Zip(numbers, (name, number) => new { name, number, })
will "zip" the two sequences into one. From there you can either order the sequence immediately, like
.OrderBy(a => a.number)
or you can instead create a Dictionary<,>, like
.ToDictionary(a => a.number, a => a.name)
But it sounds like what you really want is a SortedDictionary<,>, not a Dictionary<,> which is organized by hash codes. There's no LINQ method for creating a sorted dictionary, but just say
var sorted = new SortedDictionary<int, string>();
foreach (var a in zipResultSequence)
sorted.Add(a.number, a.name);
Or alternatively, with a SortedDictionary<,>, skip Linq entirely, an go like:
var sorted = new SortedDictionary<int, string>();
for (int idx = 0; idx < numbers.Count; ++idx) // supposing the two list have same Count
sorted.Add(numbers[idx], names[idx]);
To complement Tims answer, you can also use a custom data structure to associate one name with a number.
public class Person
{
public int Number { get; set; } // in this case you could also name it ID
public string Name { get; set; }
}
Then you would have a List<Person> persons; and you can sort this List by whatever Attribute you like:
List<Person> persons = new List<Person>();
persons.Add(new Person(){Number = 10, Name = "John Doe"});
persons.Add(new Person(){Number = 3, Name = "Max Muster"});
// sort by number
persons = persons.OrderBy(p=>p.Number).ToList();
// alternative sorting method
persons.Sort((a,b) => a.Number-b.Number);
I fixed it by doing it with an dictionary, this was the result:
dictionary.OrderBy(kv => kv.Value).Reverse().Select(kv => kv.Key).ToList();

Categories

Resources