C# dedupe List based on split

C# dedupe List based on split - c#

I'm having a hard time deduping a list based on a specific delimiter.
For example I have 4 strings like below:
apple|pear|fruit|basket
orange|mango|fruit|turtle
purple|red|black|green
hero|thor|ironman|hulk
In this example I should want my list to only have unique values in column 3, so it would result in an List that looks like this,
apple|pear|fruit|basket
purple|red|black|green
hero|thor|ironman|hulk
In the above example I would have gotten rid of line 2 because line 1 had the same result in column 3. Any help would be awesome, deduping is tough in C#.
how i'm testing this:
static void Main(string[] args)
{
BeginListSet = new List<string>();
startHashSet();
}
public static List<string> BeginListSet { get; set; }
public static void startHashSet()
{
string[] BeginFileLine = File.ReadAllLines(#"C:\testit.txt");
foreach (string begLine in BeginFileLine)
{
BeginListSet.Add(begLine);
}
}
public static IEnumerable<string> Dedupe(IEnumerable<string> list, char seperator, int keyIndex)
{
var hashset = new HashSet<string>();
foreach (string item in list)
{
var array = item.Split(seperator);
if (hashset.Add(array[keyIndex]))
yield return item;
}
}

Something like this should work for you
static IEnumerable<string> Dedupe(this IEnumerable<string> input, char seperator, int keyIndex)
{
var hashset = new HashSet<string>();
foreach (string item in input)
{
var array = item.Split(seperator);
if (hashset.Add(array[keyIndex]))
yield return item;
}
}
...
var list = new string[]
{
"apple|pear|fruit|basket",
"orange|mango|fruit|turtle",
"purple|red|black|green",
"hero|thor|ironman|hulk"
};
foreach (string item in list.Dedupe('|', 2))
Console.WriteLine(item);
Edit: In the linked question Distinct() with Lambda, Jon Skeet presents the idea in a much better fashion, in the form of a DistinctBy custom method. While similar, his is far more reusable than the idea presented here.
Using his method, you could write
var deduped = list.DistinctBy(item => item.Split('|')[2]);
And you could later reuse the same method to "dedupe" another list of objects of a different type by a key of possibly yet another type.

Try this:
var list = new string[]
{
"apple|pear|fruit|basket",
"orange|mango|fruit|turtle",
"purple|red|black|green",
"hero|thor|ironman|hulk "
};
var dedup = new List<string>();
var filtered = new List<string>();
foreach (var s in list)
{
var filter = s.Split('|')[2];
if (dedup.Contains(filter)) continue;
filtered.Add(s);
dedup.Add(filter);
}
// Console.WriteLine(filtered);

Can you use a HashSet instead? That will eliminate dupes automatically for you as they are added.

May be you can sort the words with delimited | on alphabetical order. Then store them onto grid (columns). Then when you try to insert, just check if there is column having a word which starting with this char.

If LINQ is an option, you can do something like this:
// assume strings is a collection of strings
List<string> list = strings.Select(a => a.Split('|')) // split each line by '|'
.GroupBy(a => a[2]) // group by third column
.Select(a => a.First()) // select first line from each group
.Select(a => string.Join("|", a))
.ToList(); // convert to list of strings
Edit (per Jeff Mercado's comment), this can be simplified further:
List<string> list =
strings.GroupBy(a => a.split('|')[2]) // group by third column
.Select(a => a.First()) // select first line from each group
.ToList(); // convert to list of strings

Related

How do I remove duplicates from excel range? c#

I've converted cells in my excel range from strings to form a string list and have separated each item after the comma in the original list. I am starting to think I have not actually separated each item, and they are still one whole, trying to figure out how to do this properly so that each item( ie. the_red_bucket_01)is it's own string.
example of original string in a cell 1 and 2:
Cell1 :
the_red_bucket_01, the_blue_duck_01,_the green_banana_02, the orange_bear_01
Cell2 :
the_purple_chair_01, the_blue_coyote_01,_the green_banana_02, the orange_bear_01
The new list looks like this, though I'm not sure they are separate items:
the_red_bucket_01
the_blue_duck_01
the green_banana_02
the orange_bear_01
the_red_chair_01
the_blue_coyote_01
the green_banana_02
the orange_bear_01
Now I want to remove duplicates so that the console only shows 1 of each item, no matter how many there are of them, I can't seem to get my foreah/if statements to work. It is printing out multiple copies of the items, I'm assuming because it is iterating for each item in the list, so it is returning the data that many items.
foreach (Excel.Range item in xlRng)
{
string itemString = (string)item.Text;
List<String> fn = new List<String>(itemString.Split(','));
List<string> newList = new List<string>();
foreach (string s in fn)
if (!newList.Contains(s))
{
newList.Add(s);
}
foreach (string combo in newList)
{
Console.Write(combo);
}

You probably need to trim the strings, because they have leading white spaces, so "string1" is different from " string1".
foreach (string s in fn)
if (!newList.Contains(s.Trim()))
{
newList.Add(s);
}

You can do this much simpler with Linq by using Distinct.
Returns distinct elements from a sequence by using the default
equality comparer to compare values.
foreach (Excel.Range item in xlRng)
{
string itemString = (string)item.Text;
List<String> fn = new List<String>(itemString.Split(','));
foreach (string combo in fn.Distinct())
{
Console.Write(combo);
}
}
As mentioned in another answer, you may also need to Trim any whitespace, in which case you would do:
fn.Select(x => x.Trim()).Distinct()

Where you need to contain keys/values, its better to use Dictionary type. Try changing code with List<T> to Dictionary<T>. i.e.
From:
List<string> newList = new List<string>();
foreach (string s in fn)
if (!newList.Containss))
{
newList.Add(s);
}
to
Dictionary<string, string> newList = new Dictionary<string, string>();
foreach (string s in fn)
if (!newList.ContainsKey(s))
{
newList.Add(s, s);
}

If you are concerned about the distinct items while you are reading, then just use the Distinct operator like fn.Distinct()
For processing the whole data, I can suggest two methods:
Read in the whole data then use LINQ's Distinct operator
Or use a Set data structure and store each element in that while reading the excel
I suggest that you take a look at the LINQ documentation if you are processing data. It has really great extensions. For even more methods, you can check out the MoreLINQ package.

I think your code would probably work as you expect if you moved newList out of the loop - you create a new variable named newList each loop so it's not going to find duplicates from earlier loops.
You can do all of this this more concisely with Linq:
//set up some similar data
string list1 = "a,b,c,d,a,f";
string list2 = "a,b,c,d,a,f";
List<string> lists = new List<string> {list1,list2};
// find unique items
var result = lists.SelectMany(i=>i.Split(',')).Distinct().ToList();
SelectMany() "flattens" the list of lists into a list.
Distinct() removes duplicates.

var uniqueItems = new HashSet<string>();
foreach (Excel.Range cell in xlRng)
{
var cellText = (string)cell.Text;
foreach (var item in cellText.Split(',').Select(s => s.Trim()))
{
uniqueItems.Add(item);
}
}
foreach (var item in uniqueItems)
{
Console.WriteLine(item);
}

C#: LINQ query with split and parsing

I have an object with a String field containing a comma separated list of integers in it. I'm trying to use LINQ to retrieve the ones that have a specific number in the list.
Here's my approach
from p in source
where (p.Keywords.Split(',').something.Contains(val))
select p;
Where p.Keywords is the field to split.
I've seen the following in the net but just doesn't compile:
from p in source
where (p.Keywords.Split(',').Select(x=>x.Trim()).Contains(val))
select p;
I'm a LINQ newbie, but had success with simpler queries.
Update:
Looks like I was missing some details:
source is a List containing the object with the field Keywords with strings like 1,2,4,7
Error I get is about x not being defined.

Here's an example of selecting numbers that are greater than 3:
string str = "1,2,3,4,5,6,7,8";
var numbers = str.Split(',').Select(int.Parse).Where(num => num > 3); // 4,5,6,7,8
If you have a list then change the Where clause:
string str = "1,2,3,4,5,6,7,8";
List<int> relevantNums = new List<int>{5,6,7};
var numbers = str.Split(',').Select(int.Parse).Where(num => relevantNums.Contains(num)); // 5,6,7
If you are not looking for number but for strings then:
string str = "1,2,3,4,5,6,7,8";
List<string> relevantNumsStr = new List<string>{"5","6","7"};
var numbers = str.Split(',').Where(numStr => relevantNumsStr.Contains(numStr)); // 5,6,7

Here is an example of how you can achieve this. For simplicity I did to string on the number to check for, but you get the point.
// class to mimic what you structure
public class MyObj
{
public string MyStr{get;set;}
}
//method
void Method()
{
var myObj = new List <MyObj>
{
new MyObj{ MyStr="1,2,3,4,5"},
new MyObj{ MyStr="9,2,3,4,5"}
};
var num =9;
var searchResults = from obj in myObj
where !string.IsNullOrEmpty(obj.MyStr) &&
obj.MyStr.Split(new []{','})
.Contains(num.ToString())
select obj;
foreach(var item in searchResults)
Console.WriteLine(item.MyStr);
}

Thanks for all the answers, although not in the right language they led me to the answer:
from p in source where (p.Keywords.Split(',').Contains(val.ToString())) select p;
Where val is the number I'm looking for.

Find out if string list items startswith another item from another list

I'd like to loop over a string list, and find out if the items from this list start with one of the item from another list.
So I have something like:
List<string> firstList = new List<string>();
firstList.Add("txt random");
firstList.Add("text ok");
List<string> keyWords = new List<string>();
keyWords.Add("txt");
keyWords.Add("Text");

You can do that using a couple simple for each loops.
foreach (var t in firstList) {
foreach (var u in keyWords) {
if (t.StartsWith(u) {
// Do something here.
}
}
}

If you just want a list and you'd rather not use query expressions (I don't like them myself; they just don't look like real code to me)
var matches = firstList.Where(fl => keyWords.Any(kw => fl.StartsWith(kw)));

from item in firstList
from word in keyWords
where item.StartsWith(word)
select item

Try this one it is working fine.
var result = firstList.Where(x => keyWords.Any(y => x.StartsWith(y)));

In C#, What is the fastest way to search for elements in a list but do a "StartsWith()" search?

I have a list of strings:
var list = new List<string>();
list.Add("CAT");
list.Add("DOG");
var listofItems = new List<string>();
listofItems .Add("CATS ARE GOOD");
listofItems .Add("DOGS ARE NICE");
listofItems .Add("BIRD");
listofItems .Add("CATAPULT");
listofItems .Add("DOGGY");
and now i want a function like this:
listofItems.Where(r=> list.Contains(r));
but instead of Contains, i want it to do a starts with check so 4 out of the 5 items would be returned (BIRD would NOT).
What is the fastest way to do that?

You can use StartsWith inside of an Any
listofItems.Where(item=>list.Any(startsWithWord=>item.StartsWith(startsWithWord)))
You can visualize this as a double for loop, with the second for breaking out as soon as it hits a true case
var filteredList = new List<String>();
foreach(var item in listOfItems)
{
foreach(var startsWithWord in list)
{
if(item.StartsWith(startsWithWord))
{
filteredList.Add(item)
break;
}
}
}
return filteredList;

The fastest way would be usage of another data structure, for example Trie. Basic C# implementation can be found here: https://github.com/kpol/trie

This should get you what you need in a more simplified format:
var result = listofItems.Select(n =>
{
bool res = list.Any(v => n.StartsWith(v));
return res
? n
: string.Empty;
}).Where(b => !b.Equals(string.Empty));

The Trie data structure is what you need. Take a look at this more mature library: TrieNet
using Gma.DataStructures.StringSearch;
...
var trie = new SuffixTrie<int>(3);
trie.Add("hello", 1);
trie.Add("world", 2);
trie.Add("hell", 3);
var result = trie.Retrieve("hel");

Comparing two strings with different orders

I have a dictionary with a list of strings that each look something like:
"beginning|middle|middle2|end"
Now what I wanted was to do this:
List<string> stringsWithPipes = new List<string>();
stringWithPipes.Add("beginning|middle|middle2|end");
...
if(stringWithPipes.Contains("beginning|middle|middle2|end")
{
return true;
}
problem is, the string i'm comparing it against is built slightly different so it ends up being more like:
if(stringWithPipes.Contains(beginning|middle2|middle||end)
{
return true;
}
and obviously this ends up being false. However, I want to consider it true, since its only the order that is different.
What can I do?

You can split your string on | and then split the string to be compared, and then use Enumerable.Except along with Enumerable.Any like
List<string> stringsWithPipes = new List<string>();
stringsWithPipes.Add("beginning|middle|middle2|end");
stringsWithPipes.Add("beginning|middle|middle3|end");
stringsWithPipes.Add("beginning|middle2|middle|end");
var array = stringsWithPipes.Select(r => r.Split('|')).ToArray();
string str = "beginning|middle2|middle|end";
var compareArray = str.Split('|');
foreach (var subArray in array)
{
if (!subArray.Except(compareArray).Any())
{
//Exists
Console.WriteLine("Item exists");
break;
}
}
This can surely be optimized, but the above is one way to do it.

Try this instead::
if(stringWithPipes.Any(P => P.split('|')
.All(K => "beginning|middle2|middle|end".split('|')
.contains(K)))
Hope this will help !!

You need to split on a delimeter:
var searchString = "beginning|middle|middle2|end";
var searchList = searchString.Split('|');
var stringsWithPipes = new List<string>();
stringsWithPipes.Add("beginning|middle|middle2|end");
...
return stringsWithPipes.Select(x => x.Split('|')).Any(x => Match(searchList,x));
Then you can implement match in multiple ways
First up must contain all the search phrases but could include others.
bool Match(string[] search, string[] match) {
return search.All(x => match.Contains(x));
}
Or must be all the search phrases cannot include others.
bool Match(string[] search, string[] match) {
return search.All(x => match.Contains(x)) && search.Length == match.Length;
}

That should work.
List<string> stringsWithPipes = new List<string>();
stringsWithPipes.Add("beginning|middle|middle2|end");
string[] stringToVerifyWith = "beginning|middle2|middle||end".Split(new[] { '|' },
StringSplitOptions.RemoveEmptyEntries);
if (stringsWithPipes.Any(s => !s.Split('|').Except(stringToVerifyWith).Any()))
{
return true;
}
The Split will remove any empty entries created by the doubles |. You then check what's left if you remove every common element with the Except method. If there's nothing left (the ! [...] .Any(), .Count() == 0 would be valid too), they both contain the same elements.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# dedupe List based on split - c#

Can you use a HashSet instead? That will eliminate dupes automatically for you as they are added.

May be you can sort the words with delimited | on alphabetical order. Then store them onto grid (columns). Then when you try to insert, just check if there is column having a word which starting with this char.

Related

How do I remove duplicates from excel range? c#

C#: LINQ query with split and parsing

Find out if string list items startswith another item from another list

In C#, What is the fastest way to search for elements in a list but do a "StartsWith()" search?

Comparing two strings with different orders

Categories

Resources