compare string from 2 different collections and extract the difference

compare string from 2 different collections and extract the difference - c#

I need to compare 2 sets of string which have some similar names and I need to extract the similar names, how can I do that? They are both collections and lets say one of them is "Sanjay, Race" and the other is "Let, Sanjay", I need to extract Sanjay.

Depends on what data structure you have but I suggest you work with an Array or a List if you collection is big enough to care about optimisation.
You want to go through the first of the two lists, and for each element of list1, compare to compare to every element of list2. Be careful, this might take a while (if your collection is big enough).
Might look like :
using System.Collections.Generic;
LinkedList<string> set1 = new LinkedList<string>();
LinkedList<string> set2 = new LinkedList<string>();
LinkedList<string> extracted = new LinkedList<string>();
//fill in your sets with loops if needed :
see https://learn.microsoft.com/fr-fr/dotnet/api/system.collections.generic.linkedlist-1?view=net-7.0
foreach (string name in set1){
foreach (string name2 in set2){
if(string.Compare(name,name2)==0){
extracted.AddAfter(name);
}
}
}
Please, do correct me (nicely) :)

Related

Code is modifying the wrong variable... why?

I have a strange thing where some code I am doing is modifying both the copy and the original List.. I have boiled the problem down as much as I can to only show the error in a single file. Though my real world example us a lot more complex.. but at the root of it all this is the problem.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace TestingRandomShit
{
class Program
{
private static string rawInput;
private static List<string> rawList;
private static List<string> modifiedList;
static void Main(string[] args)
{
rawInput = "this is a listing of cats";
rawList = new List<string>();
rawList.Add("this");
rawList.Add("is");
rawList.Add("a");
rawList.Add("listing");
rawList.Add("of");
rawList.Add("cats");
PrintAll();
modifiedList = ModIt(rawList);
Console.WriteLine("\n\n**** Mod List Code has been run **** \n\n");
PrintAll();
}
public static List<string> ModIt(List<string> wordlist)
{
List<string> huh = new List<string>();
huh = wordlist;
for (int i = 0; i < huh.Count; i++)
{
huh[i] = "wtf?";
}
return huh;
}
//****************************************************************************************************************
//Below is just a print function.. all the action is above this line
public static void PrintAll()
{
Console.WriteLine(": Raw Input :");
Console.WriteLine(rawInput);
if (rawList != null)
{
Console.WriteLine("\n: Original List :");
foreach (string line in rawList)
{
Console.WriteLine(line);
}
}
if (modifiedList != null)
{
Console.WriteLine("\n: Modified List :");
foreach (string wtf in modifiedList)
{
Console.WriteLine(wtf);
}
Console.ReadKey();
}
}
}
}
Basically, I have three variables.... a string and two List. The original code dose some tokenisation on the string but for this demo I simple use the List.Add() to fake it to make it simple to read.
So I now have a string and a List with a single word in each element.
This is the confusing part that I do not understand.. I know it has something to do with references but I can not work out how to fit it.
There is a method I have called ModIt()... it simple takes in a List then makes a completely new List called huh, copies the original list over the new list and then changes every line in huh to "wtf?".
Now as I understand it.. I should end up with 3 variables...
1) a string
2) a List with a different word in each element
3) a List of the same length as the other with each element being "wtf?"
But, what happens is that is I try to print out both List they BOTH have every element set to "WTF?".... so yeah.. wtf man? I am super confused. I mean in the ModIt I even build a entire new string rather than modding the one being passes but it doesn't seem to effect anything.
This is the output...
: Raw Input : this is a listing of cats
: Original List : this is a listing of cats
**** Mod List Code has been run ****
: Raw Input : this is a listing of cats
: Original List : wtf? wtf? wtf? wtf? wtf? wtf?
: Modified List : wtf? wtf? wtf? wtf? wtf? wtf?

huh = wordlist; doesn't copy the items of wordlist into a new list, it copies the reference to the same object occupied by wordlist (i.e. huh and wordlist then point at the same object in memory).
If you want a copy, the simplest way to produce one is using LINQ:
List<string> huh = wordlist.ToList();
Note that this will be a "shallow copy". If your list stores reference objects, both the old and new lists will store references to the same objects.
See here for more reading on value vs reference types, and then here if you need a deep copy.
Since all you're doing is replacing the value at an index of the list, I imagine a shallow copy is fine.

John's already commented on the faulting code:
List<string> huh = new List<string>();
huh = wordlist;
Here you make a new list, then throw it away and attach your reference huh to your old list, so both huh and wordlist refer to the same thing..
I just wanted to point out the non LINQ way of copying a list:
List<string> huh = new List<string>(wordlist);
Pass the old list into the new list's constructor; list has a constructor that takes a collection of objects to store in the new list
You now have two lists, and initially they both refer to the same strings, but because strings cannot be altered, if you start to change the strings inside the list (rather than just shufffling or removing them from the list) new ones will be created
If a worthy point though; you'll have 2 lists pointing to the same objects so if you have, in the future, the same scenario with objects that can be changed and you change the object in one list it will also change in the other list:
//imagine the list stores people, the age of the first
//person in the list is 27, and we increment it
List1[0].PersonAge++;
//list2 is a different list but refers to the same people objects
//this will print 28
Console.Out.WriteLine(list2[0].PersonAge);
That's what we mean by a shallow copy

Your problem comes from the fact that in C# we have reference types and value types.
Value types can be assigned values by the direct assignment operator (=), but for reference types it is different. Reference types do not store the actual data itself, they store a location in memory where the data is held. Like pointers, if you come from the C world.
Have a look into IClonable. Also read Parameter passing by Jon Skeet, it gives a good description of value and reference types.

C# - Compare 2 lists with custom elements

I have 2 lists. One contains search element, one contains the data.
I need to loop for each element in list2 which contains any string in list1 ("cat" or "dog"). For examples:
List<string> list1 = new List<string>();
list1.Add("Cat");
list1.Add("Dog");
list1.Add... ~1000 items;
List<string> list2 = new List<string>();
list2.Add("Gray Cat");
list2.Add("Black Cat");
list2.Add("Green Duck");
list2.Add("White Horse");
list2.Add("Yellow Dog Tasmania");
list2.Add("White Horse");
list2.Add... ~million items;
My expect is listResult: {"Gray Cat", "Black Cat", "Yellow Dog Tasmania"} (because it contains "cat" and "dog" in list1). Instead of nested looping, do you have any idea to make the sequence run faster?
My current solution as below. But...it seems too slow:
foreach (string str1 in list1)
{
foreach (string str2 in list2)
{
if str2.Contains(str1)
{
listResult.Add(str2);
}
}
}

An excellent use case for parallelization!
Linq approach without parallelization (equals internally your approach beside the fact that the internal loop breaks if one match was found - your approach also searches for other matches)
List<string> listResult = list2.Where(x => list1.Any(x.Contains)).ToList();
Parallelized the loop with AsParallel() - if you have a multicore system there will be a huge performance improvement.
List<string> listResult = list2.AsParallel().Where(x => list1.Any(x.Contains)).ToList();
Runtime comparison:
(4 core system, list1 1000 items, list2 1.000.000 items)
Without AsParallel(): 91 seconds
With AsParallel(): 23 seconds
The other way with Parallel.ForEach and a thread safe result list
System.Collections.Concurrent.ConcurrentBag<string> listResult = new System.Collections.Concurrent.ConcurrentBag<string>();
System.Threading.Tasks.Parallel.ForEach<string>(list2, str2 =>
{
foreach (string str1 in list1)
{
if (str2.Contains(str1))
{
listResult.Add(str2);
//break the loop if one match was found to avoid duplicates and improve performance
break;
}
}
});
Side note: You have to iterate over list2 first and break; after match, otherwise you add items twice: https://dotnetfiddle.net/VxoRUW

Contains will use a 'naive approach' to string searching. You can improve on that by looking into string search algorithms.
One way to do this could be to create a generalized Suffix tree for all your search words. Then iterate through all the items in your list2 to see if they match.
Still, this might be overkill. You can first try with some simple optimizations as proposed by fubo to see if that's fast enough for you.

The List string is not a suitable data structure for solving this problem efficiently.
What you are looking for is a Trie or Dawg, to sort every word from your original dictionary list1.
The aim is for every letter of word from list2, you will only have 0-26 check.
With this datastructure instead of reading a big list of word till you find one, you will be looking for word like in a paper dictionary. And that should be faster. Application that look for all word from a language in a text use this principle.

Since it seems you want to match entire words, you can use a HashSet to do a more efficient search and prevent iterating list1 and list2 more than once.
HashSet<string> species =
new HashSet<string>(list1);
List<string> result = new List<string>();
foreach (string animal in list2)
{
if (animal.Split(' ').Any(species.Contains))
result.Add(animal);
}
If I run this (with list1 containing 1000 items and list2 containing 100,000 items) on a 4 core laptop:
The algorithm in the question: 37 seconds
The algorithm using AsParallel: 7 seconds
This algorithm: 0.17 seconds
With 1 million items in list2 this algorithm takes about a second.
Now while this approach does work, it might produce incorrect results. If list1 contains Lion then a Sea lion in list2 will be added to the results even if there is none in list1. (If you use a case insensitive StringComparer in the HashSet as suggested below.)
To solve that problem, you would need some way to parse the strings in list2 into a more complex object Animal. If you can control your input, that may be a trivial task, but in general it is hard. If you have some way of doing that, you can use a solution like the following:
public class Animal
{
public string Color { get; set; }
public string Species { get; set; }
public string Breed { get; set; }
}
And then search the species in a HashSet.
HashSet<string> species = new HashSet<string>
{
"Cat",
"Dog",
// etc.
};
List<Animal> animals = new List<Animal>
{
new Animal {Color = "Gray", Species = "Cat"},
new Animal {Color = "Green", Species = "Duck"},
new Animal {Color = "White", Species = "Horse"},
new Animal {Color = "Yellow", Species = "Dog", Breed = "Tasmania"}
// etc.
};
var result = animals.Where(a => species.Contains(a.Species));
Note that the string search in the HashSet is case sensitive, if you do not want that you can supply a StringComparer as constructor argument:
new HashSet<string>(StringComparer.CurrentCultureIgnoreCase)

Check if Characters in ArrayList C# exist - C# (2.0)

I was wondering if there is a way in an ArrayList that I can search to see if the record contains a certain characters, If so then grab the whole entire sentence and put in into a string. For Example:
list[0] = "C:\Test3\One_Title_Here.pdf";
list[1] = "D:\Two_Here.pdf";
list[2] = "C:\Test\Hmmm_Joke.pdf";
list[3] = "C:\Test2\Testing.pdf";
Looking for: "Hmmm_Joke.pdf"
Want to get: "C:\Test\Hmmm_Joke.pdf" and put it in the Remove()
protected void RemoveOther(ArrayList list, string Field)
{
string removeStr;
-- Put code in here to search for part of a string which is Field --
-- Grab that string here and put it into a new variable --
list.Contains();
list.Remove(removeStr);
}
Hope this makes sense. Thanks.

Loop through each string in the array list and if the string does not contain the search term then add it to new list, like this:
string searchString = "Hmmm_Joke.pdf";
ArrayList newList = new ArrayList();
foreach(string item in list)
{
if(!item.ToLower().Contains(searchString.ToLower()))
{
newList.Add(item);
}
}
Now you can work with the new list that has excluded any matches of the search string value.
Note: Made string be lowercase for comparison to avoid casing issues.

In order to remove a value from your ArrayList you'll need to loop through the values and check each one to see if it contains the desired value. Keep track of that index, or indexes if there are many.
Then after you have found all of the values you wish to remove, you can call ArrayList.RemoveAt to remove the values you want. If you are removing multiple values, start with the largest index and then process the smaller indexes, otherwise, the indexes will be off if you remove the smallest first.

This will do the job without raising an InvalidOperationException:
string searchString = "Hmmm_Joke.pdf";
foreach (string item in list.ToArray())
{
if (item.IndexOf(searchString, StringComparison.OrdinalIgnoreCase) >= 0)
{
list.Remove(item);
}
}
I also made it case insensitive.
Good luck with your task.

I would rather use LINQ to solve this. Since IEnumerables are immutable, we should first get what we want removed and then, remove it.
var toDelete = Array.FindAll(list.ToArray(), s =>
s.ToString().IndexOf("Hmmm_Joke.pdf", StringComparison.OrdinalIgnoreCase) >= 0
).ToList();
toDelete.ForEach(item => list.Remove(item));
Of course, use a variable where is hardcoded.
I would also recommend read this question: Case insensitive 'Contains(string)'
It discuss the proper way to work with characters, since convert to Upper case/Lower case since it costs a lot of performance and may result in unexpected behaviours when dealing with file names like: 文書.pdf

Comparing strings multiple times

I'm generating random scripts, but I have to guarantee that each new one is unique (hasn't been repeated before). So basically each script that has already been generated gets compared against every new script.
Instead of just using normal string compare, I'm thinking there must be a way to hash each new script so that comparison will be faster.
Any ideas on how to hash strings to make multiple comparisons faster?

One way is to use a HashSet<String>
The HashSetclass provides high performance set operations. A set is
a collection that contains no duplicate elements, and whose elements
are in no particular order.
HashSet<string> scripts = new HashSet<string>();
string generated_script = "some_text";
if (!scripts.Contains(generated_script)) // is HashSet<String> dont contains your string already then you can add it
{
scripts.Add(generated_script);
}
Also, You can check for existence of duplicate items in the array.
But this may not be very efficient as compared to HashSet<String>
string[] array = new[] {"demo", "demo", "demo"};
string compareWith = "demo";
int duplicates_count = array.GroupBy(x => x).Count(g => g.Count() > 1);

Use HashSet like below
string uniqueCode= "ABC";
string uniqueCode1 = "XYZ";
string uniqueCode2 = "ABC";
HashSet<string> uniqueList = new HashSet<string>();
uniqueList.Add(uniqueCode);
uniqueList.Add(uniqueCode1);
uniqueList.Add(uniqueCode2);
If you see the Count of uniqueList you will 2. so ABC will not be there two times.

You could use a HashSet. a hash-set is guaranteed to never contain duplicates

Store the script along with its hash:
class ScriptData
{
public ScriptData(string script)
{
this.ScriptHash=script.GetHashCode();
this.Script=script;
}
public int ScriptHash{get;private set;}
public string Script{get;private set;}
}
Then, whenever you need to check if your new random script is unique just take the hash code of the new script and seach all your ScriptData instances for any with the same hash code. If you dont find any you know your new random script is unique. If you do find some then they may be the same and you'll have to compare the actual text of the scripts in order to see if they're identical.

You can store each generated string in a HashSet.
For each new string you will call the method Contains which runs in O(1) complexity. This is an easy way to decide if the new generated string was generated before.

Naming a variable from a text file

I'm making a program in C# that uses mathematical sets of numbers. I've defined the class Conjunto (which means "set" in spanish). Conjunto has an ArrayList that contains all the numbers of the set. It also has a string called "ID" which is pretty much what it sounds; the name of an instance of Conjunto.
The program have methods that applies the operations of union, intersection, etc, between the sets.
Everything was fine, but now i've a text file with sentences like:
A={1,2,3}
B={2,4,5}
A intersection B
B union A
And so on. The thing is, i don't know how many sets the text file contains, and i don't know how to name the variables after those sentences. For example, name an instance of Conjunto A, and name another instance B.
Sorry for the grammar, english is not my native language.
Thanks!

It's pretty complicated to create varaibles dynamically, and pretty useless unless you have some already existing code that expects certain variables.
Use a Dictionary<string, Conjunto> to hold your instances of the class. That way you can access them by name.

First off, If you don't target lower version than .Net 2.0 use List instead of ArrayList. If I were you I wouldn't reinvent the wheel. Use HashSet or SortedSet to store the numbers and then you can use defined union and intersection.
Secondly, what is your goal? Do want to have just the output set after all operations? Do you want to read and store all actions and them process it on some event?

First of all, your program is taken from bad side. I would advice to start making new one. One of ways to name "variables" dynamicaly is by making class objects and editing their properties.
This is what I made as a starting platform:
First af all I have crated a class called set
class set
{
public string ID { get; set; }
public List<int> numbers { get; set; }
}
Then I have made the code to sort whole textfile into list of those classes:
List<set> Sets = new List<set>();
string textfile = "your text file";
char[] spliter = new char[] { ',' }; //switch that , to whatever you want but this will split whole textfile into fragments of sets
List<string> files = textfile.Split(spliter).ToList<string>();
int i = 1;
foreach (string file in files)
{
set set = new set();
set.ID = i.ToString();
char[] secondspliter = new char[] { ',' }; //switch that , to whatever you want but this will split one set into lone numbers
List<string> data = textfile.Split(secondspliter).ToList<string>();
foreach (string number in data)
{
bool success = Int32.TryParse(number, out int outcome);
if (success)
{
set.numbers.Add(outcome);
}
}
i++;
Sets.Add(set);
}
Hope it helps someone.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

compare string from 2 different collections and extract the difference - c#

I need to compare 2 sets of string which have some similar names and I need to extract the similar names, how can I do that? They are both collections and lets say one of them is "Sanjay, Race" and the other is "Let, Sanjay", I need to extract Sanjay.

Related

Code is modifying the wrong variable... why?

C# - Compare 2 lists with custom elements

Check if Characters in ArrayList C# exist - C# (2.0)

Comparing strings multiple times

Naming a variable from a text file

Categories

Resources