Checking for duplicates in an array - c#

I have checked for answers on the website, but I am curious about the way I am writing my code via C# to check for duplicates in the array. My function sort of works. But in the end when I print out my 5 sets of arrays, duplicates are detected but, the output still contains duplicates. I have also commented out the part where if a duplicate is detected generate a random number to replace the duplicate found in that element of the array. My logic seems to be sound, a nested for loop that starts with the first element then loop through the same array 5 times to see if the initial element matches. So start at element 0, then loop 5 times to see if 0 through 4 matches element 0, then 1 and so on. Plus generating a random number when a duplicate is found and replacing that element is not working out so good. I did see a solution with using a dictionary object and its key, but I don't want to do that, I want to just use raw code to solve this algorithm, no special objects.
My function:
void checkForDuplicates()
{
int[] test = { 3,6,8,10,2,3 };
int count = 0;
Random ranDuplicateChange;
for(int i = 0; i < test.Length; i++)
{
count = 0;
Console.WriteLine(" {0} :: The current number is: {1} ",i, test[i]);
for(int j = 0; j < test.Length; j++)
{
if (test[i] == test[j])
{
count++;
if (count >= 2)
{
Console.WriteLine("Duplicate found: {0}", test[j]);
//ranDuplicateChange = new Random();
//test[j] = ranDuplicateChange.Next(1, 72);
}
}
}
}

You can get them using lambda expressions:
var duplicates = test.GroupBy(a => a)
.Where(g => g.Count() > 1)
.Select(i => new { Number = i.Key, Count = i.Count()});
This returns an IEnumerable of an anonymous type with 2 properties, Number and Count

I want to just use raw code to solve this algorithm, no special objects.
I believe what you mean by this is not using LINQ or any other library methods that are out there that can achieve your end-goal easily, but simply to manipulate the array you have and find a way to find duplicates.
Let's leave code aside for a moment and see what we need to do to find the duplicates. Your approach is to start from the beginning, and compare each element to other elements in the array, and see if they are duplicates. Not a bad idea, so let's see what we need to do to implement that.
Your array:
test = 3, 6, 8, 10, 2, 3
What we must do is, take 3, see if it's equal to the next element, and then the next, and then the next, until the end of the array. If duplicates are found replace them.
Second round, take 6, and since we already compared first element 3, we start with 8 and go on till the end of the array.
Third round, start from 8, and go on.
You get the drift.
Now let's take a look at your code.
Now we start at zero-element (I'm using zero based index for convenience), which is 3, and then in the inner-loop of j we see if the next element, 6, is a duplicate. It's not, so we move on. And so on. We do find a duplicate at the last position, then count it. So far so good.
Next loop, now here is your first mistake. Your second loop, j, starts at 0, so when i=1, the first iteration of your j starts at 0, so you're comparing test[1] vs test[0], which you already compared in the first round (your outer loop). What you should instead do is, compare test[1] vs test[2].
So think what you need to change in your code to achieve this, in terms of i and j in your loops. What you want to do is, start your j loop one more than your current i value.
Next, you increment count whenever you find a duplicate, which is fine. But printing the number only when count >= 2 doesn't make sense. Because, you started it at 0, and increment only if you found a duplicate, so even if your counter is 1, that means you've found a duplicate. You should instead simply generate a random number and replace test[j] with it.
I'm intentionally not giving you code samples as you say that you're eager to learn yourself how to solve this problem, which is always a good thing. Hope the above information is useful.
Disclaimer:
All the above is simply to show you how to fix your current code, but in itself, it still has flaws. To being with, your 'replacing with random number' idea is not watertight. For instance, if it generated the same number you're trying to replace (although the odds are low it can happen, and when you write a program you shouldn't rely on chance for your program to not go wrong), you'd still end up with duplicates. Same with if it generated a number that's found at the beginning of the list later on. For example say your list is 2, 3, 5, 3. The first iteration of i would correctly determine 2 is not a duplicate. Then in next iteration, you find that 3 is a duplicate, and replace it. However, there, if the new randomly generated number turned out to be 2, and since we've already ruled out that 2 is not a duplicate, the newly generated 2 will not be overwritten again and you'll end up with a list with duplicates. To combat that you can revert to your original idea of starting j loop with 0 every time, and replace if a duplicate is encountered. To do that, you'll need an extra condition to see if i == j and if so skip the inner loop. But even then, the now newly generated random number could be equal to one of the numbers in the list to again ruining your logic.
So really, it's fine to attempt this problem this way, but you should also compare your random number to your list every time you generate a number and if it's equal, then generate another random number, and so on until you're good.
But at the end of the day to remove duplicates for a list and replace them with unique numbers there are way more easier and non-error-prone methods using LINQ etc.

Related

How do I count modified lines of code?

I have a program which counts lines of code (excluding comments, braces, whitespace, etc.) of two programs then compares them. It puts all the lines from one program in one List and the lines from the other program in another List. It then removes all lines that are identical between the two. One List is then all the lines added to program 1 to get program 2 and the other List is all the lines removed from program 1 to get program 2.
Now I need a way to detect how many lines of code from program 1 have been MODIFIED to get program 2. I found an algorithm for the Levenshtein Distance, and it seems like that will work. I just need to compare the distance with the length of the strings to get a percentage changed, and I'll need to come up with a good value for the threshold.
However my problem is this: how do I know which two strings to compare for the Levenshtein Distance? My best guess is to have a nested for loop and loop through one program once for every line in the other program to compare every line with every other line looking for a Distance that meets my difference threshold. However, that seems very inefficient. Are there any other ways of doing this?
I should add this is for a software engineering class. It's technically homework, but we're allowed to use any resource we need. While I'm just looking for an algorithm, I'll let you know I'm using C#.
If you allow lines to be shuffled, how do you count the changes? Not all shuffled lines might result in identical functionality, even if you compare all lines and find exact matches.
If you compare
var random = new Random();
for (int i = 0; i < 9; i++) {
int randomNumber = random.Next(1, 50);
}
to
for (int i = 0; i < 9; i++) {
var random = new Random();
int randomNumber = random.Next(1, 50);
}
you have four unchanged lines of code, but the second version is likely to produce different results. There is definitely a change in the code, and yet line-by-line comparison will not detect it if you allow shuffling.
This is a good reason to disallow shuffling and actually mark line 1 in the first code as deleted, and line 2 in the second code as added, even though the deleted line and the added line are exactly the same.
Once you dicide that lines cannot be shuffled, i think you can figure out quite easily how to match your lines for comparison.
To step through both sources and compare the line you might want to look up the balance line algorithm (e.g http://www.isqa.unomaha.edu/haworth/isqa3300/fs006.htm )
If you suggest that lines of codes are shuffled (their order can be changed) then you need to compare all lines from 1st program to all lines from 2nd program excluding not changed lines.
You can simplify you task suggesting that lines cannot be shuffled. They can be only inserted, removed or unchanged. From my experience most of the programs comparing text files work this way

What is an elegant way to find min value in a subset of an array?

I have an array a of 100 integers. What is a recommended way to find the min value in a[3] through a[70] AND the index of this min value? Assuming no duplication of values.
I know the clumsy way of looping through the relevant range of indices:
for(i = 3; i < 70, i++)
{
...
}
I am looking for a more elegant way of doing this in C# instead of looping. Thanks.
To Find out min
List<int> templist = a.Skip(3).Take(67).ToList();
int minimum = templist.Min();
For Index
int index = templist.FindIndex(i => i == minimum) + 3;
I added 3 because index in list will be 3 less than index in original sequence a.
What it is doing
Skip - Leaves first 3 values i.e. index 0,1,2 and returns remaining array.
Take - From the array returned by Skip it takes 67 values. (Since your for loop goes till starts from 3 and goes till 70 so you are basically looping on 67 items bcoz 70 - 3 = 67).
ToList - Converts returned sequence to List for finding index.
Min - Gets minimum from of it.
You have to use loop since it is a sequence. Since you said elegant so instead of for loop I used LINQ (Even it does that looping also).
If your data structure is not sorted then there is no way to do it without looping through all the elements in the sublist, either if you use some implicit looping through the provided API.
You cannot use a sorted collection since you are working on a subpart of it (so you'd need to create a sorted collection for the part of the list just for it), so in any case you'll have to loop over it.
LINQ's Aggregate is not the easiest, but it is arguably the least inefficient of the "elegant" solutions (though they're still more lines of code than the straightforward loop. Additionally, iterating through yourself is still the best because you are not allocating any additional memory).
But anyway, should you feel the need to make your successor hang you in effigy, you can do this instead of a straightforward loop:
var minValueAndItsIndex = a
.Skip(3)
.Take(70 - 3)
.Select((value, index) => new { Value = value, Index = index + 3})
.Aggregate((tuple1, tuple2) => (tuple1.Value < tuple2.Value) ? tuple1 : tuple2);
If you create a 2-item ValueType-based tuple and use that instead of the anonymous type, it will be comparable to the more-efficient direct iteration because it won't allocate any additional memory.

Interview - Write a program to remove even elements

I was asked this today and i know the answer is damn sure simple but he kept me the twist to the last.
Question
Write a program to remove even numbers stored in ArrayList containing 1 - 100.
I just said wow
Here you go this is how i have implemented it.
ArrayList source = new ArrayList(100);
for (int i = 1; i < 100; i++)
{
source.Add(i);
}
for (int i = 0; i < source.Count; i++)
{
if (Convert.ToInt32(source[i]) % 2 ==0)
{
source.RemoveAt(i);
}
}
//source contains only Odd elements
The twist
He asked me what is the computational complexity of this give him a equation. I just did and said this is Linear directly proportional to N (Input).
he said : hmmm.. so that means i need to wait longer to get results when the input size increases am i right? Yes sirr you are
Tune it for me, make it Log(N) try as much as you can he said. I failed miserably in this part.
Hence come here for the right logic, answer or algorithm to do this.
note: He wanted no Linq, No extra bells and whistles. Just plain loops or other logic to do it
I dare say that the complexity is in fact O(N^2), since removal in arrays is O(N) and it can potentially be called for each item.
So you have O(N) for the traversal of the array(list) and O(N) for each removal => O(N) * O(N).
Since it does not seem clear, I'll explain the reasoning. At each step a removal of an item may take place (assuming the worst case in which every item must be removed). In an array the removal is done by shifting. Hence, to remove the first item, I need to shift all the following N-1 items by one position to the left:
1 2 3 4 5 6...
<---
2 3 4 5 6...
Now, at each iteration I need to shift, so I'm doing N-1 + N-2 + ... + 1 + 0 shifts, which gives a result of (N) * (N-1) / 2 (arithmetic series) giving a final complexity of O(N^2).
Let's think it this way:
The number of delete actions you are doing is, forcely, the half of array lenght (if the elements are stored in array). So the complexity is at least O(N) .
The question you received let me suppose that your professor wanted you to reason about different ways of storing the numbers.
Usually when you have log complexity you are working with different structures, like graphs or trees.
The only way I can think of having logartmic complexity is having the numbers stored in a tree (ordered tree, b-tree... we colud elaborate on this), but it is actually out of the constraints of your exam (sotring numbers in array).
Does it make sense to you?
You can get noticeably better performance if you keep two indexes, one to the current read position and one to the current write position.
int read = 0
int write = 0;
The idea is that read looks at each member of the array in turn; write keeps track of the current end of the list. When we find a member we want to delete, we move read forwards, but not write.
for (int read = 0; read < source.Count; read++) {
if (source[read] % 2 != 0) {
source[write] = source[read];
write += 1;
}
}
Then at the end, tell the ArrayList that its new length is the current value of `write'.
This takes you from your original O(n^2) down to O(n).
(note: I haven't tested this)
Without changing the data structure or making some assumption on the way items are stores inside the ArrayList, I can't see how you'll avoid checking the parity of each and every member (hence at least O(n) complexity). Perhaps the interviewer simply wanted you to tell him it's impossible.
If you really have to use an ArrayList and actively have to remove the entries (instead if not adding them in the first place)
Not incrementing by i + 1 but i + 2 will remove your need to check if it is odd.
for (int i = source.Count - 1 ; i > 0; i = i i 2)
{
source.RemoveAt(i);
}
Edit: I know this will only work if source contains the entries from 1-100 in sequential order.
The problem with the given solution is that it starts from the beginning, so the entire list must be shifted each time an item is removed:
Initial List: 1, 2, 3, 4, 5, ..., 98, 99
/ / / /// /
After 1st removal: 1, 3, 4, 5, ..., 98, 99, <empty>
/ /// / /
After 2nd removal: 1, 3, 5, ..., 98, 99, <empty>, <empty>
I've used the slashes to try to show how the list shifts after each removal.
You can reduce the complexity (and eliminate the bug I mentioned in the comments) simply by reversing the order of removal:
for (int i = source.Count-1; i >= 0; --i) {
if (Convert.ToInt32(source[i]) % 2 == 0) {
// No need to re-check the same element during the next iteration.
source.RemoveAt(--i);
}
}
It is possible IF you have unlimited parallel threads available to you.
Suppose that we have an array with n elements. Assign one thread per element. Assume all threads act in perfect sync.
Each thread decides whether its element is even or odd. (Time O(1).)
Determine how many elements below it in the array are odd. (Time O(log(n)).)
Mark a 0 or 1 in an second array depending whether you are even or odd at the same index. So each one is a count of odds at that spot.
If your index is odd, add the previous number. Now each entry is a count of odds in the current block of 2 up to yourself
If your index mod 4 is 2, add the value at the index below, if it is 3, add the answer 2 indexes below. Now each entry is a count of odds in the current block of 4 up to yourself.
Continue this pattern with blocks of 2**i (if you're in the top half add the count for the bottom half) log2(n) times - now each entry in this array is the count of odds below.
Each CPU inserts its value into the correct slot.
Truncate the array to the right size.
I am willing to bet that something like this is the answer your friend has in mind.

Sorting numbers array issue

Yesterday at work I set out to figure out how to sort numbers without using the library method Array.Sort. I worked on and off when time permitted and finally was able to come up with a basic working algorithm at the end of today. It might be rather stupid and the slowest way, but I am content that I have a working code.
But there is something wrong or missing in the logic, that is causing the output to hang before printing the line: Numbers Sorted. (12/17/2011 2:11:42 AM)
This delay is directly proportionate to the number of elements in the array. To be specific, the output just hangs at the position where I put the tilde in the results section below. The content after tilde is getting printed after that noticeable delay.
Here is the code that does the sort:
while(pass != unsortedNumLen)
{
for(int i=0,j=1; i < unsortedNumLen-1 && j < unsortedNumLen; i++,j++)
{
if (unsorted[i] > unsorted[j])
{
pass = 0;
swaps++;
Console.Write("Swapping {0} and {1}:\t", unsorted[i], unsorted[j]);
tmp = unsorted[i];
unsorted[i] = unsorted[j];
unsorted[j] = tmp;
printArray(unsorted);
}
else pass++;
}
}
The results:
Numbers unsorted. (12/17/2011 2:11:19 AM)
4 3 2 1
Swapping 4 and 3: 3 4 2 1
Swapping 4 and 2: 3 2 4 1
Swapping 4 and 1: 3 2 1 4
Swapping 3 and 2: 2 3 1 4
Swapping 3 and 1: 2 1 3 4
Swapping 2 and 1: 1 2 3 4
~
Numbers sorted. (12/17/2011 2:11:42 AM)
1 2 3 4
Number of swaps: 6
Can you help identify the issue with my attempt?
Link to full code
This is not homework, just me working out.
Change the condition in your while to this:
while (pass < unsortedNumLen)
Logically pass never equals unsortedNumLen so your while won't terminate.
pass does eventually equal unsortedNumLen when it goes over the max value of an int and loops around to it.
In order to see what's happening yourself while it's in the hung state, just hit the pause button in Visual Studio and hover your mouse over pass to see that it contains a huge value.
You could also set a breakpoint on the while line and add a watch for pass. That would show you that the first time the list is sorted, pass equals 5.
It sounds like you want a hint to help you work through it and learn, so I am not posting a complete solution.
Change your else block to the below and see if it puts you on the right track.
else {
Console.WriteLine("Nothing to do for {0} and {1}", unsorted[i], unsorted[j]);
pass++;
}
Here is the fix:
while(pass < unsortedNumLen)
And here is why the delay occurred.
After the end of the for loop in which the array was eventually sorted, pass contains at most unsortedNumLen - 2 (if the last change was between first and second members). But it does not equal the unsorted array length, so another iteration of while and inner for starts. Since the array is sorted unsorted[i] > unsorted[j] is always false, so pass always gets incremented - exactly the number of times j got incremented, and that is the unsortedNumLen - 1. Which is not equal to unsortedNumLen, and so another iteration of while begins. Nothing essentially changed, and after this iteration pass contains 2 * (unsortedNumLen - 1), which is still not equal to unsortedNumLen. And so on.
When pass reaches value int.MaxValue, it the overflow happens, and next value the variable pass will get is int.MinValue. And the process goes on, until pass finally gets the value unsortedNumLen at the moment the while condition is checked. If you are particularly unlucky, this might never happen at all.
P.S. You might want to check out this link.
This is just a characteristic of the algorithm you're using to sort. Once it's completed sorting the elements it has no way of knowing the sort is complete, so it does one final pass checking every element again. You can fix this by adding --unsortedNumLen; at the end of your for loop as follows:
for(int i=0,j=1; i < unsortedNumLen-1 && j < unsortedNumLen; i++,j++)
{
/// existing sorting code
}
--unsortedNumLen;
Reason? Because you algorithm is bubbling the biggest value to the end of the array, there is no need to check this element again since it's already been determined to be larger the all other elements.

Algorithm for generating a nearly sorted list on predefined data

Note: This is part 1 of a 2 part question.
Part 2 here
I'm wanting to more about sorting algorithms and what better way to do than then to code! So I figure I need some data to work with.
My approach to creating some "standard" data will be as follows: create a set number of items, not sure how large to make it but I want to have fun and make my computer groan a little bit :D
Once I have that list, I'll push it into a text file and just read off that to run my algorithms against. I should have a total of 4 text files filled with the same data but just sorted differently to run my algorithms against (see below).
Correct me if I'm wrong but I believe I need 4 different types of scenarios to profile my algorithms.
Randomly sorted data (for this I'm going to use the knuth shuffle)
Reversed data (easy enough)
Nearly sorted (not sure how to implement this)
Few unique (once again not sure how to approach this)
This question is for generating a nearly sorted list.
Which approach is best to generate a nearly sorted list on predefined data?
To "shuffle" a sorted list to make it "almost sorted":
Create a list of functions you can think of which you can apply to parts of the array, like:
Negate(array, startIndex, endIndex);
Reverse(array, startIndex, endIndex);
Swap(array, startIndex, endIndex);
For i from zero to some function of the array's length (e.g. Log(array.Length):
Randomly choose 2 integers*
Randomly choose a function from the functions you thought of
Apply that function to those indices of the array
*Note: The integers should not be constricted to the array size. Rather, choose random integers and "wrap" around the array -- that way the elements near the ends will have the same chance of being modified as the elements in the middle.
Answering my own question here. All this does is taking a sorted list and shuffling up small sections of it.
public static T[] ShuffleBagSort<T>(T[] array, int shuffleSize)
{
Random r = _random;
for (int i = 0; i < array.Length; i += shuffleSize)
{
//Prevents us from getting index out of bounds, while still getting a shuffle of the
//last set of un shuffled array, but breaks for loop if the number of unshuffled array is 1
if (i + shuffleSize > array.Length)
{
shuffleSize = array.Length - i;
if (shuffleSize <= 1) // should never be less than 1, don't think that's possible lol
continue;
}
if (i % shuffleSize == 0)
{
for (int j = i; j < i + shuffleSize; j++)
{
// Pick random element to swap from our small section of the array.
int k = r.Next(i, i + shuffleSize);
// Swap.
T tmp = array[k];
array[k] = array[j];
array[j] = tmp;
}
}
}
return array;
}
Sort the array.
Start sorting it in descending order with bubble sort
Stop after a few iterations (depending how much 'dis-sorted' you want the array to be
Add some randomness (each time when bubblesort wants to swap two elements toss a coin and perform that operation or not depending on the result, or use a different probability than 50/50 for that)
This will give you an array which will be roughly equally modified across its whole range, preserving most of the order (the begining will hold the least elements, the end the greatest). That's because the changes performed by bubblesort with a random test will be rather local. It won't mix the whole array at once so much that it wouldn't resemble the original.
If you want to you can also completely randomly shuffle whole parts of the array (but keep the parts not to big because, you'll completely loose the ordering).
Or you may also randomly swap whole sorted parts of the array. That would be an interesing test case, for example:
[1,2,3,4,5,6,7,8] -> [1,2,6,7,8,3,4,5]
The almost sorted list is the reason why Timsort (python) is so efficient in the real world is because data is typically "almost sorted" . There is an article about it explaining the math behind the entropy of data.

Categories

Resources