I have a program which counts lines of code (excluding comments, braces, whitespace, etc.) of two programs then compares them. It puts all the lines from one program in one List and the lines from the other program in another List. It then removes all lines that are identical between the two. One List is then all the lines added to program 1 to get program 2 and the other List is all the lines removed from program 1 to get program 2.
Now I need a way to detect how many lines of code from program 1 have been MODIFIED to get program 2. I found an algorithm for the Levenshtein Distance, and it seems like that will work. I just need to compare the distance with the length of the strings to get a percentage changed, and I'll need to come up with a good value for the threshold.
However my problem is this: how do I know which two strings to compare for the Levenshtein Distance? My best guess is to have a nested for loop and loop through one program once for every line in the other program to compare every line with every other line looking for a Distance that meets my difference threshold. However, that seems very inefficient. Are there any other ways of doing this?
I should add this is for a software engineering class. It's technically homework, but we're allowed to use any resource we need. While I'm just looking for an algorithm, I'll let you know I'm using C#.
If you allow lines to be shuffled, how do you count the changes? Not all shuffled lines might result in identical functionality, even if you compare all lines and find exact matches.
If you compare
var random = new Random();
for (int i = 0; i < 9; i++) {
int randomNumber = random.Next(1, 50);
}
to
for (int i = 0; i < 9; i++) {
var random = new Random();
int randomNumber = random.Next(1, 50);
}
you have four unchanged lines of code, but the second version is likely to produce different results. There is definitely a change in the code, and yet line-by-line comparison will not detect it if you allow shuffling.
This is a good reason to disallow shuffling and actually mark line 1 in the first code as deleted, and line 2 in the second code as added, even though the deleted line and the added line are exactly the same.
Once you dicide that lines cannot be shuffled, i think you can figure out quite easily how to match your lines for comparison.
To step through both sources and compare the line you might want to look up the balance line algorithm (e.g http://www.isqa.unomaha.edu/haworth/isqa3300/fs006.htm )
If you suggest that lines of codes are shuffled (their order can be changed) then you need to compare all lines from 1st program to all lines from 2nd program excluding not changed lines.
You can simplify you task suggesting that lines cannot be shuffled. They can be only inserted, removed or unchanged. From my experience most of the programs comparing text files work this way
Related
I have checked for answers on the website, but I am curious about the way I am writing my code via C# to check for duplicates in the array. My function sort of works. But in the end when I print out my 5 sets of arrays, duplicates are detected but, the output still contains duplicates. I have also commented out the part where if a duplicate is detected generate a random number to replace the duplicate found in that element of the array. My logic seems to be sound, a nested for loop that starts with the first element then loop through the same array 5 times to see if the initial element matches. So start at element 0, then loop 5 times to see if 0 through 4 matches element 0, then 1 and so on. Plus generating a random number when a duplicate is found and replacing that element is not working out so good. I did see a solution with using a dictionary object and its key, but I don't want to do that, I want to just use raw code to solve this algorithm, no special objects.
My function:
void checkForDuplicates()
{
int[] test = { 3,6,8,10,2,3 };
int count = 0;
Random ranDuplicateChange;
for(int i = 0; i < test.Length; i++)
{
count = 0;
Console.WriteLine(" {0} :: The current number is: {1} ",i, test[i]);
for(int j = 0; j < test.Length; j++)
{
if (test[i] == test[j])
{
count++;
if (count >= 2)
{
Console.WriteLine("Duplicate found: {0}", test[j]);
//ranDuplicateChange = new Random();
//test[j] = ranDuplicateChange.Next(1, 72);
}
}
}
}
You can get them using lambda expressions:
var duplicates = test.GroupBy(a => a)
.Where(g => g.Count() > 1)
.Select(i => new { Number = i.Key, Count = i.Count()});
This returns an IEnumerable of an anonymous type with 2 properties, Number and Count
I want to just use raw code to solve this algorithm, no special objects.
I believe what you mean by this is not using LINQ or any other library methods that are out there that can achieve your end-goal easily, but simply to manipulate the array you have and find a way to find duplicates.
Let's leave code aside for a moment and see what we need to do to find the duplicates. Your approach is to start from the beginning, and compare each element to other elements in the array, and see if they are duplicates. Not a bad idea, so let's see what we need to do to implement that.
Your array:
test = 3, 6, 8, 10, 2, 3
What we must do is, take 3, see if it's equal to the next element, and then the next, and then the next, until the end of the array. If duplicates are found replace them.
Second round, take 6, and since we already compared first element 3, we start with 8 and go on till the end of the array.
Third round, start from 8, and go on.
You get the drift.
Now let's take a look at your code.
Now we start at zero-element (I'm using zero based index for convenience), which is 3, and then in the inner-loop of j we see if the next element, 6, is a duplicate. It's not, so we move on. And so on. We do find a duplicate at the last position, then count it. So far so good.
Next loop, now here is your first mistake. Your second loop, j, starts at 0, so when i=1, the first iteration of your j starts at 0, so you're comparing test[1] vs test[0], which you already compared in the first round (your outer loop). What you should instead do is, compare test[1] vs test[2].
So think what you need to change in your code to achieve this, in terms of i and j in your loops. What you want to do is, start your j loop one more than your current i value.
Next, you increment count whenever you find a duplicate, which is fine. But printing the number only when count >= 2 doesn't make sense. Because, you started it at 0, and increment only if you found a duplicate, so even if your counter is 1, that means you've found a duplicate. You should instead simply generate a random number and replace test[j] with it.
I'm intentionally not giving you code samples as you say that you're eager to learn yourself how to solve this problem, which is always a good thing. Hope the above information is useful.
Disclaimer:
All the above is simply to show you how to fix your current code, but in itself, it still has flaws. To being with, your 'replacing with random number' idea is not watertight. For instance, if it generated the same number you're trying to replace (although the odds are low it can happen, and when you write a program you shouldn't rely on chance for your program to not go wrong), you'd still end up with duplicates. Same with if it generated a number that's found at the beginning of the list later on. For example say your list is 2, 3, 5, 3. The first iteration of i would correctly determine 2 is not a duplicate. Then in next iteration, you find that 3 is a duplicate, and replace it. However, there, if the new randomly generated number turned out to be 2, and since we've already ruled out that 2 is not a duplicate, the newly generated 2 will not be overwritten again and you'll end up with a list with duplicates. To combat that you can revert to your original idea of starting j loop with 0 every time, and replace if a duplicate is encountered. To do that, you'll need an extra condition to see if i == j and if so skip the inner loop. But even then, the now newly generated random number could be equal to one of the numbers in the list to again ruining your logic.
So really, it's fine to attempt this problem this way, but you should also compare your random number to your list every time you generate a number and if it's equal, then generate another random number, and so on until you're good.
But at the end of the day to remove duplicates for a list and replace them with unique numbers there are way more easier and non-error-prone methods using LINQ etc.
I am dealing with an application that needs to randomly read an entire line of text from a series of potentially large text files (~3+ GB).
The lines can be of a different length.
In order to reduce GC and create unnecessary strings, I am using the solution provided at: Is there a better way to determine the number of lines in a large txt file(1-2 GB)? to detect each new line and store that in a map in one pass therefore producing an index of lineNo => positioni.e:
// maps each line to it's corresponding fileStream.position in the file
List<int> _lineNumberToFileStreamPositionMapping = new List<int>();
go through the entire file
when detect a new line increment lineCount and add the fileStream.Position to the _lineNumberToFileStreamPositionMapping
We then use an API similar to:
public void ReadLine(int lineNumber)
{
var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber];
//... set the stream position, read the byte array, convert to string etc.
}
This solution is currently providing a good performance however there are two things I do not like:
Since I do not know the total number of lines in the file, I cannot preallocate an array therefore I have to use a List<int> which has the potential inefficiency of resizing to double of what I actually need;
Memory usage, so as an example for a text file of ~1GB with ~5 million lines of text the index occupies ~150MB I would really like to decrease this as much as possible.
Any ideas are very much appreciated.
Use List.Capacity to manually increase the capacity, perhaps every 1000 lines or so.
If you want to trade performance for memory, you can do this: instead of storing the positions of every line, store only the positions of every 100th (or something) line. Then when, say, line 253 is required, go to the position of line 200 and count forward 53 lines.
OK so I need to know if anyone can see a way to reduce the number of iterations of these loops because I can't. The first while loop is for going through a file, reading one line at a time. The first foreach loop is then comparing each of the compareSet with what was read in the first while loop. Then the next while loop is to do with bit counting.
As requested, an explaination of my algorithm:
There is a file that is too large to fit in memory. It contains a word followed by the pages in a very large document that this word is on. EG:
sky 1 7 9 32....... (it is not in this format, but you get the idea).
so parseLine reads in the line and converts it into a list of ints that are like a bit array where 1 means the word is on the page, and 0 means it isn't.
CompareSet is a bunch of other words. I can't fit my entire list of words into memory so I can only fit a subset of them. This is a bunch of words just like the "sky" example. I then compare each word in compareSet with Sky by seeing if they are on the same page.
So if sky and some other word both have 1 set at a certain index in the bit array (simulated as an int array for performance), they are on the same page. The algorithm therefore counts the occurances of any two words on a particular page. So in the end I will have a list like:
(for all words in list) is on the same page as (for all words in list) x number of times.
eg sky and land is on the same page x number of times.
while ((line = parseLine(s)) != null) {
getPageList(line.Item2, compareWord);
foreach (Tuple<int, uint[], List<Tuple<int, int>>> word in compareSet) {
unchecked {
for (int i = 0; i < 327395; i++) {
if (word.Item2[i] == 0 || compareWord[i] == 0)
continue;
uint combinedNumber = word.Item2[i] & compareWord[i];
while (combinedNumber != 0) {
actual++;
combinedNumber = combinedNumber & (combinedNumber - 1);
}
}
}
As my old professor Bud used to say: "When you see nested loops like this, your spidey senses should be goin' CRAZY!"
You have a while with a nested for with another while. This nesting of loops is an exponential increase on the order of operations. Your one for loop has 327395 iterations. Assuming they have the same or similar number of iterations, that means you have an order of operations of
327,395 * 327,395 * 327,395 = 35,092,646,987,154,875 (insane)
It's no wonder that things would be slowing down. You need to redefine your algorithm to remove these nested loops or combine work somewhere. Even if the numbers are smaller than my assumptions, the nesting of the loops is creating a LOT of operations that are probably unnecessary.
As Joal already mentioned nobody is able to optimize this looping algorithm. But what you can do is trying to better explain what you are trying to accomplish and what your hard requirements are. Maybe you can take a different approach by using some like HashSet<T>.IntersectWith() or BloomFilter or something like this.
So if you really want help from here you should not only post the code that doesn't work, but also what the overall task is you like to accomplish. Maybe someone has a completely other idea to solve your problem, making your whole algorithm obsolete.
I want to provide the user with a selection of questions but I want them to be random, having a quiz game with the same questions isn't exactly fun.
My idea was to store a large collection of questions and there appropiate answers in a text file:
What colour is an Strawberry|Red
How many corners are there on a Triangle|Three
This means that I could simply select a line at random, read the question and answer from the line and store them in a collection to be used in the game.
I've come up with some pseudo code with an approach I think would be useful and am looking for some input as to how it could be improved:
Random rand = new Random();
int line;
string question,answer;
for(int i = 0; i < 20; i++)
{
line = rand.Next();
//Read question at given line number to string
//Read answer at given line number to string
//Copy question and answer to collection
}
In terms of implementing the idea I'm unsure as to how I can specify a line number to read from as well as how to split the entire line and read both parts separately. Unless there's a better way my thoughts are manually entering a line number in the text file followed by a "|" so each line looks like this:
1|What colour is an Strawberry|Red
2|How many corners are there on a Triangle|Three
Thanks for any help!
Why not read the entire file into an array or a list using ReadLine and then refer to a random index within the bounds of the array to pull the question/answer string from, rather than reading from the text file when you want a question.
As for parsing it, just use Split to split it at the | delineator (and make sure that no questions have a | in the question for some reason). This would also let you store some wrong answers with the question (just say that the first one is always right, then when you output it you can randomize the order).
You don't want to display any questions twice, right?
Random random = new Random();
var q = File.ReadAllLines("questions.txt")
.OrderBy(x=>random.Next())
.Take(20)
.Select(x=>x.Split('|'))
.Select(x=>new QuestionAndAnswer(){Question=x[0],Answer=x[1]});
I need help on an algorithm. I have randomly generated numbers with 6 digits. Like;
123654
109431
There are approximately 1 million of them saved in a file line by line. I have to filter them according to the rule I try to describe below.
Take a number, compare it to all others digit by digit. If a number comes up with a digit with a value of bigger by one to the compared number, then delete it. Let me show it by using numbers.
Our number is: 123456
Increase the first digit with 1, so the number becomes: 223456. Delete all the 223456s from the file.
Increase the second digit by 1, the number becomes: 133456. Delete all 133456s from the file, and so on...
I can do it just as I describe but I need it to be "FAST".
So can anyone help me on this?
Thanks.
First of all, since it is around 1Million you had better perform the algorithm in RAM, not on Disk, that is, first load the contents into an array, then modify the array, then paste the results back into the file.
I would suggest the following algorithm - a straightforward one. Precalculate all the target numbers, in this case 223456, 133456, 124456, 123556, 123466, 123457. Now pass the array and if the number is NOT any of these, write it to another array. Alternatively if it is one of these numbers delete it(recommended if your data structure has O(1) remove)
This algorithm will keep a lot of numbers around in memory, but it will process the file one number at a time so you don't actually need to read it all in at once. You only need to supply an IEnumerable<int> for it to operate on.
public static IEnumerable<int> FilterInts(IEnumerable<int> ints)
{
var removed = new HashSet<int>();
foreach (var i in ints)
{
var iStr = i.ToString("000000").ToCharArray();
for (int j = 0; j < iStr.Length; j++)
{
var c = iStr[j];
if (c == '9')
iStr[j] = '0';
else
iStr[j] = (char)(c + 1);
removed.Add(int.Parse(new string(iStr)));
iStr[j] = c;
}
if (!removed.Contains(i))
yield return i;
}
}
You can use this method to create an IEnumerable<int> from the file:
public static IEnumerable<int> ReadIntsFrom(string path)
{
using (var reader = File.OpenText(path))
{
string line;
while ((line = reader.ReadLine()) != null)
yield return int.Parse(line);
}
}
Take all the numbers from the file to an arrayList, then:
take the number of threads as the number of digits
increment the first digit on the number in first thread, second in the second thread and then compare it with the rest of the numbers,
It would be fast as it will undergo by parallel processing...
All the suggestions (so far) require six comparisons per input line, which is not necessary. The numbers are coming in as strings, so use string comparisons.
Start with #Armen Tsirunyan's idea:
Precalculate all the target numbers,
in this case 223456, 133456, 124456,
123556, 123466, 123457.
But instead of single comparisons, make that into a string:
string arg = "223456 133456 124456 123556 123466 123457";
Then read through the input (either from file or in memory). Pseudocode:
foreach (string s in theBigListOfNumbers)
if (arg.indexOf(s) == -1)
print s;
This is just one comparison per input line, no dictionaries, maps, iterators, etc.
Edited to add:
In x86 instruction set processors (not just the Intel brand), substring searches like this are very fast. To search for a character within a string, for example, is just one machine instruction.
I'll have to ask others to weigh in on alternate architectures.
For starters, I would just read all the numbers into an array.
When you are finally done, rewrite the file.
It seems like the rule you're describing is for the target number abdcef you want to find all numbers that contain a+1, b+1, c+1, d+1, e+1, or f+1 in the appropriate place. You can do this in O(n) by looping over the lines in the file and comparing each of the six digits to the digit in the target number if no digits match, write the number to an output file.
This sounds like a potential case for a multidimensional array, and possibly also unsafe c# code so that you can use pointer math to iterate through such a large quantity of numbers.
I would have to dig into it further, but I would also probably use a Dictionary for non-linear lookups, if you are comparing numbers that aren't in sequence.
How about this. You process numbers one by one. Numbers will be stored in hash tables NumbersOK and NumbersNotOK.
Take one number
If it's not in NumbersNotOK place it in a Hash of NumbersOK
Get it's variances of single number increments in hash - NumbersNotOK.
Remove all of the NumbersOK members if they match any of the variances.
Repeat from 1, untill end of file
Save the NumbersOK to the file.
This way you'll pass the list just once. The hash table is made just for this kind of purposes and it'll be very fast (no expensive comparison methods).
This algorithm is not in full, as it doesn't handle when there are some numbers repeating, but it can be handled with some tweaking...
Read all your numbers from the file and store them in a map where the number is the key and a boolean is the value signifying that the value hasn't been deleted. (True means exists, false means deleted).
Then iterate through your keys. For each key, set the map to false for the values you would be deleting from the list.
Iterate through your list one more time and get all the keys where the value is true. This is the list of remaining numbers.
public List<int> FilterNumbers(string fileName)
{
StreamReader sr = File.OpenTest(fileName);
string s = "";
Dictionary<int, bool> numbers = new Dictionary<int, bool>();
while((s = sr.ReadLine()) != null)
{
int number = Int32.Parse(s);
numbers.Add(number,true);
}
foreach(int number in numbers.Keys)
{
if(numbers[number])
{
if(numbers.ContainsKey(100000+number))
numbers[100000+number]=false;
if(numbers.ContainsKey(10000+number))
numbers[10000+number]=false;
if(numbers.ContainsKey(1000+number))
numbers[1000+number]=false;
if(numbers.ContainsKey(100+number))
numbers[100+number]=false;
if(numbers.ContainsKey(10+number))
numbers[10+number]=false;
if(numbers.ContainsKey(1+number))
numbers[1+number]=false;
}
}
List<int> validNumbers = new List<int>();
foreach(int number in numbers.Keys)
{
validNumbers.Add(number);
}
return validNumbers;
}
This may need to be tested as I don't have a C# compiler on this computer and I'm a bit rusty. The algorithm will take a bit of memory bit it runs in linear time.
** EDIT **
This runs into problems whenever one of the numbers is 9. I'll update the code later.
Still sounds like a homework question... the fastest sort on a million numbers will be n log(n) that is 1000000log(1000000) that is 6*1000000 which is the same as comparing 6 numbers to each of the million numbers. So a direct comparison will be faster than sort and remove, because after sorting you still have to compare to remove. Unless, ofcourse, my calculations have entirely missed the target.
Something else comes to mind. When you pick up the number, read it as hex and not base 10. then maybe some bitwise operators may help somehow.
Still thinking on what can be done using this. Will update if it works
EDIT: currently thinking on the lines of gray code. 123456 (our original number) and 223456 or 133456 will be off only by one digit and a gray code convertor will catch it fast. It's late night here, so if someone else finds this useful and can give a solution...