Can you speed up this algorithm? C# / C++ [closed]

Can you speed up this algorithm? C# / C++ [closed] - c#

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Hey I've been working on something from time to time and it has become relatively large now (and slow). However I managed to pinpoint the bottleneck after close up measuring of performance in function of time.
Say I want to "permute" the string "ABC". What I mean by "permute" is not quite a permutation but rather a continuous substring set following this pattern:
A
AB
ABC
B
BC
C
I have to check for every substring if it is contained within another string S2 so I've done some quick'n dirty literal implementation as follows:
for (int i = 0; i <= strlen1; i++)
{
for (int j = 0; j <= strlen2- i; j++)
{
sub = str1.Substring(i, j);
if (str2.Contains(sub)) {do stuff}
else break;
This was very slow initially but once I realised that if the first part doesnt exist, there is no need to check for the subsequent ones meaning that if sub isn't contained within str2, i can call break on the inner loop.
Ok this gave blazing fast results but calculating my algorithm complexity I realised that in worst case this will be N^4 ? I forgot that str.contains() and str.substr() both have their own complexities (N or N^2 I forgot which).
The fact that I have a huge amount of calls on those inside a 2nd for loop makes it perform rather.. well N^4 ~ said enough.
However I calculated the average run-time of this both mathematically using probability theory to evaluate the probability of growth of the substring in a pool of randomly generated strings (this was my base line) measuring when the probability became > 0.5 (50%)
This showed an exponential relationship between the number of different characters and the string length (roughly) which means that in the scenarios I use my algorithm the length of string1 wont (most probably) never exceed 7
Thus the average complexity would be ~O(N * M) where N is string length1 and M is string length 2. Due to the fact that I've tested N in function of constant M, I've gotten linear growth ~O(N) (not bad opposing to the N^4 eh?)
I did time testing and plotted a graph which showed nearly perfect linear growth so I got my actual results matching my mathematical predictions (yay!)
However, this was NOT taking into account the cost of string.contains() and string.substring() which made me wonder if this could be optimized even further?
I've been also thinking of making this in C++ because I need rather low-level stuff? What do you guys think? I have put a great time into analysing this hope I've elaborated everything clear enough :)!

Your question is tagged both C++ and C#.
In C++ the optimal solution will be to use iterators, and std::search. The original strings remains unmodified, and no intermediate objects get created. There won't be an equivalent of your Substring() taking place at all, so this eliminates that part of the overhead.
This should achieve the theoretically-best performance: brute force search, testing all permutations, with no intermediate object construction or destruction, other than the iterators themselves, which simply replace your two int index variables. I can't think of any faster way of implementing this basic algorithm.

Are You testing one string against one string? If You test bunch of strings against another bunch of strings, it is a whole different story. Even if You have the best algorithm for comparing one string against another O(X), it does not mean repeating it M*N times You would get the best algorithm for processing M strings against N.
When I made something simmiliar, I built dictionary of all substrings of all N strings
Dictionary<string, List<int>>
The string is a substring and int is index of string that contains that substring. Then I tested all substrings of all M strings against it. The speed was suddenly not O(M*N*X), but O(max(M,N)*S), where S is number of substrings of one string. Depending on M, N, X, S that may be faster. I do not say the dictionary of substrings is the best approach, I just want to point out that You should always try to see the whole picture.

Related

Limited performance with use of permutation and recursion

I'm writing a program that calculates the longest streak weighted by their probability and is using recursion to obtain all different possible scenarios. This is the coding challenge that I'm doing: https://open.kattis.com/problems/winningstreak
I noticed that the permutation function that I have is not the most effective when it comes to larger input due to recursion. An example input would be 3 and it would add the following to the matches array:
000, 010, 001, 011, 100,110,101,111
public static void Permutations(string text, int numberOfGames, List<String> matches)
{
if (numberOfGames > 0)
for (int j = 0; j < 2; j++)
Permutations(text + j.ToString(), numberOfGames - 1, matches);
else
{
matches.Add(text.ToString());
}
}
My problem lies with larger inputs (example 500), since that causes crashes on my program and throws the error: Garbage collector could not allocate 16384 bytes of memory for major heap section.
Is there any other way to improve this recursion so it runs better on larger inputs?
Thank you guys!

My problem lies with larger inputs (example 500)
Your program attempts to produce a list with 2500 strings.
There are roughly 2267 atoms in the universe.
I find it unsurprising that you're running out of memory.
Find a more clever solution to your problem.
Remember, the problem is not "enumerate all possible combinations of games". The problem is to deduce the expected value of the length of the longest streak. Generating all possible combinations and summing up the length of the longest streak in each is not going to work when the number of combinations becomes large.
Also remember that the statement of the problem is that the result must be within some fraction of the exact result. It does not have to be the exact result. Use meta-reasoning when dealing with puzzles like this: the person who posed the puzzle likely would not have made that relaxation of the problem unless it was something you could take advantage of.
Does this give you some insight into how to solve the problem?
If you want some more hints and insight, start by reading this:
http://gato-docs.its.txstate.edu/mathworks/DistributionOfLongestRun.pdf

Algorithm Comparison C# [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I am currently looking at typical interview questions, to get me in the right frame of mind.
I am trying to come up with my own solutions to the problems instead of trying to remember the given solutions.
The problem is that I'm not sure if my solutions are optimal or have a major flaw in design that i am not seeing.
So here is one of the solutions that I came up with for the basic "Is this string unique" problem as in check if all characters in a string are unique.
public static bool IsUnique(string str)
{
bool isUnique = true;
for (int i = 0; i < str.Length; i++)
{
if (str.LastIndexOf(str.ElementAt(i)) != i)
{
isUnique = false;
break;
}
}
return isUnique;
}
Does anyone have advice on whether this code is optimal and has acceptable time and space complexity?

For the purposes of this answer I will refer to Big-O notation to indicate complexity of an algorithm. The trick to efficiency is to realize the minimum Big-O measurement at which the problem can be solved, and then attempt to replicate that efficiency.
You can derive some efficiency facts by thinking about the algorithm logically: to check if all characters are unique, you need to evaluate all characters. So that's an O(n) traversal of the string guaranteed, and I doubt you'd easily get more efficient than that. Now, can you solve it yourself in O(n) or O(2n) time? If so, that's pretty decent because your algorithm is in linear time and will scale linearly (steadily get slower for larger string inputs).
Your current algorithm loops over the string and then for each character, iterates over the string again to compare it and find an equal character. This makes the algorithm an n traversal where each visit does an n traversal itself, so an O(n^2) algorithm. This is known as a polynomial time algorithm, which is not very good because it does not scale linearly; it scales polynomially. This means that your algorithm will get much slower with larger inputs, and that's a bad thing.
A quick change to make it slightly more efficient would be to start the comparison for an equivalent character at the current index you're at in the string + 1... You know that all previously checked characters are unique, so you care only about future characters. This would become an n traversal where each visit does a substring traversal from the current point (less work done as you traverse the string), but this is also an O(n^2) algorithm because it runs in the square of the outer loop's time. This is also a polynomial time algorithm, as before, but is slightly more efficient. It will still scale badly with larger inputs, however.
Think of alternative ways to avoid repeated iterations. These often come at the cost of memory, but are practical. I know how I would try and solve it, but telling you my answer doesn't help you learn. ;)
EDIT: As you requested, I'll share my answer
I'd do it by having a HashSet that I load each visited character into. HashSet lookups and adds are approximately an O(1) operation. The beauty of the HashSet.Add method is that it returns true if it added the value and false if the value already existed (which is the condition that determines your algorithm result). So mine would be:
var hashSet = new HashSet<char>();
foreach (char c in myString)
{
if (!hashSet.Add(c))
{
return false;
}
}
return true;
Pros: O(n) linear algorithm.
Cons: Extra memory used for HashSet.
EDIT2: Everyone loves cheap LINQ tricks, so here's another way
var hashSet = new HashSet<char>();
return myString.Any(c => !hashSet.Add(c));

Using a HashSet is more efficient as it has a constant looup time O(1), compared to looking up a character in a string with a linear lookup time O(n):
public static bool AreCharsUnique(string str)
{
var charset = new HashSet<char>();
foreach (char c in str) {
if (charset.Contains(c)) {
return false;
} else {
charset.Add(c);
}
}
return true;
}

performance problem

OK so I need to know if anyone can see a way to reduce the number of iterations of these loops because I can't. The first while loop is for going through a file, reading one line at a time. The first foreach loop is then comparing each of the compareSet with what was read in the first while loop. Then the next while loop is to do with bit counting.
As requested, an explaination of my algorithm:
There is a file that is too large to fit in memory. It contains a word followed by the pages in a very large document that this word is on. EG:
sky 1 7 9 32....... (it is not in this format, but you get the idea).
so parseLine reads in the line and converts it into a list of ints that are like a bit array where 1 means the word is on the page, and 0 means it isn't.
CompareSet is a bunch of other words. I can't fit my entire list of words into memory so I can only fit a subset of them. This is a bunch of words just like the "sky" example. I then compare each word in compareSet with Sky by seeing if they are on the same page.
So if sky and some other word both have 1 set at a certain index in the bit array (simulated as an int array for performance), they are on the same page. The algorithm therefore counts the occurances of any two words on a particular page. So in the end I will have a list like:
(for all words in list) is on the same page as (for all words in list) x number of times.
eg sky and land is on the same page x number of times.
while ((line = parseLine(s)) != null) {
getPageList(line.Item2, compareWord);
foreach (Tuple<int, uint[], List<Tuple<int, int>>> word in compareSet) {
unchecked {
for (int i = 0; i < 327395; i++) {
if (word.Item2[i] == 0 || compareWord[i] == 0)
continue;
uint combinedNumber = word.Item2[i] & compareWord[i];
while (combinedNumber != 0) {
actual++;
combinedNumber = combinedNumber & (combinedNumber - 1);
}
}
}

As my old professor Bud used to say: "When you see nested loops like this, your spidey senses should be goin' CRAZY!"
You have a while with a nested for with another while. This nesting of loops is an exponential increase on the order of operations. Your one for loop has 327395 iterations. Assuming they have the same or similar number of iterations, that means you have an order of operations of
327,395 * 327,395 * 327,395 = 35,092,646,987,154,875 (insane)
It's no wonder that things would be slowing down. You need to redefine your algorithm to remove these nested loops or combine work somewhere. Even if the numbers are smaller than my assumptions, the nesting of the loops is creating a LOT of operations that are probably unnecessary.

As Joal already mentioned nobody is able to optimize this looping algorithm. But what you can do is trying to better explain what you are trying to accomplish and what your hard requirements are. Maybe you can take a different approach by using some like HashSet<T>.IntersectWith() or BloomFilter or something like this.
So if you really want help from here you should not only post the code that doesn't work, but also what the overall task is you like to accomplish. Maybe someone has a completely other idea to solve your problem, making your whole algorithm obsolete.

Optimizing a Recursive Function for Very Large Lists .Net

I have built an application that is used to simulate the number of products that a company can produce in different "modes" per month. This simulation is used to aid in finding the optimal series of modes to run in for a month to best meet the projected sales forecast for the month. This application has been working well, until recently when the plant was modified to run in additional modes. It is now possible to run in 16 modes. For a month with 22 work days this yields 9,364,199,760 possible combinations. This is up from 8 modes in the past that would have yielded a mere 1,560,780 possible combinations. The PC that runs this application is on the old side and cannot handle the number of calculations before an out of memory exception is thrown. In fact the entire application cannot support more than 15 modes because it uses integers to track the number of modes and it exceeds the upper limit for an integer. Baring that issue, I need to do what I can to reduce the memory utilization of the application and optimize this to run as efficiently as possible even if it cannot achieve the stated goal of 16 modes. I was considering writing the data to disk rather than storing the list in memory, but before I take on that overhead, I would like to get people’s opinion on the method to see if there is any room for optimization there.
EDIT
Based on a suggestion by few to consider something more academic then merely calculating every possible answer, listed below is a brief explanation of how the optimal run (combination of modes) is chosen.
Currently the computer determines every possible way that the plant can run for the number of work days that month. For example 3 Modes for a max of 2 work days would result in the combinations (where the number represents the mode chosen) of (1,1), (1,2), (1,3), (2,2), (2,3), (3,3) For each mode a product produces at a different rate of production, for example in mode 1, product x may produce at 50 units per hour where product y produces at 30 units per hour and product z produces at 0 units per hour. Each combination is then multiplied by work hours and production rates. The run that produces numbers that most closely match the forecasted value for each product for the month is chosen. However, because some months the plant does not meet the forecasted value for a product, the algorithm increases the priority of a product for the next month to ensure that at the end of the year the product has met the forecasted value. Since warehouse space is tight, it is important that products not overproduce too much either.
Thank you
private List<List<int>> _modeIterations = new List<List<int>>();
private void CalculateCombinations(int modes, int workDays, string combinationValues)
{
List<int> _tempList = new List<int>();
if (modes == 1)
{
combinationValues += Convert.ToString(workDays);
string[] _combinations = combinationValues.Split(',');
foreach (string _number in _combinations)
{
_tempList.Add(Convert.ToInt32(_number));
}
_modeIterations.Add(_tempList);
}
else
{
for (int i = workDays + 1; --i >= 0; )
{
CalculateCombinations(modes - 1, workDays - i, combinationValues + i + ",");
}
}
}

This kind of optimization problem is difficult but extremely well-studied. You should probably read up in the literature on it rather than trying to re-invent the wheel. The keywords you want to look for are "operations research" and "combinatorial optimization problem".
It is well-known in the study of optimization problems that finding the optimal solution to a problem is almost always computationally infeasible as the problem grows large, as you have discovered for yourself. However, it is frequently the case that finding a solution guaranteed to be within a certain percentage of the optimal solution is feasible. You should probably concentrate on finding approximate solutions. After all, your sales targets are already just educated guesses, therefore finding the optimal solution is already going to be impossible; you haven't got complete information.)
What I would do is start by reading the wikipedia page on the Knapsack Problem:
http://en.wikipedia.org/wiki/Knapsack_problem
This is the problem of "I've got a whole bunch of items of different values and different weights, I can carry 50 pounds in my knapsack, what is the largest possible value I can carry while meeting my weight goal?"
This isn't exactly your problem, but clearly it is related -- you've got a certain amount of "value" to maximize, and a limited number of slots to pack that value into. If you can start to understand how people find near-optimal solutions to the knapsack problem, you can apply that to your specific problem.

You could process the permutation as soon as you have generated it, instead of collecting them all in a list first:
public delegate void Processor(List<int> args);
private void CalculateCombinations(int modes, int workDays, string combinationValues, Processor processor)
{
if (modes == 1)
{
List<int> _tempList = new List<int>();
combinationValues += Convert.ToString(workDays);
string[] _combinations = combinationValues.Split(',');
foreach (string _number in _combinations)
{
_tempList.Add(Convert.ToInt32(_number));
}
processor.Invoke(_tempList);
}
else
{
for (int i = workDays + 1; --i >= 0; )
{
CalculateCombinations(modes - 1, workDays - i, combinationValues + i + ",", processor);
}
}
}
I am assuming here, that your current pattern of work is something along the lines
CalculateCombinations(initial_value_1, initial_value_2, initial_value_3);
foreach( List<int> list in _modeIterations ) {
... process the list ...
}
With the direct-process-approach, this would be
private void ProcessPermutation(List<int> args)
{
... process ...
}
... somewhere else ...
CalculateCombinations(initial_value_1, initial_value_2, initial_value_3, ProcessPermutation);
I would also suggest, that you try to prune the search tree as early as possible; if you can already tell, that certain combinations of the arguments will never yield something, which can be processed, you should catch those already during generation, and avoid the recursion alltogether, if this is possible.
In new versions of C#, generation of the combinations using an iterator (?) function might be usable to retain the original structure of your code. I haven't really used this feature (yield) as of yet, so I cannot comment on it.

The problem lies more in the Brute Force approach that in the code itself. It's possible that brute force might be the only way to approach the problem but I doubt it. Chess, for example, is unresolvable by Brute Force but computers play at it quite well using heuristics to discard the less promising approaches and focusing on good ones. Maybe you should take a similar approach.
On the other hand we need to know how each "mode" is evaluated in order to suggest any heuristics. In your code you're only computing all possible combinations which, anyway, will not scale if the modes go up to 32... even if you store it on disk.

if (modes == 1)
{
List<int> _tempList = new List<int>();
combinationValues += Convert.ToString(workDays);
string[] _combinations = combinationValues.Split(',');
foreach (string _number in _combinations)
{
_tempList.Add(Convert.ToInt32(_number));
}
processor.Invoke(_tempList);
}
Everything in this block of code is executed over and over again, so no line in that code should make use of memory without freeing it. The most obvious place to avoid memory craziness is to write out combinationValues to disk as it is processed (i.e. use a FileStream, not a string). I think that in general, doing string concatenation the way you are doing here is bad, since every concatenation results in memory sadness. At least use a stringbuilder (See back to basics , which discusses the same issue in terms of C). There may be other places with issues, though. The simplest way to figure out why you are getting an out of memory error may be to use a memory profiler (Download Link from download.microsoft.com).
By the way, my tendency with code like this is to have a global List object that is Clear()ed rather than having a temporary one that is created over and over again.

I would replace the List objects with my own class that uses preallocated arrays to hold the ints. I'm not really sure about this right now, but I believe that each integer in a List is boxed, which means much more memory is used than with a simple array of ints.
Edit: On the other hand it seems I am mistaken: Which one is more efficient : List<int> or int[]

How can I sort an array of strings?

I have a list of input words separated by comma. I want to sort these words by alphabetical and length. How can I do this without using the built-in sorting functions?

Good question!! Sorting is probably the most important concept to learn as an up-and-coming computer scientist.
There are actually lots of different algorithms for sorting a list.
When you break all of those algorithms down, the most fundamental operation is the comparison of two items in the list, defining their "natural order".
For example, in order to sort a list of integers, I'd need a function that tells me, given any two integers X and Y whether X is less than, equal to, or greater than Y.
For your strings, you'll need the same thing: a function that tells you which of the strings has the "lesser" or "greater" value, or whether they're equal.
Traditionally, these "comparator" functions look something like this:
int CompareStrings(String a, String b) {
if (a < b)
return -1;
else if (a > b)
return 1;
else
return 0;
}
I've left out some of the details (like, how do you compute whether a is less than or greater than b? clue: iterate through the characters), but that's the basic skeleton of any comparison function. It returns a value less than zero if the first element is smaller and a value greater than zero if the first element is greater, returning zero if the elements have equal value.
But what does that have to do with sorting?
A sort routing will call that function for pairs of elements in your list, using the result of the function to figure out how to rearrange the items into a sorted list. The comparison function defines the "natural order", and the "sorting algorithm" defines the logic for calling and responding to the results of the comparison function.
Each algorithm is like a big-picture strategy for guaranteeing that ANY input will be correctly sorted. Here are a few of the algorithms that you'll probably want to know about:
Bubble Sort:
Iterate through the list, calling the comparison function for all adjacent pairs of elements. Whenever you get a result greater than zero (meaning that the first element is larger than the second one), swap the two values. Then move on to the next pair. When you get to the end of the list, if you didn't have to swap ANY pairs, then congratulations, the list is sorted! If you DID have to perform any swaps, go back to the beginning and start over. Repeat this process until there are no more swaps.
NOTE: this is usually not a very efficient way to sort a list, because in the worst cases, it might require you to scan the whole list as many as N times, for a list with N elements.
Merge Sort:
This is one of the most popular divide-and-conquer algorithms for sorting a list. The basic idea is that, if you have two already-sorted lists, it's easy to merge them. Just start from the beginning of each list and remove the first element of whichever list has the smallest starting value. Repeat this process until you've consumed all the items from both lists, and then you're done!
1 4 8 10
2 5 7 9
------------ becomes ------------>
1 2 4 5 7 8 9 10
But what if you don't have two sorted lists? What if you have just one list, and its elements are in random order?
That's the clever thing about merge sort. You can break any single list into smaller pieces, each of which is either an unsorted list, a sorted list, or a single element (which, if you thing about it, is actually a sorted list, with length = 1).
So the first step in a merge sort algorithm is to divide your overall list into smaller and smaller sub lists, At the tiniest levels (where each list only has one or two elements), they're very easy to sort. And once sorted, it's easy to merge any two adjacent sorted lists into a larger sorted list containing all the elements of the two sub lists.
NOTE: This algorithm is much better than the bubble sort method, described above, in terms of its worst-case-scenario efficiency. I won't go into a detailed explanation (which involves some fairly trivial math, but would take some time to explain), but the quick reason for the increased efficiency is that this algorithm breaks its problem into ideal-sized chunks and then merges the results of those chunks. The bubble sort algorithm tackles the whole thing at once, so it doesn't get the benefit of "divide-and-conquer".
Those are just two algorithms for sorting a list, but there are a lot of other interesting techniques, each with its own advantages and disadvantages: Quick Sort, Radix Sort, Selection Sort, Heap Sort, Shell Sort, and Bucket Sort.
The internet is overflowing with interesting information about sorting. Here's a good place to start:
http://en.wikipedia.org/wiki/Sorting_algorithms

Create a console application and paste this into the Program.cs as the body of the class.
public static void Main(string[] args)
{
string [] strList = "a,b,c,d,e,f,a,a,b".Split(new [] { ',' }, StringSplitOptions.RemoveEmptyEntries);
foreach(string s in strList.Sort())
Console.WriteLine(s);
}
public static string [] Sort(this string [] strList)
{
return strList.OrderBy(i => i).ToArray();
}
Notice that I do use a built in method, OrderBy. As other answers point out there are many different sort algorithms you could implement there and I think my code snippet does everything for you except the actual sort algorithm.
Some C# specific sorting tutorials

There is an entire area of study built around sorting algorithms. You may want to choose a simple one and implement it.
Though it won't be the most performant, it shouldn't take you too long to implement a bubble sort.

If you don't want to use build-in-functions, you have to create one by your self. I would recommend Bubble sort or some similar algorithm. Bubble sort is not an effective algoritm, but it get the works done, and is easy to understand.
You will find much good reading on wikipedia.

I would recommend doing a wiki for quicksort.
Still not sure why you don't want to use the built in sort?

Bubble sort damages the brain.
Insertion sort is at least as simple to understand and code, and is actually useful in practice (for very small data sets, and nearly-sorted data). It works like this:
Suppose that the first n items are already in order (you can start with n = 1, since obviously one thing on its own is "in the correct order").
Take the (n+1)th item in your array. Call this the "pivot". Starting with the nth item and working down:
- if it is bigger than the pivot, move it one space to the right (to create a "gap" to the left of it).
- otherwise, leave it in place, put the "pivot" one space to the right of it (that is, in the "gap" if you moved anything, or where it started if you moved nothing), and stop.
Now the first n+1 items in the array are in order, because the pivot is to the right of everything smaller than it, and to the left of everything bigger than it. Since you started with n items in order, that's progress.
Repeat, with n increasing by 1 at each step, until you've processed the whole list.
This corresponds to one way that you might physically put a series of folders into a filing cabinet in order: put one in; then put another one into its correct position by pushing everything that belongs after it over by one space to make room; repeat until finished. Nobody ever sorts physical objects by bubble sort, so it's a mystery to me why it's considered "simple".
All that's left now is that you need to be able to work out, given two strings, whether the first is greater than the second. I'm not quite sure what you mean by "alphabetical and length" : alphabetical order is done by comparing one character at a time from each string. If there not the same, that's your order. If they are the same, look at the next one, unless you're out of characters in one of the strings, in which case that's the one that's "smaller".

Use NSort
I ran across the NSort library a couple of years ago in the book Windows Developer Power Tools. The NSort library implements a number of sorting algorithms. The main advantage to using something like NSort over writing your own sorting is that is is already tested and optimized.

Posting link to fast string sort code in C#:
http://www.codeproject.com/KB/cs/fast_string_sort.aspx
Another point:
The suggested comparator above is not recommended for non-English languages:
int CompareStrings(String a, String b) {
if (a < b) return -1;
else if (a > b)
return 1; else
return 0; }
Checkout this link for non-English language sort:
http://msdn.microsoft.com/en-us/goglobal/bb688122
And as mentioned, use nsort for really gigantic arrays that don't fit in memory.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.