Improve the code for performance? - c#

I wrote the follow c# codes to generate a set of numbers and then compare with another set of numbers to remove the unwanted numbers.
But its taking too long at run time to complete the process. Following is the code behind file.
The numbers it has to generate is like 7 figures large and the numbers list which I use it as to remove is around 700 numbers.
Is there a way to improve the run time performance?
string[] strAry = txtNumbersToBeExc.Text.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
int[] intAry = new int[strAry.Length];
List<int> intList = new List<int>();
for (int i = 0; i < strAry.Length; i++)
{
intList.Add(int.Parse(strAry[i]));
}
List<int> genList = new List<int>();
for (int i = int.Parse(txtStartSeed.Text); i <= int.Parse(txtEndSeed.Text); i++)
{
genList.Add(i);
}
lblStatus.Text += "Generated: " + genList.Capacity;
var finalvar = from s in genList where !intList.Contains(s) select s;
List<int> finalList = finalvar.ToList();
foreach (var item in finalList)
{
txtGeneratedNum.Text += "959" + item + "\n";
}

First thing to do is grab a profiler and see which area of your code is taking too long to run, try http://www.jetbrains.com/profiler/ or http://www.red-gate.com/products/dotnet-development/ants-performance-profiler/.
You should never start performance tuning until you know for sure where the problem is.
If the problem is in the linq query than you could try sorting the intlist and doing a binary search for each item to remove, though you can probably get a similar behavour with the right linq query.

string numbersStr = txtNumbersToBeExc.Text;
string startSeedStr = txtStartSeed.Text;
string endSeedStr = txtEndSeed.Text;
//next, the input type actually is of type int, we should test if the strings are ok ( they do represent ints)
var intAry = numbersStr.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries).Select(s=>Int32.Parse(s));
int startSeed = Int32.Parse(startSeedStr);
int endSeed = Int32.Parse(endSeedStr);
/*FROM HERE*/
// using Enumerable.Range
var genList = Enumerable.Range(startSeed, endSeed - startSeed + 1);
// we can use linq except
var finalList = genList.Except(intAry);
// do you need a string, for 700 concatenations I would suggest StringBuilder
var sb = new StringBuilder();
foreach ( var item in finalList)
{
sb.AppendLine(string.Concat("959",item.ToString()));
}
var finalString = sb.ToString();
/*TO HERE, refactor it into a method or class*/
txtGeneratedNum.Text = finalString;
They key point here is that String is a immutable class, so the "+" operation between two strings will create another string. StringBuilder it doesn't do this. On your situation it really doesn't matter if you're using for loops, foreach loops, linq fancy functions to accomplish the exclusion. The performance hurt was because of the string concatenations. I'm trusting more the System.Linq functions because they are already tested for performance.

Change intList from a List to a HashSet - gives much better performance when determining if an entry is present.
Consider using Linq's Enumerable.Intersect, especially combined with #1.

Change the block of code that create genList with this:
List<int> genList = new List<int>();
for (int i = int.Parse(txtStartSeed.Text); i <= int.Parse(txtEndSeed.Text); i++)
{
if (!intList.Contains(i)) genList.Add(i);
}
and after create txtGeneratedNum looping on genList. This will reduce the number of loop of your implementation.

Why not do the inclusion check when you are parsing the int and just build the result list directley.
There is not much point in iterating over the list twice. In fact, why build the intermediate list at all !?! just write straight to a StringBuilder since a newline delimited string seems to be your goal.
string[] strAry = txtNumbersToBeExc.Text.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
var exclusions = new HashSet<T>();
foreach (string s in txtNumbersToBeExc.Text.Split(new string[] { Environment.NewLine })
{
int value;
if (int.TryParse(s, value)
{
exclusions.Add(value);
}
}
var output = new StringBuilder();
for (int i = int.Parse(txtStartSeed.Text); i <= int.Parse(txtEndSeed.Text); i++)
{
if (!exclusions.Contains(i))
{
output.AppendFormat("959{0}\n", i);
}
}
txtGeneratedNum.Text = output.ToString();

Related

How to split a string into an array of two letter substrings with C#

Problem
Given a sample string abcdef, i am trying to split that into an array of two character string elements that should results in ['ab','cd','ef'];
What i tried
I tried to iterate through the string while storing the substring in the current index in an array i declared inside the method, but am getting this output
['ab','bc','cd','de','ef']
Code I used
static string[] mymethod(string str)
{
string[] r= new string[str.Length];
for(int i=0; i<str.Length-1; i++)
{
r[i]=str.Substring(i,2);
}
return r;
}
Any solution to correct that with the code to return the correct output is really welcome, Thanks
your problem was that you incremented your index by 1 instead of 2 every time
var res = new List<string>();
for (int i = 0; i < x.Length - 1; i += 2)
{
res.Add(x.Substring(i, 2));
}
should work
EDIT:
because you ask for a default _ suffix in case of odd characters amount,
this should be the change:
var testString = "odd";
string workOn = testString.Length % 2 != 0
? testString + "_"
: testString;
var res = new List<string>();
for (int i = 0; i < workOn.Length - 1; i += 2)
{
res.Add(workOn.Substring(i, 2));
}
two notes to notice:
in .NET 6 Chunk() is available so you can use this as suggested in other answers
this solution might not be the best in case of a very long input
so it really depends on what are your inputs and expectations
.net 6 has an IEnumerable.Chunk() method that you can use to do this, as follows:
public static void Main()
{
string[] result =
"abcdef"
.Chunk(2)
.Select(chunk => new string(chunk)).ToArray();
Console.WriteLine(string.Join(", ", result)); // Prints "ab, cd, ef"
}
Before .net 6, you can use MoreLinq.Batch() to do the same thing.
[EDIT] In response the the request below:
MoreLinq is a set of Linq utilities originally written by Jon Skeet. You can find an implementation by going to Project | Manage NuGet Packages and then browsing for MoreLinq and installing it.
After installing it, add using MoreLinq.Extensions; and then you'll be able to use the MoreLinq.Batch extension like so:
public static void Main()
{
string[] result = "abcdef"
.Batch(2)
.Select(chunk => new string(chunk.ToArray())).ToArray();
Console.WriteLine(string.Join(", ", result)); // Prints "ab, cd, ef"
}
Note that there is no string constructor that accepts an IEnumerable<char>, hence the need for the chunk.ToArray() above.
I would say, though, that including the whole of MoreLinq just for one extension method is perhaps overkill. You could just write your own extension method for Enumerable.Chunk():
public static class MyBatch
{
public static IEnumerable<T[]> Chunk<T>(this IEnumerable<T> self, int size)
{
T[] bucket = null;
int count = 0;
foreach (var item in self)
{
if (bucket == null)
bucket = new T[size];
bucket[count++] = item;
if (count != size)
continue;
yield return bucket;
bucket = null;
count = 0;
}
if (bucket != null && count > 0)
yield return bucket.Take(count).ToArray();
}
}
If you are using latest .NET version i.e (.NET 6.0 RC 1), then you can try Chunk() method,
var strChunks = "abcdef".Chunk(2); //[['a', 'b'], ['c', 'd'], ['e', 'f']]
var result = strChunks.Select(x => string.Join('', x)).ToArray(); //["ab", "cd", "ef"]
Note: I am unable to test this on fiddle or my local machine due to latest version of .NET
With linq you can achieve it with the following way:
char[] word = "abcdefg".ToCharArray();
var evenCharacters = word.Where((_, idx) => idx % 2 == 0);
var oddCharacters = word.Where((_, idx) => idx % 2 == 1);
var twoCharacterLongSplits = evenCharacters
.Zip(oddCharacters)
.Select((pair) => new char[] { pair.First, pair.Second });
The trick is the following, we create two collections:
one where we have only those characters where the original index was even (% 2 == 0)
one where we have only those characters where the original index was odd (% 2 == 1)
Then we zip them. So, we create a tuple by taking one item from the even and one item from the odd collection. Then we create a new tuple by taking one item from the even and ...
And last we convert the tuples to arrays to have the desired output format.
You are on the right track but you need to increment by 2 not by one. You also need to check if the array has not ended before taking the second character else you risk running into an index out of bounds exception. Try this code I've written below. I've tried it and it works. Best!
public static List<string> splitstring(string str)
{
List<string> result = new List<string>();
int strlen = str.Length;
for(int i = 0; i<strlen; i+=2)
{
string currentstr = str[i].ToString();
if (i + 1 <= strlen-1)
{ currentstr += str[i + 1].ToString(); }
result.Add(currentstr);
}
return result;
}

Quickest algorithm for identifying pairs in collection of string

I am looking for the quickest algorithm:
GOAL: output the total number of pair occurrences found on a line. The individual elements may be in any order on any given line.
INPUT:
a;b;c;d
a;e;f;g
a;b;f;h
OUTPUT
a;b = 2
a;c = 1
a;d = 1
a;e = 1
a;f = 2
a;g = 1
b;c = 1
b;d = 1
I am programming in C#, I've got a nested for loop adding do a common dictionary of type where string is like a;b and when an occurrence is found it adds to the existing int tally or adds a new one at tally = 0.
Note this:
a;b = 1
b;a = 1
Should be reduced to this:
a;b = 1
I am open to using other languages, the output is in a plain text file which I feed into Gephi visualization tool.
Bonus: Very interested to know the name of this particular algorithm if it's out there. Pretty sure it is.
String[] data = File.ReadAllLines(#"C:\input.txt");
Dictionary<string, int> ress = new Dictionary<string, int>();
foreach (var line in data)
{
string[] outStrings = line.Split(';');
for (int i = 0; i < outStrings.Count(); i++)
{
for (int y = 0; y < outStrings.Count(); y++)
{
if (outStrings[i] != outStrings[y])
{
try
{
if (ress.Any(x => x.Key == outStrings[i] + ";" + outStrings[y]))
{
ress[outStrings[i] + ";" + outStrings[y]] += 1;
}
else
{
ress.Add(outStrings[i] + ";" + outStrings[y], 0);
}
}
catch (Exception)
{
}
}
}
}
}
foreach (var val in ress)
{
Console.WriteLine(val.Key + "----" + val.Value);
}
I think your inner loop should start with i + 1 instead of starting back at 0 again, and the outer loop should only run until Length - 1, since the last item will be compared on the inner loop. Also, when you add a new item, you should add the value 1, not 0 (since the whole reason we're adding it is because we found one).
You can also just store the key into a string once instead of doing multiple concatenations during your comparison and assignment, and you can use the ContainsKey method to determine if a key exists already.
Also, you might want to consider avoiding empty catch blocks unless you're really certain that you don't care if or what went wrong. If I'm expecting an exception and know how to handle it, then I catch that exception, otherwise I'll just let it bubble up the stack.
Here's one way you could modify your code to find all pairs and their counts:
Update
I added a check to ensure that the "pair" key is always sorted, so that "b;a" becomes "a;b". This wasn't an issue in your sample data, but I extended the data to include lines like b;a;a;b;a;b;a;. Also I added StringSplitOptions.RemoveEmptyEntries to the Split method to handle cases where a line begins or ends with a ; (otherwise the null value resulted in a pair like ";a").
private static void Main()
{
var data = File.ReadAllLines(#"f:\public\temp\temp.txt");
var pairCount = new Dictionary<string, int>();
foreach (var line in data)
{
var lineItems = line.Split(new[] {';'}, StringSplitOptions.RemoveEmptyEntries);
for (var outer = 0; outer < lineItems.Length - 1; outer++)
{
for (var inner = outer + 1; inner < lineItems.Length; inner++)
{
var outerComparedToInner = string.Compare(lineItems[outer],
lineItems[inner], StringComparison.Ordinal);
// If both items are the same character, ignore them and keep looping
if (outerComparedToInner == 0) continue;
// Create the pair such that the lower of the two
// values is first, so that "b;a" becomes "a;b"
var thisPair = outerComparedToInner < 0
? $"{lineItems[outer]};{lineItems[inner]}"
: $"{lineItems[inner]};{lineItems[outer]}";
if (pairCount.ContainsKey(thisPair))
{
pairCount[thisPair]++;
}
else
{
pairCount.Add(thisPair, 1);
}
}
}
}
Console.WriteLine("Pair\tCount\n----\t-----");
foreach (var val in pairCount.OrderBy(i => i.Key))
{
Console.WriteLine($"{val.Key}\t{val.Value}");
}
Console.Write("\nDone!\nPress any key to exit...");
Console.ReadKey();
}
Output
Given a file containing your sample data, the output is:
#mrmcgreg, finally after changing the implementation to the ECLAT algorythm everything runs in seconds instead of hours.
Basically for each unique tag, keep track of the LINE NUMBERS where those tags are found, and simply intersect the pair of list of numbers by combination pairs to get the count.
Dictionary<string, List<int>> uniqueTagList = new Dictionary<string, List<int>>();
foreach (var uniqueTag in uniquetags)
{
List<int> lineNumbers = new List<int>();
foreach (var item in data.Select((value, i) => new { i, value }))
{
var value = item.value;
var index = item.i;
//split data into tags
var tags = item.ToString().Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var tag in tags)
{
if (uniqueTag == tag)
{
lineNumbers.Add(index);
}
}
}
//remove all having support threshold.
if (lineNumbers.Count > 5)
{
uniqueTagList.Add(uniqueTag, lineNumbers);
}
}

Fastest way to split a large list of integers into List of strings in C#

My goal is to split a list of 24043 integers into strings like:
"?ids=" + "1,2,3...198,199,200"
Can you think of a better solution than mine in terms of performance?
public List<string> ZwrocListeZapytan(List<int> listaId)
{
var listaZapytan = new List<string>();
if (listaId.Count == 0) return listaZapytan;
var zapytanie = "?ids=";
var licznik = 1;
for (var i = 0; i < listaId.Count; i++)
{
if (licznik == 200 || i == listaId.Count - 1)
{
listaZapytan.Add(zapytanie + listaId[i]);
zapytanie = "?ids=";
licznik = 1;
}
else
{
zapytanie += listaId[i] + ",";
licznik++;
}
}
return listaZapytan;
}
Using Linq:
Assuming listaId is the list of integers that has to be converted:
var result = listaId.GroupBy(x => x / 200)
.Select(y => "?ids=" + string.Join(",", y)).ToList();
.GroupBy() helps take 200 at a time
.Select() is used to combine them together in the format like the OP suggested i.e ?ids=1,2,... using string.Join()
Can you think of a better solution than mine in terms of performance?
It terms of performance the only thing that comes to my mind as an enhancement for your code is to use a StringBuilder when you concatenate the string:
public List<string> ZwrocListeZapytan(List<int> listaId)
{
var listaZapytan = new List<string>();
if (listaId.Count == 0) return listaZapytan;
StringBuilder sb = new StringBuilder();
sb.Append("?ids=");
var licznik = 1;
for (var i = 0; i < listaId.Count; i++)
{
if (licznik == 200 || i == listaId.Count - 1)
{
listaZapytan.Add(sb.ToString() +listaId[i]);
sb.Clear();
sb.Append("?ids=");
licznik = 1;
}
else
{
sb.Append(listaId[i] + ",");
licznik++;
} return listaZapytan;
}
Otherwise you could make the for-loop run in steps of the 200. At each step take the numbers from the given range and use String.Join to create the string:
// TEST DATA
List<int> listaId = Enumerable.Range(1, 420).ToList();
List<string> listaZapytan = new List<string>();
int stepsize = 200;
for (int i = 0; i < listaId.Count; i +=stepsize)
{
listaZapytan.Add("?ids=" + String.Join(",", listaId.Skip(i).Take(stepsize)));
}
Could you please make a try with this and let me know whether this approach helps to solve your issue?
List<int> listaId = Enumerable.Range(0, 24043).ToList();
var items = String.Join("", Enumerable.Range(0, 24043)
.Select((x,i)=>i%200==0?
"\n?ids=" + x.ToString():
"," + x.ToString()));
Running Example
Here we are using Enumerable.Range to generate 24043 continuous numbers starting from 0. Then we can use the Select method to split them into a list of 200 and form the required string. If you want to get the output as a List, Remove the String.Join and add .ToList() at the end of the query. Current query produces output with 0-199 in the first list if you want 200 in that list means change the condition to i%201.

Read and process 100 text files in c# in parallel

I have project that reads 100 text file with 5000 words in it.
I insert the words into a list. I have a second list that contains english stop words. I compare the two lists and delete the stop words from first list.
It takes 1 hour to run the application. I want to be parallelize it. How can I do that?
Heres my code:
private void button1_Click(object sender, EventArgs e)
{
List<string> listt1 = new List<string>();
string line;
for (int ii = 1; ii <= 49; ii++)
{
string d = ii.ToString();
using (StreamReader reader = new StreamReader(#"D" + d.ToString() + ".txt"))
while ((line = reader.ReadLine()) != null)
{
string[] words = line.Split(' ');
for (int i = 0; i < words.Length; i++)
{
listt1.Add(words[i].ToString());
}
}
listt1 = listt1.ConvertAll(d1 => d1.ToLower());
StreamReader reader2 = new StreamReader("stopword.txt");
List<string> listt2 = new List<string>();
string line2;
while ((line2 = reader2.ReadLine()) != null)
{
string[] words2 = line2.Split('\n');
for (int i = 0; i < words2.Length; i++)
{
listt2.Add(words2[i]);
}
listt2 = listt2.ConvertAll(d1 => d1.ToLower());
}
for (int i = 0; i < listt1.Count(); i++)
{
for (int j = 0; j < listt2.Count(); j++)
{
listt1.RemoveAll(d1 => d1.Equals(listt2[j]));
}
}
listt1=listt1.Distinct().ToList();
textBox1.Text = listt1.Count().ToString();
}
}
}
}
I fixed many things up with your code. I don't think you need multi-threading:
private void RemoveStopWords()
{
HashSet<string> stopWords = new HashSet<string>();
using (var stopWordReader = new StreamReader("stopword.txt"))
{
string line2;
while ((line2 = stopWordReader.ReadLine()) != null)
{
string[] words2 = line2.Split('\n');
for (int i = 0; i < words2.Length; i++)
{
stopWords.Add(words2[i].ToLower());
}
}
}
var fileWords = new HashSet<string>();
for (int fileNumber = 1; fileNumber <= 49; fileNumber++)
{
using (var reader = new StreamReader("D" + fileNumber.ToString() + ".txt"))
{
string line;
while ((line = reader.ReadLine()) != null)
{
foreach(var word in line.Split(' '))
{
fileWords.Add(word.ToLower());
}
}
}
}
fileWords.ExceptWith(stopWords);
textBox1.Text = fileWords.Count().ToString();
}
You are reading through the list of stopwords many times as well as continually adding to the list and re-attempting to remove the same stopwords over and again due to the way your code is structured. Your needs are also better matched to a HashSet than to a List, as it has set based operations and uniqueness already handled.
If you still wanted to make this parallel, you could do it by reading the stopword list once and passing it to an async method that will read the input file, remove the stopwords and return the resulting list, then you would need to merge the resulting lists after the asynchronous calls came back, but you had better test before deciding you need that, because that is quite a bit more work and complexity than this code already has.
If I understand you correctly, you want to:
Read all words from a file into a List
Remove all "stop words" from the List
Repeat for 99 more files, saving only the unique words
If this is correct, the code is pretty simple:
// The list of words to delete ("stop words")
var stopWords = new List<string> { "remove", "these", "words" };
// The list of files to check - you can get this list in other ways
var filesToCheck = new List<string>
{
#"f:\public\temp\temp1.txt",
#"f:\public\temp\temp2.txt",
#"f:\public\temp\temp3.txt"
};
// This list will contain all the unique words from all
// the files, except the ones in the "stopWords" list
var uniqueFilteredWords = new List<string>();
// Loop through all our files
foreach (var fileToCheck in filesToCheck)
{
// Read all the file text into a varaible
var fileText = File.ReadAllText(fileToCheck);
// Split the text into distinct words (splitting on null
// splits on all whitespace) and ignore empty lines
var fileWords = fileText.Split(null)
.Where(line => !string.IsNullOrWhiteSpace(line))
.Distinct();
// Add all the words from the file, except the ones in
// your "stop list" and those that are already in the list
uniqueFilteredWords.AddRange(fileWords.Except(stopWords)
.Where(word => !uniqueFilteredWords.Contains(word)));
}
This can be condensed into a single line with no explicit loop:
// This list will contain all the unique words from all
// the files, except the ones in the "stopWords" list
var uniqueFilteredWords = filesToCheck.SelectMany(fileToCheck =>
File.ReadAllText(fileToCheck)
.Split(null)
.Where(word => !string.IsNullOrWhiteSpace(word) &&
!stopWords.Any(stopWord => stopWord.Equals(word,
StringComparison.OrdinalIgnoreCase)))
.Distinct());
This code processed over 100 files with more than 12000 words each in less than a second (WAY less than a second... 0.0001782 seconds)
One issue I see here that can help improve performance is listt1.ConvertAll() will run in O(n) on the list. You are already looping to add the items to the list, why not convert them to lower case there. Also why not store the words in a hash set, so you can do look up and insertion in O(1). You could store the list of stop words in a hash set and when you are reading your text input see if the word is a stop word and if its not add it to the hash set to output the user.

How do I make the foreach instruction iterate in 2 places?

how do I make the foreach instruction iterate both in the "files" variable and in the "names" array?
var files = Directory.GetFiles(#".\GalleryImages");
string[] names = new string[8] { "Matt", "Joanne", "Robert","Andrei","Mihai","Radu","Ionica","Vasile"};
I've tried 2 options.. the first one gives me lots of errors and the second one displays 8 images of each kind
foreach(var file in files,var i in names)
{
//Do stuff
}
and
foreach(var file in files)
{
foreach (var i in names)
{
//Do stuff
}
}
You can try using the Zip Extension method of LINQ:
int[] numbers = { 1, 2, 3, 4 };
string[] words = { "one", "two", "three" };
var numbersAndWords = numbers.Zip(words, (first, second) => first + " " + second);
foreach (var item in numbersAndWords)
Console.WriteLine(item);
Would look something like this:
var files = Directory.GetFiles(#".\GalleryImages");
string[] names = new string[] { "Matt", "Joanne", "Robert", "Andrei", "Mihai","Radu","Ionica","Vasile"};
var zipped = files.Zip(names, (f, n) => new { File = f, Name = n });
foreach(var fn in zipped)
Console.WriteLine(fn.File + " " + fn.Name);
But I haven't tested this one.
It's not clear what you're asking. But, you can't iterate two iterators with foreach; but you can increment another variable in the foreach body:
int i = 0;
foreach(var file in files)
{
var name = names[i++];
// TODO: do something with name and file
}
This, of course, assumes that files and names are of the same length.
You can't. Use a for loop instead.
for(int i = 0; i < files.Length; i++)
{
var file = files[i];
var name = names[i];
}
If the both array have the same length this should work.
You have two options here; the first works if you are iterating over something that has an indexer, like an array or List, in which case use a simple for loop and access things by index:
for (int i = 0; i < files.Length && i < names.Length; i++)
{
var file = files[i];
var name = names[i];
// Do stuff with names.
}
If you have a collection that doesn't have an indexer, e.g. you just have an IEnumerable and you don't know what it is, you can use the IEnumerable interface directly. Behind the scenes, that's all foreach is doing, it just hides the slightly messier syntax. That would look like:
var filesEnum = files.GetEnumerator();
var namesEnum = names.GetEnumerator();
while (filesEnum.MoveNext() && namesEnum.MoveNext())
{
var file = filesEnum.Current;
var name = namesEnum.Current;
// Do stuff with files and names.
}
Both of these assume that both collections have the same number of items. The for loop will only iterate as many times as the smaller one, and the smaller enumerator will return false from MoveNext when it runs out of items. If one collection is bigger than the other, the 'extra' items won't get processed, and you'll need to figure out what to do with them.
I guess the files array and the names array have the same indices.
When this is the case AND you always want the same index at one time you do this:
for (int key = 0; key < files.Length; ++key)
{
// access names[key] and files[key] here
}
You can try something like this:
var pairs = files.Zip(names, (f,n) => new {File=f, Name=n});
foreach (var item in pairs)
{
Console.Write(item.File);
Console.Write(item.Name);
}

Categories

Resources