Compare 2 big string lists

Compare 2 big string lists - c#

I have two lists of strings - this is not mandatory, i can convert them to any collection (list, dictionary, etc).
First is "text":
Birds sings
Dogs barks
Frogs jumps
Second is "words":
sing
dog
cat
I need to iterate through "text" and if line contains any of "words" - do one thing and if not another thing.
Important: yes, in my case i need to find partial match ignoring case, like text "Dogs" is a match for word "dog". This is why i use .Contains and .ToLower().
My naive try looks like this:
List<string> text = new List<string>();
List<string> words = new List<string>();
foreach (string line in text)
{
bool found = false;
foreach (string word in words)
{
if (line.ToLower().Contains(word.ToLower()))
{
;// one thing
found = true;
break;
}
}
if (!found)
;// another
}
Problem in size - 8000 in first list and ~50000 in second. This takes too many time.
How to make it faster?

I'm assuming that you only want to match on the specific words in your text list: that is, if text contains "dogs", and words contains "dog", then that shouldn't be a match.
Note that this is different to what your code currently does.
Given this, we can construct a HashSet<string> of all of the words in your text list. We can then query this very cheaply.
We'll also use StringComparer.OrdinalIgnoreCase to do our comparisons. This is a better way of doing a case-insensitive match than ToLower(), and ordinal comparisons are relatively cheap. If you're dealing with languages other than English, you'll need to consider whether you actually need StringComparer.CurrentCultureIgnoreCase or StringComparer.InvariantCultureIgnoreCase.
var textWords = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
foreach (var line in text)
{
var lineWords = line.Split(' ');
textWords.UnionWith(lineWords);
}
if (textWords.Overlaps(words))
{
// One thing
}
else
{
// Another
}
If this is not the case, and you do want to do a .Contains on each, then you can speed it up a bit by avoiding the calls to .ToLower(). Each call to .ToLower() creates a new string in memory, so you're creating two new, useless objects per comparison.
Instead, use:
if (line.IndexOf(word, StringComparison.OrdinalIgnoreCase) >= 0)
As above, you might have to use StringComparison.CurrentCultureIgnoreCase or StringComparison.InvariantCultureIgnoreCase depending on the language of your strings. However, you should see a significant speedup if your strings are entirely ASCII and you use OrdinalIgnoreCase as this makes the string search a lot quicker.
If you're using .NET Framework, another thing to try is moving to .NET Core. .NET Core introduced a lot of optimizations in this area, and you might find that it's quicker.
Another thing you can do is see if you have duplicates in either text or words. If you have a lot, you might be able to save a lot of time. Consider using a HashSet<string> for this, or linq's .Distinct() (you'll need to see which is quicker).

You can try using LINQ for the second looping construct.
List<string> text = new List<string>();
List<string> words = new List<string>();
foreach (string line in text)
{
bool found = words.FirstOrDefault(w=>line.ToLower().Contains(w.ToLower()))!=null;
if (found)
{
//Do something
}
else
{
//Another
}
}
Might not be as fast as you want but it will be faster than before.

You can improve the search algorithm.
public static int Search(string word, List<string> stringList)
{
string wordCopy = word.ToLower();
List<string> stringListCopy = new List<string>();
stringList.ForEach(s => stringListCopy.Add(s.ToLower()));
stringListCopy.Sort();
int position = -1;
int count = stringListCopy.Count;
if (count > 0)
{
int min = 0;
int max = count - 1;
int middle = (max - min) / 2;
int comparisonStatus = 0;
do
{
comparisonStatus = string.Compare(wordCopy, stringListCopy[middle]);
if (comparisonStatus == 0)
{
position = middle;
break;
}
else if (comparisonStatus < 0)
{
max = middle - 1;
}
else
{
min = middle + 1;
}
middle = min + (max - min) / 2;
} while (min < max);
}
return position;
}
Inside this method we create copy of string list. All elements are lower case.
After that we sort copied list by ascending. This is crucial because the entire algorithm is based upon ascending sort.
If word exists in the list then the Search method will return its position inside list, otherwise it will return -1.
How the algorithm works?
Instead of checking every element in the list, we split the list in half in every iteration.
In every iteration we take the element in the middle and compare two strings (the element and our word). If out string is the same as the one in the middle then our search is finished. If our string is lexical before the string in the middle, then our string must be in the first half of the list, because the list is sorted by ascending. If our string is lexical after the string in the middle, then our string must be in the second half of the list, again because the list is sorted by ascending. Then we take the appropriate half and repeat the process.
In first iteration we take the entire list.
I've tested the Search method using these data:
List<string> stringList = new List<string>();
stringList.Add("Serbia");
stringList.Add("Greece");
stringList.Add("Egypt");
stringList.Add("Peru");
stringList.Add("Palau");
stringList.Add("Slovakia");
stringList.Add("Kyrgyzstan");
stringList.Add("Mongolia");
stringList.Add("Chad");
Search("Serbia", stringList);
This way you will search the entire list of ~50,000 elements in 16 iterations at most.

Related

I need to extract string(that is coming from a list) from line

I am trying to extract "strings" which are items of List. There are 200 lines from which I need to extract strings which are item(s) of a list, there are 54474 items in the said list, I am passing the line from which the items of the list are to extracted as sub-string if exists. I am passing both to the function as an argument as below:-
private static string FindMatchingSkill(string line, List<string> skillsfromMongoDB)
{
StringBuilder builtString = new StringBuilder();
foreach (var item in skillsfromMongoDB)
{
string temp = " " + item;
builtString.Append(line.Substring(line.IndexOf(temp), temp.Length).Trim() + ", ");
}
return builtString.ToString();
}

The first thing you want to do is not to substring the original string, instead, print out the item from the list.
Instead of:
Console.WriteLine(line.Substring(line.IndexOf(item), item.Length).Trim() + ", ");
use
Console.Write(item +", ");
But to do that, you need to get only the items that are actually in the string, so your loop should be something like this:
foreach (var item in data.Where(i => line.IndexOf(i) > -1)
That might leave you with some false positives, since if the line contains javascript and not java, you will get both.
So the next step is to identify what is a full world and what is not - now that might be a problem, since dot net is two words but it's just one item. Also, Items in the original string might be followed by chars other then white space - like a comma, a dot, semicolon etc'.
So instead of just using IndexOf, you need to also make sure the item you found is not a part of a larger item - and since your list items are not restricted to be a single word, that poses a real difficulty.
I would probably suggest something like this:
foreach (var item in data.Where(i => line.IndexOf(i) > -1 && !Char.IsLetter(line[line.IndexOf(i) + i.Length + 1]) && !Char.IsLetter(line[line.IndexOf(i) - 1]))
{
Console.Write(item +", ");
}
Testing the char after the item to make sure it's not a letter. If it is, then it's a false positive. Please note that since your items might contain non-letters chars you might still get false positives - if you have both dot net core and dot net in the list, but the line only have dot net core you will get a false positive for dot net. However, this is an edge case that I think it's probably safe to ignore.

Here is an example:
var result = new Dictionary<string, string[]>();
var searchInLines = new string[200]; // filled with resumes
var dictionary = new string[50000]; // search dictionary
searchInLines.AsParallel()
.WithDegreeOfParallelism(Environment.ProcessorCount * 2)
.Select(searchInLine =>
{
result.Add(searchInLine, dictionary.Where(s => searchInLine.Contains(s)).ToArray());
return string.Empty;
})
.ToList();
Produces dictionary with "Resume" to found dictionary items.
If you use inaccurate string.Contains then it will work quickly in 0.2 seconds
If you use RegEx like Regex.IsMatch(searchInLine, $"\\b{s}\\b")) (to find words), then it will work slowly in 30 seconds.
The choice is yours

The list of data being bigger is not good to be looped through. I would suggest to loop through the line as it is smaller. Considering there would always be space between words.
List<string> data = new List<string>() { "Delphi", "dot net", "java", "Oracle" }
String line = "Dheeraj has experience in dot net java programming Oracle javascript and Delphi";
foreach (var item in line.Split(new char[] { ' ' }))
{
// If you use Contains here, it will use sorting and searching the keyword
if(data.Contains(item))
{
Console.WriteLine(item);
}
}

How can I implement the Viterbi algorithm in C# to split conjoined words?

In short - I want to convert the first answer to the question here from Python into C#. My current solution to splitting conjoined words is exponential, and I would like a linear solution. I am assuming no spacing and consistent casing in my input text.
Background
I wish to convert conjoined strings such as "wickedweather" into separate words, for example "wicked weather" using C#. I have created a working solution, a recursive function using exponential time, which is simply not efficient enough for my purposes (processing at least over 100 joined words). Here the questions I have read so far, which I believe may be helpful, but I cannot translate their responses from Python to C#.
How can I split multiple joined words?
Need help understanding this Python Viterbi algorithm
How to extract literal words from a consecutive string efficiently?
My Current Recursive Solution
This is for people who only want to split a few words (< 50) in C# and don't really care about efficiency.
My current solution works out all possible combinations of words, finds the most probable output and displays. I am currently defining the most probable output as the one which uses the longest individual words - I would prefer to use a different method. Here is my current solution, using a recursive algorithm.
static public string find_words(string instring)
{
if (words.Contains(instring)) //where words is my dictionary of words
{
return instring;
}
if (solutions.ContainsKey(instring.ToString()))
{
return solutions[instring];
}
string bestSolution = "";
string solution = "";
for (int i = 1; i < instring.Length; i++)
{
string partOne = find_words(instring.Substring(0, i));
string partTwo = find_words(instring.Substring(i, instring.Length - i));
if (partOne == "" || partTwo == "")
{
continue;
}
solution = partOne + " " + partTwo;
//if my current solution is smaller than my best solution so far (smaller solution means I have used the space to separate words fewer times, meaning the words are larger)
if (bestSolution == "" || solution.Length < bestSolution.Length)
{
bestSolution = solution;
}
}
solutions[instring] = bestSolution;
return bestSolution;
}
This algorithm relies on having no spacing or other symbols in the entry text (not really a problem here, I'm not fussed about splitting up punctuation). Random additional letters added within the string can cause an error, unless I store each letter of the alphabet as a "word" within my dictionary. This means that "wickedweatherdykjs" would return "wicked weather d y k j s" using the above algorithm, when I would prefer an output of "wicked weather dykjs".
My updated exponential solution:
static List<string> words = File.ReadLines("E:\\words.txt").ToList();
static Dictionary<char, HashSet<string>> compiledWords = buildDictionary(words);
private void btnAutoSpacing_Click(object sender, EventArgs e)
{
string text = txtText.Text;
text = RemoveSpacingandNewLines(text); //get rid of anything that breaks the algorithm
if (text.Length > 150)
{
//possibly split the text up into more manageable chunks?
//considering using textSplit() for this.
}
else
{
txtText.Text = find_words(text);
}
}
static IEnumerable<string> textSplit(string str, int chunkSize)
{
return Enumerable.Range(0, str.Length / chunkSize)
.Select(i => str.Substring(i * chunkSize, chunkSize));
}
private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
var dictionary = new Dictionary<char, HashSet<string>>();
foreach (var word in words)
{
var key = word[0];
if (!dictionary.ContainsKey(key))
{
dictionary[key] = new HashSet<string>();
}
dictionary[key].Add(word);
}
return dictionary;
}
static public string find_words(string instring)
{
string bestSolution = "";
string solution = "";
if (compiledWords[instring[0]].Contains(instring))
{
return instring;
}
if (solutions.ContainsKey(instring.ToString()))
{
return solutions[instring];
}
for (int i = 1; i < instring.Length; i++)
{
string partOne = find_words(instring.Substring(0, i));
string partTwo = find_words(instring.Substring(i, instring.Length - i));
if (partOne == "" || partTwo == "")
{
continue;
}
solution = partOne + " " + partTwo;
if (bestSolution == "" || solution.Length < bestSolution.Length)
{
bestSolution = solution;
}
}
solutions[instring] = bestSolution;
return bestSolution;
}
How I would like to use the Viterbi Algorithm
I would like to create an algorithm which works out the most probable solution to a conjoined string, where the probability is calculated according to the position of the word in a text file that I provide the algorithm with. Let's say the file starts with the most common word in the English language first, and on the next line the second most common, and so on until the least common word in my dictionary. It looks roughly like this
the
be
and
...
attorney
Here is a link to a small example of such a text file I would like to use.
Here is a much larger text file which I would like to use
The logic behind this file positioning is as follows...
It is reasonable to assume that they follow Zipf's law, that is the
word with rank n in the list of words has probability roughly 1/(n log
N) where N is the number of words in the dictionary.
Generic Human, in his excellent Python solution, explains this much better than I can. I would like to convert his solution to the problem from Python into C#, but after many hours spent attempting this I haven't been able to produce a working solution.
I also remain open to the idea that perhaps relative frequencies with the Viterbi algorithm isn't the best way to split words, any other suggestions for creating a solution using C#?

Written text is highly contextual and you may wish to use a Markov chain to model sentence structure in order to estimate joint probability. Unfortunately, sentence structure breaks the Viterbi assumption -- but there is still hope, the Viterbi algorithm is a case of branch-and-bound optimization aka "pruned dynamic programming" (something I showed in my thesis) and therefore even when the cost-splicing assumption isn't met, you can still develop cost bounds and prune your population of candidate solutions. But let's set Markov chains aside for now... assuming that the probabilities are independent and each follows Zipf's law, what you need to know is that the Viterbi algorithm works on accumulating additive costs.
For independent events, joint probability is the product of the individual probabilities, making negative log-probability a good choice for the cost.
So your single-step cost would be -log(P) or log(1/P) which is log(index * log(N)) which is log(index) + log(log(N)) and the latter term is a constant.

Can't help you with the Viterbi Algorithm but I'll give my two cents concerning your current approach. From your code its not exactly clear what words is. This can be a real bottleneck if you don't choose a good data structure. As a gut feeling I'd initially go with a Dictionary<char, HashSet<string>> where the key is the first letter of each word:
private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
var dictionary = new Dictionary<char, HashSet<string>>();
foreach (var word in words)
{
var key = word[0];
if (!dictionary.ContainsKey(key))
{
dictionary[key] = new HashSet<string>();
}
dictionary[key].Add(word);
}
return dictionary;
}
And I'd also consider serializing it to disk to avoid building it up every time.
Not sure how much improvement you can make like this (dont have full information of you current implementation) but benchmark it and see if you get any improvement.
NOTE: I'm assuming all words are cased consistently.

Randomly select a specific quantity of indices from an array?

I have an array of boolean values and need to randomly select a specific quantity of indices for values which are true.
What is the most efficient way to generate the array of indices?
For instance,
BitArray mask = GenerateSomeMask(length: 100000);
int[] randomIndices = RandomIndicesForTrue(mask, quantity: 10);
In this case the length of randomIndices would be 10.

There's a faster way to do this that requires only a single scan of the list.
Consider picking a line at random from a text file when you don't know how many lines are in the file, and the file is too large to fit in memory. The obvious solution is to read the file once to count the lines, pick a random number in the range of 0 to Count-1, and then read the file again up to the chosen line number. That works, but requires you to read the file twice.
A faster solution is to read the first line and save it as the selected line. You replace the selected line with the next line with probability 1/2. When you read the third line, you replace with probability 1/3, etc. When you've read the entire file, you have selected a line at random, and every line had equal probability of being selected. The code looks something like this:
string selectedLine = null;
int numLines = 0;
Random rnd = new Random();
foreach (var line in File.ReadLines(filename))
{
++numLines;
double prob = 1.0/numLines;
if (rnd.Next() >= prob)
selectedLine = line;
}
Now, what if you want to select 2 lines? You select the first two. Then, as each line is read the probability that it will replace one of the two lines is 2/n, where n is the number of lines already read. If you determine that you need to replace a line, you randomly select the line to be replaced. You can follow that same basic idea to select any number of lines at random. For example:
string[] selectedLines = new int[M];
int numLines = 0;
Random rnd = new Random();
foreach (var line in File.ReadLines(filename))
{
++numLines;
if (numLines <= M)
{
selectedLines[numLines-1] = line;
}
else
{
double prob = (double)M/numLines;
if (rnd.Next() >= prob)
{
int ix = rnd.Next(M);
selectedLines[ix] = line;
}
}
}
You can apply that to your BitArray quite easily:
int[] selected = new int[quantity];
int num = 0; // number of True items seen
Random rnd = new Random();
for (int i = 0; i < items.Length; ++i)
{
if (items[i])
{
++num;
if (num <= quantity)
{
selected[num-1] = i;
}
else
{
double prob = (double)quantity/num;
if (rnd.Next() > prob)
{
int ix = rnd.Next(quantity);
selected[ix] = i;
}
}
}
}
You'll need some special case code at the end to handle the case where there aren't quantity set bits in the array, but you'll need that with any solution.
This makes a single pass over the BitArray, and the only extra memory it uses is for the list of selected indexes. I'd be surprised if it wasn't significantly faster than the LINQ version.
Note that I used the probability calculation to illustrate the math. You can change the inner loop code in the first example to:
if (rnd.Next(numLines+1) == numLines)
{
selectedLine = line;
}
++numLines;
You can make a similar change to the other examples. That does the same thing as the probability calculation, and should execute a little faster because it eliminates a floating point divide for each item.

There are two families of approaches you can use: deterministic and non-deterministic. The first one involves finding all the eligible elements in the collection and then picking N at random; the second involves randomly reaching into the collection until you have found N eligible items.
Since the size of your collection is not negligible at 100K and you only want to pick a few out of those, at first sight non-deterministic sounds like it should be considered because it can give very good results in practice. However, since there is no guarantee that N true values even exist in the collection, going non-deterministic could put your program into an infinite loop (less catastrophically, it could just take a very long time to produce results).
Therefore I am going to suggest going for a deterministic approach, even though you are going to pay for the guarantees you need through the nose with resource usage. In particular, the operation will involve in-place sorting of an auxiliary collection; this will practically undo the nice space savings you got by using BitArray.
Theory aside, let's get to work. The standard way to handle this is:
Filter all eligible indices into an auxiliary collection.
Randomly shuffle the collection with Fisher-Yates (there's a convenient implementation on StackOverflow).
Pick the N first items of the shuffled collection. If there are less than N then your input cannot satisfy your requirements.
Translated into LINQ:
var results = mask
.Select((i, f) => Tuple.Create) // project into index/bool pairs
.Where(t => t.Item2) // keep only those where bool == true
.Select(t => t.Item1) // extract indices
.ToList() // prerequisite for next step
.Shuffle() // Fisher-Yates
.Take(quantity) // pick N
.ToArray(); // into an int[]
if (results.Length < quantity)
{
// not enough true values in input
}

If you have 10 indices to choose from, you could generate a random number from 0 to 2^10 - 1, and use that as you mask.

How to get String Line number in Foreach loop from reading array?

The program helps users to parse a text file by grouping certain part of the text files into "sections" array.
So the question is "Are there any methods to find out the line numbers/position within the array?" The program utilizes a foreach loop to read the "sections" array.
May someone please advise on the codes? Thanks!
namespace Testing
{
class Program
{
static void Main(string[] args)
{
TextReader tr = new StreamReader(#"C:\Test\new.txt");
String SplitBy = "----------------------------------------";
// Skip 5 lines of the original text file
for(var i = 0; i < 5; i++)
{
tr.ReadLine();
}
// Read the reststring
String fullLog = tr.ReadToEnd();
String[] sections = fullLog.Split(new string[] { SplitBy }, StringSplitOptions.None);
//String[] lines = sections.Skip(5).ToArray();
int t = 0;
// Tried using foreach (String r in sections.skip(4)) but skips sections instead of the Text lines found within each sections
foreach (String r in sections)
{
Console.WriteLine("The times are : " + t);
// Is there a way to know or get the "r" line number?
Console.WriteLine(r);
Console.WriteLine("============================================================");
t++;
}
}
}
}

A foreach loop doesn't have a loop counter of any kind. You can keep your own counter:
int number = 1;
foreach (var element in collection) {
// Do something with element and number,
number++;
}
or, perhaps easier, make use of LINQ's Enumerable.Select that gives you the current index:
var numberedElements = collection.Select((element, index) => new { element, index });
with numberedElements being a collection of anonymous type instances with properties element and index. In the case a file you can do this:
var numberedLines = File.ReadLines(filename)
.Select((Line,Number) => new { Line, Number });
with the advantage that the whole thing is processed lazily, so it will only read the parts of the file into memory that you actually use.

As far as I know, there is not a way to know which line number you are at within the file. You'd either have to keep track of the lines yourself, or read the file again until you get to that line and count along the way.
Edit:
So you're trying to get the line number of a string inside the array after the master string's been split by the SplitBy?
If there's a specific delimiter in that sub string, you could split it again - although, this might not give you what you're looking for, except...
You're essentially back at square one.
What you could do is try splitting the section string by newline characters. This should spit it out into an array that corresponds with line numbers inside the string.

Yes, you can use a for loop instead of foreach. Also, if you know the file isn't going to be too large, you can read all of the lines into an array with:
string[] lines = File.ReadAllLines(#"C:\Test\new.txt");

Well, don't use a foreach, use a for loop
for( int i = 0; i < sections.Length; ++ )
{
string section = sections[i];
int lineNum = i + 1;
}
You can of course maintain a counter when using a foreach loop as well, but there is no reason to since you have the standard for loop at your disposal which is made for this sort of thing.
Of course, this won't necessarily give you the line number of the string in the text file unless you split on Environment.NewLine. You are splitting on a large number of '-' characters and I have no idea how your file is structured. You'll likely end up underestimating the line number because all of the '---' bits will be discarded.

Not as your code is written. You must track the line number for yourself. Problematic areas of your code:
You skip 5 lines at the beginning of your code, you must track this.
Using the Split method, you are potentially "removing" lines from the original collection of lines. You must find away to know how many splits you have made, because they are an original part of the line count.
Rather than taking the approach you have, I suggest doing the parsing and searching within a classic indexed for-loop that visits each line of the file. This probably means giving up conveniences like Split, and rather looking for markers in the file manually with e.g. IndexOf.

I've got a much simpler solution to the questions after reading through all the answers yesterday.
As the string had a newline after each line, it is possible to split the strings and convert it into a new array which then is possible to find out the line number according to the array position.
The Codes:
foreach (String r in sections)
{
Console.WriteLine("The times are : " + t);
IList<String> names = r.Split('\n').ToList<String>();
}

Counting occurrences of a string in an array and then removing duplicates

I am fairly new to C# programming and I am stuck on my little ASP.NET project.
My website currently examines Twitter statuses for URLs and then adds those URLs to an array, all via a regular expression pattern matching procedure. Clearly more than one person will update a with a specific URL so I do not want to list duplicates, and I want to count the number of times a particular URL is mentioned in, say, 100 tweets.
Now I have a List<String> which I can sort so that all duplicate URLs are next to each other. I was under the impression that I could compare list[i] with list[i+1] and if they match, for a counter to be added to (count++), and if they don't match, then for the URL and the count value to be added to a new array, assuming that this is the end of the duplicates.
This would remove duplicates and give me a count of the number of occurrences for each URL. At the moment, what I have is not working, and I do not know why (like I say, I am not very experienced with it all).
With the code below, assume that a JSON feed has been searched for using a keyword into srchResponse.results. The results with URLs in them get added to sList, a string List type, which contains only the URLs, not the message as a whole.
I want to put one of each URL (no duplicates), a count integer (to string) for the number of occurrences of a URL, and the username, message, and user image URL all into my jagged array called 'urls[100][]'. I have made the array 100 rows long to make sure everything can fit but generally, this is too big. Each 'row' will have 5 elements in them.
The debugger gets stuck on the line: if (sList[i] == sList[i + 1]) which is the crux of my idea, so clearly the logic is not working. Any suggestions or anything will be seriously appreciated!
Here is sample code:
var sList = new ArrayList();
string[][] urls = new string[100][];
int ctr = 0;
int j = 1;
foreach (Result res in srchResponse.results)
{
string content = res.text;
string pattern = #"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)";
MatchCollection matches = Regex.Matches(content, pattern);
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
sList.Add(groups[0].Value.ToString());
}
}
sList.Sort();
foreach (Result res in srchResponse.results)
{
for (int i = 0; i < 100; i++)
{
if (sList[i] == sList[i + 1])
{
j++;
}
else
{
urls[ctr][0] = sList[i].ToString();
urls[ctr][1] = j.ToString();
urls[ctr][2] = res.text;
urls[ctr][3] = res.from_user;
urls[ctr][4] = res.profile_image_url;
ctr++;
j = 1;
}
}
}
The code then goes on to add each result into a StringBuilder method with the HTML.
Is now edite

The description of your algorithm seems fine. I don't know what's wrong with the implementation; I haven't read it that carefully. (The fact that you are using an ArrayList is an immediate red flag; why aren't you using a more strongly typed generic collection?)
However, I have a suggestion. This is exactly the sort of problem that LINQ was intended to solve. Instead of writing all that error-prone code yourself, just describe the transformation you're interested in, and let the compiler work it out for you.
Suppose you have a list of strings and you wish to determine the number of occurrences of each:
var notes = new []{ "Do", "Fa", "La", "So", "Mi", "Do", "Re" };
var counts = from note in notes
group note by note into g
select new { Note = g.Key, Count = g.Count() }
foreach(var count in counts)
Console.WriteLine("Note {0} occurs {1} times.", count.Note, count.Count);
Which I hope you agree is much easier to read than all that array logic you wrote. And of course, now you have your sequence of unique items; you have a sequence of counts, and each count contains a unique Note.

I'd recommend using a more sophisticated data structure than an array. A Set will guarantee that you have no duplicates.
Looks like C# collections doesn't include a Set, but there are 3rd party implementations available, like this one.

Your loop fails because when i == 99, (i + 1) == 100 which is outside the bounds of your array.
But as other have pointed out, .Net 3.5 has ways of doing what you want more elegantly.

If you don't need to know how many duplicates a specific entry has you could do the following:
LINQ Extension Methods
.Count()
.Distinct()
.Count()

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Compare 2 big string lists - c#

Related

I need to extract string(that is coming from a list) from line

How can I implement the Viterbi algorithm in C# to split conjoined words?

Randomly select a specific quantity of indices from an array?

How to get String Line number in Foreach loop from reading array?

Counting occurrences of a string in an array and then removing duplicates

Categories

Resources