erroneous character fixing of strings in c#

erroneous character fixing of strings in c# - c#

I have five strings like below,
ABBCCD
ABBDCD
ABBDCD
ABBECD
ABBDCD
all the strings are basically same except for the fourth characters. But only the character that appears maximum time will take the place. For example here D was placed 3 times in the fourth position. So, the final string will be ABBDCD. I wrote following code, but it seemed to be less efficient in terms of time. Because this function can be called million times. What should I do to improve the performance?
Here changedString is the string to be matched with other 5 strings. If Any position of the changed string is not matched with other four, then the maxmum occured character will be placed on changedString.
len is the length of the strings which is same for all strings.
for (int i = 0; i < len;i++ )
{
String findDuplicate = string.Empty + changedString[i] + overlapStr[0][i] + overlapStr[1][i] + overlapStr[2][i] +
overlapStr[3][i] + overlapStr[4][i];
char c = findDuplicate.GroupBy(x => x).OrderByDescending(x => x.Count()).First().Key;
if(c!=changedString[i])
{
if (i > 0)
{
changedString = changedString.Substring(0, i) + c +
changedString.Substring(i + 1, changedString.Length - i - 1);
}
else
{
changedString = c + changedString.Substring(i + 1, changedString.Length - 1);
}
}
//string cleanString = new string(findDuplicate.ToCharArray().Distinct().ToArray());
}

I'm not quite sure what you are going to do, but if it is about sorting strings by some n-th character, then the best way is to use Counting Sort http://en.wikipedia.org/wiki/Counting_sort It is used for sorting array of small integers and is quite fine for chars. It has linear O(n) time. The main idea is that if you know all your possible elements (looks like they can be only A-Z here) then you can create an additional array and count them. For your example it will be {0, 0, 1 ,3 , 1, 0,...} if we use 0 for 'A', 1 for 'B' and so on.

There is a function that might help performance-wise as it runs five times faster. The idea is to count occurrences yourself using a dictionary to convert character to a position into counting array, increment value at this position and check if it is greater than previously highest number of occurrences. If it is, current character is top and is stored as result. This repeats for each string in overlapStr and for each position within the strings. Please read comments inside code to see details.
string HighestOccurrenceByPosition(string[] overlapStr)
{
int len = overlapStr[0].Length;
// Dictionary transforms character to offset into counting array
Dictionary<char, int> char2offset = new Dictionary<char, int>();
// Counting array. Each character has an entry here
int[] counters = new int[overlapStr.Length];
// Highest occurrence characters found so far
char[] topChars = new char[len];
for (int i = 0; i < len; ++i)
{
char2offset.Clear();
// faster! char2offset = new Dictionary<char, int>();
// Highest number of occurrences at the moment
int highestCount = 0;
// Allocation of counters - as previously unseen character arrives
// it is given a slot at this offset
int lastOffset = 0;
// Current offset into "counters"
int offset = 0;
// Small optimization. As your data seems very similar, this helps
// to reduce number of expensive calls to TryGetValue
// You might need to remove this optimization if you don't have
// unused value of char in your dataset
char lastChar = (char)0;
for (int j = 0; j < overlapStr.Length; ++ j)
{
char thisChar = overlapStr[j][i];
// If this is the same character as last one
// Offset already points to correct cell in "counters"
if (lastChar != thisChar)
{
// Get offset
if (!char2offset.TryGetValue(thisChar, out offset))
{
// First time seen - allocate & initialize cell
offset = lastOffset;
counters[offset] = 0;
// Map character to this cell
char2offset[thisChar] = lastOffset++;
}
// This is now last character
lastChar = thisChar;
}
// increment and get count for character
int charCount = ++counters[offset];
// This is now highestCount.
// TopChars receives current character
if (charCount > highestCount)
{
highestCount = charCount;
topChars[i] = thisChar;
}
}
}
return new string(topChars);
}
P.S. This is certainly not the best solution. But as it is significantly faster than original I thought I should help out.

Related

Need help finding n amount of Excel Ranges

So I have this situation:
At work I need to make an Excel AddIn which can collect some data from user surveys and show them in a neat little Excel Report. I have the format down however I have trouble figuring out how I find the Excel Ranges needed to showcase the questions that were asked in the survey.
Every question needs to take up three cells each since there are three stats associated with each and that's fine until you reach Z and have to start over with AA, AB, AC, etc. I can't quite wrap my head around it and I feel my current solution is being needlessly complicated. I know that right now there are 13 questions. That's 39 cells I need for the questions total but that could change in the future, or I might have to find smaller reports than all of the 13 questions. I need to make sure my algorithm can take care of both scenarios.
Currently I have this:
const String ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
int alphabetCounter = 0;
int alphabetIndex = 1;
for (int i = 0; i < dict["questions"].Length; i++)
{
String start = "";
String end = "";
if ((alphabetIndex + 1) > ALPHABET.Length)
{
alphabetCounter++;
alphabetIndex = 0;
start += ALPHABET[alphabetCounter - 1] + ALPHABET[alphabetIndex];
}
else
{
start += ALPHABET[alphabetIndex];
alphabetIndex++;
}
if ((alphabetIndex + 1) > ALPHABET.Length)
{
alphabetCounter++;
alphabetIndex = 0;
end += ALPHABET[alphabetIndex];
}
else
{
alphabetIndex++;
end += ALPHABET[alphabetIndex];
}
Excel.Range range = sheet.get_Range(start + "7", end + "7");
questionRanges.Add(range);
}
It's not finished because I ran into a wall here. So just to explain:
ALPHABET is just that. The alphabet. I use that to get the cell letters.
AlphabetCounter is how many times I have gone through the alphabet so in the event that I need to add an extra letter in front of my cells letter (Like the A in AB) I can get that from the ALPHABET string
AlphabetIndex is where in the alphabet I currently am.
I hope you can help me.
How would I go about getting all the ranges I need to accompany the n amount of questions I can get details about?

The trivial solution would be to change
const string ALPHABET = "ABC..."
to
const string[] ColumnNames = { "A", "B", "C", ..., "Z", "AA".. }
But this doesn't scale well. Think about what happens when you need to add a column. You'd have to add another item in the array, and eventually you'd have 26^2 array entries. Certainly not ideal.
A better solution would be to treat the column index as a base 26 number and convert it using a function like the following:
string GetColumnName(int index)
{
List<char> chars = new List<char>();
while (index >= 0)
{
int current = index % 26;
chars.Add((char)('A' + current));
index = (int)((index - current) / 26) - 1;
}
chars.Reverse();
return new string(chars.ToArray());
}
The function here converts the base by repeatedly calculating the remainder (also known as modulus or %).

just another idea of implementation, maybe it can be useful:
...
List<char> start = new List<char>();
List<char> end = new List<char>();
start = Increment(end);
Increment(end);
Increment(end);
Excel.Range range = sheet.get_Range(new String(start.ToArray())+ "7",
new String(end.ToArray())+ "7");
}
private List<char> Increment(List<char> listColumn, int position=0)
{
if (listColumn.Count > position)
{
listColumn[position]++;
if (listColumn[position] == '[')
{
listColumn[position] = 'A';
Increment(listColumn, ++position);
}
}
else
{
listColumn.Add('A');
}
return listColumn;
}

Unable to extract a substring from a string

I am long string array and i want to pass it to another function in the chunks of 250 characters one time, i have written this code:
var cStart = 0;
var phase = 250;
var cEnd = cStart + phase;
var count = 0;
while (count < 10000)
{
string fileInStringTemp = "";
fileInStringTemp = fileInString.Substring(cStart, cEnd);
var lngth = fileInStringTemp.Length;
//Do Some Work
cStart += phase;
cEnd += phase;
count++;
}
In the first iteration of the loop the value of lngth is 250 which is fine, in the next iteration i also want it to 250 because i am extracting substring from 250-500 characters but shockingly the value of lngth variable in the second iteration gets 500.
Why is that? i am also trying to initialize string variable everytime in the loop so it starts from zero but no gain.

Substring's second parameter is the length you want, not the stop index.
public string Substring(
int startIndex,
int length
)
So, all you need to do is change your code to have the start index and length (phase)
fileInString.Substring(cStart, phase)

Here is the MSDN link about how to work with Substring:
https://msdn.microsoft.com/en-us/library/aka44szs(v=vs.110).aspx
According to MSDN first parameter in Substring method is StartIndex which is defined as The zero-based starting character position of a substring and second parameter is used to define lenght of substring which is defined as The number of characters in the substring.
So you should try this:
var cStart = 0;
var phase = 250;
var count = 0;
while (count < 10000)
{
string fileInStringTemp = "";
fileInStringTemp = fileInString.Substring(cStart, phase);
var lngth = fileInStringTemp.Length;
//Do Some Work
count++;
cStart = phase * count + 1;
}

Try changing
fileInStringTemp = fileInString.Substring(cStart, cEnd);
to
fileInStringTemp = fileInString.Substring(cStart, cPhase);

The 2nd parameter to your SubString() method is the length of the substring to return. (You should be able to always use 250 and just keep shifting your starting point - the 1st param - until you are done.)

Substring has the parameters (startIndex, count) so you are not aloud to increment end. better change to Substring(cStart, phase)

Help me to understand this c# code

this code in Beginning C# 3.0: An Introduction to Object Oriented Programming
this is a program that has the user enter a couple of sentences in a multi - line textbox and then count how many times each letter occurs in that text
private const int MAXLETTERS = 26; // Symbolic constants
private const int MAXCHARS = MAXLETTERS - 1;
private const int LETTERA = 65;
.........
private void btnCalc_Click(object sender, EventArgs e)
{
char oneLetter;
int index;
int i;
int length;
int[] count = new int[MAXLETTERS];
string input;
string buff;
length = txtInput.Text.Length;
if (length == 0) // Anything to count??
{
MessageBox.Show("You need to enter some text.", "Missing Input");
txtInput.Focus();
return;
}
input = txtInput.Text;
input = input.ToUpper();
for (i = 0; i < input.Length; i++) // Examine all letters.
{
oneLetter = input[i]; // Get a character
index = oneLetter - LETTERA; // Make into an index
if (index < 0 || index > MAXCHARS) // A letter??
continue; // Nope.
count[index]++; // Yep.
}
for (i = 0; i < MAXLETTERS; i++)
{
buff = string.Format("{0, 4} {1,20}[{2}]", (char)(i + LETTERA)," ",count[i]);
lstOutput.Items.Add(buff);
}
}
I do not understand this line
count[index]++;
and this line of code
buff = string.Format("{0, 4} {1,20}[{2}]", (char)(i + LETTERA)," ",count[i]);

count[index]++; means "add 1 to the value in count at index index". The ++ is specifically known as incrementing. What the code is doing is tallying the number of occurrences of a letter.
buff = string.Format("{0, 4} {1,20}[{2}]", (char)(i + LETTERA)," ",count[i]); is formatting a line of output. With string.Format, you first pass in a format specifier that works like a template or form letter. The parts between { and } specify how the additional arguments passed into string.Format are used. Let me break down the format specification:
{0, 4} The first (index 0) argument (which is the letter, in this case).
The ,4 part means that when it is output, it should occupy 4 columns
of text.
{1,20} The second (index 1) argument (which is a space in this case).
The ,20 is used to force the output to be 20 spaces instead of 1.
{2} The third (index 2) argument (which is the count, in this case).
So when string.Format runs, (char)(i + LETTERA) is used as the first argument and is plugged into the {0} portion of the format. " " is plugged into {1}, and count[i] is plugged into {2}.

count[index]++;
That's a post-increment. If you were to save the return of that it would be count[index] prior to the increment, but all it basically does is increment the value and return the value prior to the increment. As for the reason why there is a variable inside square brackets, it is referencing a value in the index of an array. In other words, if you wanted to talk about the fifth car on the street, you may consider something like StreetCars(5). Well, in C# we use square brackets and zero-indexing, so we would have something like StreetCars[4]. If you had a Car array call StreetCars you could reference the 5th Car by using the indexed value.
As for the string.Format() method, check out this article.

Breaking Down & Rearranging String into all possible combinations

I want to break down and rearranging a string into all possible combinations
Say I have a String: ABCDEF
I want to break it down and output all possible combinations
Combination(6,6) = 1
ABCDEF
Combination(6,5) = 6
BCDEF
ACDEF
ABDEF
ABCEF
ABCDF
ABCDE
Combination(6,4) = 15
BCDE
ACDE
ABDE
ABCE
....
....
....
etc.
Combination(6,3) = 20
BCD
ACD
...
etc.
Combination(6,2) = 15
BC
AB
etc.
However the ouptut must also be arranged into alphabetical order.
How will I do this?
Thanks! Any help will be appreciated!

You can get the algorithm (actually a few of them) from Knuth Volume 4, Fascicle 3 but you'll have to convert it from his math notation to C#.
Update: As I think about this more, Fascicle 2 (Generating Permutations) is actually more helpful. You can download it free from http://www-cs-faculty.stanford.edu/~knuth/fasc2b.ps.gz though you'll need gunzip and a PostScript previewer to read it. Generating the subsets of string "ABCDE" is the easy part. Convert it to an array {'A', 'B', 'C', 'D', 'E'}, run a for loop from 0 to 2^N-1 where N is the array length, and treat each value as a bitmask of the elements you're keeping. Thus 00001, 00010, 00011,... gives you "A", "B", "AB",...
The hard part is generating all the permutations of each subset, so you get "ABC", "BAC", "CAB", etc. A brute force algorithm (like in one of the other answers) will work but will get very slow if the string is long. Knuth has some fast algorithms, some of which will generate the permutations in alphabetical order if the original string was sorted in the first place.

Well, to expand on my comment, how I got past this problem was transforming the string into a hash that doesn't care the order of the letters. The hash works by taking each unique letter, then a :, then the number of times that letter occurs.
So test = e:1,s:1,t:2
Then if somebody looks for the world tset, it would generate the same hash (e:1,s:1,t:2), and bam you have a match.
I just ran a word list (of about 20 million words), generated a hash for each one of them, and put it in a mysql table, I can find all permutations of a word (that are still words themselves, aka ered will return deer and reed) in seconds.

You can generate each permutation by incrementing a counter and converting the counter value to base n where n is the number of letters in your input. Discard any values containing repeating letters and what you have left are the possible scrabble words in alphabetic order assuming your array was sorted.
You will have to count up to (n^(n-1))*(n+1) to get the e*n! possible scrabble words.
char[] Letters = new char[] { 'A', 'B', 'C', 'D', 'E', 'F' };
// calculate e*n! (int)Math.Floor(Math.E * Math.Factorial(Letters.Length))
int x = 0;
for (int i = 1; i <= Letters.Length; i++)
x = (x + 1) * i;
for (int i = 1; x > 0; i++)
{
string Word = BaseX(i, Letters.Length, Letters);
if (NoRepeat(Word))
{
Console.WriteLine(Word);
x--;
}
}
BaseX returns the string representation of Value for the given Base and specified Symbols:
string BaseX(int Value, int Base, char[] Symbols)
{
StringBuilder s = new StringBuilder();
while (Value > Base)
{
s.Insert(0, Symbols[Value % Base]);
Value /= Base;
}
s.Insert(0, Symbols[Value - 1]);
return s.ToString();
}
NoRepeat returns false if any letter occurs more than once:
bool NoRepeat(string s)
{
bool[] Test = new bool[256];
foreach (char c in s)
if (Test[(byte)c])
return false;
else
Test[(byte)c] = true;
return true;
}

Sort the string in alphabet order. Say ABCDEF (your example)
Prepare a map between index and character
map[0] = 'A'; map[1] = 'B'; ... map[5] = 'F'
3 . Now your job is a lot more simple: find all combinations of number in which the later number is larger than the former
Combination(6,3):
for (int i = 0; i < 6 - 2; i++)
for (int j = i + 1; j < 6 - 1; j++)
for (int k = j + 1; k < 6; k++)
{
string strComb = map[i] + map[j] + map[k];
}
This is mainly the idea, you could improve in your own way.
Contact me if you want more detail!

You can use this:
static List<string> list = new List<string>();
static string letters = "bcdehijkmnopqrstuvwxyz";
static void Combine(string combinatory)
{
if(combinatory.Length < letters.Length)
{
Parallel.ForEach(letters, l =>
{
if (!combinatory.Contains(l)) Combine(combinatory + l);
});
} else
{
list.Add(combinatory);
Console.WriteLine(combinatory);
}
}
It will add to the list List all the possible combinations.
Then you can use the Sort() method in order to sort the list.

What is the most efficient way to detect if a string contains a number of consecutive duplicate characters in C#?

For example, a user entered "I love this post!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
the consecutive duplicate exclamation mark "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" should be detected.

The following regular expression would detect repeating chars. You could up the number or limit this to specific characters to make it more robust.
int threshold = 3;
string stringToMatch = "thisstringrepeatsss";
string pattern = "(\\d)\\" + threshold + " + ";
Regex r = new Regex(pattern);
Match m = r.Match(stringToMatch);
while(m.Success)
{
Console.WriteLine("character passes threshold " + m.ToString());
m = m.NextMatch();
}

Here's and example of a function that searches for a sequence of consecutive chars of a specified length and also ignores white space characters:
public static bool HasConsecutiveChars(string source, int sequenceLength)
{
if (string.IsNullOrEmpty(source))
return false;
if (source.Length == 1)
return false;
int charCount = 1;
for (int i = 0; i < source.Length - 1; i++)
{
char c = source[i];
if (Char.IsWhiteSpace(c))
continue;
if (c == source[i+1])
{
charCount++;
if (charCount >= sequenceLength)
return true;
}
else
charCount = 1;
}
return false;
}
Edit fixed range bug :/

Can be done in O(n) easily: for each character, if the previous character is the same as the current, increment a temporary count. If it's different, reset your temporary count. At each step, update your global if needed.
For abbccc you get:
a => temp = 1, global = 1
b => temp = 1, global = 1
b => temp = 2, global = 2
c => temp = 1, global = 2
c => temp = 2, global = 2
c => temp = 3, global = 3
=> c appears three times. Extend it to get the position, then you should be able to print the "ccc" substring.
You can extend this to give you the starting position fairly easily, I'll leave that to you.

Here is a quick solution I crafted with some extra duplicates thrown in for good measure. As others pointed out in the comments, some duplicates are going to be completely legitimate, so you may want to narrow your criteria to punctuation instead of mere characters.
string input = "I loove this post!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!aa";
int index = -1;
int count =1;
List<string> dupes = new List<string>();
for (int i = 0; i < input.Length-1; i++)
{
if (input[i] == input[i + 1])
{
if (index == -1)
index = i;
count++;
}
else if (index > -1)
{
dupes.Add(input.Substring(index, count));
index = -1;
count = 1;
}
}
if (index > -1)
{
dupes.Add(input.Substring(index, count));
}

The better way i my opinion is create a array, each element in array is responsible for one character pair on string next to each other, eg first aa, bb, cc, dd. This array construct with 0 on each element.
Solve of this problem is a for on this string and update array values.
You can next analyze this array for what you want.
Example: For string: bbaaaccccdab, your result array would be { 2, 1, 3 }, because 'aa' can find 2 times, 'bb' can find one time (at start of string), 'cc' can find three times.
Why 'cc' three times? Because 'cc'cc & c'cc'c & cc'cc'.

Use LINQ! (For everything, not just this)
string test = "aabb";
return test.Where((item, index) => index > 0 && item.Equals(test.ElementAt(index)));
// returns "abb", where each of these items has the previous letter before it
OR
string test = "aabb";
return test.Where((item, index) => index > 0 && item.Equals(test.ElementAt(index))).Any();
// returns true

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

erroneous character fixing of strings in c# - c#

Related

Need help finding n amount of Excel Ranges

Unable to extract a substring from a string

Help me to understand this c# code

Breaking Down & Rearranging String into all possible combinations

What is the most efficient way to detect if a string contains a number of consecutive duplicate characters in C#?

Categories

Resources