I'm trying to set up some sample data for a Naive Bayesian Classifier for Twitter.
One of the post-processing of the tweets I'd like to do is to remove unnecessary repeat characters.
For example, one of the tweets reads: Twizzlers. mmmmm goooooooooooood!
I'd like to reduce the number of w's down to just two. Why two? That's what the article I'm following did. Any individual word that is less than 2 characters is discarded (see mmmmm above). And as far as gooooooood, I would imagine double letters are the most common to be uber repeated.
So, that said, what's the fastest way (in terms of execution time) to reduce words such as gooooooooood to simply good?
[Edit]
I'll be processing 800,000 tweets in this app, hence the requirement for fastest execution
[/Edit]
[Edit2]
I just ran some simple benchmarking based on elapsed time to iterate through 1000 records & save to a text file. I repeated this iteration 100 times on each method. The average results are here:
Method 1: 386 ms [LINQ - answer was deleted]
Method 2: 407 ms [Regex]
Method 3: 303 ms [StringBuilder]
Method 4: 301 ms [StringBuilder part 2]
Method 1: LINQ (answer was apparently deleted)
static string doIt(string a)
{
var l = a.Select((p, i) => new { ch = p, index = i }).
Where(p => (p.index < a.Length - 2) && (a[p.index + 1] == p.ch) && (a[p.index + 2] == p.ch))
.Select(p => p.index).ToList();
l.Sort();
l.Reverse();
l.ForEach(i => a = a.Remove(i, 1));
return a;
}
METHOD 2:
Regex.Replace(tweet,#"(\S)\1{2,}","$1$1");
Method 3:
static string StringB(string s)
{
string input = s;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
if (i < 2 || input[i] != input[i - 1] || input[i] != input[i - 2])
sb.Append(input[i]);
}
string output = sb.ToString();
return output;
}
Method 4:
static string sb2(string s)
{
string input = s;
var sb = new StringBuilder(input);
char p2 = '\0';
char p1 = '\0';
int pos = 0, len = sb.Length;
while (pos < len)
{
if (p2 == p1) for (; pos < len && (sb[pos] == p2); len--)
sb.Remove(pos, 1);
if (pos < len)
{
p2 = p1;
p1 = sb[pos];
pos++;
}
}
return sb.ToString();
}
Regexen look to be the simplest. Simple proof of concept in the REPL:
using System.Text.RegularExpressions;
var regex = new Regex(#"(\S)\1{2,}"); // or #"([aeiouy])\1{2,}" etc?
regex.Replace("mmmmm gooood griieeeeefff", "$1$1");
-->
"mm good griieeff"
For raw performance, use something more like this: see it live on https://ideone.com/uWG68
using System;
using System.Text;
class Program
{
public static void Main(string[] args)
{
string input = "mmmm gooood griiiiiiiiiieeeeeeefffff";
var sb = new StringBuilder(input);
char p2 = '\0';
char p1 = '\0';
int pos = 0, len=sb.Length;
while (pos < len)
{
if (p2==p1) for (; pos<len && (sb[pos]==p2); len--)
sb.Remove(pos, 1);
if (pos<len)
{
p2=p1;
p1=sb[pos];
pos++;
}
}
Console.WriteLine(sb);
}
}
This is also (easily) doable via a regular expression:
var re = #"((.)\2)\2*";
Regex.Replace("god", re, "$1") // god
Regex.Replace("good", re, "$1") // good
Regex.Replace("gooood", re, "$1") // good
Is it faster than the other approaches? Well, that's for the benchmarks ;-) Regular expressions can be quite efficient in non-degenerate backtracking situations. The above may need to be altered (this will also match spaces for instance), but it's a small example.
Happy coding.
I would recommend looking into NLP solutions rather than C#/regex. In that world, python is preferred. See NLTK. I would recommend Nodebox Linguistics which gives you spelling corrections. You can even stem words and even go down to the infinitive.
I agree with the comments that this will not work in the general case, especially in "Twitter speak". Having said that the rules you mentioned are simple - eliminate every character that is the same as the previous two characters:
string input = "goooooooooooood";
StringBuilder sb = new StringBuilder(input.Length);
sb.Append(input.Substring(0, 2));
for (int i = 2; i < input.Length; i++)
{
if (input[i] != input[i - 1] || input[i] != input[i - 2])
sb.Append(input[i]);
}
string output = sb.ToString();
Related
I know how to do a string split if there's a letter, number, that I want to replace.
But how could I do a string.Split() by 2 char counts without replacing any existing letters, number, etc...?
Example:
string MAC = "00122345"
I want that string to output: 00:12:23:45
You could create a LINQ extension method to give you an IEnumerable<string> of parts:
public static class Extensions
{
public static IEnumerable<string> SplitNthParts(this string source, int partSize)
{
if (string.IsNullOrEmpty(source))
{
throw new ArgumentException("String cannot be null or empty.", nameof(source));
}
if (partSize < 1)
{
throw new ArgumentException("Part size has to be greater than zero.", nameof(partSize));
}
return Enumerable
.Range(0, (source.Length + partSize - 1) / partSize)
.Select(pos => source
.Substring(pos * partSize,
Math.Min(partSize, source.Length - pos * partSize)));
}
}
Usage:
var strings = new string[] {
"00122345",
"001223453"
};
foreach (var str in strings)
{
Console.WriteLine(string.Join(":", str.SplitNthParts(2)));
}
// 00:12:23:45
// 00:12:23:45:3
Explanation:
Use Enumerable.Range to get number of positions to slice string. In this case its the length of the string + chunk size - 1, since we need to get a big enough range to also fit leftover chunk sizes.
Enumerable.Select each position of slicing and get the startIndex using String.Substring using the position multiplied by 2 to move down the string every 2 characters. You will have to use Math.Min to calculate the smallest size leftover size if the string doesn't have enough characters to fit another chunk. You can calculate this by the length of the string - current position * chunk size.
String.Join the final result with ":".
You could also replace the LINQ query with yield here to increase performance for larger strings since all the substrings won't be stored in memory at once:
for (var pos = 0; pos < source.Length; pos += partSize)
{
yield return source.Substring(pos, Math.Min(partSize, source.Length - pos));
}
You can use something like this:
string newStr= System.Text.RegularExpressions.Regex.Replace(MAC, ".{2}", "$0:");
To trim the last colon, you can use something like this.
newStr.TrimEnd(':');
Microsoft Document
Try this way.
string MAC = "00122345";
MAC = System.Text.RegularExpressions.Regex.Replace(MAC,".{2}", "$0:");
MAC = MAC.Substring(0,MAC.Length-1);
Console.WriteLine(MAC);
A quite fast solution, 8-10x faster than the current accepted answer (regex solution) and 3-4x faster than the LINQ solution
public static string Format(this string s, string separator, int length)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.Length; i += length)
{
sb.Append(s.Substring(i, Math.Min(s.Length - i, length)));
if (i < s.Length - length)
{
sb.Append(separator);
}
}
return sb.ToString();
}
Usage:
string result = "12345678".Format(":", 2);
Here is a one (1) line alternative using LINQ Enumerable.Aggregate.
string result = MAC.Aggregate("", (acc, c) => acc.Length % 3 == 0 ? acc += c : acc += c + ":").TrimEnd(':');
An easy to understand and simple solution.
This is a simple fast modified answer in which you can easily change the split char.
This answer also checks if the number is even or odd , to make the suitable string.Split().
input : 00122345
output : 00:12:23:45
input : 0012234
output : 00:12:23:4
//The List that keeps the pairs
List<string> MACList = new List<string>();
//Split the even number into pairs
for (int i = 1; i <= MAC.Length; i++)
{
if (i % 2 == 0)
{
MACList.Add(MAC.Substring(i - 2, 2));
}
}
//Make the preferable output
string output = "";
for (int j = 0; j < MACList.Count; j++)
{
output = output + MACList[j] + ":";
}
//Checks if the input string is even number or odd number
if (MAC.Length % 2 == 0)
{
output = output.Trim(output.Last());
}
else
{
output += MAC.Last();
}
//input : 00122345
//output : 00:12:23:45
//input : 0012234
//output : 00:12:23:4
I need to remove all additional spaces in a string.
I use regex for matching strings and matched strings i replace with some others.
For better understanding please see examples below:
3 input strings:
Hello, how are you?
Hello , how are you?
Hello , how are you ?
This are 3 strings that should match by one pattern-regex.
It looks something like this:
Hello\s*,\s+how\s+are\s+you\s*?
It works fine but there is a perfomance problem.
If I have a lot of patterns (~20k) and try to execute each pattern it runs very slow (3-5 minutes).
Maybe there is better way for doing this?
for example use some 3d-party libs?
UPD: Folks, this question is not about how to do this. It's about how to do this with best perfomance. :)
Let me explain more detailed. The main goal is tokenize text. (replace some token with special symbols)
For example I have a token "nice try".
Then I input text "this is nice try".
result: "this is #tokenizedtext#" where #tokenizedtext# some special symbols. It doesen't matter in this case.
Next I have string "Mike said it was a nice try".
result should be "Mike said it was a #tokenizedtext#".
I think the main idea is clear.
So I can have a lot of tokens. When I process it I convert my token from "nice try" to pattern "nice\s+try". and try to replace with this pattern input text.
It works fine. But if in tokens there is more spaces and there is also punctuation then my regexes became bigger and works very slow.
Do you have some suggestions (technical or logic) for solving this problem?
I can suggest a few solutions.
First of all, avoid the static Regex method. Create an instance of it (and store it, don't call the constructor for each replacement!) and, if possible, use RegexOptions.Compiled. It should improve your performance.
Second, you can try to review your pattern. I'll do some profiling, but I'm currently undecisive between:
#"(?<=\s)\s+"
With replacement being an empty string or:
#"\s+"
With a space as a replacement. You can try this code, in the meanwhile:
var s = "Hello , how are you?";
var pattern = #"\s+";
var regex = new Regex(pattern, RegexOptions.Compiled);
var replaced = regex.Replace(s, " ");
EDIT: After having done some measurement, the second pattern seems to be faster. I'm editing my sample to adapt it.
EDIT 2: I've written an unsafe method. It's much faster than the other ones presented here, including the Regex ones, but, as the word itself says, it's unsafe. I don't think that there's any problem with the code I've written but I may be wrong -- So please, check it again and again in case there's a bug in the method.
static unsafe string TrimInternal(string input)
{
var length = input.Length;
var array = stackalloc char[length];
fixed (char* fix = input)
{
var ptr = fix;
var counter = 0;
var lastWasSpace = false;
while (*ptr != '\x0')
{
//Current char is a space?
var isSpace = *ptr == ' ';
//If it's a space but the last one wasn't
//Or if it's not a space
if (isSpace && !lastWasSpace || !isSpace)
//Write into the result array
array[counter++] = *ptr;
//The last character (before the next loop) was a space
lastWasSpace = isSpace;
//Increase the pointer
ptr++;
}
return new string(array, 0, counter);
}
}
Usage (compile with /unsafe):
var s = TrimInternal("Hello , how are you?");
Profiling made in Release build, optimizations on, 1000000 iterations:
My above solution with Regex: 00:00:03.2130121
The unsafe solution: 00:00:00.2063467
This might work for you. It should be pretty fast. Note that it also removes spaces at the end of the string; that might not be what you want...
using System;
namespace Demo
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(">{0}<", RemoveExtraSpaces("Hello, how are you?"));
Console.WriteLine(">{0}<", RemoveExtraSpaces("Hello , how are you?"));
Console.WriteLine(">{0}<", RemoveExtraSpaces("Hello , how are you ?"));
}
public static string RemoveExtraSpaces(string text)
{
var buffer = new char[text.Length];
bool isSpaced = false;
int n = 0;
foreach (char c in text)
{
if (c == ' ')
{
isSpaced = true;
}
else
{
if (isSpaced)
{
if ((c != ',') && (c != '?'))
{
buffer[n++] = ' ';
}
isSpaced = false;
}
buffer[n++] = c;
}
}
return new string(buffer, 0, n);
}
}
}
Something of my own :
find all the position of WhiteSpacechar in string;
private static IEnumerable<int> GetWhiteSpacePos(string input)
{
int iPos = -1;
while ((iPos = input.IndexOf(" ", iPos + 1, StringComparison.Ordinal)) > -1)
{
yield return iPos;
}
}
Remove all whitespace that are in in sequence Returned from GetWhiteSpacePos
string original_string = "Hello , how are you ?";
var poss = GetWhiteSpacePos(original_string).ToList();
int startPos;
int endPos;
StringBuilder builder = new StringBuilder(original_string);
for (int i = poss.Count -1; i > 1; i--)
{
endPos = poss[i];
while ((poss[i] == poss[i - 1] + 1) && i > 1)
{
i--;
}
startPos = poss[i];
if (endPos - startPos > 1)
{
builder.Remove(startPos, endPos - startPos);
}
}
string new_string = builder.ToString();
You are using a very complex regex..simplify the regex and that would definitely increasre the performance
Use \s+ and replace it with a single space
Well, these kind of problems really trouble us. Use this code, and I'm sure you're getting the result for what you've asked. This command removes any extra white space between any string.
cleanString= Regex.Replace(originalString, #"\s", " ");
Hope thar works for you. Thanks.
And since this is a single Instruction. It will utilize less CPU resource and hence less CPU time, which ultimately increases your performance. Therefore A/C to me this method works the best when compared in terms of performance.
if its just a matter of SPACE;
try this
Source : http://www.codeproject.com/Articles/10890/Fastest-C-Case-Insenstive-String-Replace
private static string ReplaceEx(string original,
string pattern, string replacement)
{
int count, position0, position1;
count = position0 = position1 = 0;
string upperString = original.ToUpper();
string upperPattern = pattern.ToUpper();
int inc = (original.Length / pattern.Length) *
(replacement.Length - pattern.Length);
char[] chars = new char[original.Length + Math.Max(0, inc)];
while ((position1 = upperString.IndexOf(upperPattern,
position0)) != -1)
{
for (int i = position0; i < position1; ++i)
chars[count++] = original[i];
for (int i = 0; i < replacement.Length; ++i)
chars[count++] = replacement[i];
position0 = position1 + pattern.Length;
}
if (position0 == 0) return original;
for (int i = position0; i < original.Length; ++i)
chars[count++] = original[i];
return new string(chars, 0, count);
}
Usage:
string original_string = "Hello , how are you ?";
while (original_string.Contains(" "))
{
original_string = ReplaceEx(original_string, " ", " ");
}
Replacing the regex way:
string resultString = null;
try {
resultString = Regex.Replace(subjectString, #"\s+", " ", RegexOption.Compiled);
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
I need to remove characters from a string that aren't in the Ascii range from 32 to 175, anything else have to be removed.
I doesn't known well if RegExp can be the best solution instead of using something like .replace() or .remove() pasing each invalid character or something else.
Any help will be appreciated.
You can use
Regex.Replace(myString, #"[^\x20-\xaf]+", "");
The regex here consists of a character class ([...]) consisting of all characters not (^ at the start of the class) in the range of U+0020 to U+00AF (32–175, expressed in hexadecimal notation). As far as regular expressions go this one is fairly basic, but may puzzle someone not very familiar with it.
But you can go another route as well:
new string(myString.Where(c => (c >= 32) && (c <= 175)).ToArray());
This probably depends mostly on what you're more comfortable with reading. Without much regex experience I'd say the second one would be clearer.
A few performance measurements, 10000 rounds each, in seconds:
2000 characters, the first 143 of which are between 32 and 175
Regex without + 4.1171
Regex with + 0.4091
LINQ, where, new string 0.2176
LINQ, where, string.Join 0.2448
StringBuilder (xanatos) 0.0355
LINQ, horrible (HatSoft) 0.4917
2000 characters, all of which are between 32 and 175
Regex without + 0.4076
Regex with + 0.4099
LINQ, where, new string 0.3419
LINQ, where, string.Join 0.7412
StringBuilder (xanatos) 0.0740
LINQ, horrible (HatSoft) 0.4801
So yes, my approaches are the slowest :-). You should probably go with xanatos' answer and wrap that in a method with a nice, clear name. For inline usage or quick-and-dirty things or where performance does not matter, I'd probably use the regex.
Because I think that if you don't know how to write a Regex you shouldn't use it, especially for something so simple:
var sb = new StringBuilder();
foreach (var c in str)
{
if (c >= 32 && c <= 175)
{
sb.Append(c);
}
}
var str2 = str.ToString();
Use regex [^\x20-\xAF]+ and replace it by empty string ""
Regex.Replace(str, #"[^\x20-\xAF]+", "");
How about using linq this way
string text = (from c in "AAA hello aaaa #### Y world"
let i = (int) c where i < 32 && i > 175 select c)
.Aggregate("", (current, c) => current + c);
static unsafe string TrimRange(string str, char from, char to)
{
int count = 0;
for (int i = 0; i < str.Length; i++)
{
char ch = str[i];
if ((ch >= from) && (ch <= to))
{
count++;
}
}
if (count == 0)
return String.Empty;
if (count == str.Length)
return str;
char * result = stackalloc char[count];
count = 0;
for (int i = 0; i < str.Length; i++)
{
char ch = str[i];
if ((ch >= from) && (ch <= to))
{
result[count ++] = ch;
}
}
return new String(result, 0, count);
}
I have a UTF-8 formatted data file that contains thousands of floating point numbers. At the time it was designed the developers decided to omit the 'e' in the exponential notation to save space. Therefore the data looks like:
1.85783+16 0.000000+0 1.900000+6-3.855418-4 1.958263+6 7.836995-4
-2.000000+6 9.903130-4 2.100000+6 1.417469-3 2.159110+6 1.655700-3
2.200000+6 1.813662-3-2.250000+6-1.998687-3 2.300000+6 2.174219-3
2.309746+6 2.207278-3 2.400000+6 2.494469-3 2.400127+6 2.494848-3
-2.500000+6 2.769739-3 2.503362+6 2.778185-3 2.600000+6 3.020353-3
2.700000+6 3.268572-3 2.750000+6 3.391230-3 2.800000+6 3.512625-3
2.900000+6 3.750746-3 2.952457+6 3.872690-3 3.000000+6 3.981166-3
3.202512+6 4.437824-3 3.250000+6 4.542310-3 3.402356+6 4.861319-3
The problem is float.Parse() will not work with this format. The intermediate solution I had was,
protected static float ParseFloatingPoint(string data)
{
int signPos;
char replaceChar = '+';
// Skip over first character so that a leading + is not caught
signPos = data.IndexOf(replaceChar, 1);
// Didn't find a '+', so lets see if there's a '-'
if (signPos == -1)
{
replaceChar = '-';
signPos = data.IndexOf('-', 1);
}
// Found either a '+' or '-'
if (signPos != -1)
{
// Create a new char array with an extra space to accomodate the 'e'
char[] newData = new char[EntryWidth + 1];
// Copy from string up to the sign
for (int i = 0; i < signPos; i++)
{
newData[i] = data[i];
}
// Replace the sign with an 'e + sign'
newData[signPos] = 'e';
newData[signPos + 1] = replaceChar;
// Copy the rest of the string
for (int i = signPos + 2; i < EntryWidth + 1; i++)
{
newData[i] = data[i - 1];
}
return float.Parse(new string(newData), NumberStyles.Float, CultureInfo.InvariantCulture);
}
else
{
return float.Parse(data, NumberStyles.Float, CultureInfo.InvariantCulture);
}
}
I can't call a simple String.Replace() because it will replace any leading negative signs. I could use substrings but then I'm making LOTS of extra strings and I'm concerned about the performance.
Does anyone have a more elegant solution to this?
string test = "1.85783-16";
char[] signs = { '+', '-' };
int decimalPos = test.IndexOf('.');
int signPos = test.LastIndexOfAny(signs);
string result = (signPos > decimalPos) ?
string.Concat(
test.Substring(0, signPos),
"E",
test.Substring(signPos)) : test;
float.Parse(result).Dump(); //1.85783E-16
The ideas I'm using here ensure the decimal comes before the sign (thus avoiding any problems if the exponent is missing) as well as using LastIndexOf() to work from the back (ensuring we have the exponent if one existed). If there is a possibility of a prefix "+" the first if would need to include || signPos < decimalPos.
Other results:
"1.85783" => "1.85783"; //Missing exponent is returned clean
"-1.85783" => "-1.85783"; //Sign prefix returned clean
"-1.85783-3" => "-1.85783e-3" //Sign prefix and exponent coexist peacefully.
According to the comments a test of this method shows only a 5% performance hit (after avoiding the String.Format(), which I should have remembered was awful). I think the code is much clearer: only one decision to make.
In terms of speed, your original solution is the fastest I've tried so far (#Godeke's is a very close second). #Godeke's has a lot of readability, for only a minor amount of performance degradation. Add in some robustness checks, and his may be the long term way to go. In terms of robustness, you can add that in to yours like so:
static char[] signChars = new char[] { '+', '-' };
static float ParseFloatingPoint(string data)
{
if (data.Length != EntryWidth)
{
throw new ArgumentException("data is not the correct size", "data");
}
else if (data[0] != ' ' && data[0] != '+' && data[0] != '-')
{
throw new ArgumentException("unexpected leading character", "data");
}
int signPos = data.LastIndexOfAny(signChars);
// Found either a '+' or '-'
if (signPos > 0)
{
// Create a new char array with an extra space to accomodate the 'e'
char[] newData = new char[EntryWidth + 1];
// Copy from string up to the sign
for (int ii = 0; ii < signPos; ++ii)
{
newData[ii] = data[ii];
}
// Replace the sign with an 'e + sign'
newData[signPos] = 'e';
newData[signPos + 1] = data[signPos];
// Copy the rest of the string
for (int ii = signPos + 2; ii < EntryWidth + 1; ++ii)
{
newData[ii] = data[ii - 1];
}
return Single.Parse(
new string(newData),
NumberStyles.Float,
CultureInfo.InvariantCulture);
}
else
{
Debug.Assert(false, "data does not have an exponential? This is odd.");
return Single.Parse(data, NumberStyles.Float, CultureInfo.InvariantCulture);
}
}
Benchmarks on my X5260 (including the times to just grok out the individual data points):
Code Average Runtime Values Parsed
--------------------------------------------------
Nothing (Overhead) 13 ms 0
Original 50 ms 150000
Godeke 60 ms 150000
Original Robust 56 ms 150000
Thanks Godeke for your contiually improving edits.
I ended up changing the parameters of the parsing function to take a char[] rather than a string and used your basic premise to come up with the following.
protected static float ParseFloatingPoint(char[] data)
{
int decimalPos = Array.IndexOf<char>(data, '.');
int posSignPos = Array.LastIndexOf<char>(data, '+');
int negSignPos = Array.LastIndexOf<char>(data, '-');
int signPos = (posSignPos > negSignPos) ? posSignPos : negSignPos;
string result;
if (signPos > decimalPos)
{
char[] newData = new char[data.Length + 1];
Array.Copy(data, newData, signPos);
newData[signPos] = 'E';
Array.Copy(data, signPos, newData, signPos + 1, data.Length - signPos);
result = new string(newData);
}
else
{
result = new string(data);
}
return float.Parse(result, NumberStyles.Float, CultureInfo.InvariantCulture);
}
I changed the input to the function from string to char[] because I wanted to move away from ReadLine(). I'm assuming this would perform better then creating lots of strings. Instead I get a fixed number of bytes from the data file (since it will ALWAYS be 11 char width data), converting the byte[] to char[], and then performing the above processing to convert to a float.
Could you possibly use a regular expression to pick out each occurrence?
Some information here on suitable expresions:
http://www.regular-expressions.info/floatingpoint.html
Why not just write a simple script to reformat the data file once and then use float.Parse()?
You said "thousands" of floating point numbers, so even a terribly naive approach will finish pretty quickly (if you said "trillions" I would be more hesitant), and code that you only need to run once will (almost) never be performance critical. Certainly it would take less time to run then posting the question to SO takes, and there's much less opportunity for error.
The question is complicated but I will explain it in details.
The goal is to make a function which will return next "step" of the given string.
For example
String.Step("a"); // = "b"
String.Step("b"); // = "c"
String.Step("g"); // = "h"
String.Step("z"); // = "A"
String.Step("A"); // = "B"
String.Step("B"); // = "C"
String.Step("G"); // = "H"
Until here its quite easy, But taking in mind that input IS string it can contain more than 1 characters and the function must behave like this.
String.Step("Z"); // = "aa";
String.Step("aa"); // = "ab";
String.Step("ag"); // = "ah";
String.Step("az"); // = "aA";
String.Step("aA"); // = "aB";
String.Step("aZ"); // = "ba";
String.Step("ZZ"); // = "aaa";
and so on...
This doesn't exactly need to extend the base String class.
I tried to work it out by each characters ASCII values but got stuck with strings containing 2 characters.
I would really appreciate if someone can provide full code of the function.
Thanks in advance.
EDIT
*I'm sorry I forgot to mention earlier that the function "reparse" the self generated string when its length reaches n.
continuation of this function will be smth like this. for example n = 3
String.Step("aaa"); // = "aab";
String.Step("aaZ"); // = "aba";
String.Step("aba"); // = "abb";
String.Step("abb"); // = "abc";
String.Step("abZ"); // = "aca";
.....
String.Step("zzZ"); // = "zAa";
String.Step("zAa"); // = "zAb";
........
I'm sorry I didn't mention it earlier, after reading some answers I realised that the problem was in question.
Without this the function will always produce character "a" n times after the end of the step.
NOTE: This answer is incorrect, as "aa" should follow after "Z"... (see comments below)
Here is an algorithm that might work:
each "string" represents a number to a given base (here: twice the count of letters in the alphabet).
The next step can thus be computed by parsing the "number"-string back into a int, adding 1 and then formatting it back to the base.
Example:
"a" == 1 -> step("a") == step(1) == 1 + 1 == 2 == "b"
Now your problem is reduced to parsing the string as a number to a given base and reformatting it. A quick googling suggests this page: http://everything2.com/title/convert+any+number+to+decimal
How to implement this?
a lookup table for letters to their corresponding number: a=1, b=2, c=3, ... Y = ?, Z = 0
to parse a string to number, read the characters in reverse order, looking up the numbers and adding them up:
"ab" -> 2*BASE^0 + 1*BASE^1
with BASE being the number of "digits" (2 count of letters in alphabet, is that 48?)
EDIT: This link looks even more promising: http://www.citidel.org/bitstream/10117/20/12/convexp.html
Quite collection of approaches, here is mine:-
The Function:
private static string IncrementString(string s)
{
byte[] vals = System.Text.Encoding.ASCII.GetBytes(s);
for (var i = vals.Length - 1; i >= 0; i--)
{
if (vals[i] < 90)
{
vals[i] += 1;
break;
}
if (vals[i] == 90)
{
if (i != 0)
{
vals[i] = 97;
continue;
}
else
{
return new String('a', vals.Length + 1);
}
}
if (vals[i] < 122)
{
vals[i] += 1;
break;
}
vals[i] = 65;
break;
}
return System.Text.Encoding.ASCII.GetString(vals);
}
The Tests
Console.WriteLine(IncrementString("a") == "b");
Console.WriteLine(IncrementString("z") == "A");
Console.WriteLine(IncrementString("Z") == "aa");
Console.WriteLine(IncrementString("aa") == "ab");
Console.WriteLine(IncrementString("az") == "aA");
Console.WriteLine(IncrementString("aZ") == "ba");
Console.WriteLine(IncrementString("zZ") == "Aa");
Console.WriteLine(IncrementString("Za") == "Zb");
Console.WriteLine(IncrementString("ZZ") == "aaa");
public static class StringStep
{
public static string Next(string str)
{
string result = String.Empty;
int index = str.Length - 1;
bool carry;
do
{
result = Increment(str[index--], out carry) + result;
}
while (carry && index >= 0);
if (index >= 0) result = str.Substring(0, index+1) + result;
if (carry) result = "a" + result;
return result;
}
private static char Increment(char value, out bool carry)
{
carry = false;
if (value >= 'a' && value < 'z' || value >= 'A' && value < 'Z')
{
return (char)((int)value + 1);
}
if (value == 'z') return 'A';
if (value == 'Z')
{
carry = true;
return 'a';
}
throw new Exception(String.Format("Invalid character value: {0}", value));
}
}
Split the input string into columns and process each, right-to-left, like you would if it was basic arithmetic. Apply whatever code you've got that works with a single column to each column. When you get a Z, you 'increment' the next-left column using the same algorithm. If there's no next-left column, stick in an 'a'.
I'm sorry the question is stated partly.
I edited the question so that it meets the requirements, without the edit the function would end up with a n times by step by step increasing each word from lowercase a to uppercase z without "re-parsing" it.
Please consider re-reading the question, including the edited part
This is what I came up with. I'm not relying on ASCII int conversion, and am rather using an array of characters. This should do precisely what you're looking for.
public static string Step(this string s)
{
char[] stepChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".ToCharArray();
char[] str = s.ToCharArray();
int idx = s.Length - 1;
char lastChar = str[idx];
for (int i=0; i<stepChars.Length; i++)
{
if (stepChars[i] == lastChar)
{
if (i == stepChars.Length - 1)
{
str[idx] = stepChars[0];
if (str.Length > 1)
{
string tmp = Step(new string(str.Take(str.Length - 1).ToArray()));
str = (tmp + str[idx]).ToCharArray();
}
else
str = new char[] { stepChars[0], str[idx] };
}
else
str[idx] = stepChars[i + 1];
break;
}
}
return new string(str);
}
This is a special case of a numeral system. It has the base of 52. If you write some parser and output logic you can do any kind of arithmetics an obviously the +1 (++) here.
The digits are "a"-"z" and "A" to "Z" where "a" is zero and "Z" is 51
So you have to write a parser who takes the string and builds an int or long from it. This function is called StringToInt() and is implemented straight forward (transform char to number (0..51) multiply with 52 and take the next char)
And you need the reverse function IntToString which is also implementet straight forward (modulo the int with 52 and transform result to digit, divide the int by 52 and repeat this until int is null)
With this functions you can do stuff like this:
IntToString( StringToInt("ZZ") +1 ) // Will be "aaa"
You need to account for A) the fact that capital letters have a lower decimal value in the Ascii table than lower case ones. B) The table is not continuous A-Z-a-z - there are characters inbetween Z and a.
public static string stepChar(string str)
{
return stepChar(str, str.Length - 1);
}
public static string stepChar(string str, int charPos)
{
return stepChar(Encoding.ASCII.GetBytes(str), charPos);
}
public static string stepChar(byte[] strBytes, int charPos)
{
//Escape case
if (charPos < 0)
{
//just prepend with a and return
return "a" + Encoding.ASCII.GetString(strBytes);
}
else
{
strBytes[charPos]++;
if (strBytes[charPos] == 91)
{
//Z -> a plus increment previous char
strBytes[charPos] = 97;
return stepChar(strBytes, charPos - 1); }
else
{
if (strBytes[charPos] == 123)
{
//z -> A
strBytes[charPos] = 65;
}
return Encoding.ASCII.GetString(strBytes);
}
}
}
You'll probably want some checking in place to ensure that the input string only contains chars A-Za-z
Edit Tidied up code and added new overload to remove redundant byte[] -> string -> byte[] conversion
Proof http://geekcubed.org/random/strIncr.png
This is a lot like how Excel columns would work if they were unbounded. You could change 52 to reference chars.Length for easier modification.
static class AlphaInt {
private static string chars =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
public static string StepNext(string input) {
return IntToAlpha(AlphaToInt(input) + 1);
}
public static string IntToAlpha(int num) {
if(num-- <= 0) return "a";
if(num % 52 == num) return chars.Substring(num, 1);
return IntToAlpha(num / 52) + IntToAlpha(num % 52 + 1);
}
public static int AlphaToInt(string str) {
int num = 0;
for(int i = 0; i < str.Length; i++) {
num += (chars.IndexOf(str.Substring(i, 1)) + 1)
* (int)Math.Pow(52, str.Length - i - 1);
}
return num;
}
}
LetterToNum should be be a Function that maps "a" to 0 and "Z" to 51.
NumToLetter the inverse.
long x = "aazeiZa".Aggregate((x,y) => (x*52) + LetterToNum(y)) + 1;
string s = "";
do { // assertion: x > 0
var c = x % 52;
s = NumToLetter() + s;
x = (x - c) / 52;
} while (x > 0)
// s now should contain the result