How to find text between two tabs - c#

I have a file that looks similar like the following:
Tomas | Nordstrom | Sweden | Europe | World
(the character "|" in the above line represents a tab, new column)
Now I want a string containing only the text in the 4th column.
I have succeeded to find characters in a certain spot in the line. But that spot changes according to the number och characters in each column.
I could really need some nice input on this.
Thanks in advance.
/Tomas

This can be done using the Split method like this:
string s = "Tomas|Nordstrom|Sweden|Europe|World";
string[] stringArray = s.Split( new string[] { "|" }, StringSplitOptions.None );
Console.WriteLine( stringArray[3] );
This will print out "Europe", because that is located at index 3 in stringArray.
Edit:
The same can be achieved using Regex like this:
string[] stringRegex = Regex.Split( s, #"\|+" );

Basic algorithm would be iterating characters, until n-1 tabs found, then take chars up to the next tab or the end of string.
Depending on requirements, if performance is critical, you might need to implement a scanning algorithm manually.
You might be surprising how slow is string splitting. Well - it's not not by itself, but the overall approach requires:
Scanning to the end of the string
Creation of all of the split parts on heap
Collecting garbage
Consider following benchmark of the two approaches:
void Main()
{
string source = "Tomas\tNordstrom\tSweden\tEurope\tWorld";
var sw = Stopwatch.StartNew();
string result = null;
var n = 100000000;
for (var i = 0; i < n; i++)
{
result = FindBySplitting(source);
}
sw.Stop();
var splittingNsop = (double)sw.ElapsedMilliseconds / n * 1000000.0;
Console.WriteLine("Splitting. {0} ns/op",splittingNsop);
Console.WriteLine(result);
sw.Restart();
for (var i = 0; i < n; i++)
{
result = FindByScanning(source);
}
sw.Stop();
var scanningNsop = (double)sw.ElapsedMilliseconds / n * 1000000.0;
Console.WriteLine("Scanning. {0} ns/op",
scanningNsop);
Console.WriteLine(result);
Console.WriteLine("Scanning over splitting: {0}", splittingNsop / scanningNsop);
}
string FindBySplitting(string s)
{
return s.Split('\t')[3];
}
string FindByScanning(string s)
{
int l = s.Length, p = 0, q = 0, c = 0;
while (c++ < 4 - 1)
while (p < l && s[p++] != '\t')
;
for (q = p; q < l && s[q] != '\t'; q++)
;
return s.Substring(p, q - p);
}
Scanning algorithm implemented in pure C# outperforms the splitting one implemented on the low level by a factor of 4.6 on my laptop:
Splitting. 174.81 ns/op
Europe
Scanning. 37.58 ns/op
Europe
Scanning over splitting: 4.65167642362959

Related

How do you do a string split with 2 chars counts in C#?

I know how to do a string split if there's a letter, number, that I want to replace.
But how could I do a string.Split() by 2 char counts without replacing any existing letters, number, etc...?
Example:
string MAC = "00122345"
I want that string to output: 00:12:23:45
You could create a LINQ extension method to give you an IEnumerable<string> of parts:
public static class Extensions
{
public static IEnumerable<string> SplitNthParts(this string source, int partSize)
{
if (string.IsNullOrEmpty(source))
{
throw new ArgumentException("String cannot be null or empty.", nameof(source));
}
if (partSize < 1)
{
throw new ArgumentException("Part size has to be greater than zero.", nameof(partSize));
}
return Enumerable
.Range(0, (source.Length + partSize - 1) / partSize)
.Select(pos => source
.Substring(pos * partSize,
Math.Min(partSize, source.Length - pos * partSize)));
}
}
Usage:
var strings = new string[] {
"00122345",
"001223453"
};
foreach (var str in strings)
{
Console.WriteLine(string.Join(":", str.SplitNthParts(2)));
}
// 00:12:23:45
// 00:12:23:45:3
Explanation:
Use Enumerable.Range to get number of positions to slice string. In this case its the length of the string + chunk size - 1, since we need to get a big enough range to also fit leftover chunk sizes.
Enumerable.Select each position of slicing and get the startIndex using String.Substring using the position multiplied by 2 to move down the string every 2 characters. You will have to use Math.Min to calculate the smallest size leftover size if the string doesn't have enough characters to fit another chunk. You can calculate this by the length of the string - current position * chunk size.
String.Join the final result with ":".
You could also replace the LINQ query with yield here to increase performance for larger strings since all the substrings won't be stored in memory at once:
for (var pos = 0; pos < source.Length; pos += partSize)
{
yield return source.Substring(pos, Math.Min(partSize, source.Length - pos));
}
You can use something like this:
string newStr= System.Text.RegularExpressions.Regex.Replace(MAC, ".{2}", "$0:");
To trim the last colon, you can use something like this.
newStr.TrimEnd(':');
Microsoft Document
Try this way.
string MAC = "00122345";
MAC = System.Text.RegularExpressions.Regex.Replace(MAC,".{2}", "$0:");
MAC = MAC.Substring(0,MAC.Length-1);
Console.WriteLine(MAC);
A quite fast solution, 8-10x faster than the current accepted answer (regex solution) and 3-4x faster than the LINQ solution
public static string Format(this string s, string separator, int length)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.Length; i += length)
{
sb.Append(s.Substring(i, Math.Min(s.Length - i, length)));
if (i < s.Length - length)
{
sb.Append(separator);
}
}
return sb.ToString();
}
Usage:
string result = "12345678".Format(":", 2);
Here is a one (1) line alternative using LINQ Enumerable.Aggregate.
string result = MAC.Aggregate("", (acc, c) => acc.Length % 3 == 0 ? acc += c : acc += c + ":").TrimEnd(':');
An easy to understand and simple solution.
This is a simple fast modified answer in which you can easily change the split char.
This answer also checks if the number is even or odd , to make the suitable string.Split().
input : 00122345
output : 00:12:23:45
input : 0012234
output : 00:12:23:4
//The List that keeps the pairs
List<string> MACList = new List<string>();
//Split the even number into pairs
for (int i = 1; i <= MAC.Length; i++)
{
if (i % 2 == 0)
{
MACList.Add(MAC.Substring(i - 2, 2));
}
}
//Make the preferable output
string output = "";
for (int j = 0; j < MACList.Count; j++)
{
output = output + MACList[j] + ":";
}
//Checks if the input string is even number or odd number
if (MAC.Length % 2 == 0)
{
output = output.Trim(output.Last());
}
else
{
output += MAC.Last();
}
//input : 00122345
//output : 00:12:23:45
//input : 0012234
//output : 00:12:23:4

C# how to find and replace specific text in string?

I have a string which represents byte array, inside of it I have several groups of numbers (usually 5): which are encoded as 0x30..0x39 (codes for 0..9 digits). Before and after each number I have a space (0x20 code).
Examples:
"E5-20-32-36-20-E0" // "32-36" encodes number "26", notice spaces: "20"
"E5-20-37-20-E9" // "37" encodes number "7"
"E5-20-38-20-E7-E4-20-37-35-20-E9" // two numbers: "8" (from "38") and "75" (from "37-35")
I want to find out all these groups and reverse digits in the encoded numbers:
8 -> 8
75 -> 57
123 -> 321
Desired outcome:
"E5-20-32-36-20-E0" -> "E5-20-36-32-20-E0"
"E5-20-37-20-E9" -> "E5-20-37-20-E9"
"E5-20-37-38-39-20-E9" -> "E5-20-39-38-37-20-E9"
"E5-20-38-39-20-E7-E4-20-37-35-20-E9" -> "E5-20-39-38-20-E7-E4-20-35-37-20-E9"
I have the data inside a List \ String \ Byte[] - so maybe there is a way to do it ?
Thanks,
It's unclear (from the original question) what do you want to do with the the digits; let's extract a custom method for you to implement it. As an example, I've implemented reverse:
32 -> 32
32-36 -> 36-32
36-32-37 -> 37-32-36
36-37-38-39 -> 39-38-37-36
Code:
// items: array of digits codes, e.g. {"36", "32", "37"}
//TODO: put desired transformation here
private static IEnumerable<string> Transform(string[] items) {
// Either terse Linq:
// return items.Reverse();
// Or good old for loop:
string[] result = new string[items.Length];
for (int i = 0; i < items.Length; ++i)
result[i] = items[items.Length - i - 1];
return result;
}
Now we can use regular expressions (Regex) to extract all the digit sequencies and replace them with transformed ones:
using System.Text.RegularExpressions;
...
string input = "E5-20-36-32-37-20-E0";
string result = Regex
.Replace(input,
#"(?<=20\-)3[0-9](\-3[0-9])*(?=\-20)",
match => string.Join("-", Transform(match.Value.Split('-'))));
Console.Write($"Before: {input}{Environment.NewLine}After: {result}";);
Outcome:
Before: E5-20-36-32-37-20-E0
After: E5-20-37-32-36-20-E0
Edit: In case reverse is the only desired transformation, the code can be simplified by dropping Transform and adding Linq:
using System.Linq;
using System.Text.RegularExpressions;
...
string input = "E5-20-36-32-37-20-E0";
string result = Regex
.Replace(input,
#"(?<=20\-)3[0-9](\-3[0-9])*(?=\-20)",
match => string.Join("-", match.Value.Split('-').Reverse()));
More tests:
private static string MySolution(string input) {
return Regex
.Replace(input,
#"(?<=20\-)3[0-9](\-3[0-9])*(?=\-20)",
match => string.Join("-", Transform(match.Value.Split('-'))));
}
...
string[] tests = new string[] {
"E5-20-32-36-20-E0",
"E5-20-37-20-E9",
"E5-20-37-38-39-20-E9",
"E5-20-38-39-20-E7-E4-20-37-35-20-E9",
};
string report = string.Join(Environment.NewLine, tests
.Select(test => $"{test,-37} -> {MySolution(test)}"));
Console.Write(report);
Outcome:
E5-20-32-36-20-E0 -> E5-20-36-32-20-E0
E5-20-37-20-E9 -> E5-20-37-20-E9
E5-20-37-38-39-20-E9 -> E5-20-39-38-37-20-E9
E5-20-38-39-20-E7-E4-20-37-35-20-E9 -> E5-20-39-38-20-E7-E4-20-35-37-20-E9
Edit 2: Regex explanation (see https://www.regular-expressions.info/lookaround.html for details):
(?<=20\-) - must appear before the match: "20-" ("-" escaped with "\")
3[0-9](\-3[0-9])* - match itself (what we are replacing in Regex.Replace)
(?=\-20) - must appear after the match "-20" ("-" escaped with "\")
Let's have a look at match part 3[0-9](\-3[0-9])*:
3 - just "3"
[0-9] - character (digit) within 0-9 range
(\-3[0-9])* - followed by zero or more - "*" - groups of "-3[0-9]"
I'm not sure but I guess the length can change and you just want to reorder in reverse order just the numbers. so a possible way is:
Put the string in 2 arrays (so they are the same)
Iterate through one of them to locate begin and end o fthe number area
Go from end-area to begin-area in first array and write to the second from begin-area to end-area
Edit: not really tested, i just wrote that quickly:
string input = "E5-20-36-32-37-20-E0";
string[] array1 = input.Split('-');
string[] array2 = input.Split('-');
int startIndex = -1;
int endIndex = -1;
for (int i= 0; i < array1.Length; ++i)
{
if (array1[i] == "20")
{
if (startIndex < 0)
{
startIndex = i + 1;
}
else
{
endIndex = i - 1;
}
}
}
int pos1 = startIndex;
int pos2 = endIndex;
for (int j=0; j < (endIndex- startIndex + 1); ++j)
{
array1[pos1] = array2[pos2];
pos1++;
pos2--;
}
If you would be clear about how you want to process the numbers, it would be easier to provide a solution.
Do you want to swap them randomly?
Do you want to reverse order?
Do you want to swap every second number with the number before?
Do you want to swap ...
you can try the following (for reversing the numbers)
string hex = "E5-20-36-32-20-E0"; // this is your input string
// split the numbers by '-' and generate list out of it
List<string> hexNumbers = new List<string>();
hexNumbers.AddRange(hex.Split('-'));
// find start and end of the numbers that should be swapped
int startIndex = hexNumbers.IndexOf("20");
int endIndex = hexNumbers.LastIndexOf("20");
string newHex = "";
// add the part in front of the numbers that should be reversed
for (int i = 0; i <= startIndex; i++) newHex += hexNumbers[i] + "-";
// reverse the numbers
for (int i = endIndex-1; i > startIndex; i--) newHex += hexNumbers[i] + "-";
// add the part behind the numbers that should be reversed
for (int i = endIndex; i < hexNumbers.Count-1; i++) newHex += hexNumbers[i] + "-";
newHex += hexNumbers.Last();
If the start and the end is always the same, this can be fairly simplified into 4 lines of code:
string[] hexNumbers = hex.Split('-');
string newHex = "E5-20-";
for (int i = hexNumbers.Count() - 3; i > 1; i--) newHex += hexNumbers[i] + "-";
newHex += "20-E0";
Results:
"E5-20-36-32-20-E0" -> "E5-20-32-36-20-E0"
"E5-20-36-32-37-20-E0" -> "E5-20-32-37-36-20-E0"
"E5-20-36-12-18-32-20-E0" -> "E5-20-32-18-12-36-20-E0"

Replace Only Multiples Of three in c#

How can I replace only multiples of 3 in C#? Say for example I had the string "000100000", and I wanted "000" to be replaced with "+" but only every group of three characters. Additional condition: the groups should be changed starting from the end:, e.g. for "000100000" it should output "+100+".
You can just use a regular expression for this.
(0{3}(?!0+))
This uses a negative lookahead to make sure there aren't any other zeros after a group of three 0s - in other words, for a sequence of an arbitrary number of 0s, it'll only match the last 3.
You can modify this if you want to do something subtly different looking lookaheads and lookbehinds.
I suggest using regular expressions, e.g.:
string source = "000100000";
// "+100+"
string result = Regex.Replace(
source,
"0{3,}",
match => new string('0', match.Length % 3) + new string('+', match.Length / 3));
Tests:
001 -> 001
0001 -> +1
000100 -> +100
0001000 -> +1+
00010000 -> +10+
000100000 -> +100+
0001000000 -> +1++
You can do this with Substring:
string strReplace = "000100000";
//Store your string on StringBuilder to edit the string
StringBuilder sb = new StringBuilder();
sb.Append("+");
sb.Append(strReplace.Substring(0, 3)); //Use substring, 0 is the start of index and 3 is the length as your requirement
sb.Append("+");
sb.Append(strReplace.Substring(3, 3));
sb.Append("+");
sb.Append(strReplace.Substring(6, 3));
sb.Append("+");
strReplace = sb.ToString(); //Finally replace your string instance with your result
Or by for loop but this time instead of using substring, we use Char array to get every char in your string:
string strReplace = "000100000";
char[] chReplace = strReplace.ToCharArray();
StringBuilder sb = new StringBuilder();
for (int x = 0; x <= 8; x++)
{
if (x == 0 || x == 3 || x == 6 || x == 9)
{
sb.Append("+");
sb.Append(chReplace[x]);
}
else
{
sb.Append(chReplace[x]);
}
}
sb.Append("+");
strReplace = sb.ToString();
Okay, a bunch of these answers are addressing detecting groups of 3 '0's. Here's an answer that deals with groups of 3 anythings (reading the string in groups of three characters):
string GroupsOfThree(string str)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i + 2 < str.Length; i += 3)
{
string sub = str.Substring(i, 3);
if (sub.All(c => c == sub[0]))
sb.Append("+");
else
sb.Append(sub);
}
return sb.ToString();
}
You can use a replace regular expression.
"[0]{3}|[1]{3}"
The above Regular Expression can be use like below in C#:
string k = "000100000";
Regex pattern = new Regex("[0]{3}|[1]{3}");
pattern.Replace(k, "+");
reversing the string before you replace and after solves your problem.
something like:
string ReplaceThreeZeros(string text)
{
var reversed = new string(text.Reverse().ToArray());
var replaced = reversed.Replace("000","+");
return new string(replaced.Reverse().ToArray());
}

Add separator to string at every N characters?

I have a string which contains binary digits. How to separate string after each 8 digit?
Suppose the string is:
string x = "111111110000000011111111000000001111111100000000";
I want to add a separator like ,(comma) after each 8 character.
output should be :
"11111111,00000000,11111111,00000000,11111111,00000000,"
Then I want to send it to a list<> last 8 char 1st then the previous 8 chars(excepting ,) and so on.
How can I do this?
Regex.Replace(myString, ".{8}", "$0,");
If you want an array of eight-character strings, then the following is probably easier:
Regex.Split(myString, "(?<=^(.{8})+)");
which will split the string only at points where a multiple of eight characters precede it.
Try this:
var s = "111111110000000011111111000000001111111100000000";
var list = Enumerable
.Range(0, s.Length/8)
.Select(i => s.Substring(i*8, 8));
var res = string.Join(",", list);
There's another Regex approach:
var str = "111111110000000011111111000000001111111100000000";
# for .NET 4
var res = String.Join(",",Regex.Matches(str, #"\d{8}").Cast<Match>());
# for .NET 3.5
var res = String.Join(",", Regex.Matches(str, #"\d{8}")
.OfType<Match>()
.Select(m => m.Value).ToArray());
...or old school:
public static List<string> splitter(string in, out string csv)
{
if (in.length % 8 != 0) throw new ArgumentException("in");
var lst = new List<string>(in/8);
for (int i=0; i < in.length / 8; i++) lst.Add(in.Substring(i*8,8));
csv = string.Join(",", lst); //This we want in input order (I believe)
lst.Reverse(); //As we want list in reverse order (I believe)
return lst;
}
Ugly but less garbage:
private string InsertStrings(string s, int insertEvery, char insert)
{
char[] ins = s.ToCharArray();
int length = s.Length + (s.Length / insertEvery);
if (ins.Length % insertEvery == 0)
{
length--;
}
var outs = new char[length];
long di = 0;
long si = 0;
while (si < s.Length - insertEvery)
{
Array.Copy(ins, si, outs, di, insertEvery);
si += insertEvery;
di += insertEvery;
outs[di] = insert;
di ++;
}
Array.Copy(ins, si, outs, di, ins.Length - si);
return new string(outs);
}
String overload:
private string InsertStrings(string s, int insertEvery, string insert)
{
char[] ins = s.ToCharArray();
char[] inserts = insert.ToCharArray();
int insertLength = inserts.Length;
int length = s.Length + (s.Length / insertEvery) * insert.Length;
if (ins.Length % insertEvery == 0)
{
length -= insert.Length;
}
var outs = new char[length];
long di = 0;
long si = 0;
while (si < s.Length - insertEvery)
{
Array.Copy(ins, si, outs, di, insertEvery);
si += insertEvery;
di += insertEvery;
Array.Copy(inserts, 0, outs, di, insertLength);
di += insertLength;
}
Array.Copy(ins, si, outs, di, ins.Length - si);
return new string(outs);
}
If I understand your last requirement correctly (it's not clear to me if you need the intermediate comma-delimited string or not), you could do this:
var enumerable = "111111110000000011111111000000001111111100000000".Batch(8).Reverse();
By utilizing morelinq.
Here my two little cents too. An implementation using StringBuilder:
public static string AddChunkSeparator (string str, int chunk_len, char separator)
{
if (str == null || str.Length < chunk_len) {
return str;
}
StringBuilder builder = new StringBuilder();
for (var index = 0; index < str.Length; index += chunk_len) {
builder.Append(str, index, chunk_len);
builder.Append(separator);
}
return builder.ToString();
}
You can call it like this:
string data = "111111110000000011111111000000001111111100000000";
string output = AddChunkSeparator(data, 8, ',');
One way using LINQ:
string data = "111111110000000011111111000000001111111100000000";
const int separateOnLength = 8;
string separated = new string(
data.Select((x,i) => i > 0 && i % separateOnLength == 0 ? new [] { ',', x } : new [] { x })
.SelectMany(x => x)
.ToArray()
);
I did it using Pattern & Matcher as following way:
fun addAnyCharacter(input: String, insertion: String, interval: Int): String {
val pattern = Pattern.compile("(.{$interval})", Pattern.DOTALL)
val matcher = pattern.matcher(input)
return matcher.replaceAll("$1$insertion")
}
Where:
input indicates Input string. Check results section.
insertion indicates Insert string between those characters. For example comma (,), start(*), hash(#).
interval indicates at which interval you want to add insertion character.
input indicates Input string. Check results section. Check results section; here I've added insertion at every 4th character.
Results:
I/P: 1234XXXXXXXX5678 O/P: 1234 XXXX XXXX 5678
I/P: 1234567812345678 O/P: 1234 5678 1234 5678
I/P: ABCDEFGHIJKLMNOP O/P: ABCD EFGH IJKL MNOP
Hope this helps.
As of .Net 6, you can simply use the IEnumerable.Chunk method (Which splits elements of a sequence into chunks) then reconcatenate the chunks using String.Join.
var text = "...";
string.Join(',', text.Chunk(size: 6).Select(x => new string(x)));
This is much faster without copying array (this version inserts space every 3 digits but you can adjust it to your needs)
public string GetString(double valueField)
{
char[] ins = valueField.ToString().ToCharArray();
int length = ins.Length + (ins.Length / 3);
if (ins.Length % 3 == 0)
{
length--;
}
char[] outs = new char[length];
int i = length - 1;
int j = ins.Length - 1;
int k = 0;
do
{
if (k == 3)
{
outs[i--] = ' ';
k = 0;
}
else
{
outs[i--] = ins[j--];
k++;
}
}
while (i >= 0);
return new string(outs);
}
For every 1 character, you could do this one-liner:
string.Join(".", "1234".ToArray()) //result: 1.2.3.4
If you intend to create your own function to acheive this without using regex or pattern matching methods, you can create a simple function like this:
String formatString(String key, String seperator, int afterEvery){
String formattedKey = "";
for(int i=0; i<key.length(); i++){
formattedKey += key.substring(i,i+1);
if((i+1)%afterEvery==0)
formattedKey += seperator;
}
if(formattedKey.endsWith("-"))
formattedKey = formattedKey.substring(0,formattedKey.length()-1);
return formattedKey;
}
Calling the mothod like this
formatString("ABCDEFGHIJKLMNOPQRST", "-", 4)
Would result in the return string as this
ABCD-EFGH-IJKL-MNOP-QRST
A little late to the party, but here's a simplified LINQ expression to break an input string x into groups of n separated by another string sep:
string sep = ",";
int n = 8;
string result = String.Join(sep, x.InSetsOf(n).Select(g => new String(g.ToArray())));
A quick rundown of what's happening here:
x is being treated as an IEnumerable<char>, which is where the InSetsOf extension method comes in.
InSetsOf(n) groups characters into an IEnumerable of IEnumerable -- each entry in the outer grouping contains an inner group of n characters.
Inside the Select method, each group of n characters is turned back into a string by using the String() constructor that takes an array of chars.
The result of Select is now an IEnumerable<string>, which is passed into String.Join to interleave the sep string, just like any other example.
I am more than late with my answer but you can use this one:
static string PutLineBreak(string str, int split)
{
for (int a = 1; a <= str.Length; a++)
{
if (a % split == 0)
str = str.Insert(a, "\n");
}
return str;
}

Eliminate redundant letters in string? (e.g. gooooooooood -> good)

I'm trying to set up some sample data for a Naive Bayesian Classifier for Twitter.
One of the post-processing of the tweets I'd like to do is to remove unnecessary repeat characters.
For example, one of the tweets reads: Twizzlers. mmmmm goooooooooooood!
I'd like to reduce the number of w's down to just two. Why two? That's what the article I'm following did. Any individual word that is less than 2 characters is discarded (see mmmmm above). And as far as gooooooood, I would imagine double letters are the most common to be uber repeated.
So, that said, what's the fastest way (in terms of execution time) to reduce words such as gooooooooood to simply good?
[Edit]
I'll be processing 800,000 tweets in this app, hence the requirement for fastest execution
[/Edit]
[Edit2]
I just ran some simple benchmarking based on elapsed time to iterate through 1000 records & save to a text file. I repeated this iteration 100 times on each method. The average results are here:
Method 1: 386 ms [LINQ - answer was deleted]
Method 2: 407 ms [Regex]
Method 3: 303 ms [StringBuilder]
Method 4: 301 ms [StringBuilder part 2]
Method 1: LINQ (answer was apparently deleted)
static string doIt(string a)
{
var l = a.Select((p, i) => new { ch = p, index = i }).
Where(p => (p.index < a.Length - 2) && (a[p.index + 1] == p.ch) && (a[p.index + 2] == p.ch))
.Select(p => p.index).ToList();
l.Sort();
l.Reverse();
l.ForEach(i => a = a.Remove(i, 1));
return a;
}
METHOD 2:
Regex.Replace(tweet,#"(\S)\1{2,}","$1$1");
Method 3:
static string StringB(string s)
{
string input = s;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
if (i < 2 || input[i] != input[i - 1] || input[i] != input[i - 2])
sb.Append(input[i]);
}
string output = sb.ToString();
return output;
}
Method 4:
static string sb2(string s)
{
string input = s;
var sb = new StringBuilder(input);
char p2 = '\0';
char p1 = '\0';
int pos = 0, len = sb.Length;
while (pos < len)
{
if (p2 == p1) for (; pos < len && (sb[pos] == p2); len--)
sb.Remove(pos, 1);
if (pos < len)
{
p2 = p1;
p1 = sb[pos];
pos++;
}
}
return sb.ToString();
}
Regexen look to be the simplest. Simple proof of concept in the REPL:
using System.Text.RegularExpressions;
var regex = new Regex(#"(\S)\1{2,}"); // or #"([aeiouy])\1{2,}" etc?
regex.Replace("mmmmm gooood griieeeeefff", "$1$1");
-->
"mm good griieeff"
For raw performance, use something more like this: see it live on https://ideone.com/uWG68
using System;
using System.Text;
class Program
{
public static void Main(string[] args)
{
string input = "mmmm gooood griiiiiiiiiieeeeeeefffff";
var sb = new StringBuilder(input);
char p2 = '\0';
char p1 = '\0';
int pos = 0, len=sb.Length;
while (pos < len)
{
if (p2==p1) for (; pos<len && (sb[pos]==p2); len--)
sb.Remove(pos, 1);
if (pos<len)
{
p2=p1;
p1=sb[pos];
pos++;
}
}
Console.WriteLine(sb);
}
}
This is also (easily) doable via a regular expression:
var re = #"((.)\2)\2*";
Regex.Replace("god", re, "$1") // god
Regex.Replace("good", re, "$1") // good
Regex.Replace("gooood", re, "$1") // good
Is it faster than the other approaches? Well, that's for the benchmarks ;-) Regular expressions can be quite efficient in non-degenerate backtracking situations. The above may need to be altered (this will also match spaces for instance), but it's a small example.
Happy coding.
I would recommend looking into NLP solutions rather than C#/regex. In that world, python is preferred. See NLTK. I would recommend Nodebox Linguistics which gives you spelling corrections. You can even stem words and even go down to the infinitive.
I agree with the comments that this will not work in the general case, especially in "Twitter speak". Having said that the rules you mentioned are simple - eliminate every character that is the same as the previous two characters:
string input = "goooooooooooood";
StringBuilder sb = new StringBuilder(input.Length);
sb.Append(input.Substring(0, 2));
for (int i = 2; i < input.Length; i++)
{
if (input[i] != input[i - 1] || input[i] != input[i - 2])
sb.Append(input[i]);
}
string output = sb.ToString();

Categories

Resources