regex/linq to replace consecutive characters with count

regex/linq to replace consecutive characters with count - c#

I have the following method (written in C#/.NET). Input text consist only of letters (no digits). Returned value is another text in which groups of more than two consecutive characters are replaced with one the character preceded with a count of repetitions.
Ex.: aAAbbbcccc -> aAA3b4c
public static string Pack(string text)
{
if (string.IsNullOrEmpty(text)) return text;
StringBuilder sb = new StringBuilder(text.Length);
char prevChar = text[0];
int prevCharCount = 1;
for (int i = 1; i < text.Length; i++)
{
char c = text[i];
if (c == prevChar) prevCharCount++;
else
{
if (prevCharCount > 2) sb.Append(prevCharCount);
else if (prevCharCount == 2) sb.Append(prevChar);
sb.Append(prevChar);
prevChar = c;
prevCharCount = 1;
}
}
if (prevCharCount > 2) sb.Append(prevCharCount);
else if (prevCharCount == 2) sb.Append(prevChar);
sb.Append(prevChar);
return sb.ToString();
}
The method is not too long. But does any one has an idea how to do that in a more concise way using regex? Or LINQ?

How about:
static readonly Regex re = new Regex(#"(\w)(\1){2,}", RegexOptions.Compiled);
static void Main() {
string result = re.Replace("aAAbbbcccc",
match => match.Length.ToString() + match.Value[0]);
}
The regex is a word char, followed by the same (back-ref) at least twice; the lamba takes the length of the match (match.Length) and appends the first character (match.Value[0])

Related

How to exchange numbers to alphabet and alphabet to numbers in a string?

How do I convert numbers to its equivalent alphabet character and convert alphabet character to its numeric values from a string (except 0, 0 should stay 0 for obvious reasons)
So basically if there is a string
string content="D93AK0F5I";
How can I convert it to ?
string new_content="4IC11106E9";

I'm assuming you're aware this is not reversible, and that you're only using upper case and digits. Here you go...
private string Transpose(string input)
{
StringBuilder result = new StringBuilder();
foreach (var character in input)
{
if (character == '0')
{
result.Append(character);
}
else if (character >= '1' && character <= '9')
{
int offset = character - '1';
char replacement = (char)('A' + offset);
result.Append(replacement);
}
else if (character >= 'A' && character <= 'Z') // I'm assuming upper case only; feel free to duplicate for lower case
{
int offset = character - 'A' + 1;
result.Append(offset);
}
else
{
throw new ApplicationException($"Unexpected character: {character}");
}
}
return result.ToString();
}

Well, if you are only going to need a one way translation, here is quite a simple way to do it, using linq:
string convert(string input)
{
var chars = "0abcdefghijklmnopqrstuvwxyz";
return string.Join("",
input.Select(
c => char.IsDigit(c) ?
chars[int.Parse(c.ToString())].ToString() :
(chars.IndexOf(char.ToLowerInvariant(c))).ToString())
);
}
You can see a live demo on rextester.

You can use ArrayList of Albhabets. For example
ArrayList albhabets = new ArrayList();
albhabets.Add("A");
albhabets.Add("B");
and so on.
And now parse your string character by character.
string s = "1BC34D";
char[] characters = s.ToCharArray();
for (int i = 0; i < characters.Length; i++)
{
if (Char.IsNumber(characters[0]))
{
var index = characters[0];
var stringAlbhabet = albhabets[index];
}
else
{
var digitCharacter = albhabets.IndexOf(characters[0]);
}
}
This way you can get "Alphabet" representation of number & numeric representation of "Alphabet".

How to remove non-INTERNATIONAL-alphanumeric characters from a string?

How can I remove (or recognize) non-alphanumeric characters such as '-', '*', '‡', '€', '⁋', '™' from a string without removing non-Latin alphanumeric characters such as 'Ж', 'ק', 'ओ', 'を'?
The removing part is easy, my issue is differentiating the non-Latin alphabets from the non-Latin symbols.
* All existing Q&A I found filtered out non-Latin alphabets.

A simple solution (that works only for basic BMP characters) is:
string str = "-*‡€⁋™Жקओを";
var sb = new StringBuilder();
foreach (char ch in str)
{
bool isLetter = char.IsLetterOrDigit(ch);
if (isLetter)
{
sb.Append(ch);
}
}
string str2 = sb.ToString();
The Char.IsLetterOrDigit is described as:
Indicates whether the specified Unicode character is categorized as a letter or a decimal digit.
if you want to support surrogate pairs it becomes more complex:
// \U0001F000 is MAHJONG TILE EAST WIND
// \U0001D49E is MATHEMATICAL SCRIPT CAPITAL C
string str = "-*‡€⁋™Жקओを\U0001F000\U0001D49E";
var sb = new StringBuilder();
for (int i = 0; i < str.Length; i++)
{
bool isLetter = char.IsLetterOrDigit(str, i);
bool isHighSurrogate = char.IsHighSurrogate(str[i]);
if (isLetter)
{
sb.Append(str, i, isHighSurrogate ? 2 : 1);
}
if (isHighSurrogate && i + 1 < str.Length && char.IsLowSurrogate(str[i + 1]))
{
i++;
}
}
string str2 = sb.ToString();

Convert Dash-Separated String to camelCase via C#

I have a large XML file that contain tag names that implement the dash-separated naming convention. How can I use C# to convert the tag names to the camel case naming convention?
The rules are:
1. Convert all characters to lower case
2. Capitalize the first character after each dash
3. Remove all dashes
Example
Before Conversion
<foo-bar>
<a-b-c></a-b-c>
</foo-bar>
After Conversion
<fooBar>
<aBC></aBC>
</fooBar>
Here's a code example that works, but it's slow to process - I'm thinking that there is a better way to accomplish my goal.
string ConvertDashToCamelCase(string input)
{
input = input.ToLower();
char[] ca = input.ToCharArray();
StringBuilder sb = new StringBuilder();
for(int i = 0; i < ca.Length; i++)
{
if(ca[i] == '-')
{
string t = ca[i + 1].ToString().toUpper();
sb.Append(t);
i++;
}
else
{
sb.Append(ca[i].ToString());
}
}
return sb.ToString();
}

The reason your original code was slow is because you're calling ToString all over the place unnecessarily. There's no need for that. There's also no need for the intermediate array of char. The following should be much faster, and faster than the version that uses String.Split, too.
string ConvertDashToCamelCase(string input)
{
StringBuilder sb = new StringBuilder();
bool caseFlag = false;
for (int i = 0; i < input.Length; ++i)
{
char c = input[i];
if (c == '-')
{
caseFlag = true;
}
else if (caseFlag)
{
sb.Append(char.ToUpper(c));
caseFlag = false;
}
else
{
sb.Append(char.ToLower(c));
}
}
return sb.ToString();
}
I'm not going to claim that the above is the fastest possible. In fact, there are several obvious optimizations that could save some time. But the above is clean and clear: easy to understand.
The key is the caseFlag, which you use to indicate that the next character copied should be set to upper case. Also note that I don't automatically convert the entire string to lower case. There's no reason to, since you'll be looking at every character anyway and can do the appropriate conversion at that time.
The idea here is that the code doesn't do any more work than it absolutely has to.

For completeness, here's also a regular expression one-liner (inspred by this JavaScript answer):
string ConvertDashToCamelCase(string input) =>
Regex.Replace(input, "-.", m => m.Value.ToUpper().Substring(1));
It replaces all occurrences of -x with x converted to upper case.
Special cases:
If you want lower-case all other characters, replace input with input.ToLower() inside the expression:
string ConvertDashToCamelCase(string input) =>
Regex.Replace(input.ToLower(), "-.", m => m.Value.ToUpper().Substring(1));
If you want to support multiple dashes between words (dash--case) and have all of the dashes removed (dashCase), replace - with -+ in the regular expression (to greedily match all sequences of dashes) and keep only the final character:
string ConvertDashToCamelCase(string input) =>
Regex.Replace(input, "-+.", m => m.Value.ToUpper().Substring(m.Value.Length - 1));
If you want to support multiple dashes between words (dash--case) and remove only the final one (dash-Case), change the regular expression to match only a dash followed by a non-dash (rather than a dash followed by any character):
string ConvertDashToCamelCase(string input) =>
Regex.Replace(input, "-[^-]", m => m.Value.ToUpper().Substring(1));

string ConvertDashToCamelCase(string input)
{
string[] words = input.Split('-');
words = words.Select(element => wordToCamelCase(element));
return string.Join("", words);
}
string wordToCamelCase(string input)
{
return input.First().ToString().ToUpper() + input.Substring(1).ToLower();
}

Here is an updated version of #Jim Mischel's answer that will ignore the content - i.e. it will only camelCase tag names.
string ConvertDashToCamelCase(string input)
{
StringBuilder sb = new StringBuilder();
bool caseFlag = false;
bool tagFlag = false;
for(int i = 0; i < input.Length; i++)
{
char c = input[i];
if(tagFlag)
{
if (c == '-')
{
caseFlag = true;
}
else if (caseFlag)
{
sb.Append(char.ToUpper(c));
caseFlag = false;
}
else
{
sb.Append(char.ToLower(c));
}
}
else
{
sb.Append(c);
}
// Reset tag flag if necessary
if(c == '>' || c == '<')
{
tagFlag = (c == '<');
}
}
return sb.ToString();
}

using System;
using System.Text;
public class MyString
{
public static string ToCamelCase(string str)
{
char[] s = str.ToCharArray();
StringBuilder sb = new StringBuilder();
for(int i = 0; i < s.Length; i++)
{
if (s[i] == '-' || s[i] == '_')
sb.Append(Char.ToUpper(s[++i]));
else
sb.Append(s[i]);
}
return sb.ToString();
}
}

Regular expression help - ignoring parenthesis, ands, ors and whitespace again

Consider the following english phrase
FRIEND AND COLLEAGUE AND (FRIEND OR COLLEAGUE AND (COLLEAGUE AND FRIEND AND FRIEND))
I want to be able to programmatically change arbitrary phrases, such as above, to something like:
SELECT * FROM RelationTable R1 JOIN RelationTable R2 ON R2.RelationName etc etc WHERE
R2.RelationName = FRIEND AND R2.RelationName = Colleague AND (R3.RelationName = FRIENd,
etc. etc.
My question is. How do I take the initial string, strip it of the following words and symbols : AND, OR, (, ),
Then change each word, and create a new string.
I can do most of it, but my main problem is that if I do a string.split and only get the words I care for, I can't really replace them in the original string because I lack their original index. Let me explain in a smaller example:
string input = "A AND (B AND C)"
Split the string for space, parenthesies, etc, gives: A,B,C
input.Replace("A", "MyRandomPhrase")
But there is an A in AND.
So I moved into trying to create a regular expression that matches exact words, post split, and replaces. It started to look like this:
"(\(|\s|\))*" + itemOfInterest + "(\(|\s|\))+"
Am I on the right track or am I overcomplicating things..Thanks !

You can try using Regex.Replace, with \b word boundary regex
string input = "A AND B AND (A OR B AND (B AND A AND A))";
string pattern = "\\bA\\b";
string replacement = "MyRandomPhrase";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(input, replacement);

class Program
{
static void Main(string[] args)
{
string text = "A AND (B AND C)";
List<object> result = ParseBlock(text);
Console.ReadLine();
}
private static List<object> ParseBlock(string text)
{
List<object> result = new List<object>();
int bracketsCount = 0;
int lastIndex = 0;
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (c == '(')
bracketsCount++;
else if (c == ')')
bracketsCount--;
if (bracketsCount == 0)
if (c == ' ' || i == text.Length - 1)
{
string substring = text.Substring(lastIndex, i + 1 - lastIndex).Trim();
object itm = substring;
if (substring[0] == '(')
itm = ParseBlock(substring.Substring(1, substring.Length - 2));
result.Add(itm);
lastIndex = i;
}
}
return result;
}
}

How to separate 1 string into multiple strings [duplicate]

This question already has answers here:
.NET - How can you split a "caps" delimited string into an array?
(19 answers)
Split a PascalCase string into separate words
(10 answers)
Regular expression, split string by capital letter but ignore TLA
(7 answers)
Closed 9 years ago.
How do I convert "ThisIsMyTestString" into "This Is My Test String" using C#?
Is there a fast way to do it?
I've been thinking of a pseudo code but it's complicated and ugly:
String s = "ThisIsMyTestString";
List<String> strList = new List<String>();
for(int i=0; i < str->Length ; i++)
{
String tmp = "";
if (Char.IsUpper(str[i]))
{
tmp += str[i];
i++;
}
while (Char::IsLower(str[i]))
{
tmp += str[i];
i++;
}
strList .Add(tmp);
}
String tmp2 = "";
for (uint i=0 ; i<strList.Count(); i++)
{
tmp2 += strList[i] + " ";
}

You can use Regex as outlined here:
Regular expression, split string by capital letter but ignore TLA
Your regex: "((?<=[a-z])[A-Z]|A-Z)"
Find and replace with " $1"
string splitString = Replace("ThisIsMyTestString", "((?<=[a-z])[A-Z]|[A-Z](?=[a-z]))", " $1")
Here (?<=...) is a "positive lookbehind, a regex that should precede the match. In this case the lookbehind is "characters 'a' through 'z'"
(?=...) is a similar construct with lookahead, where the match has to be followed by regex-described string. In this case the lookahead is "characters 'a' through 'z'"
In both cases the final match contains one character "A" through "Z" followed by 'a'-'z' OR one character 'a' through 'z' followed by a capital letter. Replacing these matches puts a space between the capital and lowercase letters

Not best code, but it works
String.Join("", s.Select(c => Char.IsUpper(c) ? " " + c : c.ToString())).Trim()

lazyberezovsky beat me with a much simpler solution... but this creates less garbage so I won't delete it.
static void Main(string[] args)
{
Console.WriteLine(SplitByCase("ThisIsMyString"));
Console.ReadLine();
}
static string SplitByCase(string str, bool upper = true)
{
return String.Join(" ", SplitIntoWords(str, c => Char.IsUpper(c)));
}
static IEnumerable<String> SplitIntoWords(string str, Func<char, bool> splitter)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.Length; i++)
{
sb.Append(str[i]);
if (i + 1 == str.Length || splitter(str[i + 1]))
{
yield return sb.ToString();
sb.Clear();
}
}
}

This will do it for that string:
String s = "ThisIsMyTestString";
StringBuilder result = new StringBuilder();
result.Append(s[0]);
for (int i = 1; i < s.Length; i++)
{
if (char.IsUpper(s[i]) )
result.Append(' ');
result.Append(s[i]);
}
s = result.ToString();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

regex/linq to replace consecutive characters with count - c#

Related

How to exchange numbers to alphabet and alphabet to numbers in a string?

How to remove non-INTERNATIONAL-alphanumeric characters from a string?

Convert Dash-Separated String to camelCase via C#

Regular expression help - ignoring parenthesis, ands, ors and whitespace again

How to separate 1 string into multiple strings [duplicate]

Categories

Resources