Regular Expression - Is this possible?

Regular Expression - Is this possible? - c#

Rather than describing what I want (it's difficult to explain), Let me provide an example of what I need to accomplish in C# using a regular expression:
"HelloWorld" should be transformed to "Hello World"
"HelloWORld" should be transformed to "Hello WO Rld" //Two consecutive letters in capital should be treatead as one word
"helloworld" should be transformed to "helloworld"
EDIT:
"HellOWORLd" should be transformed to "Hell OW OR Ld"
Every 2-consecutive capital letters should be considered one word.
Is this possible?

This is fully working C# code, not just the regex:
Console.WriteLine(
Regex.Replace(
"HelloWORld",
"(?<!^)(?<wordstart>[A-Z]{1,2})",
" ${wordstart}", RegexOptions.Compiled));
And it prints:
Hello WO Rld
Update
To make this more UNICODE/international aware, consider replacing [A-Z] by \p{Lt} (meaning a UNICODE code point that represents a Letter in uppercase). The result for the current input would the same. So here is a slightly more compelling example:
Console.WriteLine(Regex.Replace(
#"ÉclaireürfØÑJßå",
#"(?<!^)(?<wordstart>\p{Lu}{1,2})",
#" ${wordstart}",
RegexOptions.Compiled));

The regular expression engine is not a transformative thing by nature, but rather a pattern matching (and replacing) engine. People often mistake the replace part of Regex, thinking that it can do more than it's designed to.
Back to your question, though... Regex cannot do what you want, instead, you should write your own parser to do this. With C#, if you're familiar with the language, this task is somewhat trivial.
It's a case of "You're using the wrong tool for the job".

Here are regular expressions that detect what you are looking for:
([A-Z]\w*?)[A-Z]
this matches any uppercase letter from A to Z once followed by aphanumerics up to the next uppercase.
([A-Z]{2}\w*?)[A-Z]
this matches any uppercase letter from A to Z exactly 2 times.
Regex is a matching engine, you can parse the input string and use regex.isMatch to find candidate matches to then insert spaces into the output string

string f(string input)
{
//'lowerUPPER' -> 'lower UPPER'
var x = Regex.Replace(input, "([a-z])([A-Z])","$1 $2");
//'UPPER' -> 'UP PE R'
return Regex.Replace(x, "([A-Z]{2})","$1 ");
}

class Program
{
static void Main(string[] args)
{
Print(Parse("HelloWorld"));
Print(Parse("HelloWORld"));
Print(Parse("helloworld"));
Print(Parse("HellOWORLd"));
Console.ReadLine();
}
static void Print(IEnumerable<string> input)
{
foreach (var s in input)
{
Console.Write(s);
Console.Write(' ');
}
Console.WriteLine();
}
static IEnumerable<string> Parse(string input)
{
var sb = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
if (!char.IsUpper(input[i]))
{
sb.Append(input[i]);
continue;
}
if (sb.Length > 0)
{
yield return sb.ToString();
sb.Clear();
}
sb.Append(input[i]);
if (char.IsUpper(input[i + 1]))
{
sb.Append(input[++i]);
yield return sb.ToString();
sb.Clear();
}
}
if (sb.Length > 0)
{
yield return sb.ToString();
}
}
}

I think does not need regular expression in this case.
Try this:
static void Main(string[] args)
{
var input = "HellOWORLd";
var i = 0;
var x = 4;
var len = input.Length;
var output = new List<string>();
while (x <= len)
{
output.Add(SubStr(input, i, x));
i = x;
x += 2;
}
var ret = output.ToArray(); //["Hell","OW", "OR", "Ld"]
Console.ReadLine();
}
static string SubStr(string str, int start, int end)
{
var len = str.Length;
if (start >= 0 && end <= len)
{
var ret = new StringBuilder();
for (int i = 0; i < len; i++)
{
if (i == start)
{
do
{
ret.Append(str[i]);
i++;
} while (i != end);
}
}
return ret.ToString();
}
return null;
}

Related

Convert Dash-Separated String to camelCase via C#

I have a large XML file that contain tag names that implement the dash-separated naming convention. How can I use C# to convert the tag names to the camel case naming convention?
The rules are:
1. Convert all characters to lower case
2. Capitalize the first character after each dash
3. Remove all dashes
Example
Before Conversion
<foo-bar>
<a-b-c></a-b-c>
</foo-bar>
After Conversion
<fooBar>
<aBC></aBC>
</fooBar>
Here's a code example that works, but it's slow to process - I'm thinking that there is a better way to accomplish my goal.
string ConvertDashToCamelCase(string input)
{
input = input.ToLower();
char[] ca = input.ToCharArray();
StringBuilder sb = new StringBuilder();
for(int i = 0; i < ca.Length; i++)
{
if(ca[i] == '-')
{
string t = ca[i + 1].ToString().toUpper();
sb.Append(t);
i++;
}
else
{
sb.Append(ca[i].ToString());
}
}
return sb.ToString();
}

The reason your original code was slow is because you're calling ToString all over the place unnecessarily. There's no need for that. There's also no need for the intermediate array of char. The following should be much faster, and faster than the version that uses String.Split, too.
string ConvertDashToCamelCase(string input)
{
StringBuilder sb = new StringBuilder();
bool caseFlag = false;
for (int i = 0; i < input.Length; ++i)
{
char c = input[i];
if (c == '-')
{
caseFlag = true;
}
else if (caseFlag)
{
sb.Append(char.ToUpper(c));
caseFlag = false;
}
else
{
sb.Append(char.ToLower(c));
}
}
return sb.ToString();
}
I'm not going to claim that the above is the fastest possible. In fact, there are several obvious optimizations that could save some time. But the above is clean and clear: easy to understand.
The key is the caseFlag, which you use to indicate that the next character copied should be set to upper case. Also note that I don't automatically convert the entire string to lower case. There's no reason to, since you'll be looking at every character anyway and can do the appropriate conversion at that time.
The idea here is that the code doesn't do any more work than it absolutely has to.

For completeness, here's also a regular expression one-liner (inspred by this JavaScript answer):
string ConvertDashToCamelCase(string input) =>
Regex.Replace(input, "-.", m => m.Value.ToUpper().Substring(1));
It replaces all occurrences of -x with x converted to upper case.
Special cases:
If you want lower-case all other characters, replace input with input.ToLower() inside the expression:
string ConvertDashToCamelCase(string input) =>
Regex.Replace(input.ToLower(), "-.", m => m.Value.ToUpper().Substring(1));
If you want to support multiple dashes between words (dash--case) and have all of the dashes removed (dashCase), replace - with -+ in the regular expression (to greedily match all sequences of dashes) and keep only the final character:
string ConvertDashToCamelCase(string input) =>
Regex.Replace(input, "-+.", m => m.Value.ToUpper().Substring(m.Value.Length - 1));
If you want to support multiple dashes between words (dash--case) and remove only the final one (dash-Case), change the regular expression to match only a dash followed by a non-dash (rather than a dash followed by any character):
string ConvertDashToCamelCase(string input) =>
Regex.Replace(input, "-[^-]", m => m.Value.ToUpper().Substring(1));

string ConvertDashToCamelCase(string input)
{
string[] words = input.Split('-');
words = words.Select(element => wordToCamelCase(element));
return string.Join("", words);
}
string wordToCamelCase(string input)
{
return input.First().ToString().ToUpper() + input.Substring(1).ToLower();
}

Here is an updated version of #Jim Mischel's answer that will ignore the content - i.e. it will only camelCase tag names.
string ConvertDashToCamelCase(string input)
{
StringBuilder sb = new StringBuilder();
bool caseFlag = false;
bool tagFlag = false;
for(int i = 0; i < input.Length; i++)
{
char c = input[i];
if(tagFlag)
{
if (c == '-')
{
caseFlag = true;
}
else if (caseFlag)
{
sb.Append(char.ToUpper(c));
caseFlag = false;
}
else
{
sb.Append(char.ToLower(c));
}
}
else
{
sb.Append(c);
}
// Reset tag flag if necessary
if(c == '>' || c == '<')
{
tagFlag = (c == '<');
}
}
return sb.ToString();
}

using System;
using System.Text;
public class MyString
{
public static string ToCamelCase(string str)
{
char[] s = str.ToCharArray();
StringBuilder sb = new StringBuilder();
for(int i = 0; i < s.Length; i++)
{
if (s[i] == '-' || s[i] == '_')
sb.Append(Char.ToUpper(s[++i]));
else
sb.Append(s[i]);
}
return sb.ToString();
}
}

Matching each char in strings

I am trying to match user input with the pattern "ran,om", where "ran om will match exact characters with order, and "," can match to any characters. The program will find words in the arrayList for example in ArrayList dictionary{rammm, random, ranom}, for example, random will match, but ranom will not.
I have written the following code, but it only finds any words contains any of the characters in the user input:
for (int i = 0; i < userinput.Length; i++)
{
foreach (string line in dictionary)
if (line[i] == userinput[i])
{
Matching.Add(line);
}
foreach (string line in FirstCom)
Console.WriteLine(line);
}
Can anyone help me to figure out what do I do next? (p.s no regex will be using in this program)

How about this:
public static bool IsMatch(string pattern, string line)
{
var patternSplit = pattern.Split(',');
if (!line.StartsWith(patternSplit[0])) return false;
if(patternSplit.Count() > 2){
for (var i = 1; i < patternSplit.Count() - 1; i++)
{
if (!line.Contains(patternSplit[i])) return false;
}
}
if (!line.EndsWith(patternSplit[patternSplit.Count() - 1])) return false;
return true;
}
static void Main(string[] args)
{
var matchingData = "quick brown fox jumped over a lazy dog";
var failingData = "I am batman";
var pattern = "qu,pe,ov,og";
if(IsMatch(pattern, matchingData))Console.WriteLine("{0} matches pattern {1}", pattern, matchingData);
if(!IsMatch(pattern, failingData)) Console.WriteLine("{0} does not match {1}", pattern, failingData);
Console.ReadKey();
}

How can i remove none alphabet chars from a string[]? [duplicate]

This question already has answers here:
How do I remove all non alphanumeric characters from a string except dash?
(13 answers)
Closed 9 years ago.
This is the code:
StringBuilder sb = new StringBuilder();
Regex rgx = new Regex("[^a-zA-Z0-9 -]");
var words = Regex.Split(textBox1.Text, #"(?=(?<=[^\s])\s+\w)");
for (int i = 0; i < words.Length; i++)
{
words[i] = rgx.Replace(words[i], "");
}
When im doing the Regex.Split() the words contain also strings with chars inside for exmaple:
Daniel>
or
Hello:
or
\r\nNew
or
hello---------------------------
And i need to get only the words without all the signs
So i tried to use this loop but i end that in words there are many places with ""
And some places with only ------------------------
And i cant use this as strings later in my code.

You don't need a regex to clear non-letters. This will remove all non-unicode letters.
public string RemoveNonUnicodeLetters(string input)
{
StringBuilder sb = new StringBuilder();
foreach(char c in input)
{
if(Char.IsLetter(c))
sb.Append(c);
}
return sb.ToString();
}
Alternatively, if you only want to allow Latin letters, you can use this
public string RemoveNonLatinLetters(string input)
{
StringBuilder sb = new StringBuilder();
foreach(char c in input)
{
if(c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
sb.Append(c);
}
return sb.ToString();
}
Benchmark vs Regex
public static string RemoveNonUnicodeLetters(string input)
{
StringBuilder sb = new StringBuilder();
foreach (char c in input)
{
if (Char.IsLetter(c))
sb.Append(c);
}
return sb.ToString();
}
static readonly Regex nonUnicodeRx = new Regex("\\P{L}");
public static string RemoveNonUnicodeLetters2(string input)
{
return nonUnicodeRx.Replace(input, "");
}
static void Main(string[] args)
{
Stopwatch sw = new Stopwatch();
StringBuilder sb = new StringBuilder();
//generate guids as input
for (int j = 0; j < 1000; j++)
{
sb.Append(Guid.NewGuid().ToString());
}
string input = sb.ToString();
sw.Start();
for (int i = 0; i < 1000; i++)
{
RemoveNonUnicodeLetters(input);
}
sw.Stop();
Console.WriteLine("SM: " + sw.ElapsedMilliseconds);
sw.Restart();
for (int i = 0; i < 1000; i++)
{
RemoveNonUnicodeLetters2(input);
}
sw.Stop();
Console.WriteLine("RX: " + sw.ElapsedMilliseconds);
}
Output (SM = String Manipulation, RX = Regex)
SM: 581
RX: 9882
SM: 545
RX: 9557
SM: 664
RX: 10196

keyboardP’s solution is decent – do consider it. But as I’ve argued in the comments, regular expressions are actually the correct tool for the job, you’re just making it unnecessarily complicated. The actual solution is a one-liner:
var result = Regex.Replace(input, "\\P{L}", "");
\P{…} specifies a Unicode character class we do not want to match (the opposite of \p{…}). L is the Unicode character class for letters.
Of course it makes sense to encapsulate this into a method, as keyboardP did. To avoid recompiling the regular expression over again, you should also consider pulling the regex creation out of the actual code (although this probably won’t give a big impact on performance):
static readonly Regex nonUnicodeRx = new Regex("\\P{L}");
public static string RemoveNonUnicodeLetters(string input) {
return nonUnicodeRx.Replace(input, "");
}

To help Konrad and keyboardP resolve their differences, I ran a benchmark test, using their code. It turns out that keyboardP's code is 10x faster than Konrad's code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string input = "asdf234!##*advfk234098awfdasdfq9823fna943";
DateTime start = DateTime.Now;
for (int i = 0; i < 100000; i++)
{
RemoveNonUnicodeLetters(input);
}
Console.WriteLine(DateTime.Now.Subtract(start).TotalSeconds);
start = DateTime.Now;
for (int i = 0; i < 100000; i++)
{
RemoveNonUnicodeLetters2(input);
}
Console.WriteLine(DateTime.Now.Subtract(start).TotalSeconds);
}
public static string RemoveNonUnicodeLetters(string input)
{
StringBuilder sb = new StringBuilder();
foreach (char c in input)
{
if (Char.IsLetter(c))
sb.Append(c);
}
return sb.ToString();
}
public static string RemoveNonUnicodeLetters2(string input)
{
var result = Regex.Replace(input, "\\P{L}", "");
return result;
}
}
}
I got
0.12
1.2
as output
UPDATE:
To see if it is the Regex compilation that is slowing down the Regex method, I put the regex in a static variable that is only constructed once.
static Regex rex = new Regex("\\P{L}");
public static string RemoveNonUnicodeLetters2(string input)
{
var result = rex.Replace(input,m => "");
return result;
}
But this had no effect on the runtime.

Trim to first number

Is there a way to trim a string to the first numeric digit from left AND right using standard .NET tools? Or I need to write my own function (not difficult, but I'd rather use standard methods). I need the following outputs for the provided inputs:
Input Output
-----------------------
abc123def 123
;'-2s;35(r 2s;35
abc12de3f4g 12de3f4

You'll need to use regular expressions
string TrimToDigits(string text)
{
var pattern = #"\d.*\d";
var regex = new Regex(pattern);
Match m = regex.Match(text); // m is the first match
if (m.Success)
{
return m.Value;
}
return String.Empty;
}
If you want to call this like you normally would the String.Trim() method, you can create it as an extension method.
static class StringExtensions
{
static string TrimToDigits(this string text)
{
// ...
}
}
And then you can call it like this:
var trimmedString = otherString.TrimToDigits();

No, there is no built in way. You will have to write your own method to do this.

No, I don't think there is. Method though:
for (int i = 0; i < str.Length; i++)
{
if (char.IsDigit(str[i]))
{
break;
}
str = string.Substring(1);
}
for (int i = str.Length - 1; i > 0; i--)
{
if (char.IsDigit(str[i]))
{
break;
}
str = string.Substring(0, str.Length - 1);
}
I think this'll work.

What is the best algorithm for arbitrary delimiter/escape character processing?

I'm a little surprised that there isn't some information on this on the web, and I keep finding that the problem is a little stickier than I thought.
Here's the rules:
You are starting with delimited/escaped data to split into an array.
The delimiter is one arbitrary character
The escape character is one arbitrary character
Both the delimiter and the escape character could occur in data
Regex is fine, but a good-performance solution is best
Edit: Empty elements (including leading or ending delimiters) can be ignored
The code signature (in C# would be, basically)
public static string[] smartSplit(
string delimitedData,
char delimiter,
char escape) {}
The stickiest part of the problem is the escaped consecutive escape character case, of course, since (calling / the escape character and , the delimiter): ////////, = ////,
Am I missing somewhere this is handled on the web or in another SO question? If not, put your big brains to work... I think this problem is something that would be nice to have on SO for the public good. I'm working on it myself, but don't have a good solution yet.

A simple state machine is usually the easiest and fastest way. Example in Python:
def extract(input, delim, escape):
# states
parsing = 0
escaped = 1
state = parsing
found = []
parsed = ""
for c in input:
if state == parsing:
if c == delim:
found.append(parsed)
parsed = ""
elif c == escape:
state = escaped
else:
parsed += c
else: # state == escaped
parsed += c
state = parsing
if parsed:
found.append(parsed)
return found

void smartSplit(string const& text, char delim, char esc, vector<string>& tokens)
{
enum State { NORMAL, IN_ESC };
State state = NORMAL;
string frag;
for (size_t i = 0; i<text.length(); ++i)
{
char c = text[i];
switch (state)
{
case NORMAL:
if (c == delim)
{
if (!frag.empty())
tokens.push_back(frag);
frag.clear();
}
else if (c == esc)
state = IN_ESC;
else
frag.append(1, c);
break;
case IN_ESC:
frag.append(1, c);
state = NORMAL;
break;
}
}
if (!frag.empty())
tokens.push_back(frag);
}

private static string[] Split(string input, char delimiter, char escapeChar, bool removeEmpty)
{
if (input == null)
{
return new string[0];
}
char[] specialChars = new char[]{delimiter, escapeChar};
var tokens = new List<string>();
var token = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
var c = input[i];
if (c.Equals(escapeChar))
{
if (i >= input.Length - 1)
{
throw new ArgumentException("Uncompleted escape sequence has been encountered at the end of the input");
}
var nextChar = input[i + 1];
if (nextChar != escapeChar && nextChar != delimiter)
{
throw new ArgumentException("Unknown escape sequence has been encountered: " + c + nextChar);
}
token.Append(nextChar);
i++;
}
else if (c.Equals(delimiter))
{
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
token.Length = 0;
}
}
else
{
var index = input.IndexOfAny(specialChars, i);
if (index < 0)
{
token.Append(c);
}
else
{
token.Append(input.Substring(i, index - i));
i = index - 1;
}
}
}
if (!removeEmpty || token.Length > 0)
{
tokens.Add(token.ToString());
}
return tokens.ToArray();
}

The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.
You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).
Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:
state(input) action
========================
BEGIN(*): token.clear(); state=START;
END(*): return;
*(\n\0): token.emit(); state=END;
START(DELIMITER): ; // NB: the input is *not* added to the token!
START(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
START(*): token.append(input); state=NORM;
NORM(DELIMITER): token.emit(); token.clear(); state=START;
NORM(ESCAPE): state=ESC; // NB: the input is *not* added to the token!
NORM(*): token.append(input);
ESC(*): token.append(input); state=NORM;
This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).

Here's my ported function in C#
public static void smartSplit(string text, char delim, char esc, ref List<string> listToBuild)
{
bool currentlyEscaped = false;
StringBuilder fragment = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (currentlyEscaped)
{
fragment.Append(c);
currentlyEscaped = false;
}
else
{
if (c == delim)
{
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
fragment.Remove(0, fragment.Length);
}
}
else if (c == esc)
currentlyEscaped = true;
else
fragment.Append(c);
}
}
if (fragment.Length > 0)
{
listToBuild.Add(fragment.ToString());
}
}
Hope this helps someone in the future. Thanks to KenE for pointing me in the right direction.

Here's a more idiomatic and readable way to do it:
public IEnumerable<string> SplitAndUnescape(
string encodedString,
char separator,
char escape)
{
var inEscapeSequence = false;
var currentToken = new StringBuilder();
foreach (var currentCharacter in encodedString)
if (inEscapeSequence)
{
currentToken.Append(currentCharacter);
inEscapeSequence = false;
}
else
if (currentCharacter == escape)
inEscapeSequence = true;
else
if (currentCharacter == separator)
{
yield return currentToken.ToString();
currentToken.Clear();
}
else
currentToken.Append(currentCharacter);
yield return currentToken.ToString();
}
Note that this doesn't remove empty elements. I don't think that should be the responsibility of the parser. If you want to remove them, just call Where(item => item.Any()) on the result.
I think this is too much logic for a single method; it gets hard to follow. If someone has time, I think it would be better to break it up into multiple methods and maybe its own class.

You'ew looking for something like a "string tokenizer". There's a version I found quickly that's similar. Or look at getopt.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular Expression - Is this possible? - c#

string f(string input) { //'lowerUPPER' -> 'lower UPPER' var x = Regex.Replace(input, "([a-z])([A-Z])","$1 $2"); //'UPPER' -> 'UP PE R' return Regex.Replace(x, "([A-Z]{2})","$1 "); }

Related

Convert Dash-Separated String to camelCase via C#

Matching each char in strings

How can i remove none alphabet chars from a string[]? [duplicate]

Trim to first number

What is the best algorithm for arbitrary delimiter/escape character processing?

Categories

Resources