How to get parentheses inside parentheses - c#

I'm trying to keep a parenthese within a string that's surrounded by a parenthese.
The string in question is: test (blue,(hmmm) derp)
The desired output into an array is: test and (blue,(hmmm) derp).
The current output is: (blue,, (hmm) and derp).
My current code is thatof this:
var input = Regex
.Split(line, #"(\([^()]*\))")
.Where(s => !string.IsNullOrEmpty(s))
.ToList();
How can i extract the text inside the outside parentheses (keeping them) and keep the inside parenthese as one string in an array?
EDIT:
To clarify my question, I want to ignore the inner parentheses and only split on the outer parentheses.
herpdediderp (orange,(hmm)) some other crap (red,hmm)
Should become:
herpdediderp, orange,(hmm), some other crap and red,hmm.
The code works for everything except the double parentheses: (orange,(hmm)) to orange,(hmm).

You can use the method
public string Trim(params char[] trimChars)
Like this
string trimmedLine = line.Trim('(', ')'); // Specify undesired leading and trailing chars.
// Specify separator characters for the split (here command and space):
string[] input = trimmedLine.Split(new[]{',', ' '}, StringSplitOptions.RemoveEmptyEntries);
If the line can start or end with 2 consecutive parentheses, use simply good old if-statements:
if (line.StartsWith("(")) {
line = line.Substring(1);
}
if (line.EndsWith(")")) {
line = line.Substring(0, line.Length - 1);
}
string[] input = line.Split(new[]{',', ' '},

Lot's o' guessing going on here - from me and the others. You could try
[^(]+|\([^(]*(?:\([^(]*\)[^(]*)*\)
It handles one level of parentheses recursion (could be extended though).
Here at regexstorm.
Visual illustration at regex101.
If this piques your interest, I'll add an explanation ;)
Edit:
If you need to use split, put the selection in to a group, like
([^(]+|\([^(]*(?:\([^(]*\)[^(]*)*\))
and filter out empty strings. See example here at ideone.
Edit 2:
Not quite sure what behaviour you want with multiple levels of parentheses, but I assume this could do it for you:
([^(]+|\([^(]*(?:\([^(]*(?:\([^(]*\)[^(]*)*\)[^(]*)*\))
^^^^^^^^^^^^^^^^^^^ added
For each level of recursion you want, you "just" add another inner level. So this is for two levels of recursion ;)
See it here at ideone.

Hopefully someone will come up with a regex. Here's my code answer.
static class ExtensionMethods
{
static public IEnumerable<string> GetStuffInsideParentheses(this IEnumerable<char> input)
{
int levels = 0;
var current = new Queue<char>();
foreach (char c in input)
{
if (levels == 0)
{
if (c == '(') levels++;
continue;
}
if (c == ')')
{
levels--;
if (levels == 0)
{
yield return new string(current.ToArray());
current.Clear();
continue;
}
}
if (c == '(')
{
levels++;
}
current.Enqueue(c);
}
}
}
Test program:
public class Program
{
public static void Main()
{
var input = new []
{
"(blue,(hmmm) derp)",
"herpdediderp (orange,(hmm)) some other crap (red,hmm)"
};
foreach ( var s in input )
{
var output = s.GetStuffInsideParentheses();
foreach ( var o in output )
{
Console.WriteLine(o);
}
Console.WriteLine();
}
}
}
Output:
blue,(hmmm) derp
orange,(hmm)
red,hmm
Code on DotNetFiddle

I think if you think about the problem backwards, it becomes a bit easier - don't split on what you don't what, extract what you do want.
The only slightly tricky part if matching nested parentheses, I assume you will only go one level deep.
The first example:
var s1 = "(blue, (hmmm) derp)";
var input = Regex.Matches(s1, #"\((?:\(.+?\)|[^()]+)+\)").Cast<Match>().Select(m => Regex.Matches(m.Value, #"\(\w+\)|\w+").Cast<Match>().Select(m2 => m2.Value).ToArray()).ToArray();
// input is string[][] { string[] { "blue", "(hmmm)", "derp" } }
The second example uses an extension method:
public static string TrimOutside(this string src, string openDelims, string closeDelims) {
if (!String.IsNullOrEmpty(src)) {
var openIndex = openDelims.IndexOf(src[0]);
if (openIndex >= 0 && src.EndsWith(closeDelims.Substring(openIndex, 1)))
src = src.Substring(1, src.Length - 2);
}
return src;
}
The code/patterns are different because the two examples are being handled differently:
var s2 = "herpdediderp (orange,(hmm)) some other crap (red,hmm)";
var input3 = Regex.Matches(s2, #"\w(?:\w| )+\w|\((?:[^(]+|\([^)]+\))+\)").Cast<Match>().Select(m => m.Value.TrimOutside("(",")")).ToArray();
// input2 is string[] { "herpdediderp", "orange,(hmm)", "some other crap", "red,hmm" }

Related

Efficiently split a string in format "{ {}, {}, ...}"

I have a string in the following format.
string instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}"
private void parsestring(string input)
{
string[] tokens = input.Split(','); // I thought this would split on the , seperating the {}
foreach (string item in tokens) // but that doesn't seem to be what it is doing
{
Console.WriteLine(item);
}
}
My desired output should be something like this below:
112,This is the first day 23/12/2009
132,This is the second day 24/12/2009
But currently, I get the one below:
{112
This is the first day 23/12/2009
{132
This is the second day 24/12/2009
I am very new to C# and any help would be appreciated.
Don't fixate on Split() being the solution! This is a simple thing to parse without it. Regex answers are probably also OK, but I imagine in terms of raw efficiency making "a parser" would do the trick.
IEnumerable<string> Parse(string input)
{
var results = new List<string>();
int startIndex = 0;
int currentIndex = 0;
while (currentIndex < input.Length)
{
var currentChar = input[currentIndex];
if (currentChar == '{')
{
startIndex = currentIndex + 1;
}
else if (currentChar == '}')
{
int endIndex = currentIndex - 1;
int length = endIndex - startIndex + 1;
results.Add(input.Substring(startIndex, length));
}
currentIndex++;
}
return results;
}
So it's not short on lines. It iterates once, and only performs one allocation per "result". With a little tweaking I could probably make a C#8 version with Index types that cuts on allocations? This is probably good enough.
You could spend a whole day figuring out how to understand the regex, but this is as simple as it comes:
Scan every character.
If you find {, note the next character is the start of a result.
If you find }, consider everything from the last noted "start" until the index before this character as "a result".
This won't catch mismatched brackets and could throw exceptions for strings like "}}{". You didn't ask for handling those cases, but it's not too hard to improve this logic to catch it and scream about it or recover.
For example, you could reset startIndex to something like -1 when } is found. From there, you can deduce if you find { when startIndex != -1 you've found "{{". And you can deduce if you find } when startIndex == -1, you've found "}}". And if you exit the loop with startIndex < -1, that's an opening { with no closing }. that leaves the string "}whoops" as an uncovered case, but it could be handled by initializing startIndex to, say, -2 and checking for that specifically. Do that with a regex, and you'll have a headache.
The main reason I suggest this is you said "efficiently". icepickle's solution is nice, but Split() makes one allocation per token, then you perform allocations for each TrimX() call. That's not "efficient". That's "n + 2 allocations".
Use Regex for this:
string[] tokens = Regex.Split(input, #"}\s*,\s*{")
.Select(i => i.Replace("{", "").Replace("}", ""))
.ToArray();
Pattern explanation:
\s* - match zero or more white space characters
Well, if you have a method that is called ParseString, it's a good thing it returns something (and it might not be that bad to say that it is ParseTokens instead). So if you do that, you can come to the following code
private static IEnumerable<string> ParseTokens(string input)
{
return input
// removes the leading {
.TrimStart('{')
// removes the trailing }
.TrimEnd('}')
// splits on the different token in the middle
.Split( new string[] { "},{" }, StringSplitOptions.None );
}
The reason why it didn't work for you before, is because your understanding of how the split method works, was wrong, it will effectively split on all , in your example.
Now if you put this all together, you get something like in this dotnetfiddle
using System;
using System.Collections.Generic;
public class Program
{
private static IEnumerable<string> ParseTokens(string input)
{
return input
// removes the leading {
.TrimStart('{')
// removes the trailing }
.TrimEnd('}')
// splits on the different token in the middle
.Split( new string[] { "},{" }, StringSplitOptions.None );
}
public static void Main()
{
var instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}";
foreach (var item in ParseTokens( instance ) ) {
Console.WriteLine( item );
}
}
}
Add using System.Text.RegularExpressions; to top of the class
and use the regex split method
string[] tokens = Regex.Split(input, "(?<=}),");
Here, we use positive lookahead to split on a , which is immediately after a }
(note: (?<= your string ) matches all the characters after your string only. you can read more about it here
If you dont want to your regular expressions, the following code will produce your required output.
string instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}";
string[] tokens = instance.Replace("},{", "}{").Split('}', '{');
foreach (string item in tokens)
{
if (string.IsNullOrWhiteSpace(item)) continue;
Console.WriteLine(item);
}
Console.ReadLine();

Equals() method not recognizing similar/same characters when comparing

Why comparing characters with .Equals always returns false?
char letter = 'a';
Console.WriteLine(letter.Equals("a")); // false
Overall I'm trying to write an English - Morse Code translator. I run into a problem comparing char values which shown above. I began with a foreach to analyze all the characters from a ReadLine() input, by using the WriteLine() method, all the characters were transposed fine, but when trying to compare them using the .Equals() method, no matter what I did, it always output false when trying to compare chars.
I have used the .Equals() method with other strings successfully, but it seems to not work with my chars.
using System;
public class MorseCode {
public static void Main (string[] args) {
Console.WriteLine ("Hello, write anything to convert it to morse code!");
var input = Console.ReadLine();
foreach (char letter in input) {
if(letter.Equals("a")) {
Console.WriteLine("Its A - live");
}
Console.WriteLine(letter);
}
var morseTranslation = "";
foreach (char letter in input) {
if(letter.Equals("a")) {
morseTranslation += ". _ - ";
}
if(letter.Equals("b")) {
morseTranslation += "_ . . . - ";
}
if(letter.Equals("c")) {
morseTranslation += "_ . _ . - ";
}
...
}
}
Console.WriteLine("In morse code, " + input + " is '"morseTranslation + "'");
}
}
At the beginning, I wrote the foreach to test if it recognized and ran the correct output, but in the end, when I wrote "sample" into the ReadLine(), it gave me :
Hello, write anything to convert it to morse code!
sample
s
a
m
p
l
e
When you do this:
var c = 'x';
var isEqual = c.Equals("x");
the result (isEqual) will always be false because it's comparing a string to a char. This would return true:
var isEqual = c.Equals('x');
The difference is that "x" is a string literal and 'x' is a char literal.
Part of what makes this confusing is that when you use an object's Equals method, it allows you to compare any type to any other type. So you could do this:
var x = 0;
var y = "y";
var isEqual = x.Equals(y);
...and the compiler will allow it, even though the comparison between int and string won't work. It will give you this warning:
When comparing value types like int or char with other values of the same type, we usually use ==, like
if (someChar == someOtherChar)
Then if you tried to do this:
if(someChar == "a")
It wouldn't compile. It would tell you that you're comparing a char to a string, and then it's easier because instead of running the program and looking for the error it just won't compile at all and it will tell you exactly where the problem is.
Just for the fun of it, here's another implementation.
public static class MorseCodeConverter
{
private static readonly Dictionary<char, string> Codes
= CreateMorseCodeDictionary();
public static string Convert(string input)
{
var lowerCase = input.ToLower();
var result = new StringBuilder();
foreach (var character in input)
{
if (Codes.ContainsKey(character))
result.Append(Codes[character]);
}
return result.ToString();
}
static Dictionary<char, string> CreateMorseCodeDictionary()
{
var result = new Dictionary<char, string>();
result.Add('a', ". _ - ");
result.Add('b', "_ . . . - ");
// add all the rest
return result;
}
}
One difference is that it's a class by itself without the console app. Then you can use it in a console app. Read the input from the keyboard and then call
MorseCodeConverter.Convert(input);
to get the result, and then you can print it to the console.a
Putting all of the characters in a dictionary means that instead of repeating the if/then you can just check to see if each character is in the dictionary.
It's important to remember that whilst the char and string keywords look reminiscant of eachother when looking at printed values you should note that they are not accomodated for in exactly the same way.
When you check a string you can use:
string s = "A";
if(s.Equals("A"))
{
//Do Something
}
However, the above will not work with a char. The difference between chars (value types) and strings (reference types) on a surface level is the use of access - single quote (apostrophe) vs quote.
To compare a char you can do this:
char s = 'A';
if(s.Equals('A'))
{
//Do Something
}
On a point relevant to your specific case however, morse code will only requre you to use a single case alphabet and as such when you try to compare against 'A' and 'a' you can call input.ToLower() to reduce your var (string) to all lower case so you don't need to cater for both upper and lower case alphabets.
It's good that you're aware of string comparissons and are not using direct value comparisson as this:
if (letter == 'a')
{
Console.WriteLine("Its A - live");
}
Would've allowed you to compare the char but it's bad practice as it may lead to lazy comparisson of strings in the same way and this:
if (letter == "a")
{
Console.WriteLine("Its A - live");
}
Is a non-representitive method of comparison for the purpose of comparing strings as it evaluates the reference not the direct value, see here
For char comparison you have to use single quote ' character not " this.
By the way it writes sample in decending order beacuse in your first foreach loop you write all letters in new line. SO below code will work for you:
using System;
public class MorseCode {
public static void Main (string[] args) {
Console.WriteLine ("Hello, write anything to convert it to morse code!");
var input = Console.ReadLine();
/*foreach (char letter in input) {
if(letter.Equals("a")) {
Console.WriteLine("Its A - live");
}
Console.WriteLine(letter);
}*/
var morseTranslation = "";
foreach (char letter in input) {
if(letter.Equals('a')) {
morseTranslation += ". _ - ";
}
if(letter.Equals('b')) {
morseTranslation += "_ . . . - ";
}
if(letter.Equals('c')) {
morseTranslation += "_ . _ . - ";
}
...
}
}
Console.WriteLine("In morse code, " + input + " is '"morseTranslation + "'");
}
}
In C#, you can compare strings like integers, that is with == operator. Equals is a method inherited from the object class, and normally implementations would make some type checks. char letter is (obviously) a character, while "a" is a single lettered string.
That's why it returns false.
You could use if (letter.Equals('a')) { ... }, or simpler if (letter == 'a') { ... }
Even simpler than that would be switch (letter) { case 'a': ...; break; ... }.
Or something that is more elegant but maybe too advanced yet for a beginner, using LINQ:
var validCharacters = "ABCDE...";
var codes = new string[] {
".-", "-...", "-.-.", "-..", ".", ...
};
var codes = input.ToUpper() // make uppercase
.ToCharArray() // explode string into single characters
.Select(validCharaters.IndexOf) // foreach element (i. e. character), get the result of "validCharacters.IndexOf",
// which equals the index of the morse code in the array "codes"
.Where(i => i > -1) // only take the indexes of characters that were found in "validCharacters"
.Select(i => codes[i]); // retrieve the matching entry from "codes" by index
// "codes" is now an IEnumerable<string>, a structure saying
// "I am a list of strings over which you can iterate,
// and I know how to generate the elements as you request them."
// Now concatenate all single codes to one long result string
var result = string.Join(" ", codes);

Converting Arabic Words to Unicode format in C#

I am designing an API where the API user needs Arabic text to be returned in Unicode format, to do so I tried the following:
public static class StringExtensions
{
public static string ToUnicodeString(this string str)
{
StringBuilder sb = new StringBuilder();
foreach (var c in str)
{
sb.Append("\\u" + ((int)c).ToString("X4"));
}
return sb.ToString();
}
}
The issue with the above code that it returns the unicode of letters regardless of its position in word.
Example: let us assume we have the following word:
"سمير" which consists of:
'س' which is written like 'سـ' because it is the first letter in word.
'م' which is written like 'ـمـ' because it is in the middle of word.
'ي' which is written like 'ـيـ' because it is in the middle of word.
'ر' which is written like 'ـر' because it is last letter of word.
The above code returns unicode of { 'س', 'م' , 'ي' , 'ر'} which is:
\u0633\u0645\u064A\u0631
instead of { 'سـ' , 'ـمـ' , 'ـيـ' , 'ـر'} which is
\uFEB3\uFEE4\uFEF4\uFEAE
Any ideas on how to update code to get correct Unicode?
Helpful link
The string is just a sequence of Unicode code points; it does not know the rules of Arabic. You're getting out exactly the data you put in; if you want different data out, then put different data in!
Try this:
Console.WriteLine("\u0633\u0645\u064A\u0631");
Console.WriteLine("\u0633\u0645\u064A\u0631".ToUnicodeString());
Console.WriteLine("\uFEB3\uFEE4\uFEF4\uFEAE");
Console.WriteLine("\uFEB3\uFEE4\uFEF4\uFEAE".ToUnicodeString());
As expected the output is
سمير
\u0633\u0645\u064A\u0631
ﺳﻤﻴﺮ
\uFEB3\uFEE4\uFEF4\uFEAE
Those two sequences of Unicode code points render the same in the browser, but they're different sequences. If you want to write out the second sequence, then don't pass in the first sequence.
Based on Eric's answer I knew how to solve my problem, I have created a solution on Github.
You will find a simple tool to run on Windows, and if you want to use the code in your projects then just copy paste UnicodesTable.cs and Unshaper.cs.
Basically you need a table of Unicodes for each Arabic letter then you can use something like the following extension method.
public static string GetUnShapedUnicode(this string original)
{
original = Regex.Unescape(original.Trim());
var words = original.Split(' ');
StringBuilder builder = new StringBuilder();
var unicodesTable = UnicodesTable.GetArabicGliphes();
foreach (var word in words)
{
string previous = null;
for (int i = 0; i < word.Length; i++)
{
string shapedUnicode = #"\u" + ((int)word[i]).ToString("X4");
if (!unicodesTable.ContainsKey(shapedUnicode))
{
builder.Append(shapedUnicode);
previous = null;
continue;
}
else
{
if (i == 0 || previous == null)
{
builder.Append(unicodesTable[shapedUnicode][1]);
}
else
{
if (i == word.Length - 1)
{
if (!string.IsNullOrEmpty(previous) && unicodesTable[previous][4] == "2")
{
builder.Append(unicodesTable[shapedUnicode][0]);
}
else
builder.Append(unicodesTable[shapedUnicode][3]);
}
else
{
bool previouChar = unicodesTable[previous][4] == "2";
if (previouChar)
builder.Append(unicodesTable[shapedUnicode][1]);
else
builder.Append(unicodesTable[shapedUnicode][2]);
}
}
}
previous = shapedUnicode;
}
if (words.ToList().IndexOf(word) != words.Length - 1)
builder.Append(#"\u" + ((int)' ').ToString("X4"));
}
return builder.ToString();
}

Get the different substrings from one main string

I have the following main string which contains link Name and link URL. The name and url is combined with #;. I want to get the string of each link (name and url i.e. My web#?http://www.google.com), see example below
string teststring = "My web#;http://www.google.com My Web2#;http://www.bing.se Handbooks#;http://www.books.se/";
and I want to get three different strings using any string function:
My web#?http://www.google.com
My Web2#?http://www.bing.se
Handbooks#?http://www.books.de
So this looks like you want to split on the space after a #;, instead of splitting at #; itself. C# provides arbitrary length lookbehinds, which makes that quite easy. In fact, you should probably do the replacement of #; with #? first:
string teststring = "My web#;http://www.google.com My Web2#;http://www.bing.se Handbooks#;http://www.books.se/";
teststring = Regex.Replace(teststring, #"#;", "#?");
string[] substrings = Regex.Split(teststring, #"(?<=#\?\S*)\s+");
That's it:
foreach(var s in substrings)
Console.WriteLine(s);
Output:
My web#?http://www.google.com
My Web2#?http://www.bing.se
Handbooks#?http://www.books.se/
If you are worried that your input might already contain other #? that you don't want to split on, you can of course do the splitting first (using #; in the pattern) and then loop over substrings and do the replacement call inside the loop.
If these are constant strings, you can just use String.Substring. This will require you to count letters, which is a nuisance, in order to provide the right parameters, but it will work.
string string1 = teststring.Substring(0, 26).Replace(";","?");
If they aren't, things get complicated. You could almost do a split with " " as the delimiter, except that your site name has a space. Do any of the substrings in your data have constant features, such as domain endings (i.e. first .com, then .de, etc.) or something like that?
If you have any control on the input format, you may want to change it to be easy to parse, for example by using another separator between items, other than space.
If this format can't be changed, why not just implement the split in code? It's not as short as using a RegEx, but it might be actually easier for a reader to understand since the logic is straight forward.
This will almost definitely will be faster and cheaper in terms of memory usage.
An example for code that solves this would be:
static void Main(string[] args)
{
var testString = "My web#;http://www.google.com My Web2#;http://www.bing.se Handbooks#;http://www.books.se/";
foreach(var x in SplitAndFormatUrls(testString))
{
Console.WriteLine(x);
}
}
private static IEnumerable<string> SplitAndFormatUrls(string input)
{
var length = input.Length;
var last = 0;
var seenSeparator = false;
var previousChar = ' ';
for (var index = 0; index < length; index++)
{
var currentChar = input[index];
if ((currentChar == ' ' || index == length - 1) && seenSeparator)
{
var currentUrl = input.Substring(last, index - last);
yield return currentUrl.Replace("#;", "#?");
last = index + 1;
seenSeparator = false;
previousChar = ' ';
continue;
}
if (currentChar == ';' && previousChar == '#')
{
seenSeparator = true;
}
previousChar = currentChar;
}
}

C# Best way to retrieve strings that's in quotation mark?

Suppose I am given a following text (in a string array)
engine.STEPCONTROL("00000000","02000001","02000043","02000002","02000007","02000003","02000008","02000004","02000009","02000005","02000010","02000006","02000011");
if("02000001" == 1){
dimlevel = 1;
}
if("02000001" == 2){
dimlevel = 3;
}
I'd like to extract the strings that's in between the quotation mark and put it in a separate string array. For instance, string[] extracted would contain 00000000, 02000001, 02000043....
What is the best approach for this? Should I use regular expression to somehow parse those lines and split it?
Personally I don't think a regular expression is necessary. If you can be sure that the input string is always as described and will not have any escape sequences in it or vary in any other way, you could use something like this:
public static string[] ExtractNumbers(string[] originalCodeLines)
{
List<string> extractedNumbers = new List<string>();
string[] codeLineElements = originalCodeLines[0].Split('"');
foreach (string element in codeLineElements)
{
int result = 0;
if (int.TryParse(element, out result))
{
extractedNumbers.Add(element);
}
}
return extractedNumbers.ToArray();
}
It's not necessarily the most efficient implementation but it's quite short and its easy to see what it does.
that could be
string data = "\"00000000\",\"02000001\",\"02000043\"".Replace("\"", string.Empty);
string[] myArray = data.Split(',');
or in 1 line
string[] data = "\"00000000\",\"02000001\",\"02000043\"".Replace("\"", string.Empty).Split(',');

Categories

Resources