Parsing Nested Text in C Sharp

Parsing Nested Text in C Sharp - c#

If I have a series of strings that have this base format:
"[id value]"//id and value are space delimited. id will never have spaces
They can then be nested like this:
[a]
[a [b value]]
[a [b [c [value]]]
So every item can have 0 or 1 value entries.
What is the best approach to go about parsing this format? Do I just use stuff like string.Split() or string.IndexOf() or are there better methods?

there is nothing wrong with split and indexof methods, they exist for string parsing.
Here is a sample for your case:
string str = "[a [b [c [d value]]]]";
while (str.Trim().Length > 0)
{
int start = str.LastIndexOf('[');
int end = str.IndexOf(']');
string s = str.Substring(start +1, end - (start+1)).Trim();
string[] pair = s.Split(' ');// this is what you are looking for. its length will be 2 if it has a value
str = str.Remove(start, (end + 1)- start);
}

A little recursion and split would work, the main point is use recursion, it'll make it so much easier. Your input syntax looks kind of like LISP :)
Parsing a, split, no second part. done.
Parsing a [b value]. has second part, go to the beginning.
...
You get the idea.

Regex is alway a nice solution.
string test = "[a [b [c [value]]]";
Regex r = new Regex("\\[(?<id>[A-Za-z]*) (?<value>.*)\\]");
var res = r.Match(test);
Then you can get the value (which is [b [c [value]] after the first iteration) and apply the same again until the match fails.
string id = res.Groups[1].Value;
string value = res.Groups[2].Value;

Simple split should work
For every id,there is one bracket [
So when you split that string you have n-brackets so n-1 id(s) where the last element contains the value.

Related

C# How to extract words from a string and put them into class members

I have a problem with c# string manipulation and I'd appreciate your help.
I have a file that contains many lines. It looks like this:
firstWord number(secondWord) thirdWord(Phrase) Date1 Date2
firstWord number(secondWord) thirdWord(Phrase) Date1 Time1
...
I need to separate these words and put them in a class properties. As you can see the problem is that the spaces between words are not the same, sometimes is one space sometimes eight spaces between them. And the second problem is that on the third place comes a phrase containing 2 to 5 words (again divided by spaces or sometimes contected with _ or -) and it needs to be considered as one string - it has to be one class member. The class should look like this:
class A
string a = firstWord;
int b = number;
string c = phrase;
Date d = Date1;
Time e = Time1;
I'd appreciate if you had any ideas how to solve this. Thank you.

Use the following steps:
Use File.ReadAllLines() to get a string[], where each element represents one line of the file.
For each line, use string.Split() and chop your line into individual words. Use both space and parentheses as your delimiters. This will give you an array of words. Call it arr.
Now create an object of your class and assign like this:
string a = arr[0];
int b = int.Parse(arr[1]);
string c = string.Join(" ", arr.Skip(4).Take(arr.Length - 6));
Date d = DateTime.Parse(arr[arr.Length - 2]);
Date e = DateTime.Parse(arr[arr.Length - 1]);
The only tricky stuff is string c above. Logic here is that from element no. 4 up to the 3rd last element, all of these elements form your phrase part, so we use linq to extract those elements and join them together to get back your phrase. This would obviously require that the phrase itself doesn't contain any parentheses itself, but that shouldn't normally be the case I assume.

You need a loop and string- and TryParse-methods:
var list = new List<ClassName>();
foreach (string line in File.ReadLines(path).Where(l => !string.IsNullOrEmpty(l)))
{
string[] fields = line.Trim().Split(new char[] { }, StringSplitOptions.RemoveEmptyEntries);
if (fields.Length < 5) continue;
var obj = new ClassName();
list.Add(obj);
obj.FirstWord = fields[0];
int number;
int index = fields[1].IndexOf('(');
if (index > 0 && int.TryParse(fields[1].Remove(index), out number))
obj.Number = number;
int phraseStartIndex = fields[2].IndexOf('(');
int phraseEndIndex = fields[2].LastIndexOf(')');
if (phraseStartIndex != phraseEndIndex)
{
obj.Phrase = fields[2].Substring(++phraseStartIndex, phraseEndIndex - phraseStartIndex);
}
DateTime dt1;
if(DateTime.TryParse(fields[3], out dt1))
obj.Date1 = dt1;
DateTime dt2;
if (DateTime.TryParse(fields[3], out dt2))
obj.Date2 = dt2;
}

The following regular expression seems to cover what I imagine you would need - at least a good start.
^(?<firstWord>[\w\s]*)\s+(?<secondWord>\d+)\s+(?<thirdWord>[\w\s_-]+)\s+(?<date>\d{4}-\d{2}-\d{2})\s+(?<time>\d{2}:\d{2}:\d{2})$
This captures 5 named groups
firstWord is any alphanumeric or whitespace
secondWord is any numeric entry
thirdWord any alphanumeric, space underscore or hyphen
date is any iso formatted date (date not validated)
time any time (time not validated)
Any amount of whitespace is used as the delimiter - but you will have to Trim() any group captures. It makes a hell of a lot of assumptions about your format (dates are ISO formatted, times are hh:mm:ss).
You could use it like this:
Regex regex = new Regex( #"(?<firstWord>[\w\s]*)\s+(?<secondWord>\d+)\s+(?<thirdWord>[\w\s_-]+)\s+(?<date>\d{4}-\d{2}-\d{2})\s+(?<time>\d{2}:\d{2}:\d{2})$", RegexOptions.IgnoreCase );
var match = regex.Match("this is the first word 123 hello_world 2017-01-01 10:00:00");
if(match.Success){
Console.WriteLine("{0}\r\n{1}\r\n{2}\r\n{3}\r\n{4}",match.Groups["firstWord"].Value.Trim(),match.Groups["secondWord"].Value,match.Groups["thirdWord"].Value,match.Groups["date"].Value,match.Groups["time"].Value);
}
http://rextester.com/LGM52187

You have to use Regex, you may have a look here as a starting point. so for example to get the first word you may use this
string data = "Example 2323 Second This is a Phrase 2017-01-01 2019-01-03";
string firstword = new Regex(#"\b[A-Za-z]+\b").Matches(data )[0]

How to remove only certain substrings from a string?

Using C#, I have a string that is a SQL script containing multiple queries. I want to remove sections of the string that are enclosed in single quotes. I can do this using Regex.Replace, in this manner:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, "'[^']*'", string.Empty);
Results in: "Only can we turn him to the of the Force"
What I want to do is remove the substrings between quotes EXCEPT for substrings containing a specific substring. For example, using the string above, I want to remove the quoted substrings except for those that contain "dark," such that the resulting string is:
Results in: "Only can we turn him to the 'dark side' of the Force"
How can this be accomplished using Regex.Replace, or perhaps by some other technique? I'm currently trying a solution that involves using Substring(), IndexOf(), and Contains().
Note: I don't care if the single quotes around "dark side" are removed or not, so the result could also be: "Only can we turn him to the dark side of the Force." I say this because a solution using Split() would remove all the single quotes.
Edit: I don't have a solution yet using Substring(), IndexOf(), etc. By "working on," I mean I'm thinking in my head how this can be done. I have no code, which is why I haven't posted any yet. Thanks.
Edit: VKS's solution below works. I wasn't escaping the \b the first attempt which is why it failed. Also, it didn't work unless I included the single quotes around the whole string as well.
test = Regex.Replace(test, "'(?![^']*\\bdark\\b)[^']*'", string.Empty);

'(?![^']*\bdark\b)[^']*'
Try this.See demo.Replace by empty string.You can use lookahead here to check if '' contains a word dark.
https://www.regex101.com/r/rG7gX4/12

While vks's solution works, I'd like to demonstrate a different approach:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, #"'[^']*'", match => {
if (match.Value.Contains("dark"))
return match.Value;
// You can add more cases here
return string.Empty;
});
Or, if your condition is simple enough:
test = Regex.Replace(test, #"'[^']*'", match => match.Value.Contains("dark")
? match.Value
: string.Empty
);
That is, use a lambda to provide a callback for the replacement. This way, you can run arbitrary logic to replace the string.

some thing like this would work. you can add all strings you want to keep into the excludedStrings array
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
var excludedString = new string[] { "dark side" };
int startIndex = 0;
while ((startIndex = test.IndexOf('\'', startIndex)) >= 0)
{
var endIndex = test.IndexOf('\'', startIndex + 1);
var subString = test.Substring(startIndex, (endIndex - startIndex) + 1);
if (!excludedString.Contains(subString.Replace("'", "")))
{
test = test.Remove(startIndex, (endIndex - startIndex) + 1);
}
else
{
startIndex = endIndex + 1;
}
}

Another method through regex alternation operator |.
#"('[^']*\bdark\b[^']*')|'[^']*'"
Then replace the matched character with $1
DEMO
string str = "Only 'together' can we turn him to the 'dark side' of the Force";
string result = Regex.Replace(str, #"('[^']*\bdark\b[^']*')|'[^']*'", "$1");
Console.WriteLine(result);
IDEONE
Explanation:
(...) called capturing group.
'[^']*\bdark\b[^']*' would match all the single quoted strings which contains the substring dark . [^']* matches any character but not of ', zero or more times.
('[^']*\bdark\b[^']*'), because the regex is within a capturing group, all the matched characters are stored inside the group index 1.
| Next comes the regex alternation operator.
'[^']*' Now this matches all the remaining (except the one contains dark) single quoted strings. Note that this won't match the single quoted string which contains the substring dark because we already matched those strings with the pattern exists before to the | alternation operator.
Finally replacing all the matched characters with the chars inside group index 1 will give you the desired output.

I made this attempt that I think you were thinking about (some solution using split, Contain, ... without regex)
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
string[] separated = test.Split('\'');
string result = "";
for (int i = 0; i < separated.Length; i++)
{
string str = separated[i];
str = str.Trim(); //trim the tailing spaces
if (i % 2 == 0 || str.Contains("dark")) // you can expand your condition
{
result += str+" "; // add space after each added string
}
}
result = result.Trim(); //trim the tailing space again

Remove the last three characters from a string

I want to remove last three characters from a string:
string myString = "abcdxxx";
Note that the string is dynamic data.

read last 3 characters from string [Initially asked question]
You can use string.Substring and give it the starting index and it will get the substring starting from given index till end.
myString.Substring(myString.Length-3)
Retrieves a substring from this instance. The substring starts at a
specified character position. MSDN
Edit, for updated post
Remove last 3 characters from string [Updated question]
To remove the last three characters from the string you can use string.Substring(Int32, Int32) and give it the starting index 0 and end index three less than the string length. It will get the substring before last three characters.
myString = myString.Substring(0, myString.Length-3);
String.Substring Method (Int32, Int32)
Retrieves a substring from this instance. The substring starts at a
specified character position and has a specified length.
You can also using String.Remove(Int32) method to remove the last three characters by passing start index as length - 3, it will remove from this point to end of string.
myString = myString.Remove(myString.Length-3)
String.Remove Method (Int32)
Returns a new string in which all the characters in the current
instance, beginning at a specified position and continuing through the
last position, have been deleted

myString = myString.Remove(myString.Length - 3, 3);

I read through all these, but wanted something a bit more elegant. Just to remove a certain number of characters from the end of a string:
string.Concat("hello".Reverse().Skip(3).Reverse());
output:
"he"

The new C# 8.0 range operator can be a great shortcut to achieve this.
Example #1 (to answer the question):
string myString = "abcdxxx";
var shortenedString = myString[0..^3]
System.Console.WriteLine(shortenedString);
// Results: abcd
Example #2 (to show you how awesome range operators are):
string s = "FooBar99";
// If the last 2 characters of the string are 99 then change to 98
s = s[^2..] == "99" ? s[0..^2] + "98" : s;
System.Console.WriteLine(s);
// Results: FooBar98

myString.Remove(myString.Length-3);

string test = "abcdxxx";
test = test.Remove(test.Length - 3);
//output : abcd

You can use String.Remove to delete from a specified position to the end of the string.
myString = myString.Remove(myString.Length - 3);

Probably not exactly what you're looking for since you say it's "dynamic data" but given your example string, this also works:
? "abcdxxx".TrimEnd('x');
"abc"

If you're working in C# 8 or later, you can use "ranges":
string myString = "abcdxxx";
string trimmed = myString[..^3]; // "abcd"
More examples:
string test = "0123456789", s;
char c;
c = test[^3]; // '7'
s = test[0..^3]; // "0123456"
s = test[..^3]; // "0123456"
s = test[2..^3]; // "23456"
s = test[2..7]; // "23456"
//c = test[^12]; // IndexOutOfRangeException
//s = test[8..^3]; // ArgumentOutOfRangeException
s = test[7..^3]; // string.Empty

str= str.Remove(str.Length - 3);

myString.Substring(myString.Length - 3, 3)
Here are examples on substring.>>
http://www.dotnetperls.com/substring
Refer those.

string myString = "abcdxxx";
if (myString.Length<3)
return;
string newString=myString.Remove(myString.Length - 3, 3);

Easy. text = text.remove(text.length - 3). I subtracted 3 because the Remove function removes all items from that index to the end of the string which is text.length. So if I subtract 3 then I get the string with 3 characters removed from it.
You can generalize this to removing a characters from the end of the string, like this:
text = text.remove(text.length - a)
So what I did was the same logic. The remove function removes all items from its inside to the end of the string which is the length of the text. So if I subtract a from the length of the string that will give me the string with a characters removed.
So it doesn't just work for 3, it works for all positive integers, except if the length of the string is less than or equal to a, in that case it will return a negative number or 0.

Remove the last characters from a string
TXTB_DateofReiumbursement.Text = (gvFinance.SelectedRow.FindControl("lblDate_of_Reimbursement") as Label).Text.Remove(10)
.Text.Remove(10)// used to remove text starting from index 10 to end

items.Remove(items.Length - 3)
string.Remove() removes all items from that index to the end. items.length - 3 gets the index 3 chars from the end

You can call the Remove method and pass the last 3 characters
str.Substring(str.Length-3)
Complete code can be
str.Remove(str.Substring(str.Length-3));

C# how to split a string on the basis of <> character

I want to split my text by <> characters.
Example suppose I have a string
string Name="this <link> is my <name>";
Now I want to split this so that I have a array of string like
ar[0]="this "
ar[1]="<link>"
ar[2]=" is my "
ar[3]="<name>"
I was trying with split function like
string[] ar=Name.Split('<');
I have also tried
string[] nameArray = Regex.Split(name, "<[^<]+>");
But this is not giving me
"<link>"
and "<name>"
But it is not a good approach.
Can I use regular expression here.

This
Regex r = new Regex(#"(?<=.)(?=<)|(?<=>)(?=.)");
foreach (var s in r.Split("this_<link>_is_my_<name>"))
{
Console.WriteLine(s);
}
gives
this_
<link>
_is_my_
<name>
(underscores used for clarity)
The regex splits on a zero-width point (so it doesn't remove anything) which is either:
preceeded by something and followed by <
preceeded by > and followed by something
The "something" checks are necessary to avoid empty strings at the start or end if your string starts or ends with something in brackets.
Note something like "<link<link>>" will give you { "<link", "<link>", ">" } so try to make your angle brackets balance.
If you want empty strings if the string starts with < or ends with > you can use (?=<)|(?<=>). If you want empty strings in the middle when you encounter ><, I think you need to first split on (?=<) and then split all the results on (?<=>) - I don't think you can do it in one go.

Extract substring from a string until finds a comma

I'm building a page and would like to know how to extract substring from a string until finds a comma in ASP.Net C#. Can someone help please?

substring = str.Split(',')[0];
If str doesn't contain any commas, substring will be the same as str.
EDIT: as with most things, performance of this will vary for edge cases. If there are lots and lots of commas, this will create lots of String instances on the heap that won't be used. If it is a 5000 character string with a comma near the start, the IndexOf+Substring method will perform much better. However, for reasonably small strings this method will work fine.

var firstPart = str.Split(new [] { ',' }, 2)[0]
Second parameter tells maximum number of parts. Specifying 2 ensures performance is fine even if there are lots and lots of commas.

You can use IndexOf() to find out where is the comma, and then extract the substring. If you are sure it will always have the comma you can skip the check.
string a = "asdkjafjksdlfm,dsklfmdkslfmdkslmfksd";
int comma = a.IndexOf(',');
string b = a;
if (comma != -1)
{
b = a.Substring(0, comma);
}
Console.WriteLine(b);

myString = myString.Substring(0,myString.IndexOf(','));

Alina, based on what you wrote above, then Split will work for you.
string[] a = comment.Split(',');
Given your example string, then a[0] = "aaa", a[1] = "bbbbb", a[2] = "cccc", and a[3] = "dddd"

string NoComma = "";
string example = "text before first comma, more stuff and another comma, there";
string result = example.IndexOf(',') == 0 ? NoComma : example.Split(',')[0];

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing Nested Text in C Sharp - c#

A little recursion and split would work, the main point is use recursion, it'll make it so much easier. Your input syntax looks kind of like LISP :) Parsing a, split, no second part. done. Parsing a [b value]. has second part, go to the beginning. ... You get the idea.

Simple split should work For every id,there is one bracket [ So when you split that string you have n-brackets so n-1 id(s) where the last element contains the value.

Related

C# How to extract words from a string and put them into class members

How to remove only certain substrings from a string?

Remove the last three characters from a string

C# how to split a string on the basis of <> character

Extract substring from a string until finds a comma

Categories

Resources