Regex to match last character in string - C# - c#

How can I remove the last ';' in a string?
In case of comment at the end of the string I need to return the ';' before the comment.
Example:
"line 1 //comment
line2;
extra text; //comment may also contain ;."

You didn't wrote what you wanna do with the character, so I give you a solution here that replaces the character:
string pattern = "(?<!//.*);(?=[^;]*(//|$))";
Console.WriteLine(Regex.Replace("line 1 //comment", pattern, "#"));
Console.WriteLine(Regex.Replace("line2;", pattern, "#"));
Console.WriteLine(Regex.Replace("extra; text; //comment may also contain ;.", pattern, "#"));
Output:
line 1 //comment
line2#
extra; text# //comment may also contain ;.

This is slightly ugly with Regex, but here it is:
var str = #"line 1 //comment
line2; test;
extra text; //comment may also contain ;.";
var matches = Regex.Matches(str, #"^(?:(?<!//).)+(;)", RegexOptions.Multiline);
if (matches.Count > 0)
{
Console.WriteLine(matches[matches.Count - 1].Groups[1].Index);
}
We get a match for the last semicolon in each line (that isn't preceded by a comment), then we look at the last of these matches.
We have to do this on a line-by-line basis, as comments apply for the whole line.
If you want to process each line individually (your question doesn't say this, but it implies it), then loop over matches instead of just looking at the last one.
If you want to replace each semicolon with another character, then you can do something like this:
const string replacement = "#";
var result = Regex.Replace(str, #"^((?:(?<!//).)+);", "$1" + replacement, RegexOptions.Multiline);
If you want to remove it entirely, then simply:
var result = Regex.Replace(str, #"^((?:(?<!//).)+);", "$1", RegexOptions.Multiline);
If you just want to remove the final semicolon in the entire string, then you can just use string.Remove:
var matches = Regex.Matches(str, #"^(?:(?<!//).)+(;)", RegexOptions.Multiline);
if (matches.Count > 0)
{
str = str.Remove(matches[matches.Count - 1].Groups[1].Index, 1);
}

Related

How to remove only certain substrings from a string?

Using C#, I have a string that is a SQL script containing multiple queries. I want to remove sections of the string that are enclosed in single quotes. I can do this using Regex.Replace, in this manner:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, "'[^']*'", string.Empty);
Results in: "Only can we turn him to the of the Force"
What I want to do is remove the substrings between quotes EXCEPT for substrings containing a specific substring. For example, using the string above, I want to remove the quoted substrings except for those that contain "dark," such that the resulting string is:
Results in: "Only can we turn him to the 'dark side' of the Force"
How can this be accomplished using Regex.Replace, or perhaps by some other technique? I'm currently trying a solution that involves using Substring(), IndexOf(), and Contains().
Note: I don't care if the single quotes around "dark side" are removed or not, so the result could also be: "Only can we turn him to the dark side of the Force." I say this because a solution using Split() would remove all the single quotes.
Edit: I don't have a solution yet using Substring(), IndexOf(), etc. By "working on," I mean I'm thinking in my head how this can be done. I have no code, which is why I haven't posted any yet. Thanks.
Edit: VKS's solution below works. I wasn't escaping the \b the first attempt which is why it failed. Also, it didn't work unless I included the single quotes around the whole string as well.
test = Regex.Replace(test, "'(?![^']*\\bdark\\b)[^']*'", string.Empty);
'(?![^']*\bdark\b)[^']*'
Try this.See demo.Replace by empty string.You can use lookahead here to check if '' contains a word dark.
https://www.regex101.com/r/rG7gX4/12
While vks's solution works, I'd like to demonstrate a different approach:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, #"'[^']*'", match => {
if (match.Value.Contains("dark"))
return match.Value;
// You can add more cases here
return string.Empty;
});
Or, if your condition is simple enough:
test = Regex.Replace(test, #"'[^']*'", match => match.Value.Contains("dark")
? match.Value
: string.Empty
);
That is, use a lambda to provide a callback for the replacement. This way, you can run arbitrary logic to replace the string.
some thing like this would work. you can add all strings you want to keep into the excludedStrings array
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
var excludedString = new string[] { "dark side" };
int startIndex = 0;
while ((startIndex = test.IndexOf('\'', startIndex)) >= 0)
{
var endIndex = test.IndexOf('\'', startIndex + 1);
var subString = test.Substring(startIndex, (endIndex - startIndex) + 1);
if (!excludedString.Contains(subString.Replace("'", "")))
{
test = test.Remove(startIndex, (endIndex - startIndex) + 1);
}
else
{
startIndex = endIndex + 1;
}
}
Another method through regex alternation operator |.
#"('[^']*\bdark\b[^']*')|'[^']*'"
Then replace the matched character with $1
DEMO
string str = "Only 'together' can we turn him to the 'dark side' of the Force";
string result = Regex.Replace(str, #"('[^']*\bdark\b[^']*')|'[^']*'", "$1");
Console.WriteLine(result);
IDEONE
Explanation:
(...) called capturing group.
'[^']*\bdark\b[^']*' would match all the single quoted strings which contains the substring dark . [^']* matches any character but not of ', zero or more times.
('[^']*\bdark\b[^']*'), because the regex is within a capturing group, all the matched characters are stored inside the group index 1.
| Next comes the regex alternation operator.
'[^']*' Now this matches all the remaining (except the one contains dark) single quoted strings. Note that this won't match the single quoted string which contains the substring dark because we already matched those strings with the pattern exists before to the | alternation operator.
Finally replacing all the matched characters with the chars inside group index 1 will give you the desired output.
I made this attempt that I think you were thinking about (some solution using split, Contain, ... without regex)
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
string[] separated = test.Split('\'');
string result = "";
for (int i = 0; i < separated.Length; i++)
{
string str = separated[i];
str = str.Trim(); //trim the tailing spaces
if (i % 2 == 0 || str.Contains("dark")) // you can expand your condition
{
result += str+" "; // add space after each added string
}
}
result = result.Trim(); //trim the tailing space again

REGEX: What the meaning of . followed by +?

Sorry to ask this question, but I'm really stuck. This code belongs to someone who already left the company. And it causing problem.
protected override string CleanDataLine(string line)
{
//the regular expression for GlobalSight log
Regex regex = new Regex("\".+\"");
Match match = regex.Match(line);
if (match.Success)
{
string matchPart = match.Value;
matchPart =
matchPart.Replace(string.Format("\"{0}\"",
Delimiter), string.Format("\"{0}\"", "*+*+"));
matchPart = matchPart.Replace(Delimiter, '_');
matchPart =
matchPart.Replace(string.Format("\"{0}\"", "*+*+"),
string.Format("\"{0}\"", Delimiter));
line = line.Replace(match.Value, matchPart);
}
return line;
}
I've spent to much time researching. What was he trying to accomplish?
Thanks for helping.
That regex matches
a quote ("),
followed by one or more (+) characters (any character except newlines (.), as many as possible),
followed by a quote ".
It's not a very good regex. For example, in the string foo "bar" baz "bam" boom, it will match "bar" baz "bam".
If the intention is to match a quoted string, a more appropriate regex would be "[^"]*".
. is any character except \n, + means 1 or more.
So: .+ is "1 or more characters"
The dot matches any character except line breaks.
+ is "one or more" (equal to {1,})
protected override string CleanDataLine(string line)
{
//the regular expression for GlobalSight log
Regex regex = new Regex("\".+\"");
Match match = regex.Match(line);
if (match.Success)
{
string matchPart = match.Value;
matchPart =
matchPart.Replace(string.Format("\"{0}\"",
Delimiter), string.Format("\"{0}\"", "*+*+"));
matchPart = matchPart.Replace(Delimiter, '_');
matchPart =
matchPart.Replace(string.Format("\"{0}\"", "*+*+"),
string.Format("\"{0}\"", Delimiter));
line = line.Replace(match.Value, matchPart);
}
return line;
}
line is just some text, could be Hello World, or anything really.
new Regex("\".+\"") the \" is an escaped quote, this means it's actually looking for a string to start with a double quote. .+ means to find any character not including the new-line character one or more times.
If it does match, then he tries to figure out the part that matched by grabbing the value.
It then just becomes a normal search and replace for whatever string was matched.

How to replace the text between two characters in c#

I am bit confused writing the regex for finding the Text between the two delimiters { } and replace the text with another text in c#,how to replace?
I tried this.
StreamReader sr = new StreamReader(#"C:abc.txt");
string line;
line = sr.ReadLine();
while (line != null)
{
if (line.StartsWith("<"))
{
if (line.IndexOf('{') == 29)
{
string s = line;
int start = s.IndexOf("{");
int end = s.IndexOf("}");
string result = s.Substring(start+1, end - start - 1);
}
}
//write the lie to console window
Console.Write Line(line);
//Read the next line
line = sr.ReadLine();
}
//close the file
sr.Close();
Console.ReadLine();
I want replace the found text(result) with another text.
Use Regex with pattern: \{([^\}]+)\}
Regex yourRegex = new Regex(#"\{([^\}]+)\}");
string result = yourRegex.Replace(yourString, "anyReplacement");
string s = "data{value here} data";
int start = s.IndexOf("{");
int end = s.IndexOf("}", start);
string result = s.Substring(start+1, end - start - 1);
s = s.Replace(result, "your replacement value");
To get the string between the parentheses to be replaced, use the Regex pattern
string errString = "This {match here} uses 3 other {match here} to {match here} the {match here}ation";
string toReplace = Regex.Match(errString, #"\{([^\}]+)\}").Groups[1].Value;
Console.WriteLine(toReplace); // prints 'match here'
To then replace the text found you can simply use the Replace method as follows:
string correctString = errString.Replace(toReplace, "document");
Explanation of the Regex pattern:
\{ # Escaped curly parentheses, means "starts with a '{' character"
( # Parentheses in a regex mean "put (capture) the stuff
# in between into the Groups array"
[^}] # Any character that is not a '}' character
* # Zero or more occurrences of the aforementioned "non '}' char"
) # Close the capturing group
\} # "Ends with a '}' character"
The following regular expression will match the criteria you specified:
string pattern = #"^(\<.{27})(\{[^}]*\})(.*)";
The following would perform a replace:
string result = Regex.Replace(input, pattern, "$1 REPLACE $3");
For the input: "<012345678901234567890123456{sdfsdfsdf}sadfsdf" this gives the output "<012345678901234567890123456 REPLACE sadfsdf"
You need two calls to Substring(), rather than one: One to get textBefore, the other to get textAfter, and then you concatenate those with your replacement.
int start = s.IndexOf("{");
int end = s.IndexOf("}");
//I skip the check that end is valid too avoid clutter
string textBefore = s.Substring(0, start);
string textAfter = s.Substring(end+1);
string replacedText = textBefore + newText + textAfter;
If you want to keep the braces, you need a small adjustment:
int start = s.IndexOf("{");
int end = s.IndexOf("}");
string textBefore = s.Substring(0, start-1);
string textAfter = s.Substring(end);
string replacedText = textBefore + newText + textAfter;
the simplest way is to use split method if you want to avoid any regex .. this is an aproach :
string s = "sometext {getthis}";
string result= s.Split(new char[] { '{', '}' })[1];
You can use the Regex expression that some others have already posted, or you can use a more advanced Regex that uses balancing groups to make sure the opening { is balanced by a closing }.
That expression is then (?<BRACE>\{)([^\}]*)(?<-BRACE>\})
You can test this expression online at RegexHero.
You simply match your input string with this Regex pattern, then use the replace methods of Regex, for instance:
var result = Regex.Replace(input, "(?<BRACE>\{)([^\}]*)(?<-BRACE>\})", textToReplaceWith);
For more C# Regex Replace examples, see http://www.dotnetperls.com/regex-replace.

Removing all whitespace lines from a multi-line string efficiently

In C# what's the best way to remove blank lines i.e., lines that contain only whitespace from a string? I'm happy to use a Regex if that's the best solution.
EDIT: I should add I'm using .NET 2.0.
Bounty update: I'll roll this back after the bounty is awarded, but I wanted to clarify a few things.
First, any Perl 5 compat regex will work. This is not limited to .NET developers. The title and tags have been edited to reflect this.
Second, while I gave a quick example in the bounty details, it isn't the only test you must satisfy. Your solution must remove all lines which consist of nothing but whitespace, as well as the last newline. If there is a string which, after running through your regex, ends with "/r/n" or any whitespace characters, it fails.
If you want to remove lines containing any whitespace (tabs, spaces), try:
string fix = Regex.Replace(original, #"^\s*$\n", string.Empty, RegexOptions.Multiline);
Edit (for #Will): The simplest solution to trim trailing newlines would be to use TrimEnd on the resulting string, e.g.:
string fix =
Regex.Replace(original, #"^\s*$\n", string.Empty, RegexOptions.Multiline)
.TrimEnd();
string outputString;
using (StringReader reader = new StringReader(originalString)
using (StringWriter writer = new StringWriter())
{
string line;
while((line = reader.ReadLine()) != null)
{
if (line.Trim().Length > 0)
writer.WriteLine(line);
}
outputString = writer.ToString();
}
off the top of my head...
string fixed = Regex.Replace(input, "\s*(\n)","$1");
turns this:
fdasdf
asdf
[tabs]
[spaces]
asdf
into this:
fdasdf
asdf
asdf
Using LINQ:
var result = string.Join("\r\n",
multilineString.Split(new string[] { "\r\n" }, ...None)
.Where(s => !string.IsNullOrWhitespace(s)));
If you're dealing with large inputs and/or inconsistent line endings you should use a StringReader and do the above old-school with a foreach loop instead.
Alright this answer is in accordance to the clarified requirements specified in the bounty:
I also need to remove any trailing newlines, and my Regex-fu is
failing. My bounty goes to anyone who can give me a regex which passes
this test: StripWhitespace("test\r\n \r\nthis\r\n\r\n") ==
"test\r\nthis"
So Here's the answer:
(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z
Or in the C# code provided by #Chris Schmich:
string fix = Regex.Replace("test\r\n \r\nthis\r\n\r\n", #"(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|(\r?\n)+\z", string.Empty, RegexOptions.Multiline);
Now let's try to understand it. There are three optional patterns in here which I am willing to replace with string.empty.
(?<=\r?\n)(\s*$\r?\n)+ - matches one to unlimited lines containing only white space and preceeded by a line break (but does not match the first preceeding line breaks).
(?<=\r?\n)(\r?\n)+ - matches one to unlimited empty lines with no content that are preceeded by a line break (but does not match the first preceeding line breaks).
(\r?\n)+\z - matches one to unlimited line breaks at the end of the tested string (trailing line breaks as you called them)
That satisfies your test perfectly! But also satisfies both \r\n and \n line break styles! Test it out! I believe this will be the most correct answer, although simpler expression would pass your specified bounty test, this regex passes more complex conditions.
EDIT: #Will pointed out a potential flaw in the last pattern match of the above regex in that it won't match multiple line breaks containing white space at the end of the test string. So let's change that last pattern to this:
\b\s+\z The \b is a word boundry (beginning or END of a word), the \s+ is one or more white space characters, the \z is the end of the test string (end of "file"). So now it will match any assortment of whitespace at the end of the file including tabs and spaces in addition to carriage returns and line breaks. I tested both of #Will's provided test cases.
So all together now, it should be:
(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z
EDIT #2: Alright there is one more possible case #Wil found that the last regex doesn't cover. That case is inputs that have line breaks at the beginning of the file before any content. So lets add one more pattern to match the beginning of the file.
\A\s+ - The \A match the beginning of the file, the \s+ match one or more white space characters.
So now we've got:
\A\s+|(?<=\r?\n)(\s*$\r?\n)+|(?<=\r?\n)(\r?\n)+|\b\s+\z
So now we have four patterns for matching:
whitespace at the beginning of the file,
redundant line breaks containing white space, (ex: \r\n \r\n\t\r\n)
redundant line breaks with no content, (ex: \r\n\r\n)
whitespace at the end of the file
not good. I would use this one using JSON.net:
var o = JsonConvert.DeserializeObject(prettyJson);
new minifiedJson = JsonConvert.SerializeObject(o, Formatting.None);
In response to Will's bounty, which expects a solution that takes "test\r\n \r\nthis\r\n\r\n" and outputs "test\r\nthis", I've come up with a solution that makes use of atomic grouping (aka Nonbacktracking Subexpressions on MSDN). I recommend reading those articles for a better understanding of what's happening. Ultimately the atomic group helped match the trailing newline characters that were otherwise left behind.
Use RegexOptions.Multiline with this pattern:
^\s+(?!\B)|\s*(?>[\r\n]+)$
Here is an example with some test cases, including some I gathered from Will's comments on other posts, as well as my own.
string[] inputs =
{
"one\r\n \r\ntwo\r\n\t\r\n \r\n",
"test\r\n \r\nthis\r\n\r\n",
"\r\n\r\ntest!",
"\r\ntest\r\n ! test",
"\r\ntest \r\n ! "
};
string[] outputs =
{
"one\r\ntwo",
"test\r\nthis",
"test!",
"test\r\n ! test",
"test \r\n ! "
};
string pattern = #"^\s+(?!\B)|\s*(?>[\r\n]+)$";
for (int i = 0; i < inputs.Length; i++)
{
string result = Regex.Replace(inputs[i], pattern, "",
RegexOptions.Multiline);
Console.WriteLine(result == outputs[i]);
}
EDIT: To address the issue of the pattern failing to clean up text with a mix of whitespace and newlines, I added \s* to the last alternation portion of the regex. My previous pattern was redundant and I realized \s* would handle both cases.
string corrected =
System.Text.RegularExpressions.Regex.Replace(input, #"\n+", "\n");
I'll go with:
public static string RemoveEmptyLines(string value) {
using (StringReader reader = new StringReader(yourstring)) {
StringBuilder builder = new StringBuilder();
string line;
while ((line = reader.ReadLine()) != null) {
if (line.Trim().Length > 0)
builder.AppendLine(line);
}
return builder.ToString();
}
}
Here's another option: use the StringReader class. Advantages: one pass over the string, creates no intermediate arrays.
public static string RemoveEmptyLines(this string text) {
var builder = new StringBuilder();
using (var reader = new StringReader(text)) {
while (reader.Peek() != -1) {
string line = reader.ReadLine();
if (!string.IsNullOrWhiteSpace(line))
builder.AppendLine(line);
}
}
return builder.ToString();
}
Note: the IsNullOrWhiteSpace method is new in .NET 4.0. If you don't have that, it's trivial to write on your own:
public static bool IsNullOrWhiteSpace(string text) {
return string.IsNullOrEmpty(text) || text.Trim().Length < 1;
}
In response to Will's bounty here is a Perl sub that gives correct response to the test case:
sub StripWhitespace {
my $str = shift;
print "'",$str,"'\n";
$str =~ s/(?:\R+\s+(\R)+)|(?:()\R+)$/$1/g;
print "'",$str,"'\n";
return $str;
}
StripWhitespace("test\r\n \r\nthis\r\n\r\n");
output:
'test
this
'
'test
this'
In order to not use \R, replace it with [\r\n] and inverse the alternative. This one produces the same result:
$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/g;
There're no needs for special configuration neither multi line support. Nevertheless you can add s flag if it's mandatory.
$str =~ s/(?:(\S)[\r\n]+)|(?:[\r\n]+\s+([\r\n])+)/$1/sg;
if its only White spaces why don't you use the C# string method
string yourstring = "A O P V 1.5";
yourstring.Replace(" ", string.empty);
result will be "AOPV1.5"
char[] delimiters = new char[] { '\r', '\n' };
string[] lines = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
string result = string.Join(Environment.NewLine, lines)
Here is something simple if working against each individual line...
(^\s+|\s+|^)$
Eh. Well, after all that, I couldn't find one that would hit all the corner cases I could figure out. The following is my latest incantation of a regex that strips
All empty lines from the start of a string
Not including any spaces at the beginning of the first non-whitespace line
All empty lines after the first non-whitespace line and before the last non-whitespace line
Again, preserving all whitespace at the beginning of any non-whitespace line
All empty lines after the last non-whitespace line, including the last newline
(?<=(\r\n)|^)\s*\r\n|\r\n\s*$
which essentially says:
Immediately after
The beginning of the string OR
The end of the last line
Match as much contiguous whitespace as possible that ends in a newline*
OR
Match a newline and as much contiguous whitespace as possible that ends at the end of the string
The first half catches all whitespace at the start of the string until the first non-whitespace line, or all whitespace between non-whitespace lines. The second half snags the remaining whitespace in the string, including the last non-whitespace line's newline.
Thanks to all who tried to help out; your answers helped me think through everything I needed to consider when matching.
*(This regex considers a newline to be \r\n, and so will have to be adjusted depending on the source of the string. No options need to be set in order to run the match.)
String Extension
public static string UnPrettyJson(this string s)
{
try
{
// var jsonObj = Json.Decode(s);
// var sObject = Json.Encode(value); dont work well with array of strings c:['a','b','c']
object jsonObj = JsonConvert.DeserializeObject(s);
return JsonConvert.SerializeObject(jsonObj, Formatting.None);
}
catch (Exception e)
{
throw new Exception(
s + " Is Not a valid JSON ! (please validate it in http://www.jsoneditoronline.org )", e);
}
}
Im not sure is it efficient but =)
List<string> strList = myString.Split(new string[] { "\n" }, StringSplitOptions.None).ToList<string>();
myString = string.Join("\n", strList.Where(s => !string.IsNullOrWhiteSpace(s)).Distinct().ToList());
Try this.
string s = "Test1" + Environment.NewLine + Environment.NewLine + "Test 2";
Console.WriteLine(s);
string result = s.Replace(Environment.NewLine, String.Empty);
Console.WriteLine(result);
s = Regex.Replace(s, #"^[^\n\S]*\n", "");
[^\n\S] matches any character that's not a linefeed or a non-whitespace character--so, any whitespace character except \n. But most likely the only characters you have to worry about are space, tab and carriage return, so this should work too:
s = Regex.Replace(s, #"^[ \t\r]*\n", "");
And if you want it to catch the last line, without a final linefeed:
s = Regex.Replace(s, #"^[ \t\r]*\n?", "");

Regex removing double/triple comma in string

I need to parse a string so the result should output like that:
"abc,def,ghi,klm,nop"
But the string I am receiving could looks more like that:
",,,abc,,def,ghi,,,,,,,,,klm,,,nop"
The point is, I don't know in advance how many commas separates the words.
Is there a regex I could use in C# that could help me resolve this problem?
You can use the ,{2,} expression to match any occurrences of 2 or more commas, and then replace them with a single comma.
You'll probably need a Trim call in there too, to remove any leading or trailing commas left over from the Regex.Replace call. (It's possible that there's some way to do this with just a regex replace, but nothing springs immediately to mind.)
string goodString = Regex.Replace(badString, ",{2,}", ",").Trim(',');
Search for ,,+ and replace all with ,.
So in C# that could look like
resultString = Regex.Replace(subjectString, ",,+", ",");
,,+ means "match all occurrences of two commas or more", so single commas won't be touched. This can also be written as ,{2,}.
a simple solution without regular expressions :
string items = inputString.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries);
string result = String.Join(",", items);
Actually, you can do it without any Trim calls.
text = Regex.Replace(text, "^,+|,+$|(?<=,),+", "");
should do the trick.
The idea behind the regex is to only match that, which we want to remove. The first part matches any string of consecutive commas at the start of the input string, the second matches any consecutive string of commas at the end, while the last matches any consecutive string of commas that follows a comma.
Here is my effort:
//Below is the test string
string test = "YK 002 10 23 30 5 TDP_XYZ "
private static string return_with_comma(string line)
{
line = line.TrimEnd();
line = line.Replace(" ", ",");
line = Regex.Replace(line, ",,+", ",");
string[] array;
array = line.Split(',');
for (int x = 0; x < array.Length; x++)
{
line += array[x].Trim();
}
line += "\r\n";
return line;
}
string result = return_with_comma(test);
//Output is
//YK,002,10,23,30,5,TDP_XYZ

Categories

Resources