C# Replace with regex - c#

I'm new to VB, C#, and am struggling with regex. I think I've got the following code format to replace the regex match with blank space in my file.
EDIT: Per comments this code block has been changed.
var fileContents = System.IO.File.ReadAllText(#"C:\path\to\file.csv");
fileContents = fileContents.Replace(fileContents, #"regex", "");
regex = new Regex(pattern);
regex.Replace(filecontents, "");
System.IO.File.WriteAllText(#"C:\path\to\file.csv", fileContents);
My files are formatted like this:
"1111111","22222222222","Text that may, have a comma, or two","2014-09-01",,,,,,
So far, I have regex finding any string between ," and ", that contains a comma (there are never commas in the first or last cell, so I'm not worried about excluding those two. I'm testing regex in Expresso
(?<=,")([^"]+,[^"]+)(?=",)
I'm just not sure how to isolate that comma as what needs to be replaced. What would be the best way to do this?
SOLVED:
Combined [^"]+ with look behind/ahead:
(?<=,"[^"]+)(,)(?=[^"]+",)
FINAL EDIT:
Here's my final complete solution:
//read file contents
var fileContents = System.IO.File.ReadAllText(#"C:\path\to\file.csv");
//find all commas between double quotes
var regex = new Regex("(?<=,\")([^\"]+,[^\"]+(?=\",)");
//replace all commas with ""
fileContents = regex.Replace(fileContents, m => m.ToString().Replace(",", ""));
//write result back to file
System.IO.File.WriteAllText(#"C:\path\to\file.csv", fileContents);

Figured it out by combining the [^"]+ with the look ahead ?= and look behind ?<= so that it finds strings beginning with ,"[anything that's not double quotes, one or more times] then has a comma, then ends with [anything that's not double quotes, one or more times]",
(?<=,"[^"]+)(,)(?=[^"]+",)

Try to parse out all your columns with this:
Regex regex = new Regex("(?<=\").*?(?=\")");
Then you can just do:
foreach(Match match in regex.Matches(filecontents))
{
fileContents = fileContents.Replace(match.ToString(), match.ToString().Replace(",",string.Empty))
}
Might not be as fast but should work.

I would probably use the overload of Regex.Replace that takes a delegate to return the replaced text.
This is useful when you have a simple regex to identify the pattern but you need to do something less straightforward (complex logic) for the replace.
I find keeping your regexes simple will pay benefits when you're trying to maintain them later.
Note: this is similar to the answer by #Florian, but this replace restricts itself to replacement only in the matched text.
string exp = "(?<=,\")([^\"]+,[^\"]+)(?=\",)";
var regex = new Regex(exp);
string replacedtext = regex.Replace(filecontents, m => m.ToString().Replace(",",""))

What you have there is an irregular language. This is because a comma can mean different things depending upon where it is in the text stream. Strangely Regular Expressions are designed to parse regular languages where a comma would mean the same thing regardless of where it is in the text stream. What you need for an irregular language is a parser. In fact Regular expressions are mostly used for tokenizing strings before they are entered into a parser.
While what you are trying to do can be done using regular expressions it is likely to be very slow. For example you can use the following (which will work even if the comma is the first or last character in the field). However every time it finds a comma it will have to scan backwards and forwards to check if it is between two quotation characters.
(?<=,"[^"]*),(?=[^"]*",)
Note also that their may be a flaw in this approach that you have not yet spotted. I don't know if you have this issue but often in CSV files you can have quotation characters in the middle of fields where there may also be a comma. In these cases applications like MS Excel will typically double the quote up to show that it is not the end of the field. Like this:
"1111111","22222222222","Text that may, have a comma, Quote"" or two","2014-09-01",,,,,,
In this case you are going to be out of luck with a regular expression.
Thankfully the code to deal with CSV files is very simple:
public static IList<string> ParseCSVLine(string csvLine)
{
List<string> result = new List<string>();
StringBuilder buffer = new StringBuilder();
bool inQuotes = false;
char lastChar = '\0';
foreach (char c in csvLine)
{
switch (c)
{
case '"':
if (inQuotes)
{
inQuotes = false;
}
else
{
if (lastChar == '"')
{
buffer.Append('"');
}
inQuotes = true;
}
break;
case ',':
if (inQuotes)
{
buffer.Append(',');
}
else
{
result.Add(buffer.ToString());
buffer.Clear();
}
break;
default:
buffer.Append(c);
break;
}
lastChar = c;
}
result.Add(buffer.ToString());
buffer.Clear();
return result;
}
PS. There are another couple of issues often run into with CSV files which the code I have given doesn't solve. First is what happens if a field has an end of line character in the middle of it? Second is how do you know what character encoding a CSV file is in? The former of these two issues is easy to solve by modifying my code slightly. The second however is near impossible to do without coming to some agreement with the person supplying the file to you.

Related

Regular expression issues in .net 6 value converter

I am trying to learn some .net6 and c# and I am struggling with regular expressions a lot. More specificaly with Avalonia in Windows if that is relevant.
I am trying to do a small app with 2 textboxes. I write text on one and get the text "filtered" in the other one using a value converter.
I would like to filter math expressions to try to solve them later on. Something simple, kind of a way of writing text math and getting results real time.
I have been trying for several weeks to figure this regular expression on my own with no success whatsoever.
I would like to replace in my string "_Expression{BLABLA}" for "BLABLA". For testing my expressions I have been checking in http://regexstorm.net/ and https://regex101.com/ and according to them my matches should be correct (unless I misunderstood the results). But the results in my little app are extremely odd to me and I finally decided to ask for help.
Here is my code:
private static string? FilterStr(object value)
{
if (value is string str)
{
string pattern = #"\b_Expression{(.+?)\w*}";
Regex rgx = new(pattern);
foreach (Match match in rgx.Matches(str))
{
string aux = "";
aux = match.Value;
aux = Regex.Replace(aux, #"_Expression{", "");
aux = Regex.Replace(aux, #"[\}]", "");
str = Regex.Replace(str, match.Value, aux);
}
return new string(str);
}
return null;
}
Then the results for some sample inputs are:
Input:
Some text
_Expression{x}
_Expression{1}
_Expression{4}
_Expression{4.5} _Expression{4+4}
_Expression{4-4} _Expression{4*x}
_Expression{x/x}
_Expression{x^4}
_Expression{sin(x)}
Output:
Some text
x
1{1}
1{4}
1{4.5} 1{4+4}
1{4-4} 1{4*x}
1{x/x}
1{x^4}
1{sin(x)}
or
Input:
Some text
_Expression{x}
_Expression{4}
_Expression{4.5} _Expression{4+4}
_Expression{4-4} _Expression{4*x}
_Expression{x/x}
_Expression{x^4}
_Expression{sin(x)}
Output:
Some text
x
_Expression{4}
4.5 _Expression{4+4}
4-4 _Expression{4*x}
x/x
_Expression{x^4}
_Expression{sin(x)}
It feels very confusing to me this behaviour. I can't see why "(.+?)" does not work with some of them and it does with others... Or maybe I haven't defined something properly or my Replace is wrong? I can't see it...
Thanks a lot for the time! :)
There are some missing parts in your regular expression, for example it doesn't have the curly braces { and } escaped, since curly braces have a special meaning in a regular expression; they are used as quantifiers.
Use the one below.
For extracting the math expression between the curly braces, it uses a named capturing group with name mathExpression.
_Expression\{(?<mathExpression>.+?)\}
_Expression\{ : start with the fixed text_Expression{
(?<mathExpression> : start a named capturing group with name mathExpression
.+? : take the next characters in a non greedy way
) : end the named capturing group
\} : end with the fixed character }
The below example will output 2 matches
Regex regex = new(#"_Expression\{(?<mathExpression>.+?)\}");
var matches = regex.Matches(#"_Expression{4.5} _Expression{4+4}");
foreach (Match match in matches.Where(o => o.Success))
{
var mathExpression = match.Groups["mathExpression"];
Console.WriteLine(mathExpression);
}
Output
4.5
4+4

Check array for string that starts with given one (ignoring case)

I am trying to see if my string starts with a string in an array of strings I've created. Here is my code:
string x = "Table a";
string y = "a table";
string[] arr = new string["table", "chair", "plate"]
if (arr.Contains(x.ToLower())){
// this should be true
}
if (arr.Contains(y.ToLower())){
// this should be false
}
How can I make it so my if statement comes up true? Id like to just match the beginning of string x to the contents of the array while ignoring the case and the following characters. I thought I needed regex to do this but I could be mistaken. I'm a bit of a newbie with regex.
It seems you want to check if your string contains an element from your list, so this should be what you are looking for:
if (arr.Any(c => x.ToLower().Contains(c)))
Or simpler:
if (arr.Any(x.ToLower().Contains))
Or based on your comments you may use this:
if (arr.Any(x.ToLower().Split(' ')[0].Contains))
Because you said you want regex...
you can set a regex to var regex = new Regex("(table|plate|fork)");
and check for if(regex.IsMatch(myString)) { ... }
but it for the issue at hand, you dont have to use Regex, as you are searching for an exact substring... you can use
(as #S.Akbari mentioned : if (arr.Any(c => x.ToLower().Contains(c))) { ... }
Enumerable.Contains matches exact values (and there is no build in compare that checks for "starts with"), you need Any that takes predicate that takes each array element as parameter and perform the check. So first step is you want "contains" to be other way around - given string to contain element from array like:
var myString = "some string"
if (arr.Any(arrayItem => myString.Contains(arrayItem)))...
Now you actually asking for "string starts with given word" and not just contains - so you obviously need StartsWith (which conveniently allows to specify case sensitivity unlike Contains - Case insensitive 'Contains(string)'):
if (arr.Any(arrayItem => myString.StartsWith(
arrayItem, StringComparison.CurrentCultureIgnoreCase))) ...
Note that this code will accept "tableAAA bob" - if you really need to break on word boundary regular expression may be better choice. Building regular expressions dynamically is trivial as long as you properly escape all the values.
Regex should be
beginning of string - ^
properly escaped word you are searching for - Escape Special Character in Regex
word break - \b
if (arr.Any(arrayItem => Regex.Match(myString,
String.Format(#"^{0}\b", Regex.Escape(arrayItem)),
RegexOptions.IgnoreCase)) ...
you can do something like below using TypeScript. Instead of Starts with you can also use contains or equals etc..
public namesList: Array<string> = ['name1','name2','name3','name4','name5'];
// SomeString = 'name1, Hello there';
private isNamePresent(SomeString : string):boolean{
if (this.namesList.find(name => SomeString.startsWith(name)))
return true;
return false;
}
I think I understand what you are trying to say here, although there are still some ambiguity. Are you trying to see if 1 word in your String (which is a sentence) exists in your array?
#Amy is correct, this might not have to do with Regex at all.
I think this segment of code will do what you want in Java (which can easily be translated to C#):
Java:
x = x.ToLower();
string[] words = x.Split("\\s+");
foreach(string word in words){
foreach(string element in arr){
if(element.Equals(word)){
return true;
}
}
}
return false;
You can also use a Set to store the elements in your array, which can make look up more efficient.
Java:
x = x.ToLower();
string[] words = x.Split("\\s+");
HashSet<string> set = new HashSet<string>(arr);
for(string word : words){
if(set.contains(word)){
return true;
}
}
return false;
Edit: (12/22, 11:05am)
I rewrote my solution in C#, thanks to reminders by #Amy and #JohnyL. Since the author only wants to match the first word of the string, this edited code should work :)
C#:
static bool contains(){
x = x.ToLower();
string[] words = x.Split(" ");
var set = new HashSet<string>(arr);
if(set.Contains(words[0])){
return true;
}
return false;
}
Sorry my question was so vague but here is the solution thanks to some help from a few people that answered.
var regex = new Regex("^(table|chair|plate) *.*");
if (regex.IsMatch(x.ToLower())){}

Regex Replacing only whole matches

I am trying to replace a bunch of strings in files. The strings are stored in a datatable along with the new string value.
string contents = File.ReadAllText(file);
foreach (DataRow dr in FolderRenames.Rows)
{
contents = Regex.Replace(contents, dr["find"].ToString(), dr["replace"].ToString());
File.SetAttributes(file, FileAttributes.Normal);
File.WriteAllText(file, contents);
}
The strings look like this _-uUa, -_uU, _-Ha etc.
The problem that I am having is when for example this string "_uU" will also overwrite "_-uUa" so the replacement would look like "newvaluea"
Is there a way to tell regex to look at the next character after the found string and make sure it is not an alphanumeric character?
I hope it is clear what I am trying to do here.
Here is some sample data:
private function _-0iX(arg1:flash.events.Event):void
{
if (arg1.type == flash.events.Event.RESIZE)
{
if (this._-2GU)
{
this._-yu(this._-2GU);
}
}
return;
}
The next characters could be ;, (, ), dot, comma, space, :, etc.
First of all, you should use Regex.Escape.
You can use then
contents = Regex.Replace(
contents,
Regex.Escape(dr["find"].ToString()) + #"(?![a-zA-Z])",
Regex.Escape(dr["replace"].ToString()));
or even better
contents = Regex.Replace(
contents,
#"\b" + Regex.Escape(dr["find"].ToString()) + #"\b",
Regex.Escape(dr["replace"].ToString()));
I think this is what you're looking for:
contents = Regex.Replace(
contents,
string.Format(#"(?<!\w){0}(?!\w)", Regex.Escape(dr["find"].ToString())),
dr["replace"].ToString().Replace("$", "$$")
);
You can't use \b because your search strings don't always start and end with word characters. Instead, I used (?<!\w) and (?!\w) to make sure the matched substring is not immediately preceded or followed by a word character (i.e., a letter, a digit, or an underscore). I don't know the complete specs for your search strings, so this pattern might need some tweaking.
None of the sample patterns you provided contain regex metacharacters, but like the other responders, I used Regex.Escape() to render it safe anyway. In the replacement string the only character you have to watch out for is the dollar sign (ref), and the way to escape that is with another dollar sign. Notice that I used String.Replace() for that instead of Regex.Replace().
There are two tricks that can help you here:
Order all the search string by length, and replace the longest ones first, that way you won't accidentally replace the shorter ones.
Use a MatchEvaluator and instead of looping through all your rows, search fro all replacement patterns in the string and look them up in your dataset.
Option one is simple, option two would look like this:
Regex.Replace(contents", "_-\\w+", ReplaceIdentifier)
public string ReplaceIdentifier(Match m)
{
DataRow row = FolderRenames.Rows.FindRow("find"); // Requires a primary key on "find"
if (row != null) return row["replace"];
else return m.Value;
}

match first digits before # symbol

How to match all first digits before # in this line
26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html
Im trying to get this number 26909578
My try
string text = #"26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html";
MatchCollection m1 = Regex.Matches(text, #"(.+?)#", RegexOptions.Singleline);
but then its outputs all text
Make it explicit that it has to start at the beginning of the string:
#"^(.+?)#"
Alternatively, if you know that this will always be a number, restrict the possible characters to digits:
#"^\d+"
Alternatively use the function Match instead of Matches. Matches explicitly says, "give me all the matches", while Match will only return the first one.
Or, in a trivial case like this, you might also consider a non-RegEx approach. The IndexOf() method will locate the '#' and you could easily strip off what came before.
I even wrote a sscanf() replacement for C#, which you can see in my article A sscanf() Replacement for .NET.
If you dont want to/dont like to use regex, use a string builder and just loop until you hit the #.
so like this
StringBuilder sb = new StringBuilder();
string yourdata = "yourdata";
int i = 0;
while(yourdata[i]!='#')
{
sb.Append(yourdata[i]);
i++;
}
//when you get to that # your stringbuilder will have the number you want in it so return it with .toString();
string answer = sb.toString();
The entire string (except the final url) is composed of segments that can be matched by (.+?)#, so you will get several matches. Retrieve only the first match from the collection returned by matching .+?(?=#)

how can i optimize the performance of this regular expression?

I'm using a regular expression to replace commas that are not contained by text qualifying quotes into tab spaces.
I'm running the regex on file content through a script task in SSIS. The file content is over 6000 lines long.
I saw an example of using a regex on file content that looked like this
String FileContent = ReadFile(FilePath, ErrInfo);
Regex r = new Regex(#"(,)(?=(?:[^""]|""[^""]*"")*$)");
FileContent = r.Replace(FileContent, "\t");
That replace can understandably take its sweet time on a decent sized file.
Is there a more efficient way to run this regex?
Would it be faster to read the file line by line and run the regex per line?
It seems you're trying to convert comma separated values (CSV) into tab separated values (TSV).
In this case, you should try to find a CSV library instead and read the fields with that library (and convert them to TSV if necessary).
Alternatively, you can check whether each line has quotes and use a simpler method accordingly.
The problem is the lookahead, which looks all the way to the end on each comman, resulting in O(n2) complexity, which is noticeable on long inputs. You can get it done in a single pass by skipping over quotes while replacing:
Regex csvRegex = new Regex(#"
(?<Quoted>
"" # Open quotes
(?:[^""]|"""")* # not quotes, or two quotes (escaped)
"" # Closing quotes
)
| # OR
(?<Comma>,) # A comma
",
RegexOptions.IgnorePatternWhitespace);
content = csvRegex.Replace(content,
match => match.Groups["Comma"].Success ? "\t" : match.Value);
Here we match free command and quoted strings. The Replace method takes a callback with a condition that checks if we found a comma or not, and replaced accordingly.
The simplest optimization would be
Regex r = new Regex(#"(,)(?=(?:[^""]|""[^""]*"")*$)", RegexOptions.Compiled);
foreach (var line in System.IO.File.ReadAllLines("input.txt"))
Console.WriteLine(r.Replace(line, "\t"));
I haven't profiled it, but I wouldn't be surprised if the speedup was huge.
If that's not enough I suggest some manual labour:
var input = new StreamReader(File.OpenRead("input.txt"));
char[] toMatch = ",\"".ToCharArray ();
string line;
while (null != (line = input.ReadLine()))
{
var result = new StringBuilder(line);
bool inquotes = false;
for (int index=0; -1 != (index = line.IndexOfAny (toMatch, index)); index++)
{
bool isquote = (line[index] == '\"');
inquotes = inquotes != isquote;
if (!(isquote || inquotes))
result[index] = '\t';
}
Console.WriteLine (result);
}
PS: I assumed #"\t" was a typo for "\t", but perhaps it isn't :)

Categories

Resources