I need to apply a regex in C#.
The string looks like the following:
MSH|^~\&|OAZIS||C2M||20110310222404||ADT^A08|00226682|P|2.3||||||ASCII
EVN|A08
PD1
PV1|1|test
And what I want to do is delete all the lines that only contain 3 characters (with no delimiters '|'). So in this case, the 'PD1' line (3rd line) has to be deleted.
Is this possible with a regex?
Thx
The following will do what you want without regular expressions.
String inputString;
String resultingString = "";
for(var line in inputString.Split(new String[]{"\n"})) {
if (line.Trim().Length > 3 || line.Contains("|"))
resultingString += line + "\n";
}
This assumes that you have your file as one large string. And it gives you another string with the necessary lines removed.
(Or you could do it with the file directly:
string[] goodLines =
// read all of the lines of the file
File.ReadLines("fileLocation").
// filter out the ones you want
Where(line => line.Trim().Length > 3 || line.Contains("|")).ToArray();
You end up with a String[] with all of the correct lines in your file.)
This:
(?<![|])[^\n]{4}\n
Regex matched what you wanted in the online regex tester I used, however I believe that the {4} should actually be a {3}, so try switching them if it doesn't work for you.
EDIT:
This also works: \n[^|\n]{3}\n and is probably closer to what you are looking for.
EDIT 2:
The number is brackets is definitely {3}, tested it at home.
why not just get a handle to the file, make a temporary output file, and run through the lines one by one. If there is a line with 3 characters, just skip it. If the file can be held in memory entirely, then maybe use the GetLines() (i think that's what the method is called) to get an array of strings that represents the file line by line.
Are the three characters always going to be by themselves on a line? If so, you can use beginning of string/end of string markers.
Here's a Regex that matches three characters that are by themselves on a string:
\A.{3}\z
\A is the start of the string.
\z is the end of the string.
. is any character, {3} with 3 occurrences
^ - start of line.
\w - word character
{3} - repreated exactly 3 times
$ - end of line
^\w{3}$
Just a general observation from the solutions I've seen posted so far. The original question included the comment "delete all the lines that only contain 3 characters" [my emphasis]. I'm not sure if you meant literally "only 3 characters", but in case you did, you may want to change the logic of the proposed solutions from things like
if (line.Trim().Length > 3 ...)
to
if (line.Trim().Length != 3 ...)
...just in case lines with 2 characters are indeed valid, for example. (Same idea for the proposed regex solutions.)
This regex will identify the lines that meet your exclusion criteria ^[^|]{3}$ then it's just a matter of iterating over all lines (with data) and checking which ones meet exclusion criteria. Like this for instance.
foreach(Match match in Regex.Matches(data, #"^.+$")
{
if (!Regex.IsMatch(match.Value, #"^[^|]{3}$"))
{
// Do Something with legitamate match.value like write line to target file.
}
}
The question is a little vague.
As stated, the answer is something like this
(?:^|(?<=\n))[^\n|]{3}(?:\n|$) which allows whitespace in the match.
So "#\t)" will also be deleted.
To limit the characters to visual (non-whitespace), you could use
(?:^|(?<=\n))[^\s|]{3}(?:\n|$)
which doesent allow whitespace.
For both the context is a single string, replacement is '' and global.
Example context in perl: s/(?:^|(?<=\n))[^\n|]{3}(?:\n|$)//g
try this:
text = System.Text.RegularExpressions.Regex.Replace(
text,
#"^[^|]{3}(?:\r\n|[\r\n]|$)",
"",
System.Text.RegularExpressions.RegexOptions.Multiline);
You can do it Using Regex
string output = Regex.Replace(input, "^[a-zA-Z0-9]{3}$", "");
[a-zA-Z0-9] will match any character or number
{3} will match exact number of 3
Related
I have the following string:
"121 fd412 4151 3213, 421, 423 41241 fdsfsd"
And I need to get 3213 and 421 - because they both have space in front of them, and a coma behind.
The result will be set inside the string array...How can I do that?
"\\d+" catches every integer.
"\s\\d+(,)" throws some memory errors.
EDIT.
space to the left (<-) of the number, coma to the right (->)
EDIT 2.
string mainString = "Tests run: 5816, 8346, 28364 iansufbiausbfbabsbo3 4";
MatchCollection c = Regex.Matches(a, #"\d+(?=\,)");
var myList = new List<String>();
foreach(Match match in c)
{
myList.Add(match.Value);
}
Console.Write(myList[1]);
Console.ReadKey();
Your regex syntax is incorrect for wanting to match both digits, if you want them as separate results, you could do:
#"\s(\d+),\s(\d+)\s"
Live Demo
Edit
#"\s(\d+),"
Live Demo
\s\\d+(,):
\s is not properly escaped, should be \\s, same as for \\d
\\d matches single digit, you need \\d+ - one or more consecutive digits
(,) captures comma, do you really need this? seems like you need to capture a number, so \\s(\\d+),
you said "because they both have space behind them, and a coma in front", so probably ,\\s(\\d+)
How about this expression :
" \d+," // expression without the quotes
it should find what you need.
How to work with regular expression can you check on the MSDN
Hope it helps
Another solution
\s(\d+), // or maybe you'll need a double slash \\
Output:
3213
421
Demo
I think you mean you're looking for something like ,<space><digit> not ,<digit><space>
If so, try this:
, (\d+) //you might need to add another backslash as the others have noted
Well, based on your new edit
\s(\d+),
Test it here
It's all you need, only the numbers
\d+(?=\,)
Okay, I give up - time to call upon the regex gurus for some help.
I'm trying to validate CSV file contents, just to see if it looks like the expected valid CSV data. I'm not trying to validate all possible CSV forms, just that it "looks like" CSV data and isn't binary data, a code file or whatever.
Each line of data comprises comma-separated words, each word comprising a-z, 0-9, and a small number of of punctuation chars, namely - and _. There may be several lines in the file. That's it.
Here's my simple code:
const string dataWord = #"[a-z0-9_\-]+";
const string dataLine = "("+dataWord+#"\s*,\s*)*"+dataWord;
const string csvDataFormat = "("+dataLine+") | (("+dataLine+#"\r\n)*"+dataLine +")";
Regex validCSVDataPattern = new Regex(csvDataFormat, RegexOptions.IgnoreCase);
protected override bool IsCorrectDataFormat(string fileContents)
{
return validCSVDataPattern.IsMatch(fileContents);
}
This gives me a regex pattern of
(([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+) | ((([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+\r\n)*([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+)
However if I present this with a block of, say, C# code, the regex parser says it is a match. How is that? the C# code doesn't look anything like my CSV pattern (it has punctuation other than _ and -, for a start).
Can anyone point out my obvious error? Let me repeat - I am not trying to validate all possible CSV forms, just my simple subset.
Your regular expression is missing the ^ (beginning of line) and $ (end of line) anchors. This means that it would match any text that contains what is described by the expression, even if the text contains other completely unrelated parts.
For example, this text matches the expression:
foo, bar
and therefore this text also matches:
var result = calculate(foo, bar);
You can see where this is going.
Add ^ at the beginning and $ at the end of csvDataFormat to get the behavior you expect.
Here is a better pattern which looks for CSV groups such as XXX, or yyy for one to many in each line:
^([\w\s_\-]*,?)+$
^ - Start of each line
( - a CSV match group start
[\w\s_\-]* - Valid characters \w (a-zA-Z0-9) and _ and - in each CSV
,? - maybe a comma
)+ - End of the csv match group, 1 to many of these expected.
That will validate a whole file, line by line for a basic CSV structure and allow for empty ,, situations.
I came up with this regex:
^([a-z0-9_\-]+)(\s*)(,\s*[a-z0-9_\-]+)*$
Tests
asbc_- , khkhkjh, lkjlkjlkj_-, j : PASS
asbc, : FAIL
asbc_-,khkhkjh,lkjlkjlk909j_-,j : PASS
If you want to match empty lines like ,,, or when some values are blank like ,abcd,, use
^([a-z0-9_\-]*)(\s*)(,\s*[a-z0-9_\-]*)*$
Loop through all the lines to see if the file is ok:
const string dataLine = "^([a-z0-9_\-]+)(\s*)(,\s*[a-z0-9_\-]+)*$";
Regex validCSVDataPattern = new Regex(csvDataFormat, RegexOptions.IgnoreCase);
protected override bool IsCorrectDataFormat(string fileContents)
{
string[] lines = fileContents.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
foreach (var line in lines)
{
if (!validCSVDataPattern.IsMatch(line))
return false;
}
return true;
}
I think this is what you're looking for:
#"(?in)^[a-z0-9_-]+( *, *[a-z0-9_-]+)*([\r\n]+[a-z0-9_-]+( *, *[a-z0-9_-]+)*)*$"
The noteworthy changes are:
Added anchors (^ and $, because the regex is totally pointless without them
Removed spaces (which have to match literal spaces, and I don't think that's what you intended)
Replaced the \s in every occurrence of \s* with a literal space (because \s can match any whitespace character, and you only want to match actual spaces in those spots)
The basic structure of your regex looked pretty good until that | came along and bollixed things up. ;)
p.s., In case you're wondering, (?in) is an inline modifier that sets IgnoreCase and ExplicitCapture modes.
I have to find occurrences of a certain string (needle) within another string (haystack) that don't occur between specific "braces".
For example consider this haystack:
"BEGIN something END some other thing BEGIN something else END yet some more things."
And this needle:
"some"
With the braces "BEGIN" and "END"
I want to find all needles that are not between braces.
(there are two matches: the "some" followed by "other" and the "some" followed by "more")
I figured I could solve this with a Regex with negative lookahhead/lookbehind, but how?
I have tried
(?<!(BEGIN))some(?!(END))
which gives me 4 matches (obviously because no "some" is directly enclosed between "BEGIN" and "END")
I also tried
(?<!(BEGIN.*))some(?!(.*END))
but this gives me no matches at all (obviously because each needle is somehow preceeded by a "BEGIN")
No I'm stuck.
Here's the latest C# code I used:
string input = "BEGIN something END some other thing BEGIN something else END yet some more things.";
global::System.Text.RegularExpressions.Regex re = new Regex(#"(?<!(BEGIN.*))some(?!(.*END))");
global::System.Text.RegularExpressions.MatchCollection matches = re.Matches(input);
global::NUnit.Framework.Assert.AreEqual(2, matches.Count);
Would something like this work for you:
(?:^|END)((?!BEGIN).*?)(some)(.*?)(?:BEGIN|$)
This appears to match your text, as I tested using RegExDesigner.NET.
One simple option is to skip the parts you don't want to match, and capture only the needles you need:
MatchCollection matches = Regex.Matches(input, "BEGIN.*?END|(?<Needle>some)");
You'll get the two "some"s you're after by taking the successful "Needle" groups out of all matches:
IEnumerable<Group> needles = matches.Cast<Match>()
.Select(m => m.Groups["Needle"])
.Where(g => g.Success);
You might try splitting the string on occurrences of BEGIN or END so that you can insure that there is only one BEGIN and one END in the string that you apply your regex to. Also, if you are looking for occurrences of SOME that are outside your BEGIN/END braces then I think you'd want to look behind for END and lookahead for BEGIN (positive lookahead/behind), the opposite of what you have.
Hope this helps.
What if you just process the entire haystack and ignore the hay that is in between the braces (am I pushing the metaphor too far?)
For example, look through all the tokens (or characters, if you need to go to that level) and look for your braces. When the opening one is found, you loop through until you find the closing brace. At that point, you start looking for your needles until you find another opening brace. It's a bit more code than a Regex, but might be more readible and easier to troubleshoot.
I'm after a regex for C# which will turn this:
"*one*" *two** two and a bit "three four"
into this:
"*one*" "*two**" two and a bit "three four"
IE a quoted string should be unchanged whether it contains one or many words.
Any words with asterisks to be wrapped in double quotes.
Any unquoted words with no asterisks to be unchanged.
Nice to haves:
If multiple asterisks could be merged into one in the same step that would be better.
Noise words - eg and, a, the - which are not part of a quoted string should be dumped.
Thanks for any help / advice.
Julio
The following regex will do what you're looking for:
\*+ # Match 1 or more *
(
\w+ # Capture character string
)
\*+ # Match 1 or more *
If you use this in conjunction with this replace statement, all you words matched by (\w+) will be wrapped in "**":
string s = "\"one\" *two** two and a bit \"three four\"";
Regex r = new Regex(#"\*+(\w+)\*+");
var output = r.Replace(s, #"""*$1*""");
Note: This will leave the below string unquoted:
*two two*
If you wish to match those strings as well, use this regex:
\*+([^*]+)\*+
EDIT: updated code.
This solution works for your request, as well as the nice to have items:
string text = #"test the ""one"" and a *two** two and a the bit ""three four"" a";
string result = Regex.Replace(text, #"\*+(.*?)\*+", #"""*$1*""");
string noiseWordsPattern = #"(?<!"") # match if double quote prefix is absent
\b # word boundary to prevent partial word matches
(and|a|the) # noise words
\b # word boundary
(?!"") # match if double quote suffix is absent
";
// to use the commented pattern use RegexOptions.IgnorePatternWhitespace
result = Regex.Replace(result, noiseWordsPattern, "", RegexOptions.IgnorePatternWhitespace);
// or use this one line version instead
// result = Regex.Replace(result, #"(?<!"")\b(and|a|the)\b(?!"")", "");
// remove extra spaces resulting from noise words replacement
result = Regex.Replace(result, #"\s+", " ");
Console.WriteLine("Original: {0}", text);
Console.WriteLine("Result: {0}", result);
Output:
Original: test the "one" and a *two** two and a the bit "three four" a
Result: test "one" "*two*" two bit "three four"
The 2nd regex replacement for noise words causes potential duplicate of blank spaces. To remedy this side effect I added the 3rd regex replacement to clean it up.
Something like this. ArgumentReplacer is a callback that is called for each match. The return value is substituted into the returned string.
void Main() {
string text = "\"one\" *two** and a bit \"three *** four\"";
string finderRegex = #"
(""[^""]*"") # quoted
| ([^\s""*]*\*[^\s""]*) # with asteriks
| ([^\s""]+) # without asteriks
";
return Regex.Replace(text, finderRegex, ArgumentReplacer,
RegexOptions.IgnorePatternWhitespace);
}
public static String ArgumentReplacer(Match theMatch) {
// Don't touch quoted arguments, and arguments with no asteriks
if (theMatch.Groups[2].Value.Length == 0)
return theMatch.Value;
// Quote arguments with asteriks, and replace sequences of such
// by a single one.
return String.Format("\"%s\"",
Regex.Replace(theMatch.Value, #"\*\*+", "*"));
}
Alternatives to the left in the pattern has priority over those to the right. This is why I just needed to write "[^\s""]+" in the last alternative.
The quotes, on the other hand, are only matched if they occur at the beginning of the argument. They will not be detected if they occur in the middle of the argument, and we must stop before those if they occur.
Given that you wish to match pairs of quotes, I don’t think your language is regular, therefore I don’t think RegEx is a good solution. E.g
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.”
Now they have two problems.
See "When not to use Regex in C# (or Java, C++ etc)"
I've decided to follow the advice of a couple of responses and go with a parser solution. I've tried the regexes contributed so far and they seem to fail in some cases. That's probably an indication that regexes aren't the appropriate solution to this problem. Thanks for all responses.
I'd thought i do a regex replace
Regex r = new Regex("[0-9]");
return r.Replace(sz, "#");
on a file named aa514a3a.4s5 . It works exactly as i expect. It replaces all the numbers including the numbers in the ext. How do i make it NOT replace the numbers in the ext. I tried numerous regex strings but i am beginning to think that its a all or nothing pattern so i cant do this? do i need to separate the ext from the string or can i use regex?
This one does it for me:
(?<!\.[0-9a-z]*)[0-9]
This does a negative lookbehind (the string must not occur before the matched string) on a period, followed by zero or more alphanumeric characters. This ensures only numbers are matched that are not in your extension.
Obviously, the [0-9a-z] must be replaced by which characters you expect in your extension.
I don't think you can do that with a single regular expression.
Probably best to split the original string into base and extension; do the replace on the base; then join them back up.
Yes, I thing you'd be better off separating the extension.
If you are sure there is always a 3-character extension at the end of your string, the easiest, most readable/maintainable solution would be to only perform the replace on
yourString.Substring(0,YourString.Length-4)
..and then append
yourString.Substring(YourString.Length-4, 4)
Why not run the regex on the substring?
String filename = "aa514a3a.4s5";
String nameonly = filename.Substring(0,filename.Length-4);