C# how to write Regular Expression - c#

My file has certain data like::
/Pages 2 0 R/Type /Catalog/AcroForm
/Count 1 /Kids [3 0 R]/Type /Pages
/Filter /FlateDecode/Length 84
What is the regular expression to get this output..
Pages Type Catalog AcroForm Count Kids Type Pages Filter FlateDecode Length
I want to fetch string after '/' & before 2nd '/' or space.
Thanks in advance.

class Program
{
static void Main()
{
string s = #"/Pages 2 0 R/Type /Catalog/AcroForm
/Count 1 /Kids [3 0 R]/Type /Pages
/Filter /FlateDecode/Length 84";
var regex = new Regex(#"[\/]([^\s^\/]*)[\s]");
foreach (Match item in regex.Matches(s))
{
Console.WriteLine(item.Groups[1].Value);
}
}
}
Remark: Don't use regular expressions to parse PDF files.

\/[^\/\s]+
\/ -- A slash (escaped)
[^ ] -- A character class not (^) containing...
\/ -- ... slashes ...
\s -- ... or whitespace
+ -- One or more of these

Here it is for c#:
#"/([^\s/]+)"
You can test it here just adding what is in between quotes:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

I wouldn't use a regex for this, I find that using string operations is more readable:
string[] lines = input.split(#"\");
foreach(string line in lines)
{
if(line.contains(" "))
{
// Get everything before the space
}
else
{
// Get whole string
}
}

Related

C# Regex matching two numbers divided by specific string and surrounded by brackets

I'm looking for a .NET Regex pattern that matches the following:
string starts with the [ character
followed by an integer or decimal number
followed by .. (space character, dot, dot, space character)
followed by an integer or decimal number
followed by the last character of the string which is )
*- the decimal numbers have a decimal separator, the . character
*- the integer numbers or the integer value of the decimal numbers should have a maximum of 4 digits
*- the decimal numbers should have a maximum of 4 fractional digits
*- the numbers can be negative
*- if a number is positive then the + sign is missing
*- doesn't matter which one of the two numbers is smaller (first number can be bigger than the second one, "[56 .. 55)" for instance)
The pattern should match the following:
"[10 .. 15)"
"[100 .. 15.2)"
"[10.431 .. 15)"
"[-10.3 .. -5)"
"[-10.4 .. 5.12)"
"[10.4312 .. -5.1232)"
I'd also like to obtain the 2 numbers as strings from the string in case the pattern matches:
obtain "10" and "15" from "[10 .. 15)"
obtain "-10.4" and "5.12" from "[-10.4 .. 5.12)"
The following regex should be fine.
^\[-?\d+(?:\.\d+)? \.\. -?\d+(?:\.\d+)?\)$
var pattern = #"^\[-?\d+(?:\.\d+)? \.\. -?\d+(?:\.\d+)?\)$";
var inputs = new[]{"[10 .. 15)", "[100 .. 15.2)", "[10.431 .. 15)", "[-10.3 .. -5)", "[-10.4 .. 5.12)", "[10.4312 .. -5.1232)", };
foreach (var input in inputs)
{
Console.WriteLine(input + " = " + Regex.IsMatch(input, pattern));
}
// [10 .. 15) = True
// [100 .. 15.2) = True
// [10.431 .. 15) = True
// [-10.3 .. -5) = True
// [-10.4 .. 5.12) = True
// [10.4312 .. -5.1232) = True
https://dotnetfiddle.net/LpswtI
You can use
^\[(-?\d{1,4}(?:\.\d{1,4})?) \.\. (-?\d{1,4}(?:\.\d{1,4})?)\)$
See the regex demo. Details:
^ - start of string
\[ - a [ char
(-?\d{1,4}(?:\.\d{1,4})?) - Group 1: an optional -, one to four digits and then an optional sequence of a . and one to four digits
\.\. - a .. string
(-?\d{1,4}(?:\.\d{1,4})?) - Group 2: an optional -, one to four digits and then an optional sequence of a . and one to four digits
\) - a ) char
$ - end of string (use \z if you need to check for the very end of string).
See the C# demo:
var texts = new List<string> { "[10 .. 15)", "[100 .. 15.2)", "[10.431 .. 15)", "[-10.3 .. -5)", "[-10.4 .. 5.12)", "[10.4312 .. -5.1232)", "[12345.1234 .. 0)", "[1.23456 .. 0" };
var pattern = new Regex(#"^\[(-?\d{1,4}(?:\.\d{1,4})?) \.\. (-?\d{1,4}(?:\.\d{1,4})?)\)$");
foreach (var s in texts)
{
Console.WriteLine($"---- {s} ----");
var match = pattern.Match(s);
if (match.Success)
{
Console.WriteLine($"Group 1: {match.Groups[1].Value}, Group 2: {match.Groups[2].Value}");
}
else
{
Console.WriteLine($"No match found in '{s}'.");
}
}
Output:
---- [10 .. 15) ----
Group 1: 10, Group 2: 15
---- [100 .. 15.2) ----
Group 1: 100, Group 2: 15.2
---- [10.431 .. 15) ----
Group 1: 10.431, Group 2: 15
---- [-10.3 .. -5) ----
Group 1: -10.3, Group 2: -5
---- [-10.4 .. 5.12) ----
Group 1: -10.4, Group 2: 5.12
---- [10.4312 .. -5.1232) ----
Group 1: 10.4312, Group 2: -5.1232
---- [12345.1234 .. 0) ----
No match found in '[12345.1234 .. 0)'.
---- [1.23456 .. 0 ----
No match found in '[1.23456 .. 0'.
This works (see this .Net Fiddle:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
Match m = rx.Match("[123 .. -9876.5432]");
if (!m.Success )
{
Console.WriteLine("No Match");
}
else
{
Console.WriteLine(#"left: {0}", m.Groups[ "left" ] );
Console.WriteLine(#"right: {0}", m.Groups[ "right" ] );
}
}
private static readonly Regex rx = new Regex(#"
^ # anchor match at start-of-text
[[] # a left square bracket followed by
(?<left> # a named capturing group, containing a number, consisting of
-?[0-9]{1,4} # - a mandatory integer portion followed by
([.][0-9]{1,4})? # - an optional fractional portion
) # the whole of which is followed by
[ ][.][.][ ] # a separator (' .. '), followed by
(?<right> # another named capturing group containing a number, consisting of
-?[0-9]{1,4} # - a mandatory integer portion followed by
([.][0-9]{1,4})? # - an optional fractional portion
) # the whole of which is followed by
\] # a right square bracket, followed by
$ # end-of-text
",
RegexOptions.IgnorePatternWhitespace|RegexOptions.ExplicitCapture
);
}

Removal of colon and carriage returns and replace with colon

I'm working on a project where I have a HMTL fragment which needs to be cleaned up - the HTML has been removed and as a result of table being removed, there are some strange ends where they shouldnt be :-)
the characters as they appear are
a space at the beginning of a line
a colon, carriage return and linefeed at the end of the line - which needs to be replaced simply with the colon;
I am presently using regex as follows:
s = Regex.Replace(s, #"(:[\r\n])", ":", RegexOptions.Multiline | RegexOptions.IgnoreCase);
// gets rid of the leading space
s = Regex.Replace(s, #"(^[( )])", "", RegexOptions.Multiline | RegexOptions.IgnoreCase);
Example of what I am dealing with:
Tomas Adams
Solicitor
APLawyers
p:
1800 995 718
f:
07 3102 9135
a:
22 Fultam Street
PO Box 132, Booboobawah QLD 4113
which should look like:
Tomas Adams
Solicitor
APLawyers
p:1800 995 718
f:07 3102 9135
a:22 Fultam Street
PO Box 132, Booboobawah QLD 4313
as my attempt to clean the string, but the result is far from perfect ... Can someone assist me to correct the error and achive my goal ...
[EDIT]
the offending characters
f:\r\n07 3102 9135\r\na:\r\n22
the combination of :\r\n should be replaced by a single colon.
MTIA
Darrin
You may use
var result = Regex.Replace(s, #"(?m)^\s+|(?<=:)(?:\r?\n)+|(\r?\n){2,}", "$1")
See the .NET regex demo.
Details
(?m) - equal to RegexOptions.Multiline - makes ^ match the start of any line here
^ - start of a line
\s+ - 1+ whitespaces
| - or
(?<=:)(?:\r?\n)+ - a position that is immediately preceded with : (matched with (?<=:) positive lookbehind) followed with 1+ occurrences of an optional CR and LF (those are removed)
| - or
(\r?\n){2,} - two or more consecutive occurrences of an optional CR followed with an LF symbol. Only the last occurrence is saved in Group 1 memory buffer, thus the $1 replacement pattern inserts that last, single, occurrence.
A basic solution without Regex:
var lines = input.Split(new []{"\n"}, StringSplitOptions.RemoveEmptyEntries);
var output = new StringBuilder();
for (var i = 0; i < lines.Length; i++)
{
if (lines[i].EndsWith(":")) // feel free to also check for the size
{
lines[i + 1] = lines[i] + lines[i + 1];
continue;
}
output.AppendLine(lines[i].Trim()); // remove space before or after a line
}
Try it Online!
I tried to use your regular expression.I was able to replace "\n" and ":" with the following regular expression.This is removing ":" and "\n" at the end of the line.
#"([:\r\n])"
A Linq solution without Regex:
var tmp = string.Empty;
var output = input.Split(new []{"\n"}, StringSplitOptions.RemoveEmptyEntries).Aggregate(new StringBuilder(), (a,b) => {
if (b.EndsWith(":")) { // feel free to also check for the size
tmp = b;
}
else {
a.AppendLine((tmp + b).Trim()); // remove space before or after a line
tmp = string.Empty;
}
return a;
});
Try it Online!

How to write a regular expression that captures tags in a comma-separated list?

Here is my input:
#
tag1, tag with space, !##%^, 🦄
I would like to match it with a regex and yield the following elements easily:
tag1
tag with space
!##%^
🦄
I know I could do it this way:
var match = Regex.Match(input, #"^#[\n](?<tags>[\S ]+)$");
// if match is a success
var tags = match.Groups["tags"].Value.Split(',').Select(x => x.Trim());
But that's cheating, as it involves messing around with C#. There must be a neat way to do this with a regex. Just must be... right? ;D
The question is: how to write a regular expression that would allow me to iterate through captures and extract tags, without the need of splitting and trimming?
This works (?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+
It uses C#'s Capture Collection to find a variable amount of field data
in a single record.
You could extend the regex further to get all records at once.
Where each record contains its own variable amount of field data.
The regex has built-in trimming as well.
Expanded:
(?ms) # Inline modifiers: multi-line, dot-all
^ \# \s+ # Beginning of record
(?: # Quantified group, 1 or more times, get all fields of record at once
\s* # Trim leading wsp
( # (1 start), # Capture collector for variable fields
(?: # One char at a time, but not comma or begin of record
(?!
,
| ^ \# \s+
)
.
)*?
) # (1 end)
\s*
(?: , | $ ) # End of this field, comma or EOL
)+
C# code:
string sOL = #"
#
tag1, tag with space, !##%^, 🦄";
Regex RxOL = new Regex(#"(?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+");
Match _mOL = RxOL.Match(sOL);
while (_mOL.Success)
{
CaptureCollection ccOL1 = _mOL.Groups[1].Captures;
Console.WriteLine("-------------------------");
for (int i = 0; i < ccOL1.Count; i++)
Console.WriteLine(" '{0}'", ccOL1[i].Value );
_mOL = _mOL.NextMatch();
}
Output:
-------------------------
'tag1'
'tag with space'
'!##%^'
'??'
''
Press any key to continue . . .
Nothing wrong with cheating ;]
string input = #"#
tag1, tag with space, !##%^, 🦄";
string[] tags = Array.ConvertAll(input.Split('\n').Last().Split(','), s => s.Trim());
You can pretty much make it without regex. Just split it like this:
var result = input.Split(new []{'\n','\r'}, StringSplitOptions.RemoveEmptyEntries).Skip(1).SelectMany(x=> x.Split(new []{','},StringSplitOptions.RemoveEmptyEntries).Select(y=> y.Trim()));

need help in finding pattern in string using Regular Expression using C#

I have a string in following format..
"ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00|ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC11.00|ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00";
What I am trying to do is find the next group which starts with pipe but is not followed by a -
So the above string will point to 3 sections such as
ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00
ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00
ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00
I played around with following code but it doesn't seem to do anything, it is not giving me the position of the next block where pipe char is which is not followed by a dash (-)
String pattern = #"^+|[A-Z][A-Z][A-Z]$";
In the above my logic is
1:Start from the beginning
2:Find a pipe character which is not followed by a dash char
3:Return its position
4:Which I will eventually use to substring the blocks
5:And do this till the end of the string
Pls be kind as I have no idea how regex works, I am just making an attempt to use it. Thanks, language is C#
You can use the Regex.Split method with a pattern of \|(?!-).
Notice that you need to escape the | character since it's a metacharacter in regex that is used for alternatation. The (?!-) is a negative look-ahead that will stop matching when a dash is encountered after the | character.
var pattern = #"\|(?!-)";
var results = Regex.Split(input, pattern);
foreach (var match in results) {
Console.WriteLine(match);
}
My Regex logic for this was:
the delimiter is pipe "[|]"
we will gather a series of characters that are not our delimiter
"(" not our delimiter ")" but at least one character "+"
"[^|]" is not our delimiter
"[|][-]" is also not our delimiter
Variable "pattern" could use a "*" instead of "+" if empty segments are acceptable. The pattern ends with a "?" since our final string segment (in your example) does not have a pipe character.
using System;
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace ConsoleTest1
{
class Program
{
static void Main(string[] args)
{
var input = "ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00|ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC11.00|ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00";
var pattern = "([^|]|[|][-])+[|]?";
Match m;
m = Regex.Match(input, pattern);
while (m.Success) {
Debug.WriteLine(String.Format("Match from {0} for {1} characters", m.Index, m.Length));
Debug.WriteLine(input.Substring(m.Index, m.Length));
m = m.NextMatch();
}
}
}
}
Output is:
Match from 0 for 50 characters
ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00|
Match from 50 for 49 characters
ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC11.00|
Match from 99 for 49 characters
ABC 12.23-22-22-11|-ABC 33.20-ABC 44.00-ABC 11.00

Regex help - match any number of characters

I have following kind of string-sets in a text file:
<< /ImageType 1
/Width 986 /Height 1
/BitsPerComponent 8
/Decode [0 1 0 1 0 1]
/ImageMatrix [986 0 0 -1 0 1]
/DataSource <
803fe0503824160d0784426150b864361d0f8844625138a4562d178c466351b8e4763d1f904864523924964d27944a6552b964b65d2f984c665339a4d66d379c4e6753b9e4f67d3fa05068543a25168d47a4526954ba648202
> /LZWDecode filter >> image } def
There are 100s of Images defined like above.
I need to find all such images defined in the document.
Here is my code -
string txtFile = #"text file path";
string fileContents = File.ReadAllText(txtFile);
string pattern = #"<< /ImageType 1.*(\n|\r|\r\n)*image } def"; //match any number of characters between `<< /ImageType 1` and `image } def`
MatchCollection matchCollection = Regex.Matches(fileContents, pattern, RegexOptions.Singleline);
int count = matchCollection.Count; // returns 1
However, I am getting just one match - whereas there are around 600 images defined.
But it seems they all are matched in one because of 'newline' character used in pattern.
Can anyone please guide what do I need to modify the correct result of regex match as 600.
The reason is that regular expressions are usually greedy, i.e. the matches are always as long as possible. Thus, the image } def is contained in the .*. I think the best approach here would be to perform two separate regex queries, one for << /ImageType 1 and one for image } def. Every match of the first pattern would correspond to exactly one match of the second one and as these matches carry their indices in the original string, you can reconstruct the image by accessing the appropriate substring.
Instead of .* you should use the non-greedy quantifier .*?:
string pattern = #"<< /ImageType 1.*?image } def";
Here is a site that can help you out with REGEX that I use. http://webcheatsheet.com/php/regular_expressions.php.
if(preg_match('/^/[a-z]/i', $string, $matches)){
echo "Match was found <br />";
echo $matches[0];
}

Categories

Resources