I have a text file as below:
1.1 - Hello
1.2 - world!
2.1 - Some
data
here and it contains some 32 digits so i cannot use \D+
2.2 - Etc..
so i want a regex to get 4 matches in this case for each point. My regex doesn't work as I wish. Please, advice:
private readonly Regex _reactionRegex = new Regex(#"(\d+)\.(\d+)\s*-\s*(.+)", RegexOptions.Compiled | RegexOptions.Singleline);
even this regex isn't very helpful:
(\d+)\.(\d+)\s*-\s*(.+)(?<!\d+\.\d+)
Alex, this regex will do it:
(?sm)^\d+\.\d+\s*-\s*((?:.(?!^\d+\.\d+))*)
This is assuming that you want to capture the point, without the numbers, for instance: just Hello
If you want to also capture the digits, for instance 1.1 - Hello, you can use the same regex and display the entire match, not just Group 1. The online demo below will show you both.
How does it work?
The idea is to capture the text you want to Group 1 using (parentheses).
We match in multi-line mode m to allow the anchor ^ to work on each line.
We match in dotall mode s to allow the dot to eat up strings on multiple lines
We use a negative lookahead (?! to stop eating characters when what follows is the beginning of the line with your digit marker
Here is full working code and an online demo.
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program {
static void Main() {
string yourstring = #"1.1 - Hello
1.2 - world!
2.1 - Some
data
here and it contains some 32 digits so i cannot use \D+
2.2 - Etc..";
var resultList = new StringCollection();
try {
var yourRegex = new Regex(#"(?sm)^\d+\.\d+\s*-\s*((?:.(?!^\d+\.\d+))*)");
Match matchResult = yourRegex.Match(yourstring);
while (matchResult.Success) {
resultList.Add(matchResult.Groups[1].Value);
Console.WriteLine("Whole Match: " + matchResult.Value);
Console.WriteLine("Group 1: " + matchResult.Groups[1].Value + "\n");
matchResult = matchResult.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
This may do for what you're looking for, though there is some ambiguity of the expected result.
(\d+)\.(\d+)\s*-\s*(.+?)(\n)(?>\d|$)
The ambiguity is for example what would you expect to match if data looked like:
1.1 - Hello
1.2 - world!
2.1 - Some
data here and it contains some
32 digits so i cannot use \D+
2.2 - Etc..
Not clear if 32 here starts a new record or not.
Related
I try to filter some strings I streamed for some useful information in C#.
I got two possible string structures:
string examplestring1 = "from - to (mm) no. 1\r\n\r\nna 570 - 590\r\n60 18.12.20\r\nna 5390 - 5410\r\n60 18.12.20\r\nna 11380 - 11390 60 18.12.20\r\nPage 1/1";
string examplestring2 = "e ne 570 - 590 ne 5390 - 5410 ne 11380 - 11390 e";
I'd like to get an array or a List of strings in the format of "xxx - xxx". Like:
string[] example = new string[]{"570 - 590","5390 - 5410","11380 - 11390"};
I tried to use Regex:
List<string> numbers = new List<string>();
numbers.AddRange(Regex.Split(examplestring2, #"\D+"));
At least I get a list only containg the numbers. But that doesn't work out for examplestring1 since there is date within.
Also I tried to play around with Regex pattern. But things like following does not work.
Regex.Split(examplestring1, #"\D+" + " - " + #"\D+");
I'd be grateful for a solution or at least some hint how to solve that matter.
You can use
var results = Regex.Matches(text, #"\d+\s*-\s*\d+").Cast<Match>().Select(x => x.Value);
See the regex demo. If there must be a single regular space on both ends of the -, you can use \d+ - \d+ regex.
If you want to match any -, you can use [\p{Pd}\xAD] instead of -.
Note that \d in .NET matches any Unicode digits, to only match ASCII digits, use RegexOptions.ECMAScript option: Regex.Matches(text, #"\d+\s*-\s*\d+", RegexOptions.ECMAScript).
So I want the formats xxxxxx-xxxx AND xxxxxxxx-xxxx to be possible. I've managed to fix the first section before the dash, but the last four digits are troublesome. It does require to match at least 4 characters, but I also want the regex to return false if there's more than 4 characters. How do I do it?
This is how it looks so far:
var regex = new Regex(#"^\d{6,8}[-|(\s)]{0,1}\d{4}");
And this is the results:
var regex = new Regex(#"^\d{6,8}[-|(\s)]{0,1}\d{4}");
Match m = regex.Match("840204-2344");
Console.WriteLine(m.Success); // Outputs True
Match m = regex.Match("19840204-2344");
Console.WriteLine(m.Success); // Outputs True
Match m = regex.Match("19840204-23");
Console.WriteLine(m.Success); // Outputs false
Match m = regex.Match("19840204-2323423423");
Console.WriteLine(m.Success); // Outputs true, and this is what I don't want
The \d{6,8} pattern matches 6, 7 or 8 digits, so that will already invalidate your regex pattern. Besdies, [-|(\s)]{0,1} matches 1 or 0 -, (, ), | or whitespace chars, and will also match strings like 19840204|2323, 19840204(2323 and 19840204)2323.
You may use
^\d{6}(?:\d{2})?[-\s]?\d{4}$
See the regex demo.
Details
^ - start of string
\d{6} - 6 digits
(?:\d{2})? - optional 2 digits
[-\s]? - 1 or 0 - or whitespaces
\d{4} - 4 digits
$ - end of string.
To make \d only match ASCII digits, pass RegexOptions.ECMAScriptoption. Example:
var res = Regex.IsMatch(s, #"^\d{6}(?:\d{2})?[-\s]?\d{4}$", RegexOptions.ECMAScript);
You are forgetting the $ at the end:
var regex = new Regex(#"^(\d{6}|\d{8})-\d{4}$");
If you want to match the social security number anywhere in a string, you van also use \b to test for boundaries:
var regex = new Regex(#"\b(\d{6}|\d{8})-\d{4}\b");
Edit: I corrected the RegEx to fix the problems mentioned in the comments. The commentors are right, of course. In my earlier post I just wanted to explain why the RegEx matched the longer string.
I want a single Regex expression to match 2 groups of lowercase, uppercase, numbers or special characters. Length needs to also be grater than 7.
I currently have this expression
^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z]).{8,}$
It, however, forces the string to have lowercase and uppercase and digit or special character.
I currently have this implemented using 4 different regex expressions that I interrogate with some C# code.
I plan to reuse the same expression in JavaScript.
This is sample console app that shows the difference between 2 approaches.
class Program
{
private static readonly Regex[] Regexs = new[] {
new Regex("[a-z]", RegexOptions.Compiled), //Lowercase Letter
new Regex("[A-Z]", RegexOptions.Compiled), // Uppercase Letter
new Regex(#"\d", RegexOptions.Compiled), // Numeric
new Regex(#"[^a-zA-Z\d\s:]", RegexOptions.Compiled) // Non AlphaNumeric
};
static void Main(string[] args)
{
Regex expression = new Regex(#"^(?=.*[^a-zA-Z])(?=.*[a-z])(?=.*[A-Z]).{8,}$", RegexOptions.ECMAScript & RegexOptions.Compiled);
string[] testCases = new[] { "P#ssword", "Password", "P2ssword", "xpo123", "xpo123!", "xpo123!123##", "Myxpo123!123##", "Something_Really_Complex123!#43#2*333" };
Console.WriteLine("{0}\t{1}\t", "Single", "C# Hack");
Console.WriteLine("");
foreach (var testCase in testCases)
{
Console.WriteLine("{0}\t{2}\t : {1}", expression.IsMatch(testCase), testCase,
(testCase.Length >= 8 && Regexs.Count(x => x.IsMatch(testCase)) >= 2));
}
Console.ReadKey();
}
}
Result Proper Test String
------- ------- ------------
True True : P#ssword
False True : Password
True True : P2ssword
False False : xpo123
False False : xpo123!
False True : xpo123!123##
True True : Myxpo123!123##
True True : Something_Really_Complex123!#43#2*333
For javascript you can use this pattern that looks for boundaries between different character classes:
^(?=.*(?:.\b.|(?i)(?:[a-z]\d|\d[a-z])|[a-z][A-Z]|[A-Z][a-z]))[^:\s]{8,}$
if a boundary is found, you are sure to have two different classes.
pattern details:
\b # is a zero width assertion, it's a boundary between a member of
# the \w class and an other character that is not from this class.
.\b. # represents the two characters with the word boundary.
boundary between a letter and a number:
(?i) # make the subpattern case insensitive
(?:
[a-z]\d # a letter and a digit
| # OR
\d[a-z] # a digit and a letter
)
boundary between an uppercase and a lowercase letter:
[a-z][A-Z] | [A-Z][a-z]
since all alternations contains at least two characters from two different character classes, you are sure to obtain the result you hope.
You could use possessive quantifiers (emulated using atomic groups), something like this:
((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}
Since using possessive matching will prevent backtracking, you won't run into the two groups being two consecutive groups of lowercase letters, for instance. So the full regex would be something like:
^(?=.*((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}).{8,}$
Though, were it me, I'd cut the lookahead, just use the expression ((?>[a-z]+)|(?>[A-Z]+)|(?>[^a-zA-Z]+)){2,}, and check the length separately.
I have made application where i run to get html of a page,when i get it i have to mark the url as useable or not useable depending on different patterns. The patterns are provided in txt file :
Example:
+apple+banana+”baby cart” –blog
+”apple skin” +banana +”baby cart” –blog
+”apple skin” +”buy now” +jpg
The " is to tell for phrases than words.
html must contain apple AND banana AND baby cart AND CANNOT contain blog
html must contain apple skin AND banana AND baby cart AND CANNOT contain blog
html must contain apple skin AND buy now AND jpg
Problem
Can i uses regex in this case? If yes what would be the regex equivalent for the above patterns, so we can use them in the txt file except these and just use it as a pattern to match in HTML....
(The patterns are not Case sensitive).
A sample regex to at least dissect your search strings (although assuming - and " instead of – and ”):
(?<operator>[+-])?(?<word>["][^"]+["]|[^\s+-]+)
This matches a either a + or a - and the word or phrase that comes after it.
Quick PowerShell test:
PS> [regex]::matches($s, '(?<operator>[+-])?(?<word>["][^"]+["]|[^\s+-]+)')|ft -auto
Groups Success Captures Index Length Value
------ ------- -------- ----- ------ -----
{+apple, +, apple} True {+apple} 0 6 +apple
{+banana, +, banana} True {+banana} 6 7 +banana
{+"baby cart", +, "baby cart"} True {+"baby cart"} 13 12 +"baby cart"
{-blog, -, blog} True {-blog} 26 5 -blog
You can then process that to build a regex for your content, e.g.:
var re = #"(?<operator>[+-])?(?<word>[""][^""]+[""]|[^\s+-]+)";
var matches = Regex.Matches(s, re);
StringBuilder sb = new StringBuilder();
sb.Append("(?i)");
foreach (Match m in matches) {
sb.Append(string.Format("(?{1}.*{0})",
Regex.Escape(m.Groups["word"]).Trim('"'),
m.Groups["operator"] == "+" ? "=" : "!"));
}
var finalRe = sb.ToString();
But bear in mind that the resulting regex is very slow, especially for longer lists of words.
So I have this list of flight data and I need to be able to parse through it using regular expressions (this isn't the entire list).
1 AA2401 F7 A4 Y7 B7 M7 H7 K7 /DFW A LAX 4 0715 0836 E0.M80 9 3:21
2 AA2421 F7 A1 Y7 B7 M7 H7 K7 DFWLAX 4 1106 1215 E0.777 7 3:09
3UA:US6352 B9 M9 H9 K0 /DFW 1 LAX 1200 1448 E0.733 1:48
For example, I might need from the first line 1, AA, 2401, and so on and so on. Now, I'm not asking for someone to come up with a regular expression for me because for the most part I'm getting to where I can pretty much handle that myself. My issue has more to do with being able to store the data some where and access it.
So I'm just trying to initially just "match" the first piece of data I need, which is the line number '1'. My "pattern" for just getting the first number is: ".?(\d{1,2}).*" . The reason it's {1,2} is because obviously once you get past 10 it needs to be able to take 2 numbers. The rest of the line is set up so that it will definitely be a space or a letter.
Here's the code:
var assembly = Assembly.GetExecutingAssembly();
var textStreamReader = new StreamReader(
assembly.GetManifestResourceStream("FlightParser.flightdata.txt"));
List<string> lines = new List<string>();
do
{
lines.Add(textStreamReader.ReadLine());
} while (!textStreamReader.EndOfStream);
Regex sPattern = new Regex(#".?(\d{1,2}).*");//whatever the pattern is
foreach (string line in lines)
{
System.Console.Write("{0,24}", line);
MatchCollection mc = sPattern.Matches(line);
if ( sPattern.IsMatch(line))
{
System.Console.WriteLine(" (match for '{0}' found)", sPattern);
}
else
{
System.Console.WriteLine();
}
System.Console.WriteLine(mc[0].Groups[0].Captures);
System.Console.WriteLine(line);
}//end foreach
System.Console.ReadLine();
With the code I'm writing, I'm basically just trying to get '1' into the match collection and somehow access it and write it to the console (for the sake of testing, that's not the ultimate goal).
Your regex pattern includes an asterisk which matches any number of characters - ie. the whole line. Remove the "*" and it will only match the "1". You may find an online RegEx tester such as this useful.
Assuming your file is not actually formatted as you posted and has each of the fields separated by something, you can match the first two-digit number of the line with this regex (ignoring 0 and leading zeros):
^\s*([1-9]\d?)
Since it is grouped, you can access the matched part through the Groups property of the Match object.
var line = "12 foobar blah 123 etc";
var re = new Regex(#"^\s*([1-9]\d?)");
var match = re.Match(line);
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // "12"
}
else
{
Console.WriteLine("No match");
}
The following expression matches the first digit, that you wanted to capture, in the group "First".
^\s*(?<First>\d{1})
I find this regular expression tool highly useful when dealing with regex. Give it a try.
Also set RegexOption to Multiline when you are making the match.