I have made application where i run to get html of a page,when i get it i have to mark the url as useable or not useable depending on different patterns. The patterns are provided in txt file :
Example:
+apple+banana+”baby cart” –blog
+”apple skin” +banana +”baby cart” –blog
+”apple skin” +”buy now” +jpg
The " is to tell for phrases than words.
html must contain apple AND banana AND baby cart AND CANNOT contain blog
html must contain apple skin AND banana AND baby cart AND CANNOT contain blog
html must contain apple skin AND buy now AND jpg
Problem
Can i uses regex in this case? If yes what would be the regex equivalent for the above patterns, so we can use them in the txt file except these and just use it as a pattern to match in HTML....
(The patterns are not Case sensitive).
A sample regex to at least dissect your search strings (although assuming - and " instead of – and ”):
(?<operator>[+-])?(?<word>["][^"]+["]|[^\s+-]+)
This matches a either a + or a - and the word or phrase that comes after it.
Quick PowerShell test:
PS> [regex]::matches($s, '(?<operator>[+-])?(?<word>["][^"]+["]|[^\s+-]+)')|ft -auto
Groups Success Captures Index Length Value
------ ------- -------- ----- ------ -----
{+apple, +, apple} True {+apple} 0 6 +apple
{+banana, +, banana} True {+banana} 6 7 +banana
{+"baby cart", +, "baby cart"} True {+"baby cart"} 13 12 +"baby cart"
{-blog, -, blog} True {-blog} 26 5 -blog
You can then process that to build a regex for your content, e.g.:
var re = #"(?<operator>[+-])?(?<word>[""][^""]+[""]|[^\s+-]+)";
var matches = Regex.Matches(s, re);
StringBuilder sb = new StringBuilder();
sb.Append("(?i)");
foreach (Match m in matches) {
sb.Append(string.Format("(?{1}.*{0})",
Regex.Escape(m.Groups["word"]).Trim('"'),
m.Groups["operator"] == "+" ? "=" : "!"));
}
var finalRe = sb.ToString();
But bear in mind that the resulting regex is very slow, especially for longer lists of words.
Related
Take this data as an example:
ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021
I was wondering if it's possible to create a regex that will return this set of matches
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
I did try creating one below:
ID: (?<id>\w+).*\|(?<instrument>\w+):\s(?<count>\d).*Expiry:\s(?<expiry>[\w\d]+)
but it only returned the one with the violin instrument. I would highly appreciate your insights on this.
I would not use a regular expression. Especially since the string ID: JK546|Guitar: 0|Expiry: Aug14,2021 does not appear in the string ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021, so it's not strictly a match, but more of a replacement. But there's no good way to get all replacements from all matches.
So, I'd just split the input string on |.
Then you want to compose a result string that is comprised of the first field, one of the middle fields, and the last field. You'll get one result for each middle field that exists. If it splits into N fields, you'll get N-2 results. e.g.: if it splits into 5 fields, then you'll get 3 results, one for each of the "middle" fields.
string input = "ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021";
string[] fields = input.Split('|');
for( int i = 1; i < fields.Length - 1; ++i) {
string result = string.Join("|", fields.First(), fields[i], fields.Last());
Console.WriteLine(result);
}
output:
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
A single regular expression to return multiple matches on multiple calls?
I wonder whether that is possible.
I’m not familiar with how to do regex processing in C#,
but this sed command will do what you want.
Perhaps you can understand how it works and adapt it to your needs:
sed -n ':loop; h; s/^\([^|]*|[^|]*\).*\(|.*\)$/\1\2/p; g; s/^\([^|]*\)|[^|]*\(|.*\)$/\1\2/; t loop'
For simplicity, let’s pretend that the input string is “A|B|C|D|E”.
What it does:
-n is the option to tell sed not to print anything automatically
(but only print when told to, with a p command).
:loop is a label for, effectively, a “goto”.
So use a while loop structure.
h saves the pattern space into the hold space.
In other words, make a copy of your string.
s/^\([^|]*|[^|]*\).*\(|.*\)$/\1\2/p captures the first two segments
and the last one, and prints the result.
So “A|B|C|D|E” becomes “A|B|E” (i.e., your first desired output).
g restores the saved string from the hold space into the pattern space.
In other words, retrieve the copy of the string that you saved.
s/^\([^|]*\)|[^|]*\(|.*\)$/\1\2/ captures the first segment,
skips the second, and then captures the rest.
So “A|B|C|D|E” becomes “A|C|D|E”.
t loop is the “goto” command.
It says to go back to the beginning of the loop
if the most recent substitution succeeded.
In other words, this is the end of the loop,
and the specification of the loop condition.
The second iteration of the loop will change “A|C|D|E” to “A|C|E”
and print it.
And then change “A|C|D|E” to “A|D|E” and iterate.
The third iteration of the loop will change “A|D|E” to “A|D|E” and print it.
(Obviously there is no change, because the .* in the middle of the regex
matches the zero-length string between “A|D” and “|E”.)
The final substitution changes “A|D|E” to “A|E”,
and then there is nothing left to find.
You can make use of the .NET Groups.Captures property to get the values of Guitar, Piano and Violin.
(ID: \w+\|)(\w+: \d+\|)+(Expiry: \w+,\d+)
The pattern matches:
(ID: \w+\|) Capture group 1 match ID: 1+ word chars and |
(\w+: \d+\|)+ Capture group 2 Repeat 1+ times matching 1+ word chars : 1+ digits |
(Expiry: \w+,\d+) Capture group 3 match Expiry: 1+ word chars , and 1+ digits
See a .NET regex demo | C# demo
For example
var str = "ID: JK546|Guitar: 0|Piano: 1|Violin: 0|Expiry: Aug14,2021";
string pattern = #"(ID: \w+\|)(\w+: \d+\|)+(Expiry: \w+,\d+)";
Match m = Regex.Match(str, pattern);
foreach(Capture c in m.Groups[2].Captures) {
Console.WriteLine(m.Groups[1].Value + c.Value + m.Groups[3].Value);
}
Output
ID: JK546|Guitar: 0|Expiry: Aug14,2021
ID: JK546|Piano: 1|Expiry: Aug14,2021
ID: JK546|Violin: 0|Expiry: Aug14,2021
It should be possible with look behind and look ahead:
string foo = #"ID: JK546 | Guitar: 0 | Piano: 1 | Violin: 0 | Expiry: Aug14,2021";
// First look at "Guitar: 0", "Piano: 1" and "Violin: 0". Then look behind "(?<= )" and search for the ID. Then look ahead "(?= )" and search for Expiry.
string pattern = #"(\w+: \d)(?<=(ID: [A-Z0-9]+).*?)(?=.*?(Expiry: \S+))";
foreach (var match in Regex.Matches(foo, pattern))
{
....
}
Fortunately c# is one of the few languages that can handle variable length look behinds.
I have a text file as below:
1.1 - Hello
1.2 - world!
2.1 - Some
data
here and it contains some 32 digits so i cannot use \D+
2.2 - Etc..
so i want a regex to get 4 matches in this case for each point. My regex doesn't work as I wish. Please, advice:
private readonly Regex _reactionRegex = new Regex(#"(\d+)\.(\d+)\s*-\s*(.+)", RegexOptions.Compiled | RegexOptions.Singleline);
even this regex isn't very helpful:
(\d+)\.(\d+)\s*-\s*(.+)(?<!\d+\.\d+)
Alex, this regex will do it:
(?sm)^\d+\.\d+\s*-\s*((?:.(?!^\d+\.\d+))*)
This is assuming that you want to capture the point, without the numbers, for instance: just Hello
If you want to also capture the digits, for instance 1.1 - Hello, you can use the same regex and display the entire match, not just Group 1. The online demo below will show you both.
How does it work?
The idea is to capture the text you want to Group 1 using (parentheses).
We match in multi-line mode m to allow the anchor ^ to work on each line.
We match in dotall mode s to allow the dot to eat up strings on multiple lines
We use a negative lookahead (?! to stop eating characters when what follows is the beginning of the line with your digit marker
Here is full working code and an online demo.
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program {
static void Main() {
string yourstring = #"1.1 - Hello
1.2 - world!
2.1 - Some
data
here and it contains some 32 digits so i cannot use \D+
2.2 - Etc..";
var resultList = new StringCollection();
try {
var yourRegex = new Regex(#"(?sm)^\d+\.\d+\s*-\s*((?:.(?!^\d+\.\d+))*)");
Match matchResult = yourRegex.Match(yourstring);
while (matchResult.Success) {
resultList.Add(matchResult.Groups[1].Value);
Console.WriteLine("Whole Match: " + matchResult.Value);
Console.WriteLine("Group 1: " + matchResult.Groups[1].Value + "\n");
matchResult = matchResult.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
This may do for what you're looking for, though there is some ambiguity of the expected result.
(\d+)\.(\d+)\s*-\s*(.+?)(\n)(?>\d|$)
The ambiguity is for example what would you expect to match if data looked like:
1.1 - Hello
1.2 - world!
2.1 - Some
data here and it contains some
32 digits so i cannot use \D+
2.2 - Etc..
Not clear if 32 here starts a new record or not.
I have following kind of string-sets in a text file:
<< /ImageType 1
/Width 986 /Height 1
/BitsPerComponent 8
/Decode [0 1 0 1 0 1]
/ImageMatrix [986 0 0 -1 0 1]
/DataSource <
803fe0503824160d0784426150b864361d0f8844625138a4562d178c466351b8e4763d1f904864523924964d27944a6552b964b65d2f984c665339a4d66d379c4e6753b9e4f67d3fa05068543a25168d47a4526954ba648202
> /LZWDecode filter >> image } def
There are 100s of Images defined like above.
I need to find all such images defined in the document.
Here is my code -
string txtFile = #"text file path";
string fileContents = File.ReadAllText(txtFile);
string pattern = #"<< /ImageType 1.*(\n|\r|\r\n)*image } def"; //match any number of characters between `<< /ImageType 1` and `image } def`
MatchCollection matchCollection = Regex.Matches(fileContents, pattern, RegexOptions.Singleline);
int count = matchCollection.Count; // returns 1
However, I am getting just one match - whereas there are around 600 images defined.
But it seems they all are matched in one because of 'newline' character used in pattern.
Can anyone please guide what do I need to modify the correct result of regex match as 600.
The reason is that regular expressions are usually greedy, i.e. the matches are always as long as possible. Thus, the image } def is contained in the .*. I think the best approach here would be to perform two separate regex queries, one for << /ImageType 1 and one for image } def. Every match of the first pattern would correspond to exactly one match of the second one and as these matches carry their indices in the original string, you can reconstruct the image by accessing the appropriate substring.
Instead of .* you should use the non-greedy quantifier .*?:
string pattern = #"<< /ImageType 1.*?image } def";
Here is a site that can help you out with REGEX that I use. http://webcheatsheet.com/php/regular_expressions.php.
if(preg_match('/^/[a-z]/i', $string, $matches)){
echo "Match was found <br />";
echo $matches[0];
}
So I have this list of flight data and I need to be able to parse through it using regular expressions (this isn't the entire list).
1 AA2401 F7 A4 Y7 B7 M7 H7 K7 /DFW A LAX 4 0715 0836 E0.M80 9 3:21
2 AA2421 F7 A1 Y7 B7 M7 H7 K7 DFWLAX 4 1106 1215 E0.777 7 3:09
3UA:US6352 B9 M9 H9 K0 /DFW 1 LAX 1200 1448 E0.733 1:48
For example, I might need from the first line 1, AA, 2401, and so on and so on. Now, I'm not asking for someone to come up with a regular expression for me because for the most part I'm getting to where I can pretty much handle that myself. My issue has more to do with being able to store the data some where and access it.
So I'm just trying to initially just "match" the first piece of data I need, which is the line number '1'. My "pattern" for just getting the first number is: ".?(\d{1,2}).*" . The reason it's {1,2} is because obviously once you get past 10 it needs to be able to take 2 numbers. The rest of the line is set up so that it will definitely be a space or a letter.
Here's the code:
var assembly = Assembly.GetExecutingAssembly();
var textStreamReader = new StreamReader(
assembly.GetManifestResourceStream("FlightParser.flightdata.txt"));
List<string> lines = new List<string>();
do
{
lines.Add(textStreamReader.ReadLine());
} while (!textStreamReader.EndOfStream);
Regex sPattern = new Regex(#".?(\d{1,2}).*");//whatever the pattern is
foreach (string line in lines)
{
System.Console.Write("{0,24}", line);
MatchCollection mc = sPattern.Matches(line);
if ( sPattern.IsMatch(line))
{
System.Console.WriteLine(" (match for '{0}' found)", sPattern);
}
else
{
System.Console.WriteLine();
}
System.Console.WriteLine(mc[0].Groups[0].Captures);
System.Console.WriteLine(line);
}//end foreach
System.Console.ReadLine();
With the code I'm writing, I'm basically just trying to get '1' into the match collection and somehow access it and write it to the console (for the sake of testing, that's not the ultimate goal).
Your regex pattern includes an asterisk which matches any number of characters - ie. the whole line. Remove the "*" and it will only match the "1". You may find an online RegEx tester such as this useful.
Assuming your file is not actually formatted as you posted and has each of the fields separated by something, you can match the first two-digit number of the line with this regex (ignoring 0 and leading zeros):
^\s*([1-9]\d?)
Since it is grouped, you can access the matched part through the Groups property of the Match object.
var line = "12 foobar blah 123 etc";
var re = new Regex(#"^\s*([1-9]\d?)");
var match = re.Match(line);
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // "12"
}
else
{
Console.WriteLine("No match");
}
The following expression matches the first digit, that you wanted to capture, in the group "First".
^\s*(?<First>\d{1})
I find this regular expression tool highly useful when dealing with regex. Give it a try.
Also set RegexOption to Multiline when you are making the match.
Here's a quickie for your RegEx wizards. I need a regular expression that will find groups of words. Any group of words. For instance, I'd like for it to find the first two words in any sentence.
Example "Hi there, how are you?" - Return would be "hi there"
Example "How are you doing?" - Return would be "How are"
Try this:
^\w+\s+\w+
Explanation: one or more word characters, spaces and more one or more word characters together.
Regular expressions could be used to parse language. Regular expressions are a more natural tool. After gathering the words, use a dictionary to see if they're actually words in a particular language.
The premise is to define a regular expression that will split out %99.9 of possible words, word being a key definition.
I assume C# is going to use a PCRE based on 5.8 Perl.
This is my ascii definition of how to split out words (expanded):
regex = '[\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* )
and unicode (more has to be added/subtracted to suite specific encodings):
regex = '[\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* )'
To find ALL of the words, cat the regex string into a regex (i don't know c#):
#matches =~ /$regex/xg
where /xg are the expanded and global modifiers. Note that there is only capture group 1 in the regex string so the intervening text is not captured.
To find just the FIRST TWO:
#matches =~ /(?:$regex)(?:$regex)/x
Below is a Perl sample. Anyway, play around with it. Cheers!
use strict;
use warnings;
binmode (STDOUT,':utf8');
# Unicode
my $regex = qr/ [\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* ) /x;
# Ascii
# my $regex = qr/ [\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* ) /x;
my $text = q(
I confirm that sufficient information and detail have been
reported in this technical report, that it's "scientifically" sound,
and that appropriate conclusion's have been included
);
print "\n**\n$text\n";
my #matches = $text =~ /$regex/g;
print "\nTotal ".scalar(#matches)." words\n",'-'x20,"\n";
for (#matches) {
print "$_\n";
}
# =======================================
my $junk = q(
Hi, there, A écafé and Horse d'oeuvre
hasn't? 'n? '? a-b? -'a-?
);
print "\n\n**\n$junk\n";
# First 2 words
#matches = $junk =~ /(?:$regex)(?:$regex)/;
print "\nFirst 2 words\n",'-'x20,"\n";
for (#matches) {
print "$_\n";
}
# All words
#matches = $junk =~ /$regex/g;
print "\nTotal ".scalar(#matches)." words\n",'-'x20,"\n";
for (#matches) {
print "$_\n";
}
Output:
**
I confirm that sufficient information and detail have been
reported in this technical report, that it's "scientifically" sound,
and that appropriate conclusion's have been included
Total 25 words
--------------------
I
confirm
that
sufficient
information
and
detail
have
been
reported
in
this
technical
report
that
it's
scientifically
sound
and
that
appropriate
conclusion's
have
been
included
**
Hi, there, A écafé and Horse d'oeuvre
hasn't? 'n? '? a-b? -'a-?
First 2 words
--------------------
Hi
there
Total 11 words
--------------------
Hi
there
A
écafé
and
Horse
d'oeuvre
hasn't
n
a-b
a-
#Rubens Farias:
Per my comment, here's the code I used:
public int startAt = 0;
private void btnGrabWordPairs_Click(object sender, EventArgs e)
{
Regex regex = new Regex(#"\b\w+\s+\w+\b"); //Start at word boundary, find one or more word chars, one or more whitespaces, one or more chars, end at word boundary
if (startAt <= txtTest.Text.Length)
{
string match = regex.Match(txtArticle.Text, startAt).ToString();
MessageBox.Show(match);
startAt += match.Length; //update the starting position to the end of the last match
}
{
Each time the button is clicked it grabs pairs of words quite nicely, proceeding through the text in the txtTest TextBox and finding the pairs sequentially until the end of the string is reached.
#sln: Thanks for the extremely detailed response!