regex loop is slow over large files how can solve this? - c#

I trying to use a regex command with loop over regex expression but is slow for files greater than 500 kb
please help me
using System;
using System.IO;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
string Value3 =#"((#40y#.*?xlrz)(.*?)(#41y#[\s\S\r\n]*?)(\2))";
var match = Regex.Match(File.ReadAllText(#"C:\Users\diego\Desktop\pruebatabladinamica\pruebassregex.txt") , Value3);
while (match.Success)
{
var match2 = Regex.Match(File.ReadAllText(#"C:\Users\diego\Desktop\pruebatabladinamica\pruebassregex.txt") , Value3);
Regex rgx = new Regex(Value3);
match = match2;
string strFile3 = rgx.Replace(File.ReadAllText(#"C:\Users\diego\Desktop\pruebatabladinamica\pruebassregex.txt") , "$1$5$3+", 1);
File.WriteAllText(#"C:\Users\diego\Desktop\pruebatabladinamica\pruebassregex.txt", strFile3);
File.WriteAllText(#"C:\Users\diego\Desktop\pruebatabladinamica\backupregex.txt", string.Concat(File.ReadAllText(#"C:\Users\diego\Desktop\pruebatabladinamica\backupregex.txt") , strFile3.Substring(0, match2.Index +match2.Length)));
File.WriteAllText(#"C:\Users\diego\Desktop\pruebatabladinamica\pruebassregex.txt", strFile3.Substring(match2.Index +match2.Length, strFile3.Length - match2.Index - match2.Length ));
strFile3 = null;
int oldCacheSize = Regex.CacheSize;
Regex.CacheSize= 0;
GC.Collect();
Regex.CacheSize = oldCacheSize;
}
}
}
how can solve please help me
i have a loop each replace of this loop regex get a string of 10mb and make one replacement but this process is too slow , exist any method for solve this problem
when the file is small , regex process each replace fast –
hi my command divide the string in two then use the second string and find a word near at the begin of string then use the index and match lenght for divide the new string and repeat the same process in a loop –
the regex command find the string fast with string small 500 kb
but for string large 1mb this turn slow
word1 word2 word3 replace
word1 word2 word3 replace1
word1 word2 word3 replace2
output
word1 word2 word3 replace
word1 word2 word3 replace+replace1
word1 word2 word3 replace+replace1+replace2
002-0759586-1#39y#REPARTO 01#40y#002-075958655xlrz10,4#41y##42y#-10.20
002-0759586-2#39y#REPARTO 01#40y#002-0759586xlrz54#41y#0#42y#
002-0759586-2#39y#REPARTO 01#40y#002-0759586xlrz56#41y#0#42y#
002-0759586-2#39y#REPARTO 01#40y#002-0759586xlrz57#41y#0#42y#

An attempt at an answer.
Problem 1. Regex
Existing:
( # (1 start)
( \# 40y \# .*? xlrz ) # (2)
( .*? ) # (3)
( \# 41y \# [\s\S\r\n]*? ) # (4)
( \2 ) # (5)
) # (1 end)
What it should be:
( # (1 start)
\# 40y \#
.*?
xlrz
) # (1 end)
( .*? ) # (2)
\# 41y \#
(?s: .*? )
\1
Capture group changes (old -> new):
1 -> 0
2 -> 1
3 -> 2
4 -> N/A
5 -> 1
Benchmark
Regex1: ((\#40y\#.*?xlrz)(.*?)(\#41y\#[\s\S\r\n]*?)(\2))
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 1
Elapsed Time: 4.04 s, 4042.77 ms, 4042771 µs
Regex2: ((\#40y\#.*?xlrz)(.*?)\#41y\#(?s:.*?)\2)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 1
Elapsed Time: 1.91 s, 1913.65 ms, 1913650 µs
The general regex problem:
You are delimiting a string with what is captured in Group 1.
The problem is group 1 contains a middle sub-expression .*? which is
a hole that backtracking drives a truck through.
This probably can't be avoided, but since you are only matching once each
time, it might not make a difference.
Problem 2. Regex
Never remake a regex within a loop construct. Make it once outside the loop.
If you feel you need to use the same regex twice within the same loop,
red flags should go up as this is never necessary.
If you do however, make two separate regex var's of the same regex.
I.e. Regex rgx1 = new Regex(Value3); Regex rgx2 = new Regex(Value3);
then access the object instance methods (not class methods) for matching.
Problem 3. Program flow
You are manipulating a single file, C:\Users\diego\Desktop\pruebatabladinamica\pruebassregex.txt
Within a loop, you constantly read it, manipulate it, then write it.
This is problematic at best.
The only reason to ever do this would be to make temporary copies, not what
you're doing with it now.
If you think for some reason you're saving memory resources, you'd be mistaken.
For one, the stack is of limited size as compared to the heap.
Once you have the ORIGINAL file read into a string variable, do all the
operations on that string, then as a last step, write it out to the file.
This part of your code is inappropriate, but used as an example:
string strFile3 = rgx.Replace(File.ReadAllText(#"C:\Users\diego\Desktop\pruebatabladinamica\pruebassregex.txt") , "$1$5$3+", 1);
File.WriteAllText(#"C:\Users\diego\Desktop\pruebatabladinamica\pruebassregex.txt", strFile3);
File.WriteAllText(#"C:\Users\diego\Desktop\pruebatabladinamica\backupregex.txt", string.Concat(File.ReadAllText(#"C:\Users\diego\Desktop\pruebatabladinamica\backupregex.txt") , strFile3.Substring(0, match2.Index +match2.Length)));
File.WriteAllText(#"C:\Users\diego\Desktop\pruebatabladinamica\pruebassregex.txt", strFile3.Substring(match2.Index +match2.Length, strFile3.Length - match2.Index - match2.Length ));
All can be reduced to the manipulation of strFile3 without doing the
intense overhead of reading and writing and thrashing the disk.
In actuality, this is what is causing the performance lag you see on a larger
file.
End of attempt ..

Related

C# - Getting multiple values with a single key, from a text file

I store multiple values that shares a single key on a text file. The text file looks like that:
Brightness 36 , Manual
BacklightCompensation 3 , Manual
ColorEnable 0 , None
Contrast 16 , Manual
Gain 5 , Manual
Gamma 122 , Manual
Hue 0 , Manual
Saturation 100 , Manual
Sharpness 2 , Manual
WhiteBalance 5450 , Auto
Now I want to store the int value & string value of each key (Brightness, for example).
New to C# and could'nt find something that worked yet.
Thanks
I'd recommend to use custom types to store these settings like these:
public enum DisplaySettingType
{
Manual, Auto, None
}
public class DisplaySetting
{
public string Name { get; set; }
public decimal Value { get; set; }
public DisplaySettingType Type { get; set; }
}
Then you could use following LINQ query using string.Split to get all settings:
decimal value = 0;
DisplaySettingType type = DisplaySettingType.None;
IEnumerable<DisplaySetting> settings = File.ReadLines(path)
.Select(l => l.Trim().Split(new[] { ' ', ',' }, StringSplitOptions.RemoveEmptyEntries))
.Where(arr => arr.Length >= 3 && decimal.TryParse(arr[1], out value) && Enum.TryParse(arr[2], out type))
.Select(arr => new DisplaySetting { Name = arr[0], Value = value, Type = type });
With a regex and a little bit of linq you can do many things.
Here I assume you Know How to read a Text file.
Pros: If the file is not perfect, the reg exp will just ignore the misformatted line, and won't throw error.
Here is a hardcode version of your file, note that a \r will appears because of it. Depending on the way you read you file but it should not be the case with a File.ReadLines()
string input =
#"Brightness 36 , Manual
BacklightCompensation 3 , Manual
ColorEnable 0 , None
Contrast 16 , Manual
Gain 5 , Manual
Gamma 122 , Manual
Hue 0 , Manual
Saturation 100 , Manual
Sharpness 2 , Manual
WhiteBalance 5450 , Auto";
string regEx = #"(.*) (\d+) , (.*)";
var RegexMatch = Regex.Matches(input, regEx).Cast<Match>();
var outputlist = RegexMatch.Select(x => new { setting = x.Groups[1].Value
, value = x.Groups[2].Value
, mode = x.Groups[3].Value });
Regex explanation:/(.*) (\d+) , (.*)/g
1st Capturing Group (.*)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
matches the character literally (case sensitive)
2nd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
, matches the characters , literally (case sensitive)
3rd Capturing Group (.*)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Disclamer:
Never trust an input! Even if it's a file some other program did, or send by a customer.
From my experience, you have then two ways of handeling bad format:
Read line by line, and register every bad line.
or Ignore them. You don't fit , you don't sit!
And don't tell your self it won't happend, it will!

How to write a regular expression that captures tags in a comma-separated list?

Here is my input:
#
tag1, tag with space, !##%^, 🦄
I would like to match it with a regex and yield the following elements easily:
tag1
tag with space
!##%^
🦄
I know I could do it this way:
var match = Regex.Match(input, #"^#[\n](?<tags>[\S ]+)$");
// if match is a success
var tags = match.Groups["tags"].Value.Split(',').Select(x => x.Trim());
But that's cheating, as it involves messing around with C#. There must be a neat way to do this with a regex. Just must be... right? ;D
The question is: how to write a regular expression that would allow me to iterate through captures and extract tags, without the need of splitting and trimming?
This works (?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+
It uses C#'s Capture Collection to find a variable amount of field data
in a single record.
You could extend the regex further to get all records at once.
Where each record contains its own variable amount of field data.
The regex has built-in trimming as well.
Expanded:
(?ms) # Inline modifiers: multi-line, dot-all
^ \# \s+ # Beginning of record
(?: # Quantified group, 1 or more times, get all fields of record at once
\s* # Trim leading wsp
( # (1 start), # Capture collector for variable fields
(?: # One char at a time, but not comma or begin of record
(?!
,
| ^ \# \s+
)
.
)*?
) # (1 end)
\s*
(?: , | $ ) # End of this field, comma or EOL
)+
C# code:
string sOL = #"
#
tag1, tag with space, !##%^, 🦄";
Regex RxOL = new Regex(#"(?ms)^\#\s+(?:\s*((?:(?!,|^\#\s+).)*?)\s*(?:,|$))+");
Match _mOL = RxOL.Match(sOL);
while (_mOL.Success)
{
CaptureCollection ccOL1 = _mOL.Groups[1].Captures;
Console.WriteLine("-------------------------");
for (int i = 0; i < ccOL1.Count; i++)
Console.WriteLine(" '{0}'", ccOL1[i].Value );
_mOL = _mOL.NextMatch();
}
Output:
-------------------------
'tag1'
'tag with space'
'!##%^'
'??'
''
Press any key to continue . . .
Nothing wrong with cheating ;]
string input = #"#
tag1, tag with space, !##%^, 🦄";
string[] tags = Array.ConvertAll(input.Split('\n').Last().Split(','), s => s.Trim());
You can pretty much make it without regex. Just split it like this:
var result = input.Split(new []{'\n','\r'}, StringSplitOptions.RemoveEmptyEntries).Skip(1).SelectMany(x=> x.Split(new []{','},StringSplitOptions.RemoveEmptyEntries).Select(y=> y.Trim()));

Regex help - match any number of characters

I have following kind of string-sets in a text file:
<< /ImageType 1
/Width 986 /Height 1
/BitsPerComponent 8
/Decode [0 1 0 1 0 1]
/ImageMatrix [986 0 0 -1 0 1]
/DataSource <
803fe0503824160d0784426150b864361d0f8844625138a4562d178c466351b8e4763d1f904864523924964d27944a6552b964b65d2f984c665339a4d66d379c4e6753b9e4f67d3fa05068543a25168d47a4526954ba648202
> /LZWDecode filter >> image } def
There are 100s of Images defined like above.
I need to find all such images defined in the document.
Here is my code -
string txtFile = #"text file path";
string fileContents = File.ReadAllText(txtFile);
string pattern = #"<< /ImageType 1.*(\n|\r|\r\n)*image } def"; //match any number of characters between `<< /ImageType 1` and `image } def`
MatchCollection matchCollection = Regex.Matches(fileContents, pattern, RegexOptions.Singleline);
int count = matchCollection.Count; // returns 1
However, I am getting just one match - whereas there are around 600 images defined.
But it seems they all are matched in one because of 'newline' character used in pattern.
Can anyone please guide what do I need to modify the correct result of regex match as 600.
The reason is that regular expressions are usually greedy, i.e. the matches are always as long as possible. Thus, the image } def is contained in the .*. I think the best approach here would be to perform two separate regex queries, one for << /ImageType 1 and one for image } def. Every match of the first pattern would correspond to exactly one match of the second one and as these matches carry their indices in the original string, you can reconstruct the image by accessing the appropriate substring.
Instead of .* you should use the non-greedy quantifier .*?:
string pattern = #"<< /ImageType 1.*?image } def";
Here is a site that can help you out with REGEX that I use. http://webcheatsheet.com/php/regular_expressions.php.
if(preg_match('/^/[a-z]/i', $string, $matches)){
echo "Match was found <br />";
echo $matches[0];
}

Regular Expression to find separate words?

Here's a quickie for your RegEx wizards. I need a regular expression that will find groups of words. Any group of words. For instance, I'd like for it to find the first two words in any sentence.
Example "Hi there, how are you?" - Return would be "hi there"
Example "How are you doing?" - Return would be "How are"
Try this:
^\w+\s+\w+
Explanation: one or more word characters, spaces and more one or more word characters together.
Regular expressions could be used to parse language. Regular expressions are a more natural tool. After gathering the words, use a dictionary to see if they're actually words in a particular language.
The premise is to define a regular expression that will split out %99.9 of possible words, word being a key definition.
I assume C# is going to use a PCRE based on 5.8 Perl.
This is my ascii definition of how to split out words (expanded):
regex = '[\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* )
and unicode (more has to be added/subtracted to suite specific encodings):
regex = '[\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* )'
To find ALL of the words, cat the regex string into a regex (i don't know c#):
#matches =~ /$regex/xg
where /xg are the expanded and global modifiers. Note that there is only capture group 1 in the regex string so the intervening text is not captured.
To find just the FIRST TWO:
#matches =~ /(?:$regex)(?:$regex)/x
Below is a Perl sample. Anyway, play around with it. Cheers!
use strict;
use warnings;
binmode (STDOUT,':utf8');
# Unicode
my $regex = qr/ [\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* ) /x;
# Ascii
# my $regex = qr/ [\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* ) /x;
my $text = q(
I confirm that sufficient information and detail have been
reported in this technical report, that it's "scientifically" sound,
and that appropriate conclusion's have been included
);
print "\n**\n$text\n";
my #matches = $text =~ /$regex/g;
print "\nTotal ".scalar(#matches)." words\n",'-'x20,"\n";
for (#matches) {
print "$_\n";
}
# =======================================
my $junk = q(
Hi, there, A écafé and Horse d'oeuvre
hasn't? 'n? '? a-b? -'a-?
);
print "\n\n**\n$junk\n";
# First 2 words
#matches = $junk =~ /(?:$regex)(?:$regex)/;
print "\nFirst 2 words\n",'-'x20,"\n";
for (#matches) {
print "$_\n";
}
# All words
#matches = $junk =~ /$regex/g;
print "\nTotal ".scalar(#matches)." words\n",'-'x20,"\n";
for (#matches) {
print "$_\n";
}
Output:
**
I confirm that sufficient information and detail have been
reported in this technical report, that it's "scientifically" sound,
and that appropriate conclusion's have been included
Total 25 words
--------------------
I
confirm
that
sufficient
information
and
detail
have
been
reported
in
this
technical
report
that
it's
scientifically
sound
and
that
appropriate
conclusion's
have
been
included
**
Hi, there, A écafé and Horse d'oeuvre
hasn't? 'n? '? a-b? -'a-?
First 2 words
--------------------
Hi
there
Total 11 words
--------------------
Hi
there
A
écafé
and
Horse
d'oeuvre
hasn't
n
a-b
a-
#Rubens Farias:
Per my comment, here's the code I used:
public int startAt = 0;
private void btnGrabWordPairs_Click(object sender, EventArgs e)
{
Regex regex = new Regex(#"\b\w+\s+\w+\b"); //Start at word boundary, find one or more word chars, one or more whitespaces, one or more chars, end at word boundary
if (startAt <= txtTest.Text.Length)
{
string match = regex.Match(txtArticle.Text, startAt).ToString();
MessageBox.Show(match);
startAt += match.Length; //update the starting position to the end of the last match
}
{
Each time the button is clicked it grabs pairs of words quite nicely, proceeding through the text in the txtTest TextBox and finding the pairs sequentially until the end of the string is reached.
#sln: Thanks for the extremely detailed response!

How does this regex find triangular numbers?

Part of a series of educational regex articles, this is a gentle introduction to the concept of nested references.
The first few triangular numbers are:
1 = 1
3 = 1 + 2
6 = 1 + 2 + 3
10 = 1 + 2 + 3 + 4
15 = 1 + 2 + 3 + 4 + 5
There are many ways to check if a number is triangular. There's this interesting technique that uses regular expressions as follows:
Given n, we first create a string of length n filled with the same character
We then match this string against the pattern ^(\1.|^.)+$
n is triangular if and only if this pattern matches the string
Here are some snippets to show that this works in several languages:
PHP (on ideone.com)
$r = '/^(\1.|^.)+$/';
foreach (range(0,50) as $n) {
if (preg_match($r, str_repeat('o', $n))) {
print("$n ");
}
}
Java (on ideone.com)
for (int n = 0; n <= 50; n++) {
String s = new String(new char[n]);
if (s.matches("(\\1.|^.)+")) {
System.out.print(n + " ");
}
}
C# (on ideone.com)
Regex r = new Regex(#"^(\1.|^.)+$");
for (int n = 0; n <= 50; n++) {
if (r.IsMatch("".PadLeft(n))) {
Console.Write("{0} ", n);
}
}
So this regex seems to work, but can someone explain how?
Similar questions
How to determine if a number is a prime with regex?
Explanation
Here's a schematic breakdown of the pattern:
from beginning…
| …to end
| |
^(\1.|^.)+$
\______/|___match
group 1 one-or-more times
The (…) brackets define capturing group 1, and this group is matched repeatedly with +. This subpattern is anchored with ^ and $ to see if it can match the entire string.
Group 1 tries to match this|that alternates:
\1., that is, what group 1 matched (self reference!), plus one of "any" character,
or ^., that is, just "any" one character at the beginning
Note that in group 1, we have a reference to what group 1 matched! This is a nested/self reference, and is the main idea introduced in this example. Keep in mind that when a capturing group is repeated, generally it only keeps the last capture, so the self reference in this case essentially says:
"Try to match what I matched last time, plus one more. That's what I'll match this time."
Similar to a recursion, there has to be a "base case" with self references. At the first iteration of the +, group 1 had not captured anything yet (which is NOT the same as saying that it starts off with an empty string). Hence the second alternation is introduced, as a way to "initialize" group 1, which is that it's allowed to capture one character when it's at the beginning of the string.
So as it is repeated with +, group 1 first tries to match 1 character, then 2, then 3, then 4, etc. The sum of these numbers is a triangular number.
Further explorations
Note that for simplification, we used strings that consists of the same repeating character as our input. Now that we know how this pattern works, we can see that this pattern can also match strings like "1121231234", "aababc", etc.
Note also that if we find that n is a triangular number, i.e. n = 1 + 2 + … + k, the length of the string captured by group 1 at the end will be k.
Both of these points are shown in the following C# snippet (also seen on ideone.com):
Regex r = new Regex(#"^(\1.|^.)+$");
Console.WriteLine(r.IsMatch("aababc")); // True
Console.WriteLine(r.IsMatch("1121231234")); // True
Console.WriteLine(r.IsMatch("iLoveRegEx")); // False
for (int n = 0; n <= 50; n++) {
Match m = r.Match("".PadLeft(n));
if (m.Success) {
Console.WriteLine("{0} = sum(1..{1})", n, m.Groups[1].Length);
}
}
// 1 = sum(1..1)
// 3 = sum(1..2)
// 6 = sum(1..3)
// 10 = sum(1..4)
// 15 = sum(1..5)
// 21 = sum(1..6)
// 28 = sum(1..7)
// 36 = sum(1..8)
// 45 = sum(1..9)
Flavor notes
Not all flavors support nested references. Always familiarize yourself with the quirks of the flavor that you're working with (and consequently, it almost always helps to provide this information whenever you're asking regex-related questions).
In most flavors, the standard regex matching mechanism tries to see if a pattern can match any part of the input string (possibly, but not necessarily, the entire input). This means that you should remember to always anchor your pattern with ^ and $ whenever necessary.
Java is slightly different in that String.matches, Pattern.matches and Matcher.matches attempt to match a pattern against the entire input string. This is why the anchors can be omitted in the above snippet.
Note that in other contexts, you may need to use \A and \Z anchors instead. For example, in multiline mode, ^ and $ match the beginning and end of each line in the input.
One last thing is that in .NET regex, you CAN actually get all the intermediate captures made by a repeated capturing group. In most flavors, you can't: all intermediate captures are lost and you only get to keep the last.
Related questions
(Java) method matches not work well - with examples on how to do prefix/suffix/infix matching
Is there a regex flavor that allows me to count the number of repetitions matched by * and + (.NET!)
Bonus material: Using regex to find power of twos!!!
With very slight modification, you can use the same techniques presented here to find power of twos.
Here's the basic mathematical property that you want to take advantage of:
1 = 1
2 = (1) + 1
4 = (1+2) + 1
8 = (1+2+4) + 1
16 = (1+2+4+8) + 1
32 = (1+2+4+8+16) + 1
The solution is given below (but do try to solve it yourself first!!!!)
(see on ideone.com in PHP, Java, and C#):
^(\1\1|^.)*.$

Categories

Resources