I wrote a file routing utility (.NET) some time ago to examine a file's location and name pattern and move it to some other preconfigured place based on the match. Fairly simple, straightforward kinda stuff. I had included the possibility of minor transformations through a series of regular expression search-and-replace actions that could be assigned to the file "route", with the intent of adding header rows, replacing commas with pipes, that sort of thing.
So now I have a new text feed that consists of a file header, a batch header, and a multitude of detail records under the batches. The file header contains a count of all detail records in the file, and I have been asked to "split" the file in the assigned transformations, essentially producing a file for each batch record. This is fairly straightforward, as well, but the kicker is, there is an expectation to update the file header for each file to reflect the detail count.
I do not even know if this is possible with pure regular expressions. Can I count the number of matches of a group in a given text document and replace the count value in the original text, or am I going to have to write a custom transformer for this one file?
If I have to write another transformer, are there suggestions on how to make it generic enough to be reusable? I'm considering adding an XSLT transformer option, but my understanding of XSLT is not so great.
I've been asked for an example. Say I have a file like so:
FILE001DETAILCOUNT002
BATCH01
DETAIL001FOO
BATCH02
DETAIL001BAR
this file will be split and stored in two locations. The files will look like this:
FILE001DETAILCOUNT001
BATCH01
DETAIL001FOO
and
FILE001DETAILCOUNT001
BATCH01
DETAIL001BAR
so the sticker for me is the file header's DETAILCOUNT value.
Regular expressions by themselves can't count the number of matches they've made (or, better put, they don't expose that to the regex user), so you do need additional program code to keep track of this.
A regex can only capture text that exists somewhere in the source material, it can't generate new text. So unless you can find the number you need explicitly at some point in the source, you're out of luck. Sorry.
My program first breaks the text into batches.
I think you'll agree that resequencing the detail number is the trickiest part. You can do it with a MatchEvaluator delegate.
Regex.Replace (
text, // the text replace part of
#"(?<=^DETAIL)\d+", // the regex pattern to find.
m => (detailNum++).ToString ("000"), // replacement (evaluated for each match)
RegexOptions.Multiline);
See how the preceeding code increments detailNum at the begining of each batch.
var contents =
#"FILE001DETAILCOUNT002
BATCH01
DETAIL001FOO
BATCH02
DETAIL001BAR";
// foreach batch....
foreach (Match match in Regex.Matches (contents, #"BATCH\d+\s+(?:(?!BATCH\d+).*\s*)+"))
{
Console.WriteLine ("==============\r\nFile\r\n================");
int batchNum = 1;
int detailNum = 1;
StringBuilder temp = new StringBuilder ();
TextWriter file = new StringWriter (temp);
// Your file here instead of my stringBuilder/StringWriter
string batchText = match.Value;
int count = Regex.Matches (batchText, #"^DETAIL\d+", RegexOptions.Multiline).Count;
file.WriteLine ("FILE001DETAILCOUNT{0:000}", count);
string newText = Regex.Replace (batchText, #"(?<=^BATCH)\d+", batchNum.ToString ("000"), RegexOptions.Multiline);
newText = Regex.Replace (
newText,
#"(?<=^DETAIL)\d+",
m => (detailNum++).ToString ("000"), // replacement (evaluated for each match)
RegexOptions.Multiline);
file.Write (newText);
Console.WriteLine (temp.ToString ());
}
prints
==============
File
================
FILE001DETAILCOUNT001
BATCH001
DETAIL001FOO
==============
File
================
FILE001DETAILCOUNT001
BATCH001
DETAIL001BAR
Related
I have this little project in C# where I am manipulating with files. Now my task is that I have to delete specific rows from files.
For example my file looks like this:
1-this is the first line
2-this is the second line
3-this is the third line
4-this is the fourth line
Now how can I keep only the first two rows and delete only the last two rows?
Note- this is how I read the file from my local machine:
string[] lines = File.ReadAllLines(#"C:\Users\admin\Desktop\COMMANDS.dat");
I have tried something like this but I think it's not so "efficient"
string text = File.ReadAllText(#"C:\Users\admin\Desktop\COMMANDS.dat");
text = text.Replace(lines[2], "");
text = text.Replace(lines[3], "");
File.WriteAllText(#"C:\Users\admin\Desktop\COMMANDS.dat", text);
So this actually does the job, it replaces the lines by string with an empty character but when I take a look at the file, I don't want to have 4 lines there, even though 2 of them are real strings and the other two are just empty lines... Can I manage to do this in another way?
Try replacing the newline character with an empty string:
string text = File.ReadAllText(#"C:\Users\admin\Desktop\COMMANDS.dat");
text = text.Replace(lines[2], "").Remove(Environment.NewLine, "");
text = text.Replace(lines[3], "").Remove(Environment.NewLine , "");
File.WriteAllText(#"C:\Users\admin\Desktop\COMMANDS.dat", text);
If my answer is useful, please mark it as accepted, and upvote it.
async Task Example()
{
var inputLines = await File.ReadAllLinesAsync("path/to/file.txt");
var outputLines = inputLines.Where((l, i) => i < 2);
await File.WriteAllLinesAsync("target/file.txt", outputLines);
}
What it does
Read data but not as one string but as a collection of lines
Create a new collection containing only the lines you want in your output
Write the filtered lines
Notes:
This example is not optimized for memory usage (because we read all lines and for larger files, e.g. multiple GB, this will fail). See existing answers for memory optimized version) - but: It's totally fine to do it this way if you know you have just a few k lines. (and it's faster)
Try not to "modify" strings. This will always create a copy and needs a lot of memory.
In this "Linq style" (functional) approach, we should treat data as immutable. That means: we have one variable that represents the input file and one variable that represents the result. We use declarative Linq to describe how the output should look like. "output is input where the filter index < 2 matches" instead of "if xy remove line" in an imperative style.
In the text shown below, I would need to extract the info in between the double quotes (The input is a text file)
Tag = "571EC002A-TD"
Tag = "571GI001-RUN"
Tag = "571GI001-TD"
The output should be,
571EC002A-TD
571GI001-RUN
571GI001-TD
How should I frame my regex in C# to match this and save it to a text file.
I was successful till reading all the lines into my code, but the regex gives me some undesirable values.
thanks and appreciate in advance.
A simple regex could be:
Regex tagRegex = new Regex(#"Tag\s?=\s?""(.+?)""");
Example with your input
UPDATE
For those that ask why not use String.Substring: The great advantage of regular expressions over string operations is that they don't generate temporary strings untily you actually ask for a matched value. Matches and groups contain only indexes to the source string. This cane be a huge advantage when processing log files.
You can match the content of a tag using a regex like
Tag\s*=\s*"(<tagValue>.*?)"
The ? in .*? results in a non-greedy search, ie only text up to the first double quote is extracted. Otherwise the pattern would match everything up to the last double quote.
(<tagValue>.*?) defines a named group. This way you can refer to the actual value captured by name and even use LINQ to process the values
The resulting C# code may look like this after escaping:
var myRegex=new Regex("Tag\\s*=\\s*\"(<tagValue>.*?)\"");
...
var tags=myRegex.Matches(someText)
.OfType<Match>()
.Select(match=>match.Groups["tagValue"].Value);
The result is an IEnumerable with all tag values. You can convert it to an array or List using ToArray() or ToList() just like any other IEnumerable
The equivalent code using a loop would be
var myRegex=new Regex("Tag\\s*=\\s*\"(<tagValue>.*?)\"");
...
List<string> tagValues=new List<string>();
foreach(Match m in myRegex.Matches(someText))
{
tagValues.Add(m.Groups["tagValue"].Value;
}
The LINQ version though can be extended very easily. For example, File.ReadLines returns an IEnumerable and doesn't wait to load everything in memory before returning. You could write something like:
var tags=File.ReadLines(myBigLog)
.SelectMany(line=>myRegex.Matches(line))
.OfType<Match>()
.Select(match=>match.Groups["tagValue"].Value);
If the tag names changed, you could also capture the tag name. If eg tags have a tag prefix you could use the pattern:
(?<tagName>tag\w+)\s*=\s*"(<tagValue>.*?)"
And extract both tag name and value in the Select function, eg :
.Select(match=>new {
TagName=match.Groups["tagName"].Value,
Value=match.Groups["tagValue"].Value
});
Regex.Matches is thread safe which means you can create one static Regex object and use it repeatedly, or even use PLINQ to match multiple lines in parallel simply by adding AsParallel() before the call to SelectMany.
If those strings will always be like that, you can go for a simpler approach by just using Substring:
line.Substring(7, line.Length - 8)
That will give you your desired output.
I have a CSV whose author, annoyingly enough, has decided to 'introduce' the file before the contents themselves. So in all, I have a CSV that looks like:
This file was created by XXXXYY and represents the crossover between YY and QQQ.
Additional information can be found through the website GG, blah blah blah...
Jacob, Hybrid
Dan, Pure
Lianne, Hybrid
Jack, Hatchback
So the problem here is that I want to get rid of the first few lines before the 'real content' of the CSV file begins. I'm looking for robustness here, so using Streamreader and removing all content before the 4th line for example, is not ideal (plus the length of the text can vary).
Is there a way in which one can read only what matters and write a new CSV into a directory path?
Regards,
genesis
(edit - I'm looking for C sharp code)
The solution depends on the files you have to parse. You need to look for a reliable pattern that distinguishes data from comment.
In your example, there are some possibilities that might be the same in other files:
there are 4 lines of text. But you say this isn't consistent across files
The text lives may not contain the same number of commas as the data table. But that is unlikely to be reliable for all files.
there is a blank/whitespace only line between the text and the data.
the data appears to be in the form word-comma-word. If this is true it should be easy to identify non data lines (any line which doesn't contain exactly one comma, or has multiple words etc)
You may be able to use a combination of these heuristics to more reliably detect the data.
You could scan by line (looking for the \r\n) and ignore lines that don't have a comma count that matches you csv.
You should be able to read the file into a string pretty easily unless it is really massive.
e.g.
var csv = "some test\r\nsome more text\r\na,b,c\r\nd,e,f\r\n";
var lines = csv.Split('\r\n');
var csvLines = line.Where(l => l.Count(',') == 2);
// now csvLines contains only the lines you are after
List<string> info = new List<string>();
int counter = 0;
// Open the file to read from.
info = System.IO.File.ReadAllLines(path).ToList();
// Find the lines up until (& including) the empty one
foreach (string s in info)
{
counter++;
if(string.IsNullOrEmpty(s))
break; //exit from the loop
}
// Remove the lines including the blank one.
info.RemoveRange(0,counter);
Something like this should work, you should probably put some tests in to make sure counter is not > length and other tests to handle errors.
You could adapt this code so that it just finds the empty line number using linq or something, but I don't like the overhead of linq (Yeah ironic considering I'm using c#).
Regards,
Slipoch
I don't have my code with me at home but I realized that I'll need to do a regex replace on a certain expression and I was wondering if there is a best practices for this.
What my code is currently doing is searching for matches in files, taking those matches out of a file(replacing them with ""), and then once all the files are processed I make a call to the .NET Process class to do some command line stuff. Specifically what i'll be doing is taking a group of files and copying them(merging)into one final output file. But there is the instance where every file to be merged has the exact same first line, which let's just say for the example is:
FIRST_NAME|MIDDLE_NAME|LAST_NAME|ADDRESS
Now, the first file having that is okay. And I figure that I'm going to do this final match and replace once the file is merged. But I only want to replace matches of that specific expression AFTER the first occurrence.
So, I read that C# has superb support for a Regex look behind? Would that be the proper way to implement "replacing matches after the first occurrence of a match" and if so how would you implement it given a sample regular expression?
My own personal solution to this was to return the amount of files in the folder with Directory.GetFiles and then in my foreach (string file in matches) I would declare a quick condition that says
if (count == directoryCount)
do not match and replace
count minus 1
elseif (count < directoryCount)
strip matching expression
and then every iteration through the foreach after the first run-through will strip out the matching expression from the file leaving only the first file with the expression I want to keep.
Thank you for any suggestions.
how about using replaceFirst() to backup the first occurency and mark it with some char. then use replaceAll(), and replaceFirst() again to roll back the first match.
Regex.Replace has a couple of overloads that provide for MatchEvaluator evaluator which is delegate on a Match returning the replacement String.
So you can use something like re.Replace(input, m => first ? (first=false, m.Value) : "") (but I've a VB programmer and have put this in without any syntax checking).
I currently have two strings assigned - domain,subdomain
How could I delete any matched occurrences of these strings in a text file?
string domain = "127.0.0.1 test.com"
string subdomain = "127.0.0.1 sub.test.com"
I don't think using a regex would be ideal in this situation.
How can this be done?
You need to:
Open the existing file for input
Open a new file for output
Repeatedly:
Read a line of text from the input
See if it matches your pattern (it's unclear at the moment what pattern you're looking for)
If it doesn't, write the line to the output (or if you're only trying to remove bits of lines, work out which bit you want to write out)
Close both the input and output (a using statement will do this automatically)
Optionally delete the original file and rename the new one if you want to effectively replace the original.
var result = Regex.Replace(File.ReadAllText("file.txt"),
#"127\.0\.0\.1 test\.com|127\.0\.0\.1 sub\.test\.com", string.Empty);
Then write to file obtained result.