Stripping text line with regular expression with c #

Stripping text line with regular expression with c # - c#

In the text shown below, I would need to extract the info in between the double quotes (The input is a text file)
Tag = "571EC002A-TD"
Tag = "571GI001-RUN"
Tag = "571GI001-TD"
The output should be,
571EC002A-TD
571GI001-RUN
571GI001-TD
How should I frame my regex in C# to match this and save it to a text file.
I was successful till reading all the lines into my code, but the regex gives me some undesirable values.
thanks and appreciate in advance.

A simple regex could be:
Regex tagRegex = new Regex(#"Tag\s?=\s?""(.+?)""");
Example with your input

UPDATE
For those that ask why not use String.Substring: The great advantage of regular expressions over string operations is that they don't generate temporary strings untily you actually ask for a matched value. Matches and groups contain only indexes to the source string. This cane be a huge advantage when processing log files.
You can match the content of a tag using a regex like
Tag\s*=\s*"(<tagValue>.*?)"
The ? in .*? results in a non-greedy search, ie only text up to the first double quote is extracted. Otherwise the pattern would match everything up to the last double quote.
(<tagValue>.*?) defines a named group. This way you can refer to the actual value captured by name and even use LINQ to process the values
The resulting C# code may look like this after escaping:
var myRegex=new Regex("Tag\\s*=\\s*\"(<tagValue>.*?)\"");
...
var tags=myRegex.Matches(someText)
.OfType<Match>()
.Select(match=>match.Groups["tagValue"].Value);
The result is an IEnumerable with all tag values. You can convert it to an array or List using ToArray() or ToList() just like any other IEnumerable
The equivalent code using a loop would be
var myRegex=new Regex("Tag\\s*=\\s*\"(<tagValue>.*?)\"");
...
List<string> tagValues=new List<string>();
foreach(Match m in myRegex.Matches(someText))
{
tagValues.Add(m.Groups["tagValue"].Value;
}
The LINQ version though can be extended very easily. For example, File.ReadLines returns an IEnumerable and doesn't wait to load everything in memory before returning. You could write something like:
var tags=File.ReadLines(myBigLog)
.SelectMany(line=>myRegex.Matches(line))
.OfType<Match>()
.Select(match=>match.Groups["tagValue"].Value);
If the tag names changed, you could also capture the tag name. If eg tags have a tag prefix you could use the pattern:
(?<tagName>tag\w+)\s*=\s*"(<tagValue>.*?)"
And extract both tag name and value in the Select function, eg :
.Select(match=>new {
TagName=match.Groups["tagName"].Value,
Value=match.Groups["tagValue"].Value
});
Regex.Matches is thread safe which means you can create one static Regex object and use it repeatedly, or even use PLINQ to match multiple lines in parallel simply by adding AsParallel() before the call to SelectMany.

If those strings will always be like that, you can go for a simpler approach by just using Substring:
line.Substring(7, line.Length - 8)
That will give you your desired output.

Related

C# regex to parse /simple1/1.2-SNAPSHOT/

I need to find the last two values at the end of such a string, "simple1" and "1.2-SNAPSHOT" in the sample url below. But my code below (try to get simple1/1.2-SNAPSHOT/) doesn't work, can anyone help?
http://localhost:8060/nexus/service/local/repositories/snapshots/content/org/sonatype/mavenbook/simple1/1.2-SNAPSHOT/
List<string> artifacts = new List<string>(); // this is already foler URL
// store all URLs to the artifacts be deleted
artifacts = nexusAPI.findArtifacts(repository, contents, days, pattern);
var regex = new Regex(".*\\/(.*\\/.*\\/)$");
foreach (string url in artifacts)
{
Console.WriteLine("group/artifact: {0}", regex.Matches(url));
}

I would just split the string on '/' and get the last two parts. The regex isn't going to do anything more then that.

If you must use RegEx, you're encountering an issue in that regexes are greedy - that means it puts as much in each .* as it possibly can. So your first step is to make the regex not greedy. Simply use this as your pattern:
(.*?)/
Here's a simple test showing how that this works.
This tells the regex to look for any character up to the slash, and then stop.
When you call Regex.Matches(url, "(.*?)/"), you will get returned an array of the matching data. From there, you can just look at the last two elements.
Of course, as SledgeHammer mentioned, this is one case where regex is unnecessary and even cumbersome. Simply working with url.Split(new char[] {'/'}) will give you the results you need.

Reading in a text file more 'intelligently'

I have a text file which contains a list of alphabetically organized variables with their variable numbers next to them formatted something like follows:
aabcdef 208
abcdefghijk 1191
bcdefga 7
cdefgab 12
defgab 100
efgabcd 999
fgabc 86
gabcdef 9
h 11
ijk 80
...
...
I would like to read each text as a string and keep it's designated id# something like read "aabcdef" and store it into an array at spot 208.
The 2 issues I'm running into are:
I've never read from file in C#, is there a way to read, say from
start of line to whitespace as a string? and then the next string as
an int until the end of line?
given the nature and size of these files I do not know the highest ID value of each file (not all numbers are used so some
files could house a number like 3000, but only actually list 200
variables) So how could I make a flexible way to store these
variables when I don't know how big the array/list/stack/etc.. would
need to be.

Basically you need a Dictionary instead of an array or list. You can read all lines with File.ReadLines method then split each of them based on space and \t (tab), like this:
var values = File.ReadLines("path")
.Select(line => line.Split(new [] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries))
.ToDictionary(parts => int.Parse(parts[1]), parts => parts[0]);
Then values[208] will give you aabcdef. It looks like an array doesn't it :)
Also make sure you have no duplicate numbers because Dictionary keys should be unique otherwise you will get an exception.

I've been thinking about how I would improve other answers and I've found this alternative solution based on Regex which makes the search into the whole string (either coming from a file or not) safer.
Check that you can alter the whole regular expression to include other separators. Sample expression will detect spaces and tabs.
At the end of the day, I found that MatchCollection returns a safer result, since you always know that 3rd group is an integer and 2nd group is a text because regular expression does a lot of checking for you!
StringBuilder builder = new StringBuilder();
builder.AppendLine("djdodjodo\t\t3893983");
builder.AppendLine("dddfddffd\t\t233");
builder.AppendLine("djdodjodo\t\t39838");
builder.AppendLine("djdodjodo\t\t12");
builder.AppendLine("djdodjodo\t\t444");
builder.AppendLine("djdodjodo\t\t5683");
builder.Append("djdodjodo\t\t33");
// Replace this line with calling File.ReadAllText to read a file!
string text = builder.ToString();
MatchCollection matches = Regex.Matches(text, #"([^\s^\t]+)(?:[\s\t])+([0-9]+)", RegexOptions.IgnoreCase | RegexOptions.Multiline);
// Here's the magic: we convert an IEnumerable<Match> into a dictionary!
// Check that using regexps, int.Parse should never fail because
// it matched numbers only!
IDictionary<int, string> lines = matches.Cast<Match>()
.ToDictionary(match => int.Parse(match.Groups[2].Value), match => match.Groups[1].Value);
// Now you can access your lines as follows:
string value = lines[33]; // <-- By value
Update:
As we discussed in chat, this solution wasn't working in some actual use case you showed me, but it's not the approach what's not working but your particular case, because keys are "[something].[something]" (for example: address.Name).
I've changed given regular expression to ([\w\.]+)[\s\t]+([0-9]+) so it covers the case of key having a dot.
It's about improving the matching regular expression to fit your requirements! ;)
Update 2:
Since you told me that you need keys having any character, I've changed the regular expression to ([^\s^\t]+)(?:[\s\t])+([0-9]+).
Now it means that key is anything excepting spaces and tabs.
Update 3:
Also I see you're stuck in .NET 3.0 and ToDictionary was introduced in .NET 3.5. If you want to get the same approach in .NET 3.0, replace ToDictionary(...) with:
Dictionary<int, string> lines = new Dictionary<int, string>();
foreach(Match match in matches)
{
lines.Add(int.Parse(match.Groups[2].Value), match.Groups[1].Value);
}

Replace every occurrence of a Regex match EXCEPT for the first?

I don't have my code with me at home but I realized that I'll need to do a regex replace on a certain expression and I was wondering if there is a best practices for this.
What my code is currently doing is searching for matches in files, taking those matches out of a file(replacing them with ""), and then once all the files are processed I make a call to the .NET Process class to do some command line stuff. Specifically what i'll be doing is taking a group of files and copying them(merging)into one final output file. But there is the instance where every file to be merged has the exact same first line, which let's just say for the example is:
FIRST_NAME|MIDDLE_NAME|LAST_NAME|ADDRESS
Now, the first file having that is okay. And I figure that I'm going to do this final match and replace once the file is merged. But I only want to replace matches of that specific expression AFTER the first occurrence.
So, I read that C# has superb support for a Regex look behind? Would that be the proper way to implement "replacing matches after the first occurrence of a match" and if so how would you implement it given a sample regular expression?
My own personal solution to this was to return the amount of files in the folder with Directory.GetFiles and then in my foreach (string file in matches) I would declare a quick condition that says
if (count == directoryCount)
do not match and replace
count minus 1
elseif (count < directoryCount)
strip matching expression
and then every iteration through the foreach after the first run-through will strip out the matching expression from the file leaving only the first file with the expression I want to keep.
Thank you for any suggestions.

how about using replaceFirst() to backup the first occurency and mark it with some char. then use replaceAll(), and replaceFirst() again to roll back the first match.

Regex.Replace has a couple of overloads that provide for MatchEvaluator evaluator which is delegate on a Match returning the replacement String.
So you can use something like re.Replace(input, m => first ? (first=false, m.Value) : "") (but I've a VB programmer and have put this in without any syntax checking).

Regular expression for filenames that doesn't exclude whitespaces

I have been using this regular expression to extract file names out of file path strings:
Regex r = new Regex(#"\w+[.]\w+$+");
This works, as long as there is no space in the file name. For example:
r.Match("c:\somestuff\myfile.doc").Value = "myfile.doc"
r.Match("c:\somestuff\my file.doc").Value = "file.doc"
I need my regular expression to give me "my file.doc", and not just "file.doc"
I tried messing around with the expression myself. In particular I tried adding \s+ after learning that that is for matching whitespaces. I didn't get the results I hoped for.
I did devise a solution just to get the job done: I started at the end of the string, went backwards until a backslash was reached. This gave me the file name in reverse order (i.e. cod.elifym) into an array of chars, then I used Array.Reverse() to turn it around. However I'd like to learn how to achieve this by simply modifying my original regular expression.

Does it have to be a regular expression? Use System.IO.Path.GetFileName() instead.

Regex r = new Regex(#"[\w ]+\.\w+$");

A working regex might simply look like:
[^\\]+$
Consider using:
System.IO.Path.GetFileName(path)

Regex BBCode to HTML

I writing BBcode converter to html.
Converter should skip unclosed tags.
I thought about 2 options to do it:
1) match all tags in once using one regex call, like:
Regex re2 = new Regex(#"\[(\ /?(?:b|i|u|quote|strike))\]");
MatchCollection mc = re2.Matches(sourcestring);
and then, loop over MatchCollection using 2 pointers to find start and open tags and than replacing with right html tag.
2) call regex multiple time for every tag and replace directly:
Regex re = new Regex(#"\[b\](.*?)\[\/b\]");
string s1 = re.Replace(sourcestring2,"<b>$1</b>");
What is more efficient?
The first option uses one regex but will require me to loop through all tags and find all pairs, and skip tags that don't have a pair.
Another positive thins is that I don't care about the content between the tags, i just work and replace them using the position.
In second option I don't need to worry about looping and making special replace function.
But will require to execute multiple regex and replaces.
What can you suggest?
If the second option is the right one,
there is a problem with regex
\[b\](.*?)\[\/b\]
how can i fix it to also match multi lines like:
[b]
test 1
[/b]
[b]
test 2
[/b]

One option would be to use more SAX-like parsing, where instead of looking for a particular regex you look for [, then have your program handle that even in some manner, look for the ], handle that even, etc. Although more verbose than the regex it may be easier to understand, and wouldn't necessarily be slower.

r = new System.Text.RegularExpressions.Regex(#"(?:\[b\])(?<name>(?>\[b\](?<DEPTH>)|\[/b\](?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/b\])", System.Text.RegularExpressions.RegexOptions.Singleline);
var s = r.Replace("asdfasdf[b]test[/b]asdfsadf", "<b>$1</b>");
That should give you only elements that have matched closing tags and also handle multi line (even though i specified the option of SingleLine it actually treats it as a single line)
It should also handle [b][b][/b] properly by ignoring the first [b].
As to whether or not this method is better than your first method I couldn't say. But hopefully this will point you in the right direction.
Code that works with your example below:
System.Text.RegularExpressions.Regex r;
r = new System.Text.RegularExpressions.Regex(#"(?:\[b\])(?<name>(?>\[b\](?<DEPTH>)|\[/b\](?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/b\])", System.Text.RegularExpressions.RegexOptions.Singleline);
var s = r.Replace("[b]bla bla[/b]bla bla[b] " + "\r\n" + "bla bla [/b]", "<b>$1</b>");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Stripping text line with regular expression with c # - c#

A simple regex could be: Regex tagRegex = new Regex(#"Tag\s?=\s?""(.+?)"""); Example with your input

If those strings will always be like that, you can go for a simpler approach by just using Substring: line.Substring(7, line.Length - 8) That will give you your desired output.

Related

C# regex to parse /simple1/1.2-SNAPSHOT/

Reading in a text file more 'intelligently'

Replace every occurrence of a Regex match EXCEPT for the first?

Regular expression for filenames that doesn't exclude whitespaces

Regex BBCode to HTML

Categories

Resources