Just having a little problem in attempting a string or regex replace on specific numbers in a string.
For example, in the string
#1 is having lunch with #10 #11
I would like to replace "#1", "#10" and "#11" with the respective values as indicated below.
"#1" replace with "#bob"
"#10" replace with "#joe"
"#11" replace with "#sam"
So the final output would look like
"#bob is having lunch with #joe #sam"
Attempts with
String.Replace("#1", "#bob")
results in the following
#bob is having lunch with #bob0 #bob1
Any thoughts on what the solution might be?
I would prefer more declarative way of doing this. What if there will be another replacements, for example #2 change to luke? You will have to change the code (add another Replace call).
My proposition with declarations of the replacements:
string input = "#1 is having lunch with #10 #11";
var rules = new Dictionary<string,string>()
{
{ "#1", "#bob" },
{ "#10", "#joe" },
{ "#11", "#sam"}
};
string output = Regex.Replace(input,
#"#\d+",
match => rules[match.Value]);
Explanation:
Regular expression is searching for pattern #\d+ which means # followed by one or more digits. And replaces this match thanks to MatchEvaluator by the proper entry from the rules dictionary, where the key is the match value itself.
Assuming all placeholder start with # and contain only digits, you can use the Regex.Replace overload that accepts a MatchEvaluator delegate to pick the replacement value from a dictionary:
var regex = new Regex(#"#\d+");
var dict = new Dictionary<string, string>
{
{"#1","#bob"},
{"#10","#joe"},
{"#11","#sam"},
};
var input = "#1 is having lunch with #10 #11";
var result=regex.Replace(input, m => dict[m.Value]);
The result will be "#bob is having lunch with #joe #sam"
There are a few advantages compared to multiple String.Replace calls:
The code is more concise, for an arbitrary number of placeholders
You avoid mistakes due to the order of the replacements (eg #11 must come before #1)
It's faster because you don't need to search and replace the placeholders multiple times
It doesn't create temporary strings for each parameter. This can be an issue for server applications because a large number of orphaned strings will put pressure on the garbage collector
The reason for advantages 3-4 is that the regex will parse the input and create an internal representation that contains the indexes for any match. When the time comes to create the final string, it uses a StringBuilder to read characters from the original string but substitute the replacement values when a match is encountered.
Start with the biggest (read longest) number like #11 and #10 first and then replace #1.
string finalstring = mystring.Replace("#11", "#sam")
.Replace("#10", "#joe")
.Replace("#1", "#bob");
Make your regular expression look for the string #1_
The space after will ensure that it only gets the number #1.
Related
The ReadOnlySpan<char> is said to be perfect for parsing so I tried to use it and I came across a use case that I don't know how to handle.
I have a command-line string where the argument prefix - and the separator (space) are escaped (I know I could quote them here but for the sake of this problem let's assume it's not an option):
var str = #"foo -bar \-baz\ qux".AsMemory();
The tokenizer should return the following tokens:
foo - command name
bar - argument name
-baz qux - argument value
Cases 1 & 2 are simple because here I can just use str.Slice(i, length) but how can I create the 3rd case and return only a single ReadOnlySpan<char>? The Slice method doesn't allow me to specify multiple start/length ranges which would be necessary in order to jump over the escape char \.
Example:
str.Slice((10, 4), (15, 3));
where (10,4) = "-bar" and (15,3) = " qux"
With StringBuilder you can just skip a couple of characters and Append the others later. How would I achieve the same result with ReadOnlySpan<char>?
A Span/ReadOnlySpan is a contiguous block of memory. It cannot contain multiple ranges. This design is necessary for performance. Span/ReadOnlySpan is supposed to be roughly as fast as an array is. Arrays are fast because they are contiguous memory blocks with no further abstractions.
I don't see a way to do this without allocating a new string. You can use Span/ReadOnlySpan for all contiguous substrings but it seems your parsing problem is not suitable to use span to store results.
have a look at:
https://github.com/nemesissoft/Nemesis.TextParsers
and more precisely at:
TokenSequence.cs
Usage:
var tokens = "ABC|CD\|E".AsSpan().Tokenize('|', '\\', false); //no allocation. Result in 2 elements: "ABC", "CD\|E".
Consume via:
var result = new List<string>();
foreach (var part in tokens)
result.Add(part.ToString());
Unescaping can be done via:
ParsedSequence.cs
or
SpanParserHelper.UnescapeCharacter()
Hope this helps
Me and my colleague have different versions of VisualStudio. He used interpolated string and I couldn't build the solution and had to convert them all to string.Format.
Now I thought it might be good exercise for regex.
So how to convert this:
$"alpha: {alphaID}, betaValue: {beta.Value}"
To this:
string.Format("alpha: {0}, betaValue: {1}", alphaID, beta.Value)
Now the number of variables can vary (let's say 1 - 20, but should be generic)
I came up with this regex, to match the first variable
\$.*?{(\w+)}
but I couldn't figure out how to repeat the part after dollar sign, so I can repeat the result.
Regex.Replace has an overload which takes a function, called the MatchEvaluator. You might use it like this;
var paramNumber = 0;
var idNames = new List<string>();
myCSharpString = Regex.Replace(myCSharpString, match => {
// remember the id inside brackets;
idNames.Add(match.ToString());
// return "0", then "1", etc.
return (paramNumber++).ToString();
});
At the end of this process, your string like "this is {foo} not {bar}" will have been replaced to "this is {0} not {1}" and you will have a list containing { "foo" , "bar" } which you can use to assemble the parameter list.
You can use C#6 features in older version of visual studio using the C# 6 nuget package
Essentially, just use
Install-Package Microsoft.Net.Compilers
On all the projects.
I dont have experience in c# but this may help you:
\{(.+)}\gU
Explanation
{ matches the character { literally
. matches any character (except newline)
Quantifier: + Between one and unlimited times, as few times as possible,
expanding as needed [lazy]
} matches the character } literally
g modifier: global. All matches (don't return on first match)
U modifier: Ungreedy
Here are few options that do not involve regex:
You can use ReSharper. It has such conversions built in.
Write your custom code fix with Roslyn if you don't want to pay for ReSharper. Here is an example that does from string.format to interpolated string, you just have to reverse it.
Do it manually
Ask your colleague to do it for you if he can.
Update VS
In the text shown below, I would need to extract the info in between the double quotes (The input is a text file)
Tag = "571EC002A-TD"
Tag = "571GI001-RUN"
Tag = "571GI001-TD"
The output should be,
571EC002A-TD
571GI001-RUN
571GI001-TD
How should I frame my regex in C# to match this and save it to a text file.
I was successful till reading all the lines into my code, but the regex gives me some undesirable values.
thanks and appreciate in advance.
A simple regex could be:
Regex tagRegex = new Regex(#"Tag\s?=\s?""(.+?)""");
Example with your input
UPDATE
For those that ask why not use String.Substring: The great advantage of regular expressions over string operations is that they don't generate temporary strings untily you actually ask for a matched value. Matches and groups contain only indexes to the source string. This cane be a huge advantage when processing log files.
You can match the content of a tag using a regex like
Tag\s*=\s*"(<tagValue>.*?)"
The ? in .*? results in a non-greedy search, ie only text up to the first double quote is extracted. Otherwise the pattern would match everything up to the last double quote.
(<tagValue>.*?) defines a named group. This way you can refer to the actual value captured by name and even use LINQ to process the values
The resulting C# code may look like this after escaping:
var myRegex=new Regex("Tag\\s*=\\s*\"(<tagValue>.*?)\"");
...
var tags=myRegex.Matches(someText)
.OfType<Match>()
.Select(match=>match.Groups["tagValue"].Value);
The result is an IEnumerable with all tag values. You can convert it to an array or List using ToArray() or ToList() just like any other IEnumerable
The equivalent code using a loop would be
var myRegex=new Regex("Tag\\s*=\\s*\"(<tagValue>.*?)\"");
...
List<string> tagValues=new List<string>();
foreach(Match m in myRegex.Matches(someText))
{
tagValues.Add(m.Groups["tagValue"].Value;
}
The LINQ version though can be extended very easily. For example, File.ReadLines returns an IEnumerable and doesn't wait to load everything in memory before returning. You could write something like:
var tags=File.ReadLines(myBigLog)
.SelectMany(line=>myRegex.Matches(line))
.OfType<Match>()
.Select(match=>match.Groups["tagValue"].Value);
If the tag names changed, you could also capture the tag name. If eg tags have a tag prefix you could use the pattern:
(?<tagName>tag\w+)\s*=\s*"(<tagValue>.*?)"
And extract both tag name and value in the Select function, eg :
.Select(match=>new {
TagName=match.Groups["tagName"].Value,
Value=match.Groups["tagValue"].Value
});
Regex.Matches is thread safe which means you can create one static Regex object and use it repeatedly, or even use PLINQ to match multiple lines in parallel simply by adding AsParallel() before the call to SelectMany.
If those strings will always be like that, you can go for a simpler approach by just using Substring:
line.Substring(7, line.Length - 8)
That will give you your desired output.
I have a text file which contains a list of alphabetically organized variables with their variable numbers next to them formatted something like follows:
aabcdef 208
abcdefghijk 1191
bcdefga 7
cdefgab 12
defgab 100
efgabcd 999
fgabc 86
gabcdef 9
h 11
ijk 80
...
...
I would like to read each text as a string and keep it's designated id# something like read "aabcdef" and store it into an array at spot 208.
The 2 issues I'm running into are:
I've never read from file in C#, is there a way to read, say from
start of line to whitespace as a string? and then the next string as
an int until the end of line?
given the nature and size of these files I do not know the highest ID value of each file (not all numbers are used so some
files could house a number like 3000, but only actually list 200
variables) So how could I make a flexible way to store these
variables when I don't know how big the array/list/stack/etc.. would
need to be.
Basically you need a Dictionary instead of an array or list. You can read all lines with File.ReadLines method then split each of them based on space and \t (tab), like this:
var values = File.ReadLines("path")
.Select(line => line.Split(new [] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries))
.ToDictionary(parts => int.Parse(parts[1]), parts => parts[0]);
Then values[208] will give you aabcdef. It looks like an array doesn't it :)
Also make sure you have no duplicate numbers because Dictionary keys should be unique otherwise you will get an exception.
I've been thinking about how I would improve other answers and I've found this alternative solution based on Regex which makes the search into the whole string (either coming from a file or not) safer.
Check that you can alter the whole regular expression to include other separators. Sample expression will detect spaces and tabs.
At the end of the day, I found that MatchCollection returns a safer result, since you always know that 3rd group is an integer and 2nd group is a text because regular expression does a lot of checking for you!
StringBuilder builder = new StringBuilder();
builder.AppendLine("djdodjodo\t\t3893983");
builder.AppendLine("dddfddffd\t\t233");
builder.AppendLine("djdodjodo\t\t39838");
builder.AppendLine("djdodjodo\t\t12");
builder.AppendLine("djdodjodo\t\t444");
builder.AppendLine("djdodjodo\t\t5683");
builder.Append("djdodjodo\t\t33");
// Replace this line with calling File.ReadAllText to read a file!
string text = builder.ToString();
MatchCollection matches = Regex.Matches(text, #"([^\s^\t]+)(?:[\s\t])+([0-9]+)", RegexOptions.IgnoreCase | RegexOptions.Multiline);
// Here's the magic: we convert an IEnumerable<Match> into a dictionary!
// Check that using regexps, int.Parse should never fail because
// it matched numbers only!
IDictionary<int, string> lines = matches.Cast<Match>()
.ToDictionary(match => int.Parse(match.Groups[2].Value), match => match.Groups[1].Value);
// Now you can access your lines as follows:
string value = lines[33]; // <-- By value
Update:
As we discussed in chat, this solution wasn't working in some actual use case you showed me, but it's not the approach what's not working but your particular case, because keys are "[something].[something]" (for example: address.Name).
I've changed given regular expression to ([\w\.]+)[\s\t]+([0-9]+) so it covers the case of key having a dot.
It's about improving the matching regular expression to fit your requirements! ;)
Update 2:
Since you told me that you need keys having any character, I've changed the regular expression to ([^\s^\t]+)(?:[\s\t])+([0-9]+).
Now it means that key is anything excepting spaces and tabs.
Update 3:
Also I see you're stuck in .NET 3.0 and ToDictionary was introduced in .NET 3.5. If you want to get the same approach in .NET 3.0, replace ToDictionary(...) with:
Dictionary<int, string> lines = new Dictionary<int, string>();
foreach(Match match in matches)
{
lines.Add(int.Parse(match.Groups[2].Value), match.Groups[1].Value);
}
I writing BBcode converter to html.
Converter should skip unclosed tags.
I thought about 2 options to do it:
1) match all tags in once using one regex call, like:
Regex re2 = new Regex(#"\[(\ /?(?:b|i|u|quote|strike))\]");
MatchCollection mc = re2.Matches(sourcestring);
and then, loop over MatchCollection using 2 pointers to find start and open tags and than replacing with right html tag.
2) call regex multiple time for every tag and replace directly:
Regex re = new Regex(#"\[b\](.*?)\[\/b\]");
string s1 = re.Replace(sourcestring2,"<b>$1</b>");
What is more efficient?
The first option uses one regex but will require me to loop through all tags and find all pairs, and skip tags that don't have a pair.
Another positive thins is that I don't care about the content between the tags, i just work and replace them using the position.
In second option I don't need to worry about looping and making special replace function.
But will require to execute multiple regex and replaces.
What can you suggest?
If the second option is the right one,
there is a problem with regex
\[b\](.*?)\[\/b\]
how can i fix it to also match multi lines like:
[b]
test 1
[/b]
[b]
test 2
[/b]
One option would be to use more SAX-like parsing, where instead of looking for a particular regex you look for [, then have your program handle that even in some manner, look for the ], handle that even, etc. Although more verbose than the regex it may be easier to understand, and wouldn't necessarily be slower.
r = new System.Text.RegularExpressions.Regex(#"(?:\[b\])(?<name>(?>\[b\](?<DEPTH>)|\[/b\](?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/b\])", System.Text.RegularExpressions.RegexOptions.Singleline);
var s = r.Replace("asdfasdf[b]test[/b]asdfsadf", "<b>$1</b>");
That should give you only elements that have matched closing tags and also handle multi line (even though i specified the option of SingleLine it actually treats it as a single line)
It should also handle [b][b][/b] properly by ignoring the first [b].
As to whether or not this method is better than your first method I couldn't say. But hopefully this will point you in the right direction.
Code that works with your example below:
System.Text.RegularExpressions.Regex r;
r = new System.Text.RegularExpressions.Regex(#"(?:\[b\])(?<name>(?>\[b\](?<DEPTH>)|\[/b\](?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/b\])", System.Text.RegularExpressions.RegexOptions.Singleline);
var s = r.Replace("[b]bla bla[/b]bla bla[b] " + "\r\n" + "bla bla [/b]", "<b>$1</b>");