Replace characters using Regex in c# [duplicate] - c#

I like to know how to replace a regex-match of unknown amount of equal-signs, thats not less than 2... to the same amount of underscores
So far I got this:
text = Regex.Replace(text, "(={2,})", "");
What should I use as the 3rd parameter ?
EDIT: Prefferably a regex solution thats compatible in all languages

You can use Regex.Replace(String, MatchEvaluator) instead and analyze math:
string result = new Regex("(={2,})")
.Replace(text, match => new string('_', match.ToString().Length));

A much less clear answer (in term of code clarity):
text = Regex.Replace(text, "=(?==)|(?<==)=", "_");
If there are more than 2 = in a row, then at every =, we will find a = ahead or behind.
This only works if the language supports look-behind, which includes C#, Java, Python, PCRE... and excludes JavaScript.
However, since you can pass a function to String.replace function in JavaScript, you can write code similar to Alexei Levenkov's answer. Actually, Alexei Levenkov's answer works in many languages (well, except Java).

Related

C# parse Apple formatted strings (printf)

I'm dealing some with Apple formatted strings, for example:
[%d] fsPurgeable type: %#, count: %lld bytes for %lld files
According to Apple's documentation here, the format string specification follows the IEEE printf specification, with some modifications it appears.
I need to parse these strings, and replace the % type placeholders with the actual data that belongs there. My initial thought was to use a Regex, however as these claim to adhere to printf specifications, I was wondering if there was anything already in .net that might help with this.
I've done some reading, but I can't see anything that immediately jumped out as useful.
Any suggestions?
To somewhat answer my own question, I ended up writing a regex to handle the parsing:
var regex = new Regex(#"(?i)%[dspublicxfegahztj{}#%]*");
(?i) = case insensitive flag
% = literal percent sign
[dspublicxfegahztj{}#%] = all valid characters that could follow the %
* = match zero or multiple characters, e.g. %lld
Still curious if there is anything in .Net to handle this sort of thing

Child regex pattern in variable [duplicate]

Say I have a regex matching a hexadecimal 32 bit number:
([0-9a-fA-F]{1,8})
When I construct a regex where I need to match this multiple times, e.g.
(?<from>[0-9a-fA-F]{1,8})\s*:\s*(?<to>[0-9a-fA-F]{1,8})
Do I have to repeat the subexpression definition every time, or is there a way to "name and reuse" it?
I'd imagine something like (warning, invented syntax!)
(?<from>{hexnum=[0-9a-fA-F]{1,8}})\s*:\s*(?<to>{=hexnum})
where hexnum= would define the subexpression "hexnum", and {=hexnum} would reuse it.
Since I already learnt it matters: I'm using .NET's System.Text.RegularExpressions.Regex, but a general answer would be interesting, too.
RegEx Subroutines
When you want to use a sub-expression multiple times without rewriting it, you can group it then call it as a subroutine. Subroutines may be called by name, index, or relative position.
Subroutines are supported by PCRE, Perl, Ruby, PHP, Delphi, R, and others. Unfortunately, the .NET Framework is lacking, but there are some PCRE libraries for .NET that you can use instead (such as https://github.com/ltrzesniewski/pcre-net).
Syntax
Here's how subroutines work: let's say you have a sub-expression [abc] that you want to repeat three times in a row.
Standard RegEx
Any: [abc][abc][abc]
Subroutine, by Name
Perl:     (?'name'[abc])(?&name)(?&name)
PCRE: (?P<name>[abc])(?P>name)(?P>name)
Ruby:   (?<name>[abc])\g<name>\g<name>
Subroutine, by Index
Perl/PCRE: ([abc])(?1)(?1)
Ruby:          ([abc])\g<1>\g<1>
Subroutine, by Relative Position
Perl:     ([abc])(?-1)(?-1)
PCRE: ([abc])(?-1)(?-1)
Ruby:   ([abc])\g<-1>\g<-1>
Subroutine, Predefined
This defines a subroutine without executing it.
Perl/PCRE: (?(DEFINE)(?'name'[abc]))(?P>name)(?P>name)(?P>name)
Examples
Matches a valid IPv4 address string, from 0.0.0.0 to 255.255.255.255:
((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.(?1)\.(?1)\.(?1)
Without subroutines:
((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))\.((?:25[0-5])|(?:2[0-4][0-9])|(?:[0-1]?[0-9]?[0-9]))
And to solve the original posted problem:
(?<from>(?P<hexnum>[0-9a-fA-F]{1,8}))\s*:\s*(?<to>(?P>hexnum))
More Info
http://regular-expressions.info/subroutine.html
http://regex101.com/
Why not do something like this, not really shorter but a bit more maintainable.
String.Format("(?<from>{0})\s*:\s*(?<to>{0})", "[0-9a-zA-Z]{1,8}");
If you want more self documenting code i would assign the number regex string to a properly named const variable.
.NET regex does not support pattern recursion, and if you can use (?<from>(?<hex>[0-9a-fA-F]{1,8}))\s*:\s*(?<to>(\g<hex>)) in Ruby and PHP/PCRE (where hex is a "technical" named capturing group whose name should not occur in the main pattern), in .NET, you may just define the block(s) as separate variables, and then use them to build a dynamic pattern.
Starting with C#6, you may use an interpolated string literal that looks very much like a PCRE/Onigmo subpattern recursion, but is actually cleaner and has no potential bottleneck when the group is named identically to the "technical" capturing group:
C# demo:
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var block = "[0-9a-fA-F]{1,8}";
var pattern = $#"(?<from>{block})\s*:\s*(?<to>{block})";
Console.WriteLine(Regex.IsMatch("12345678 :87654321", pattern));
}
}
The $#"..." is a verbatim interpolated string literal, where escape sequences are treated as combinations of a literal backslash and a char after it. Make sure to define literal { with {{ and } with }} (e.g. $#"(?:{block}){{5}}" to repeat a block 5 times).
For older C# versions, use string.Format:
var pattern = string.Format(#"(?<from>{0})\s*:\s*(?<to>{0})", block);
as is suggested in Mattias's answer.
If I am understanding your question correctly, you want to reuse certain patterns to construct a bigger pattern?
string f = #"fc\d+/";
string e = #"\d+";
Regex regexObj = new Regex(f+e);
Other than this, using backreferences will only help if you are trying to match the exact same string that you have previously matched somewhere in your regex.
e.g.
/\b([a-z])\w+\1\b/
Will only match : text, spaces in the above text :
This is a sample text which is not the title since it does not end with 2 spaces.
There is no such predefined class. I think you can simplify it using ignore-case option, e.g.:
(?i)(?<from>[0-9a-z]{1,8})\s*:\s*(?<to>[0-9a-z]{1,8})
To reuse regex named capture group use this syntax: \k<name> or \k'name'
So the answer is:
(?<from>[0-9a-fA-F]{1,8})\s*:\s*\k<from>
More info: http://www.regular-expressions.info/named.html

Working with Unicode Blocks in Regex

I am trying to add a feature that works with certain unicode groups from a string. I found this question that suggests the following solution, which does work on the unicodes inside of the stated range:
s = Regex.Replace(s, #"[^\u0000-\u007F]", string.Empty);
This works fine.
In my research, though, I came across the use of unicode blocks, which I find to be far more readable.
InBasic_Latin = U+0000–U+007F
More often, I saw recommendations pointing people to use the actual codes themselves (\u0000-\u007F) rather than these blocks (InBasic_Latin). I could see the benefit of explicitly declaring a range when you need some subset of that block or a specific unicode, but when you really just want that entire grouping using the block declaration it seems more friendly to readability and even programmability to use the block name instead.
So, generally, my question is why would \u0000–\u007F be considered a better syntax than InBasic_Latin?
It depends on your regex engine, but some (like .NET, Java, Perl) do support Unicode blocks:
if (Regex.IsMatch(subjectString, #"\p{IsBasicLatin}")) {
// Successful match
}
Others don't (like JavaScript, PCRE, Python, Ruby, R and most others), so you need to spell out those codepoints manually or use an extension like Steve Levithan's XRegExp library for JavaScript.

Correction in this simple regular expression

I am new to regular expressions and the one that i have written might be a very simple one but donot know where I am wrong.
#"^([a-zA-Z._]+)#([\d]+)"
This RE is for the following string:
somename#somenumber
Now i am trying to retrieve the somename and somenumber. This is what i did:
ac.name = m.Groups[0].Value;
ac.number = m.Groups[1].Value;
Here ac.name reads the complete string, and ac.number reads somenumber. Where am I wrong in ac.name?
i guess the regex is correct, the problem is, you get the ac.name not from group 1 but group(0), which is the whole string. try this:
ac.name = m.Groups[1].Value;
ac.number = m.Groups[2].Value;
This regex is correct. I think your mistake is in somewhere else. You seem to use C#. So, you should think about the regex usage in the language.
Looking to the code sample in MSDN, you need to use 1-based indexes while accessing Groups instead of zero-based (as also Kent suggested). So, use this:
String name = m.Groups[1].Value;
String number = m.Groups[2].Value;
use this regex (\w+)#(\d+([.,]\d+)?)
Groups[1] will be contain name
Groups[2] will be contain number
I think you should move the + into the capture group:
#"^([a-zA-Z._]+)#([\d]+)"
If this is C#, try without the ^
([a-zA-Z\._]+)#([\d]+)
I just tried it out and it groups properly
Update: escaped the .
If you want only one match (and hence the ^ in original expression), use .Match instead of .Matches method. See MSDN documentation on Regular Expression Classes.

What is equivalent in JavaScript to C#'s # language feature for ignoring escape chars in string literals?

I think this question may be fairly evident from the title alone: I am a C# developer who often uses # before a string literal to make it more readable (e.g. string drive = #"C:\")
I am in the process of scripting out a lot of my processes (using the wonderful V8 for .Net) and I am wondering whether there's a similar feature in JavaScript?
No, unfortunately there is no such feature. In JavaScript you can enclose strings in double quotes "..." or in single quotes '...', but in both cases you will have to escape that quote character as well as the backslash.
It's not what you're after, but you can have a string that spans multiple lines like so
var s = "hello\
world";
Although this isn't part of the standard, so your engine might not support it.
I found out about it here: http://helephant.com/2007/05/spanning-javascript-strings-across-multiple-lines/

Categories

Resources