This question already has answers here:
Regular expression to match balanced parentheses
(21 answers)
Closed 2 years ago.
I'm trying to make a small scripting language using c#
currently doing a block parser
im stuck at making regex for block.
Blocks can have ∞ times of sub blocks
This is what i need to catch
{
naber();
}
{
int x = 5;
x = 2;
if (x == 5) {
x = 5;
}
}
I tried this but not working
\{[^{}]*|(\{[^\{\}]\})*\}
This is my first post please have mercy on me
Regex will not help you for this. If you are designing a scripting language, possibly to be executed, that has blocks and sub-blocks, you need context-free grammar as opposed to regular grammar which can be expressed through regular expressions.
To interpret a context-free language you need the following steps (simplified):
Convert the code string to a list of tokens/symbols. This process is done by a component usually called Lexer.
Convert the tokens into a structured tree (AST - Abstract Syntax Tree) based on grammar rules (things like operator precedence, nested code blocks, etc). This is done by a component usually called Parser.
From here several options arise, either you translate the AST into native code, or intermediate code (like bytecode) or transpile it into another language; Or you can run it directly in memory, the most simple approach and probably what you want/need.
These should already be plenty of concepts to search for, but all of this can be achieved easily with tools like ANTLR. There might be alternatives to ANTLR obviously, I just don’t recall any just now.
I agree with those saying that regex isn't what you should use parsing code.
With that said, it is possible on some reg engines to match characters and get code in a block.
This might work for you {((?>[^{}]+|(?R))*)}. If the regex engine supports recursive pattern then it is possible to do some work parsing code.
More here about it Match balanced curly braces
Related
I am trying to add a feature that works with certain unicode groups from a string. I found this question that suggests the following solution, which does work on the unicodes inside of the stated range:
s = Regex.Replace(s, #"[^\u0000-\u007F]", string.Empty);
This works fine.
In my research, though, I came across the use of unicode blocks, which I find to be far more readable.
InBasic_Latin = U+0000–U+007F
More often, I saw recommendations pointing people to use the actual codes themselves (\u0000-\u007F) rather than these blocks (InBasic_Latin). I could see the benefit of explicitly declaring a range when you need some subset of that block or a specific unicode, but when you really just want that entire grouping using the block declaration it seems more friendly to readability and even programmability to use the block name instead.
So, generally, my question is why would \u0000–\u007F be considered a better syntax than InBasic_Latin?
It depends on your regex engine, but some (like .NET, Java, Perl) do support Unicode blocks:
if (Regex.IsMatch(subjectString, #"\p{IsBasicLatin}")) {
// Successful match
}
Others don't (like JavaScript, PCRE, Python, Ruby, R and most others), so you need to spell out those codepoints manually or use an extension like Steve Levithan's XRegExp library for JavaScript.
I have some strings of some defined format like Foo.<Whatever>.$(Something) and I would like to split them in parts and have each part automatically assigned to a variable.
I once wrote something resembling the bash/shell pipe command option '<' with C# classes and operator overloading. The usage was something like
ParseExpression ex = pex("item1") > ".<" > pex("item2") > ">.$(" > pex("item3") > ")";
ParseResult r = new ParseResult(ex, "Foo.<Whatever>.$(Something)");
ParseResult then had a Dictionary with the keys item1 through item3 set to the strings found in the given string. The method pex generated some object that could be used with the > operator, eventually having a chain of ParseExpressionParts which constitute a ParseExpression.
I don't have the code at hand in the moment, and before I start coding it from scratch again I thought I better ask whether someone has done and published it already.
The parse expressions remind me of parser combinators like Parsec and FParsec (for F#). How complex is the syntax going to be? As it is, it could be handled by a regex with groups.
If you want to create a more complex grammar using parser combinators you can use FParsec, one of the better known parser combinators, targeting F#. In general, functional languages like F# are used a lot in such situations. CSharp-monad is a parser combinator targeting C#. The project isn't very active though.
You can also use a full-blown parser generator like ANTLR 4. ANTLR is used by ASP.NET MVC to parse Razor syntax views. ANTLR 4 creates a parse tree and allows you to use either a Visitor or a Listener to process it that are similar to DOM or SAX processing.. A Listener calls your code as soon as an element is encounter (eg the opening <, the content etc), while the visitor works on the finished tree.
The Visual Studio extension for ANTLR will generate both the parser classes as well as base Visitor and Listener classes for your grammar. The NetBeans-based ANTLRWorks IDE makes creating and testing grammars very easy.
A rough grammar for your example would be :
format: tag '.' '<' category '>' '.' '$' '(' value ')';
tag : ID;
category : ID;
value : ID;
ID :[A-Z0-9]+;
Or you could define keywords like FOO : 'FOO' that have special meaning for your grammar. A visitor or listener could handle the tag eg to format a string, execute an operation on the values etc.
There are no hard and fast rules. Personally, I use regular expressions for simpler cases, eg processing relatively simple log files and ANTLR for more complex cases like screen-scraping mainframe data. I haven't looked into parser combinators as I never had the time to get comfortable with F#. They would be really handy though to handle some messed up log4net log files
I started with Heinzi's suggestion and eventually came up with the following code:
const string tokenPrefix = "px";
const string tokenSuffix = "sx";
const string tokenVar = "var";
string r = string.Format(#"(?<{0}>.*)\$\((?<{1}>.*)\)(?<{2}>.*)",
tokenPrefix, tokenVar, tokenSuffix);
Regex regex = new Regex(r);
Match match = regex.Match("Foo$(Something)Else");
if (match.Success)
{
string prefix = match.Groups[tokenPrefix].Value; // = "Foo"
string suffix = match.Groups[tokenSuffix].Value; // = "Something"
string variable = match.Groups[tokenVar].Value; // = "Else"
}
After talking to a collegue about this I was told to consider using the C# parser coonstruction library named "Sprache" (which is something between regex and ANTLR-alike toolsets) when my pattern usage increases and I want to have better maintainability.
I am making a programming language in native C++, with which I am making a basic editor in C#. NET WinForms. However, I am using a SyntaxRTB, with which I would like the Regex to catch the following error:
if declare is not succeeded by string / int / float / bool / array / char
How would I do that?
(The syntax to declare a variable is declare variable_type variable_name) - A whitespace would have to be accounted for too)
I have declare(?!string), but am still confused.
If you want a regex, you need a zero-width negative lookahead
But if you're constructing a language, this isn't the way to go. Full-blown language parsers are a different entity.
Although I agree with #fejesjoco, this is the expression I used here:
(declare)[\s](int|string|float|bool|array|char)[\s](.*)
Check for !match(pattern) to further diagnose an issue.
You are going to want to use Lookahead assertion. To be honest, I'm decent in Regex but I'm not really the guy you want explaining it to you.
This link will explain it better than I can, and this link provides a fairly decent Regex editor.
I m giving a string that contains several different combination of data.
For example :
string data = "(age=20&gender=male) or (city=newyork)"
string data1 = "(job=engineer&gender=female)"
string data2 = "(foo =1 or foo = 2) & (bar =1)"
I need to parse this string and create structure out of it and i have to evaluate this to a condition of another object. eg: if the object has these properties, then do something , else skip etc.
What are the best practices to do this?
Should i use a parser such as antlr and generate tokens out of the string. etc.?
reminder : there are several combinations of how this string is created. but it s all and/or.
Something like ANTLR is probably overkill for this.
A simple implementation of the shunting-yard algorithm would probably do the trick quite nicely.
Using regular expressions may work if the example is very simple, but it will more likely lead to a code that is impossible to maintain. Using some other approach to parsing seems like a good idea.
I would take a look at NCalc - it is mainly focused on parsing mathematical expressions, but it seems to be quite customizable (you can specify your functions and constants), so it may work in your scenario as well.
If this is too complex for your purpose, you can use any "parser generator" for C#. Using ANTLR is one great option - here is an example that shows how to start writing something like your example Five minute introduction to ANTLR
You could also try using F#, which is a great language for this kind of problem. See for example FsLex Sample by Chris Smith, which shows a simple mathematical evaluator - processing the parsed expression in F# would be a lot easier than in C#. In F#, you could also use FParsec, which is very lightweight, but may be a bit difficult to follow if you're not used to F#.
I suggest you to have a look at regular expressions: http://www.codeproject.com/KB/dotnet/regextutorial.aspx
Antlr is a great tool, but you can probably do this with regular expressions. One of the nice things about the .NET regex engine is support for nested constructs. See
http://retkomma.wordpress.com/2007/10/30/nested-regular-expressions-explained/
and this SO post.
Seems like you might want to use Regular Expressions to do this.
Read up a little bit on Regular Expressions in .NET. Here are some good articles:
http://msdn.microsoft.com/en-us/library/hs600312.aspx
http://www.regular-expressions.info/dotnet.html
When it comes time to write/test your Regular expression i would highly recommend using RegExLib.com's regex tester.
I have a string of test like this:
<customtag>hey</customtag>
I want to use a RegEx to modify the text between the "customtag" tags so that it might look like this:
<customtag>hey, this is changed!</customtag>
I know that I can use a MatchEvaluator to modify the text, but I'm unsure of the proper RegEx syntax to use. Any help would be much appreciated.
I wouldn't use regex either for this, but if you must this expression should work:
<customtag>(.+?)</customtag>
I'd chew my own leg off before using a regular expression to parse and alter HTML.
Use XSL or DOM.
Two comments have asked me to clarify. The regular expression substitution works in the specific case in the OP's question, but in general regular expressions are not a good solution. Regular expressions can match regular languages, i.e. a sequence of input which can be accepted by a finite state machine. HTML can contain nested tags to any arbitrary depth, so it's not a regular language.
What does this have to do with the question? Using a regular expression for the OP's question as it is written works, but what if the content between the <customtag> tags contains other tags? What if a literal < character occurs in the text? It has been 11 months since Jon Tackabury asked the question, and I'd guess that in that time, the complexity of his problem may have increased.
Regular expressions are great tools and I do use them all the time. But using them in lieu of a real parser for input that needs one is going to work in only very simple cases. It's practically inevitable that these cases grow beyond what regular expressions can handle. When that happens, you'll be tempted to write a more complex regular expression, but these quickly become very laborious to develop and debug. Be ready to scrap the regular expression solution when the parsing requirements expand.
XSL and DOM are two standard technologies designed to work with XML or XHTML markup. Both technologies know how to parse structured markup files, keep track of nested tags, and allow you to transform tags attributes or content.
Here are a couple of articles on how to use XSL with C#:
http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=63
http://www.csharphelp.com/archives/archive78.html
Here are a couple of articles on how to use DOM with C#:
http://msdn.microsoft.com/en-us/library/aa290341%28VS.71%29.aspx
http://blogs.msdn.com/tims/archive/2007/06/13/programming-html-with-c.aspx
Here's a .NET library that assists DOM and XSL operations on HTML:
http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack
If there won't be any other tags between the two tags, this regex is a little safer, and more efficient:
<customtag>[^<>]*</customtag>
Most people use HTML Agility Pack for HTML text parsing. However, I find it a little robust and complicated for my own needs. I create a web browser control in memory, load the page, and copy the text from it. (see example below)
You can find 3 simple examples here:
http://jakemdrew.wordpress.com/2012/02/03/getting-only-the-text-displayed-on-a-webpage-using-c/
//This is to replace all HTML Text
var re = new RegExp("<[^>]*>", "g");
var x2 = Content.replace(re,"");
//This is to replace all
var x3 = x2.replace(/\u00a0/g,'');