Using C# Regex to find C++ code patterns (defines & emuns) - c#

I writing C# program that run over C++ source files and looking for the following things:
#define SOMETHING_A 99
and
typedef enum {
EX_A,
EX_B,
EX_C,
EX_D,
EX_E
} Examples;
and
enum EXAMPLE2
{
EX2_A=0,
EX2_B=1,
EX2_C=2,
EX2_D=3,
EX2_LAST = EX2_D
};
My objective is to get the following list of pairs as output:
{SOMETHING_A,99}
{EX_A,0}
{EX_B,1}
..
..
{EX2_A,0}
{EX2_B,1}
..
..
Can you help me to find the correct regular expressions that match the above 3 patterns?

If you want a solution that will work on any c++ files, use a parser instead of regexes. There are just too many possibilities to account for (different code styles, code that is commented out, etc.).
If you only want to do this on a known set of files, and they have a predictable format and style, a regex is probably ok. Actually, you are better off using several regexes:
/^#define\s+(\S+)\s+(\S+)/
This only matches define statements that are at the beginning of a line.
Here is the typedef enum:
/^\s*typedef\s+enum\s*\{[^\}]+\}[^;]+;/
(It's not clear what you want to grab from this one, so I haven't captured anything).
And here is the enum. This is best done in two steps:
/^\s*enum\s+(\S+)\s*\{\s*([^\}]+?)\s*\}\s*;/
The first step gets the name of the enum in the first capture group and the content in the second group. Perform a regex on the second capture group to get the fields and values:
/(\S+)\s*=\s*([^\s\,]+)/
Each match of this will give you one name/value pair.
These regexes should handle your examples, and they should do a decent job of handling the most common usage in C++ code. But they are not perfect; if you want a solution that covers all possible constructs, don't use a regex.
note: you need to make sure the match_single_line flag is off when using these.

Related

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

c# regex - changing pattern matches until find specific word

usually i can workaround and get everything works by myself, but this one is kinda tricky, even msdn references and examples confuses more than helps.
i have testing some codes and stuck at mixing a capture grouping for changing with a non-capturing group, to stop the matchings when i wish
a simpler code that i want to change is:
stats = "label:100,value:7878,label:110,value:7879,something,label:200,value:8888";
valor = "value:8080";
i know if i use
pattern = #"value:(\d+)";
i can change every value number to 8080 when i do
Regex.Replace(stats, pattern, valor);
but i need he stops changing these when find 'something' string
i managed to change every single char to 'valor' until he finds 'something' using
pattern = #"^(?:(?!something).)*";
is there a way to only change 'value:(\d+)' numbers to 'valor' , along with the ?:(?!something) to stop the matchings in the same sentence?
ive seen lots of examples but they never said something like this so i dunno if its possible to merge both conditions at same time
You can make use of a look-behind solution that makes sure there is no something before the value:
(?<!\bsomething\b.*)value:\d+
See demo
Note that something is matched as a whole word due to \b word boundaries.
The result of replace operation:
Note that (?:(?!something).) is very inefficient and should be used when no other means works. In .NET, there is a powerful variable-width look-behind, which is the right tool for this task.
Also note that if you are not using capture group backreferences, you do not need those capturing groups in your pattern (I remove parentheses from around \d+).

Regex to identify C# functions

I need to find all functions in my VS solution with a certain attribute and insert a line of code at the end and at the beginning of each one. For identifying the functions, I've got as far as
\[attribute\]\r?\n(.*)void(.*)\r?\n.*\{\r?\n([^\{\}]*)\}
But that only works on functions that don't contain any other blocks of code delimited by braces. If I set the last capturing group to [\s\S] (all characters), it simply selects all text from the start of the first function to the end of the last one. Is there a way to get around this and select just one whole function?
I am afraid balancing constructs themselves are not enough since you may have unbalanced number of them in the method body. You can still try this regex that will handle most of the caveats:
\[attribute\](?<signature>[^{]*)(?<body>(?:\{[^}]*\}|//.*\r?\n|"[^"]*"|[\S\s])*?\{(?:\{[^}]*\}|//.*\r?\n|"[^"]*"|[\S\s])*?)\}
See demo on RegexStorm
The regex will ignore all { and } in the string literals and //-like comments, and will consume {...} blocks. The only thing it does not support is /*...*/ multiline comments. Please let me know if you also need to account for them.
The bad news is that you can't do that by the Search-And-Replace feature because it doesn't support balancing groups. You can write a separate program in C# that does it for you.
The construct to get the matching closing brace is:
(?=\{)(?:(?<open>\{)|(?<-open>\})|[^\{\}])+?(?(open)(?!))
This matches a block of {...}. But as #DmitryBychenko mentioned it doesn't respect comments or strings.

Extracting content from text files with generic rules

I have a lot of text data with different structure. I need to extract parts of these texts based on some text-based rules. I would use regular expressions but unfortunately the people who are using the application have never heard of it.
Basically the app does the following thing:
Load the data into a textbox
Type the structure of the output as a simple set of rules into another textbox
Receive the results in a 3rd textbox
Examples of data structures (I have megabytes of this data):
Label1: value1, measurement
Label2; value2; something else
Nr, value3 (comment)
...
I need some other approach that I could use instead of regular expressions. It can be extremely simple because all I need is one value from every row.
From the example above I have to obtain the following structure:
"value1, value2, value3"
Is there a simpler alternative to regex? Did someone already implement something like this?
I can also imagine that I am approaching the problem from the wrong angle, like forcing the simple user to write data extraction rules. In this case the question is transformed to something more generic like "How can build an application that lets a very simple user extract data from a separate texts?"
Edit:
I have the following simplest as possible matching implemented for them:
File content:
"Strain at break Ax2";"Unknown"
"Strain at break Ax1";"Unknown"
"Strain at break";"Unknown"
"Yield point strain";"Unknown"
"Uniform elongation";25.4087;"%"
"Tensile strength";261.323;"MPa"
"End test phase Yield point";1;"%"
"Maximum tensile force";5.22647;"kN"
Pattern:
"Tensile strength";(?<value>[^;\n]*);
"Maximum tensile force";(?<value>[^;\n]*);
Still too complex. The problem is if I start replacing the ugly part with another string to obtain for example:
"Tensile strength", [First value after]
I loose all the generic nature of the extraction because every file looks different from this one.
Take a look at the FileHelpers library. It allows runtime generation of file layouts and I think the one that would help in your example is the DelimitedClassBuilder.
In your case, I'd probably use FileHelpers to parse the record definitions into the DelimitedClassBuilder and then use the result to parse your records.
I have solved the issue by defining the rules as regular expressions. After the rules were defined I defined a wrapper rule-set that was easier to read by the users.
Ex. to extract a value from a line
Maximum amount of Sheet Drawing Force= 35.659695[kN]
I defined the regular expression
{0}=\s*(?<value>[^[\n\r]*)
then let the user define the name of the field. The {0} placeholder was then replaced with the name of the field and the regular expression applied.

How to convert words to links?

I have a xml with two properties: word and link.
How can I replace the words on a text to a link using the xml information.
Ex.:
XML
<word>dog</word>
<link>http://www.dog.com</link>
Text: The dog is nice.
Result: The dog is nice.
Results OK.
The problems:
1- If the text has the word dogs the result is incorret, because of "s".
2- I've tested doing a split by space on text to fix it, but if the word is composed like new year the result is incorret again.
Does anyone have any suggestions to do it and fix these problems (plural and compound words)?
Thanks for the help.
You can use Lucene.Net's contrib package Snowball for stemming (words->word , came->come , having->have etc.). But you will still have troubles with compound words
If you roll your own solution, I have had good success with the .NET pluralization capabilities:
http://msdn.microsoft.com/en-us/library/system.data.entity.design.pluralizationservices.pluralizationservice.aspx
Essentially, you can pass a word in its plural form and receive a singular version and vice versa.
This could be fairly intensive depending on how often the content changed, i.e. this wouldn't be a good choice to search thousands of words in real time.
Assuming that you can pre-process/cache the results or that the source file is small, you could:
Run Once
Identify all candidate words from the source file.
Parse/split phrases and pass them through the pluralization libraries to determine their plural counterparts.
Generate (and precompile) simple regular expressions to locate the words that you do want to match. For example, if you want to match "dog" but not "dogs" you could create a regex like dog[^s] which could then be executed against the text.
Run Whenever a Search/Replace is Needed
Run your list of source expressions against the text in question. I would suggest ordering the expressions from shortest to longest (otherwise a short expression may replace a word that was just parsed by a longer expression).
Again, this would be processor intensive to run in real-time (most solutions will be). As always, if you are parsing HTML, you should use an HTML parser, not a regular expression. In this case, you might use a proper parser to locate all text nodes and then perform the search/replace on them.
An alternative solution would be to put the text and keyword list into a database and use SQL Server Full Text Indexing which tends to be pretty smart about these things and supports intelligent match predicates. You could even combine this with a CLR stored procedure to handle things that .NET excels at (like string parsing).
Regardless of the approach, this will not be an exact science.
You're likely going to need a dictionary. Create a text file/XML file that contains both the singular and plural forms of the words you want. At runtime, load them into a Dictionary<String, String>. Then look up the value of <word/> in the dictionary and extract its singular value.

Categories

Resources