Is there a way to make the code analysis spell-checker accept an acronym that contains a number?
I'm getting CA1704 and CA1709 warnings from code analysis in a C# application where I have identifiers with an acronym that contains a number. For example, "CheckAbc2deStatus". CA1704 wants to correct the spelling of Abc, while CA1709 wants "de" changed to "DE". I found Code analysis, Lost between CA1709 and CA1704 and have tried putting "Abc2de" in the code analysis dictionary as Words/Recognized/Word, Words/Compound/Term, and Acronyms/CasingExceptions/Acronym, but none of those entries will make the code analyzer happy. Other entries in the custom dictionary for "normal" acronyms work as expected.
I got it to work with:
Code:
public static bool CheckABC2DEStatus()
{
return true;
}
And in the Code Analysis Dictionary:
<Acronyms>
<CasingExceptions>
<Acronym>ABC</Acronym>
<Acronym>DE</Acronym>
</CasingExceptions>
</Acronyms>
The number seems to be treated as a word break, so I had to put the two halves in seperately.
If CheckABC2DEStatus isn't your preferred method name, let me know and I'll try and adjust the dictionary entry accordingly.
Related
That might not be the best way to phrase it, but I'm considering writing a tool that converts identifiers separated by spaces in my code to camel case. A quick example:
var zoo animals = GetZooAnimals(); // i can't help but type this
var zooAnimals = GetZooAnimals(); // i want it to rewrite it like this
I was wondering if writing a tool like this would run into any ambiguities assuming it ignores all keywords. The only reason I can think of is if there is a syntactically valid expression with 2 identifiers only separated by white space.
Looking through the grammar I could not immediately find a place that allows it, but perhaps someone else would know better.
On a side note, I realize this is not a practical solution to a real problem a lot of people have, but just something I do all the time and wanted to take a stab at fixing with tools instead of forcing myself to always write camel case.
It is hard to tell whether a space-separated sequence of identifiers represents a single variable or not without doing full semantic analysis. For example
Myclass myVariable;
is a pair of space-separated identifiers which are perfectly valid. This would cause an ambiguity if you want to camel-case both type names and variable names.
If one enters:
csharp> var i j = 3;
(1,7): error CS1525: Unexpected symbol `j', expecting `,', `;', or `='
in the csharp interactive shell, one gets an error generated by the parser (a (LA)LR parser does bookkeeping what to expect next). Such parser works left-to-right so it doesn't know which characters to come next. It simply knows that the next characters are one of the list shown above.
So that means that there is probably no way to - at least declare a variable - with spaces.
Furthermore based on this context-free grammar for C# there doesn't seem to be a case where one can place two identifiers next to each other. It is for instance possible that a primary expressions is an identifier, but there is no situation where a primary expression is placed next to an identifier (or with an optional part in between).
As #dasblinkenlight says, you can indeed see the rule "local-variable-declaration":
type variable-declarator
with type that can be evaluated to an identifier and variable-declarator starting with an identifier. You can however know that the type is the first identifier (or the var keyword). Some kind of rewrite rule is thus:
(\w+)(\s+\w+)+ -> \1 concat(\2)
where you need to combine (concat) the identifiers of the second group. In case of an assignment.
i'm matching words to create simple lexical analyzer.
here is my example code and output
example code:
public class
{
public static void main (String args[])
{
System.out.println("Hello");
}
}
output:
public = identifier
void = identifier
main = identifier
class = identifier
as you all can see my output is not arranged as the input comes. void and main comes after class but in output the class comes at the end. i want to print result as the input is matched.
c# code:
private void button1_Click(object sender, EventArgs e)
{
if (richTextBox1.Text.Contains("public"))
richTextBox2.AppendText("public = identifier\n");
if (richTextBox1.Text.Contains("void"))
richTextBox2.AppendText("void = identifier\n");
if (richTextBox1.Text.Contains("class"))
richTextBox2.AppendText("class = identifier\n");
if (richTextBox1.Text.Contains("main"))
richTextBox2.AppendText("main = identifier\n");
}
Your code is asking the following qustions:
Does the input contain the text "public"? If so, write down "public = identifier".
Does the input contain the text "void"? If so, write down "void = identifier".
Does the input contain the text "class"? If so, write down "class = identifier".
Does the input contain the text "main"? If so, write down "main = identifier".
The answer to all of these questions is yes, and since they're executed in that exact order, the output you get should not be surprising. Note: public, void, class and main are keywords, not identifiers.
Splitting on whitespace?
So your approach is not going to help you tokenize that input. Something slightly more in the right direction would be input.Split() - that will cut up the input at whitespace boundaries and give you an array of strings. Still, there's a lot of whitespace entries in there.
input.Split(new char[] { ' ', '\t', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries) is a little better, giving us the following output: public, class, {, public, static, void, main, (String, args[]), {, System.out.println("Hello");, } and }.
But you'll notice that some of these strings contain multiple 'tokens': (String, args[]) and System.out.println("Hello");. And if you had a string with whitespace in it it would get split into multiple tokens. Apparently, just splitting on whitespace is not sufficient.
Tokenizing
At this point, you would start writing a loop that goes over every character in the input, checking if it's whitespace or a punctuation character (such as (, ), {, }, [, ], ., ;, and so on). Those characters should be treated as the end of the previous token, and punctuation characters should also be treated as a token of their own. Whitespace can be skipped.
You'll also have to take things like string literals and comments into account: anything in-between two double-quotes should not be tokenized, but be treated as part of a single 'string' token (including whitespace). Also, strings can contain escape sequences, such as \", that produce a single character (that double quote should not be treated as the end of the string, but as part of its content).
Anything that comes after two forward slashes should be ignored (or parsed as a single 'comment' token, if you want to process comments somehow), until the next newline (newline characters/sequences differ across operating systems). Anything after a /* should be ignored until you encounter a */ sequence.
Numbers can optionally start with a minus sign, can contain a dot (or start with a dot), a scientific notation part (e..), which can also be negative, and there are type suffixes...
In other words, you're writing a state machine, with different behaviour depending on what state you're in: 'string', 'comment', 'block comment', 'numeric literal', and so on.
Lexing
It's useful to assign a type to each token, either while tokenizing or as a separate step (lexing). public is a keyword, main is an identifier, 1234 is an integer literal, "Hello" is a string literal, and so on. This will help during the next step.
Parsing
You can now move on to parsing: turning a list of tokens into an abstract syntax tree (AST). At this point you can check if a list of tokens is actually valid code. You basically repeat the above step, but at a higher level.
For example, public, protected and private are keyword tokens, and they're all access modifiers. As soon as you encounter one of these, you know that either a class, a function, a field or a property definition must follow. If the next token is a while keyword, then you signal an error: public while is not a valid C# construct. If, however, the next token is a class keyword, then you know it's a class definition and you continue parsing.
So you've got a state machine once again, but this time you've got states like 'class definition', 'function definition', 'expression', 'binary expression', 'unary expression', 'statement', 'assignment statement', and so on.
Conclusion
This is by no means complete, but hopefully it'll give you a better idea of all the steps involved and how to approach this. There are also tools available that can generate parsing code from a grammar specification, which can ease the job somewhat (though you still need to learn how to write such grammars).
You may also want to read the C# language specification, specifically the part about its grammar and lexical structure. The spec can be downloaded for free from one of Microsofts websites.
CodeCaster is right. You are not on the right path.
I have an lexical analyzer made by me some time ago as a project.
I know, I know I'm not supposed to put things on a plate here, but the analyzer is for c++ so you'll have to change a few things.
Take a look at the source code and please try to understand how it works at least: C++ Lexical Analyzer
In the strictest sense, the reason for the described behaviour is that in the evaluating code, the search for void comes before the search for class. However, the approach in total seems far too simple for a lexical analysis, as it simply checks for substrings. I totally second the comments above; depending on what you are trying to achieve in the big picture, a more sophisticated approach might be necessary.
I writing C# program that run over C++ source files and looking for the following things:
#define SOMETHING_A 99
and
typedef enum {
EX_A,
EX_B,
EX_C,
EX_D,
EX_E
} Examples;
and
enum EXAMPLE2
{
EX2_A=0,
EX2_B=1,
EX2_C=2,
EX2_D=3,
EX2_LAST = EX2_D
};
My objective is to get the following list of pairs as output:
{SOMETHING_A,99}
{EX_A,0}
{EX_B,1}
..
..
{EX2_A,0}
{EX2_B,1}
..
..
Can you help me to find the correct regular expressions that match the above 3 patterns?
If you want a solution that will work on any c++ files, use a parser instead of regexes. There are just too many possibilities to account for (different code styles, code that is commented out, etc.).
If you only want to do this on a known set of files, and they have a predictable format and style, a regex is probably ok. Actually, you are better off using several regexes:
/^#define\s+(\S+)\s+(\S+)/
This only matches define statements that are at the beginning of a line.
Here is the typedef enum:
/^\s*typedef\s+enum\s*\{[^\}]+\}[^;]+;/
(It's not clear what you want to grab from this one, so I haven't captured anything).
And here is the enum. This is best done in two steps:
/^\s*enum\s+(\S+)\s*\{\s*([^\}]+?)\s*\}\s*;/
The first step gets the name of the enum in the first capture group and the content in the second group. Perform a regex on the second capture group to get the fields and values:
/(\S+)\s*=\s*([^\s\,]+)/
Each match of this will give you one name/value pair.
These regexes should handle your examples, and they should do a decent job of handling the most common usage in C++ code. But they are not perfect; if you want a solution that covers all possible constructs, don't use a regex.
note: you need to make sure the match_single_line flag is off when using these.
FxCop 10 is complaining about the following:
using XYZ.Blah; //CA1709 - "XYZ"
using Xyz.Blah; //No complaint.
using XylophoneSuperDuperLongFullName.Blah; //I don't want to have a long full name for my company name.
The problem is... I want my company name to show up in all UPPERCASE because XYZ is an abbreviation. The long version of the name is much too long to be a useful namespace. Microsoft gets away with this kind of stuff because their acronym is only 2 letters.
using MS.Something; //No Complaint.
using Microsoft.SomethingElse; //No Complaint.
So, I was looking at adding a SuppressMessageAttribute to suppress this warning. But, I'm not sure how to do so properly to only (or where to even stick it) so that it ONLY affects this one instance. I don't want to Suppress anything within that namespace because I want to catch any other mistakes I make. I did look at at the msdn and google searched but I can't find anything that shows how to specifically just target this instance. The closest I found was Scope = "namespace" but I wasn't sure if that means it affects the actual namespace name or if it affects everything WITHIN that namespace.
MSDN - CA1709: Identifiers should be cased correctly:
It is safe to suppress this warning if
you have your own naming conventions,
or if the identifier represents a
proper name, for example, the name of
a company or a technology.
You can also add specific terms,
abbreviations, and acronyms that to a
code analysis custom dictionary. Terms
specified in the custom dictionary
will not cause violations of this
rule. For more information, see How
to: Customize the Code Analysis
Dictionary.
That being said, if you feel justified to suppress the message, it really isn't hard at all. In FxCop 10 right click on any message you want to suppress and go to Copy As>Suppress-Message or Copy As>Module-level Suppress Message.
You should place the SuppressMessageAttributes in the appropriate locations. Attributes that suppress a single location should be placed on that location, for example, above a method, field, property, or class.
In you're instance, there is no specific location to place the attribute (by default it should copy over as [module: SuppressMessage(...)]. This is a good indication that it belongs either at the top of a file if it is a module-level suppression particular to a file (for example, to a resource specific to a file). Or, and more likely, it belongs in a GlobalSuppressions.cs file.
using System.Diagnostics.CodeAnalysis;
[module: SuppressMessage("Microsoft.Naming", "CA1709:IdentifiersShouldBeCasedCorrectly", Justification = "Because I said so!", MessageId = "XYZ", Scope = "namespace", Target = "XYZ.Blah")]
You can also shorten the CheckId property if you want to, but it's good to know what CA1709 means. If you don't feel like it, this also works:
using System.Diagnostics.CodeAnalysis;
[module: SuppressMessage("Microsoft.Naming", "CA1709", Justification = "Because I said so!", MessageId = "XYZ", Scope = "namespace", Target = "XYZ.Blah")]
And lastly... all this will be fruitless unless you include the 'CODE_ANALYSIS' symbol in your build. Go to Properties>Build and add the conditional compilation symbol.
Acryonyms aren't meant to be all upper case in .NET naming conventions. For example HttpResponse etc.
From the capitalization conventions:
Casing of acronyms depends on the length of the acronym. All acronyms are at least two characters long. For the purposes of these guidelines, if an acronym is exactly two characters, it is considered a short acronym. An acronym of three or more characters is a long acronym.
The following guidelines specify the proper casing for short and long acronyms. The identifier casing rules take precedence over acronym casing rules.
Do capitalize both characters of two-character acronyms, except the first word of a camel-cased identifier.
A property named DBRate is an example of a short acronym (DB) used as the first word of a Pascal-cased identifier. A parameter named ioChannel is an example of a short acronym (IO) used as the first word of a camel-cased identifier.
Do capitalize only the first character of acronyms with three or more characters, except the first word of a camel-cased identifier.
A class named XmlWriter is an example of a long acronym used as the first word of a Pascal-cased identifier. A parameter named htmlReader is an example of a long acronym used as the first word of a camel-cased identifier.
If you were checking names via StyleCop, you could use StyleCop+ (custom rules) which supports configurable abbreviations list.
I am writing a code generator in which the variable names are given by the user.
Previous answers have suggested using Regex or CodeDomProvider, the former will tell you if the identifier is valid, but doesn't check keywords, the latter checks keywords, but doesn't appear to check all Types known to the code.
How to determine if a string is a valid variable name?
For instance, a user could name a variable List, or Type, but that is not desirable. How would I prevent this?
The easiest way is to add a list of C# keywords to your application. MSDN has a complete list here.
If you really want to get fancy, you could dynamically compile your generated code and check for the specific errors that you're concerned about. In this case, you're specifically looking for error CS1041:
error CS1041: Identifier expected; '**' is a keyword
You'll probably want to ignore any errors regarding unresolved references, undeclared identifiers, etc.
As others have suggested, you could just prepend your identifiers with #, which is fine if you don't want the user to examine the generated code. If it's something they're going to have to maintain, however, I'd avoid that as (in my opinion) it makes the code noisy, just like $ all over the place in PHP or guys that insist on putting this. in front of every freaking field reference.
I'm not sure there is a full API available which will give you what you're looking for. However the end result you seem to be looking for is the generation of code which will not cause conflicts with reserved C# keywords or existing types. If that is the case one approach you can take is to escape all identifiers given by the user with the # symbol. This allows even reserved keywords in C# to be treated as identifiers.
For example the following is completely valid C# program
class Program
{
static void Main(string[] args)
{
int #byte = 42;
int #string = #byte;
int #Program = 0;
}
}
One option here would be to have your code generator prefix the user-specified name with #. As described in 2.4.2, the # sign (verbatim identifier):
prefix "#" enables the use of keywords as identifiers, which is useful when interfacing with other programming languages. The character # is not actually part of the identifier, so the identifier might be seen in other languages as a normal identifier, without the prefix. An identifier with an # prefix is called a verbatim identifier. Use of the # prefix for identifiers that are not keywords is permitted, but strongly discouraged as a matter of style.
This would allow you to check for the main keywords, and deny them as needed, but not worry about all of the conflicting type information, etc.
You could just prepend a # character to the variable - for instance, #private is a valid variable name.