i'm matching words to create simple lexical analyzer.
here is my example code and output
example code:
public class
{
public static void main (String args[])
{
System.out.println("Hello");
}
}
output:
public = identifier
void = identifier
main = identifier
class = identifier
as you all can see my output is not arranged as the input comes. void and main comes after class but in output the class comes at the end. i want to print result as the input is matched.
c# code:
private void button1_Click(object sender, EventArgs e)
{
if (richTextBox1.Text.Contains("public"))
richTextBox2.AppendText("public = identifier\n");
if (richTextBox1.Text.Contains("void"))
richTextBox2.AppendText("void = identifier\n");
if (richTextBox1.Text.Contains("class"))
richTextBox2.AppendText("class = identifier\n");
if (richTextBox1.Text.Contains("main"))
richTextBox2.AppendText("main = identifier\n");
}
Your code is asking the following qustions:
Does the input contain the text "public"? If so, write down "public = identifier".
Does the input contain the text "void"? If so, write down "void = identifier".
Does the input contain the text "class"? If so, write down "class = identifier".
Does the input contain the text "main"? If so, write down "main = identifier".
The answer to all of these questions is yes, and since they're executed in that exact order, the output you get should not be surprising. Note: public, void, class and main are keywords, not identifiers.
Splitting on whitespace?
So your approach is not going to help you tokenize that input. Something slightly more in the right direction would be input.Split() - that will cut up the input at whitespace boundaries and give you an array of strings. Still, there's a lot of whitespace entries in there.
input.Split(new char[] { ' ', '\t', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries) is a little better, giving us the following output: public, class, {, public, static, void, main, (String, args[]), {, System.out.println("Hello");, } and }.
But you'll notice that some of these strings contain multiple 'tokens': (String, args[]) and System.out.println("Hello");. And if you had a string with whitespace in it it would get split into multiple tokens. Apparently, just splitting on whitespace is not sufficient.
Tokenizing
At this point, you would start writing a loop that goes over every character in the input, checking if it's whitespace or a punctuation character (such as (, ), {, }, [, ], ., ;, and so on). Those characters should be treated as the end of the previous token, and punctuation characters should also be treated as a token of their own. Whitespace can be skipped.
You'll also have to take things like string literals and comments into account: anything in-between two double-quotes should not be tokenized, but be treated as part of a single 'string' token (including whitespace). Also, strings can contain escape sequences, such as \", that produce a single character (that double quote should not be treated as the end of the string, but as part of its content).
Anything that comes after two forward slashes should be ignored (or parsed as a single 'comment' token, if you want to process comments somehow), until the next newline (newline characters/sequences differ across operating systems). Anything after a /* should be ignored until you encounter a */ sequence.
Numbers can optionally start with a minus sign, can contain a dot (or start with a dot), a scientific notation part (e..), which can also be negative, and there are type suffixes...
In other words, you're writing a state machine, with different behaviour depending on what state you're in: 'string', 'comment', 'block comment', 'numeric literal', and so on.
Lexing
It's useful to assign a type to each token, either while tokenizing or as a separate step (lexing). public is a keyword, main is an identifier, 1234 is an integer literal, "Hello" is a string literal, and so on. This will help during the next step.
Parsing
You can now move on to parsing: turning a list of tokens into an abstract syntax tree (AST). At this point you can check if a list of tokens is actually valid code. You basically repeat the above step, but at a higher level.
For example, public, protected and private are keyword tokens, and they're all access modifiers. As soon as you encounter one of these, you know that either a class, a function, a field or a property definition must follow. If the next token is a while keyword, then you signal an error: public while is not a valid C# construct. If, however, the next token is a class keyword, then you know it's a class definition and you continue parsing.
So you've got a state machine once again, but this time you've got states like 'class definition', 'function definition', 'expression', 'binary expression', 'unary expression', 'statement', 'assignment statement', and so on.
Conclusion
This is by no means complete, but hopefully it'll give you a better idea of all the steps involved and how to approach this. There are also tools available that can generate parsing code from a grammar specification, which can ease the job somewhat (though you still need to learn how to write such grammars).
You may also want to read the C# language specification, specifically the part about its grammar and lexical structure. The spec can be downloaded for free from one of Microsofts websites.
CodeCaster is right. You are not on the right path.
I have an lexical analyzer made by me some time ago as a project.
I know, I know I'm not supposed to put things on a plate here, but the analyzer is for c++ so you'll have to change a few things.
Take a look at the source code and please try to understand how it works at least: C++ Lexical Analyzer
In the strictest sense, the reason for the described behaviour is that in the evaluating code, the search for void comes before the search for class. However, the approach in total seems far too simple for a lexical analysis, as it simply checks for substrings. I totally second the comments above; depending on what you are trying to achieve in the big picture, a more sophisticated approach might be necessary.
Related
I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.
That might not be the best way to phrase it, but I'm considering writing a tool that converts identifiers separated by spaces in my code to camel case. A quick example:
var zoo animals = GetZooAnimals(); // i can't help but type this
var zooAnimals = GetZooAnimals(); // i want it to rewrite it like this
I was wondering if writing a tool like this would run into any ambiguities assuming it ignores all keywords. The only reason I can think of is if there is a syntactically valid expression with 2 identifiers only separated by white space.
Looking through the grammar I could not immediately find a place that allows it, but perhaps someone else would know better.
On a side note, I realize this is not a practical solution to a real problem a lot of people have, but just something I do all the time and wanted to take a stab at fixing with tools instead of forcing myself to always write camel case.
It is hard to tell whether a space-separated sequence of identifiers represents a single variable or not without doing full semantic analysis. For example
Myclass myVariable;
is a pair of space-separated identifiers which are perfectly valid. This would cause an ambiguity if you want to camel-case both type names and variable names.
If one enters:
csharp> var i j = 3;
(1,7): error CS1525: Unexpected symbol `j', expecting `,', `;', or `='
in the csharp interactive shell, one gets an error generated by the parser (a (LA)LR parser does bookkeeping what to expect next). Such parser works left-to-right so it doesn't know which characters to come next. It simply knows that the next characters are one of the list shown above.
So that means that there is probably no way to - at least declare a variable - with spaces.
Furthermore based on this context-free grammar for C# there doesn't seem to be a case where one can place two identifiers next to each other. It is for instance possible that a primary expressions is an identifier, but there is no situation where a primary expression is placed next to an identifier (or with an optional part in between).
As #dasblinkenlight says, you can indeed see the rule "local-variable-declaration":
type variable-declarator
with type that can be evaluated to an identifier and variable-declarator starting with an identifier. You can however know that the type is the first identifier (or the var keyword). Some kind of rewrite rule is thus:
(\w+)(\s+\w+)+ -> \1 concat(\2)
where you need to combine (concat) the identifiers of the second group. In case of an assignment.
This question already has answers here:
How to get distinct characters?
(9 answers)
Closed 8 years ago.
Lets say we have variable myString="blabla" or mystring=998769
myString.Length; //will get you your result
myString.Count(char.IsLetter); //if you only want the count of letters:
How to get, unique character count? I mean for "blabla" result must be 3, doe "998769" it will be 4. Is there ready to go function? any suggestions?
You can use LINQ:
var count = myString.Distinct().Count();
It uses a fact, that string implements IEnumerable<char>.
Without LINQ, you can do the same stuff Distinct does internally and use HashSet<char>:
var count = (new HashSet<char>(myString)).Count;
If you handle only ANSI text in English (or characters from BMP) then 80% times if you write:
myString.Distinct().Count()
You will live happy and won't ever have any trouble. Let me post this answer only for who will really need to handle that in the proper way. I'd say everyone should but I know it's not true (quote from Wikipedia):
Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed application software (e.g. CVE-2008-2938, CVE-2012-2135)
Problem of our first naïve solution is that it doesn't handle Unicode properly and it also doesn't consider what user perceive as character. Let's try "𠀑".Distinct().Count() and your code will wrongly return...2 because its UTF-16 representation is 0xD840 0xDC11 (BTW each of them, alone, is not a valid Unicode character because they're high and low surrogate, respectively).
Here I won't be very strict about terms and definitions so please refer to www.unicode.org as reference. For a (much) more broad discussion please read How can I perform a Unicode aware character by character comparison?, encoding isn't only issue you have to consider.
1) It doesn't take into account that .NET System.Char doesn't represent a character (or more specifically a grapheme) but a code unit of a UTF-16 encoded text (possible, for example, with ideographic characters). Often they coincide but now always.
2) If you're counting what user thinks (or perceives) as a character then this will fail again because it doesn't check combined characters like ا́ (many examples of this in Arabic language). There are duplicates that exists for historical reasons: for example é it's both a single Unicode code point and a combination (then that code will fail).
3) We're talking about a western/American definition of character. If you're counting characters for end-users you may need to change your definition to what they expect (for example in Korean language definition of character may not be so obvious, another example is Czech text ch that is always counted as a single character). Finally don't forget some strange things when you convert characters to upper case/lower case (for example in German language ß is SS in upper case, see also this post).
Encoding
C# strings are encoded as UTF-16 (char is two bytes) but UTF-16 isn't a fixed size encoding and char should be properly called code unit. What does it mean? That you may have a string where Length is 2 but actually user will see (and it's actually is) just one character (then count should be 1).
If you need to handle this properly then you have to make things much more complicated (and slow). Fortunately Char class has some helpful methods to handle surrogates.
Following code is untested (and for illustration purposes so absolutely not optimized, I'm sure it can be done much better than this) so get it just as starting point for further investigations:
int CountCharacters(string text)
{
HashSet<string> characters = new HashSet<string>();
string currentCharacter = "";
for (int i = 0; i < text.Length; ++i)
{
if (Char.IsHighSurrogate(text, i))
{
// Do not count this, next one will give the full pair
currentCharacter = text[i].ToString();
continue;
}
else if (Char.IsLowSurrogate(text, i))
{
// Our "character" is encoded as previous one plus this one
currentCharacter += text[i];
}
else
currentCharacter = text[i].ToString();
if (!characters.Contains(currentCharacter))
characters.Add(currentCharacter);
}
return characters.Count;
}
Note that this example doesn't handle duplicates (when same character may have different codes or can be a single code point or a combined character).
Combined Characters
If you have to handle combined characters (and of course encoding) then best way to do it is to use StringInfo class. You'll enumerate (and then count) both combined and encoded characters:
StringInfo.GetTextElementEnumerator(text).Walk()
.Distinct().Count();
Walk() is a trivial to implement extension method that simply walks through all IEnumerator elements (we need it because GetTextElementEnumerator() returns IEnumerator instead of IEnumerable).
Please note that after text has been properly splitted it can be counted with our first solution (the point is that brick isn't char but a sequence of char (for simplicity here returned as string itself). Again this code doesn't handle duplicates.
Culture
There is not much you can do to handle issues listed at point 3. Each language has its own rules and to support them all can be a pain. More examples about culture issues on this longer specific post.
It's important to be aware of them (so you have to know little bit about languages you're targeting) and don't forget that Unicode and few translated resx files won't make your application global.
If text processing is important in your application you can solve many issues using specialized DLLs for each locale you support (to count characters, to count words and so on) like Word Processors do. For example, issues I listed can be simply solved using dictionaries. What I usually do is to do not use standard .NET functions for strings (also because of some bugs), I create a Unicode class with static methods for everything I need (character counting, conversions, comparison) and many specialized derived classes for each supported language. At run-time that static methods will user current thread culture name to pick proper implementation from a dictionary and to delegate work to that. A skeleton may be something like this:
abstract class Unicode
{
public static string CountCharacters(string text)
{
return GetConcreteClass().CountCharactersCore(text);
}
protected virtual string CountCharactersCore(string text)
{
// Default implementation, overridden in derived classes if needed
return StringInfo.GetTextElementEnumerator(text).Cast<string>()
.Distinct().Count();
}
private Dictionary<string, Unicode> _implementations;
private Unicode GetConcreteClass()
{
string cultureName = Thread.Current.CurrentCulture.Name;
// Check if concrete class has been loaded and put in dictionary
...
return _implementations[cultureName];
}
}
If you're using C# then Linq comes nicely to the rescue - again:
"blabla".Distinct().Count()
will do it.
I am writing a code generator in which the variable names are given by the user.
Previous answers have suggested using Regex or CodeDomProvider, the former will tell you if the identifier is valid, but doesn't check keywords, the latter checks keywords, but doesn't appear to check all Types known to the code.
How to determine if a string is a valid variable name?
For instance, a user could name a variable List, or Type, but that is not desirable. How would I prevent this?
The easiest way is to add a list of C# keywords to your application. MSDN has a complete list here.
If you really want to get fancy, you could dynamically compile your generated code and check for the specific errors that you're concerned about. In this case, you're specifically looking for error CS1041:
error CS1041: Identifier expected; '**' is a keyword
You'll probably want to ignore any errors regarding unresolved references, undeclared identifiers, etc.
As others have suggested, you could just prepend your identifiers with #, which is fine if you don't want the user to examine the generated code. If it's something they're going to have to maintain, however, I'd avoid that as (in my opinion) it makes the code noisy, just like $ all over the place in PHP or guys that insist on putting this. in front of every freaking field reference.
I'm not sure there is a full API available which will give you what you're looking for. However the end result you seem to be looking for is the generation of code which will not cause conflicts with reserved C# keywords or existing types. If that is the case one approach you can take is to escape all identifiers given by the user with the # symbol. This allows even reserved keywords in C# to be treated as identifiers.
For example the following is completely valid C# program
class Program
{
static void Main(string[] args)
{
int #byte = 42;
int #string = #byte;
int #Program = 0;
}
}
One option here would be to have your code generator prefix the user-specified name with #. As described in 2.4.2, the # sign (verbatim identifier):
prefix "#" enables the use of keywords as identifiers, which is useful when interfacing with other programming languages. The character # is not actually part of the identifier, so the identifier might be seen in other languages as a normal identifier, without the prefix. An identifier with an # prefix is called a verbatim identifier. Use of the # prefix for identifiers that are not keywords is permitted, but strongly discouraged as a matter of style.
This would allow you to check for the main keywords, and deny them as needed, but not worry about all of the conflicting type information, etc.
You could just prepend a # character to the variable - for instance, #private is a valid variable name.
I'm working on a function that given some settings - such as line spacing, the output (in string form) is modified. In order to test such scenarios, I'm using string literals, as shown below for the expected result.
The method, using a string builder, (AppendLine) generates the said output. One issue I have run into is that of comparing such strings. In the example below, both are equal in terms of what they represent. The result is the area which I care about, however when comparing two strings, one literal, one not, equality naturally fails. This is because one of the strings emits line spacing, while the other only demonstrates the formatting it contains.
What would be the best way of solving this equality problem? I do care about formatting such as new lines from the result of the method, this is crucially important.
Code:
string expected = #"Test\n\n\nEnd Test.";
string result = "Test\n\n\nEnd Test";
Console.WriteLine(expected);
Console.WriteLine(result);
Output:
Test\n\n\nEnd Test.
Test
End Test
The # prefix tells the compiler to take the string exactly as it is written. So, it doesn't format the \n characters to carriage returns and line feeds.
Since you don't have the same prefix for the string assigned to your result variable, the compiler formats it. If you would like to continue to use the # prefix, just do the following:
string expected = #"Test
End Test";
You'll have to input the carriage returns and line feed within the string as invisible characters.
You're using the term "literal" incorrectly. "Literal" simply means an actual value that exists in code. In other words, values exist in code either as variables (for the sake of simplicity I'm including constants in this group) and literals. Variables are an abstract notion of a value, whereas literals are a value.
All this is to say that both of your strings are string literals, as they're hard-coded into your application. The # prefix simply states that the compiler is to include escape characters (indeed, anything other than a double-quote) in the string, rather than evaluating the escape sequences when compiling the string literal into the assembly.
First of all, whatever your function returns (either a string that contains standard escape sequences for newlines rather than newlines themselves, or a string that actually contains newlines) is what your test variable should contain. Make your tests as close to the actual output as possible, as the more work you do to massage the values into a comparable form the more code paths you have to test. If you're looking to be able to compare a string with formatting escape sequences embedded into it to a string where those sequences have been evaluated (essentially comparing the two strings in your example), then I would say this:
Be sure that this is really want you want to do.
You'll have to duplicate the functionality of the C# compiler in interpreting these values and turning your "format string" into a "formatted string".
For doing #2, a RegEx processor is probably going to be the simplest option. See this page for a list of C# string escape sequences.
I feel somewhat enlightened, yet annoyed at what I discovered.
This is my first project using MSTest, and after a failing test I was selecting View Test Details to see how and why my test failed. The formatting for string output in this details display is very poor, for example you get:
Assert.AreEqual failed. Expected:<TestTest End>. Actual:<TestTest End>.
This is for formatted text - the strange thing is if you have /r (line feeds) instead of line breaks (/n) the formatting is actually somewhat correct.
It turns out to view the correct output you need to run the tests in debug mode. In other words, when you have a failing test, run the test in debug and the exception will be caught and displayed as follows:
Assert.AreEqual failed. Expected:<Test
Test End>. Actual:<Test
Test End>.
The above obviously containing the correct formatting.
In the end it turns out my initial method of storing the expectations (with formatting) in strings was correct, yet my unfamiliarity of MSTest made me question my means as it appeared to be valid input, yet was simply being displayed back to myself in what appeared a valid output.
Use a regex to strip white space before you do your compare?