Say I have a Currency class
public class Currency
{
string Name; // eg "US Dollars"
string Symbol; // eg "$"
decimal Rate; // eg 1.7
}
So, symbol is $, £, ..etc i.e. it's going to be a single character.
I used string simply because it makes translation to/from views easier, but is there any reason why I should be using char? What are the advantages/disadvantages of using a char field instead of a string? Or is this a case like byte/int where you should basically always prefer int.
Not all currency symbols are a single character: Symbols List.
Well, if you use char:
It's guaranteed to be exactly a single character, and that's obvious to the developer (probably the biggest "pro" for char)
It can never be null (although it could be U+0000)
You won't be able to handle characters which aren't in the Basic Multilingual Plane
I don't know if any of those are important to you.
Or is this a case like byte/int where you should basically always prefer int.
First off, I would argue that there are valid uses for byte. byte takes less space than int - if you only need a byte, use a byte.
System.Char has a couple of advantages to string, in the right case. First, its a single, immutable value type. This is more efficient if you only need a single character.
In addition, using a char in your API makes it impossible to put more than one character into that field. This may eliminate or simplify some of the validation required.
If it truly makes sense that you'd only ever want a single char, I'd say to use a char. It simplifies the code (less validation), makes it more efficient (less memory as you don't have another object reference), and most importantly, expresses the intent more clearly, since you're saying "I just want one character here."
string is acceptable here, you just have to ensure that the string always contains only 1 character, which means a little extra code on that end to prevent exceptions. Really, you can pick your poison here.
Related
i'm matching words to create simple lexical analyzer.
here is my example code and output
example code:
public class
{
public static void main (String args[])
{
System.out.println("Hello");
}
}
output:
public = identifier
void = identifier
main = identifier
class = identifier
as you all can see my output is not arranged as the input comes. void and main comes after class but in output the class comes at the end. i want to print result as the input is matched.
c# code:
private void button1_Click(object sender, EventArgs e)
{
if (richTextBox1.Text.Contains("public"))
richTextBox2.AppendText("public = identifier\n");
if (richTextBox1.Text.Contains("void"))
richTextBox2.AppendText("void = identifier\n");
if (richTextBox1.Text.Contains("class"))
richTextBox2.AppendText("class = identifier\n");
if (richTextBox1.Text.Contains("main"))
richTextBox2.AppendText("main = identifier\n");
}
Your code is asking the following qustions:
Does the input contain the text "public"? If so, write down "public = identifier".
Does the input contain the text "void"? If so, write down "void = identifier".
Does the input contain the text "class"? If so, write down "class = identifier".
Does the input contain the text "main"? If so, write down "main = identifier".
The answer to all of these questions is yes, and since they're executed in that exact order, the output you get should not be surprising. Note: public, void, class and main are keywords, not identifiers.
Splitting on whitespace?
So your approach is not going to help you tokenize that input. Something slightly more in the right direction would be input.Split() - that will cut up the input at whitespace boundaries and give you an array of strings. Still, there's a lot of whitespace entries in there.
input.Split(new char[] { ' ', '\t', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries) is a little better, giving us the following output: public, class, {, public, static, void, main, (String, args[]), {, System.out.println("Hello");, } and }.
But you'll notice that some of these strings contain multiple 'tokens': (String, args[]) and System.out.println("Hello");. And if you had a string with whitespace in it it would get split into multiple tokens. Apparently, just splitting on whitespace is not sufficient.
Tokenizing
At this point, you would start writing a loop that goes over every character in the input, checking if it's whitespace or a punctuation character (such as (, ), {, }, [, ], ., ;, and so on). Those characters should be treated as the end of the previous token, and punctuation characters should also be treated as a token of their own. Whitespace can be skipped.
You'll also have to take things like string literals and comments into account: anything in-between two double-quotes should not be tokenized, but be treated as part of a single 'string' token (including whitespace). Also, strings can contain escape sequences, such as \", that produce a single character (that double quote should not be treated as the end of the string, but as part of its content).
Anything that comes after two forward slashes should be ignored (or parsed as a single 'comment' token, if you want to process comments somehow), until the next newline (newline characters/sequences differ across operating systems). Anything after a /* should be ignored until you encounter a */ sequence.
Numbers can optionally start with a minus sign, can contain a dot (or start with a dot), a scientific notation part (e..), which can also be negative, and there are type suffixes...
In other words, you're writing a state machine, with different behaviour depending on what state you're in: 'string', 'comment', 'block comment', 'numeric literal', and so on.
Lexing
It's useful to assign a type to each token, either while tokenizing or as a separate step (lexing). public is a keyword, main is an identifier, 1234 is an integer literal, "Hello" is a string literal, and so on. This will help during the next step.
Parsing
You can now move on to parsing: turning a list of tokens into an abstract syntax tree (AST). At this point you can check if a list of tokens is actually valid code. You basically repeat the above step, but at a higher level.
For example, public, protected and private are keyword tokens, and they're all access modifiers. As soon as you encounter one of these, you know that either a class, a function, a field or a property definition must follow. If the next token is a while keyword, then you signal an error: public while is not a valid C# construct. If, however, the next token is a class keyword, then you know it's a class definition and you continue parsing.
So you've got a state machine once again, but this time you've got states like 'class definition', 'function definition', 'expression', 'binary expression', 'unary expression', 'statement', 'assignment statement', and so on.
Conclusion
This is by no means complete, but hopefully it'll give you a better idea of all the steps involved and how to approach this. There are also tools available that can generate parsing code from a grammar specification, which can ease the job somewhat (though you still need to learn how to write such grammars).
You may also want to read the C# language specification, specifically the part about its grammar and lexical structure. The spec can be downloaded for free from one of Microsofts websites.
CodeCaster is right. You are not on the right path.
I have an lexical analyzer made by me some time ago as a project.
I know, I know I'm not supposed to put things on a plate here, but the analyzer is for c++ so you'll have to change a few things.
Take a look at the source code and please try to understand how it works at least: C++ Lexical Analyzer
In the strictest sense, the reason for the described behaviour is that in the evaluating code, the search for void comes before the search for class. However, the approach in total seems far too simple for a lexical analysis, as it simply checks for substrings. I totally second the comments above; depending on what you are trying to achieve in the big picture, a more sophisticated approach might be necessary.
This question already has answers here:
How to get distinct characters?
(9 answers)
Closed 8 years ago.
Lets say we have variable myString="blabla" or mystring=998769
myString.Length; //will get you your result
myString.Count(char.IsLetter); //if you only want the count of letters:
How to get, unique character count? I mean for "blabla" result must be 3, doe "998769" it will be 4. Is there ready to go function? any suggestions?
You can use LINQ:
var count = myString.Distinct().Count();
It uses a fact, that string implements IEnumerable<char>.
Without LINQ, you can do the same stuff Distinct does internally and use HashSet<char>:
var count = (new HashSet<char>(myString)).Count;
If you handle only ANSI text in English (or characters from BMP) then 80% times if you write:
myString.Distinct().Count()
You will live happy and won't ever have any trouble. Let me post this answer only for who will really need to handle that in the proper way. I'd say everyone should but I know it's not true (quote from Wikipedia):
Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed application software (e.g. CVE-2008-2938, CVE-2012-2135)
Problem of our first naïve solution is that it doesn't handle Unicode properly and it also doesn't consider what user perceive as character. Let's try "𠀑".Distinct().Count() and your code will wrongly return...2 because its UTF-16 representation is 0xD840 0xDC11 (BTW each of them, alone, is not a valid Unicode character because they're high and low surrogate, respectively).
Here I won't be very strict about terms and definitions so please refer to www.unicode.org as reference. For a (much) more broad discussion please read How can I perform a Unicode aware character by character comparison?, encoding isn't only issue you have to consider.
1) It doesn't take into account that .NET System.Char doesn't represent a character (or more specifically a grapheme) but a code unit of a UTF-16 encoded text (possible, for example, with ideographic characters). Often they coincide but now always.
2) If you're counting what user thinks (or perceives) as a character then this will fail again because it doesn't check combined characters like ا́ (many examples of this in Arabic language). There are duplicates that exists for historical reasons: for example é it's both a single Unicode code point and a combination (then that code will fail).
3) We're talking about a western/American definition of character. If you're counting characters for end-users you may need to change your definition to what they expect (for example in Korean language definition of character may not be so obvious, another example is Czech text ch that is always counted as a single character). Finally don't forget some strange things when you convert characters to upper case/lower case (for example in German language ß is SS in upper case, see also this post).
Encoding
C# strings are encoded as UTF-16 (char is two bytes) but UTF-16 isn't a fixed size encoding and char should be properly called code unit. What does it mean? That you may have a string where Length is 2 but actually user will see (and it's actually is) just one character (then count should be 1).
If you need to handle this properly then you have to make things much more complicated (and slow). Fortunately Char class has some helpful methods to handle surrogates.
Following code is untested (and for illustration purposes so absolutely not optimized, I'm sure it can be done much better than this) so get it just as starting point for further investigations:
int CountCharacters(string text)
{
HashSet<string> characters = new HashSet<string>();
string currentCharacter = "";
for (int i = 0; i < text.Length; ++i)
{
if (Char.IsHighSurrogate(text, i))
{
// Do not count this, next one will give the full pair
currentCharacter = text[i].ToString();
continue;
}
else if (Char.IsLowSurrogate(text, i))
{
// Our "character" is encoded as previous one plus this one
currentCharacter += text[i];
}
else
currentCharacter = text[i].ToString();
if (!characters.Contains(currentCharacter))
characters.Add(currentCharacter);
}
return characters.Count;
}
Note that this example doesn't handle duplicates (when same character may have different codes or can be a single code point or a combined character).
Combined Characters
If you have to handle combined characters (and of course encoding) then best way to do it is to use StringInfo class. You'll enumerate (and then count) both combined and encoded characters:
StringInfo.GetTextElementEnumerator(text).Walk()
.Distinct().Count();
Walk() is a trivial to implement extension method that simply walks through all IEnumerator elements (we need it because GetTextElementEnumerator() returns IEnumerator instead of IEnumerable).
Please note that after text has been properly splitted it can be counted with our first solution (the point is that brick isn't char but a sequence of char (for simplicity here returned as string itself). Again this code doesn't handle duplicates.
Culture
There is not much you can do to handle issues listed at point 3. Each language has its own rules and to support them all can be a pain. More examples about culture issues on this longer specific post.
It's important to be aware of them (so you have to know little bit about languages you're targeting) and don't forget that Unicode and few translated resx files won't make your application global.
If text processing is important in your application you can solve many issues using specialized DLLs for each locale you support (to count characters, to count words and so on) like Word Processors do. For example, issues I listed can be simply solved using dictionaries. What I usually do is to do not use standard .NET functions for strings (also because of some bugs), I create a Unicode class with static methods for everything I need (character counting, conversions, comparison) and many specialized derived classes for each supported language. At run-time that static methods will user current thread culture name to pick proper implementation from a dictionary and to delegate work to that. A skeleton may be something like this:
abstract class Unicode
{
public static string CountCharacters(string text)
{
return GetConcreteClass().CountCharactersCore(text);
}
protected virtual string CountCharactersCore(string text)
{
// Default implementation, overridden in derived classes if needed
return StringInfo.GetTextElementEnumerator(text).Cast<string>()
.Distinct().Count();
}
private Dictionary<string, Unicode> _implementations;
private Unicode GetConcreteClass()
{
string cultureName = Thread.Current.CurrentCulture.Name;
// Check if concrete class has been loaded and put in dictionary
...
return _implementations[cultureName];
}
}
If you're using C# then Linq comes nicely to the rescue - again:
"blabla".Distinct().Count()
will do it.
Does C# have an equivalent to Ruby's string successor method?
No, there's no built-in equivalent, although of course you can create your own. I'd suggest the skeleton of something like this:
char[] array = text.ToArray();
// Perform mutation...
// - find last alphanumeric character, and work backwards from that...
// - or go from final character if there aren't any alphanumerics
return new string(array);
Personally I think I'd want rather more restricted semantics personally, with a specific set of characters to use. The description given in the Ruby docs seems to be along the lines of "we'll work with anything" rather than failing if you're trying to do something which doesn't really make sense. I certainly wouldn't want to increment some arbitrary UTF-16 code unit when I was expecting A-Z, 0-9.
I'd like to String.Split() the following string using a comma as the delimitter:
John,Smith,123 Main Street,212-555-1212
The above content is entered by a user. If they enter a comma in their address, the resulting string would cause problems to String.Split() since you now have 5 fields instead of 4:
John,Smith,123 Main Street, Apt 101,212-555-1212
I can use String.Replace() on all user input to replace commas with something else, and then use String.Replace() again to convert things back to commas:
value = value.Replace(",", "*");
However, this can still be fooled if a user happens to use the placeholder delimitter "*" in their input. Then you'd end up with extra commas and no asterisks in the result.
I see solutions online for dealing with escaped delimitters, but I haven't found a solution for this seemingly common situation. What am I missing?
EDIT: This is called delimitter collision.
This is a common scenario — you have some arbitrary string values that you would like to compose into a structure, which is itself a string, but without allowing the values to interfere with the delimiters in structure around them.
You have several options:
Input restriction: If it is acceptable for your scenario, the simplest solution is to restrict the use of delimiters in the values. In your specific case, this means disallow commas.
Encoding: If input restriction is not appropriate, the next easiest option would be to encode the entire input value. Choose an encoding that does not have delimiters in its range of possible outputs (e.g. Base64 does not feature commas in its encoded output)
Escaping delimiters: A slightly more complex option is to come up with a convention for escaping delimiters. If you're working with something mainstream like CSV it is likely that the problem of escaping is already solved, and there's a standard library that you can use. If not, then it will take some thought to come up with a complete escaping system, and implement it.
If you have the flexibility to not use CSV for your data representation this would open up a host of other options. (e.g. Consider the way in which parameterised SQL queries sidestep the complexity of input escaping by storing the parameter values separately from the query string.)
This may not be an option for you but would is it not be easier to use a very uncommon character, say a pipe |, as your delimiter and not allow this character to be entered in the first instance?
If this is CSV, the address should be surrounded by quotes. CSV parsers are widely available that take this into account when parsing the text.
John,Smith,"123 Main Street, Apt. 6",212-555-1212
One foolproof solution would be to convert the user input to base64 and then delimit with a comma. It will mean that you will have to convert back after parsing.
You could try putting quotes, or some other begin and end delimiters, around each of the user inputs, and ignore any special character between a set of quotes.
This really comes down to a situation of cleansing user inputs. You should only allow desired characters in the user input and reject/strip invalid inputs from the user. This way you could use your asterisk delimiter.
The best solution is to define valid characters, and reject non valid characters somehow, then use the nonvalid character (which will not appear in the input since they are "banned") as you delimiters
Dont allow the user to enter that character which you are using as a Delimiter. I personally feel this is best way.
Funny solution (works if the address is the only field with coma):
Split the string by coma. First two pieces will be name and last name; the last piece is the telephone - take those away. Combine the rest by coma back - that would be address ;)
In a sense, the user is already "escaping" the comma with the space afterward.
So, try this:
string[] values = RegEx.Split(value, ",(?![ ])");
The user can still break this if they don't put a space, and there is a more foolproof method (using the standard CSV method of quoting values that contain commas), but this will do the trick for the use case you've presented.
One more solution: provide an "Address 2" field, which is where things like apartment numbers would traditionally go. User can still break it if they are lazy, though what they'll actually break the fields after address2.
Politely remind your users that properly-formed street addresses in the United States and Canada should NEVER contain any punctuation whatsoever, perhaps?
The process of automatically converting corrupted data into useful data is non-trivial without heuristic logic. You could try to outsource the parsing by calling a third-party address-formatting library to apply the USPS formatting rules.
Even USPS requires the user to perform much of the work, by having components of the address entered into distinct fields on their address "canonicalizer" page (http://zip4.usps.com/zip4/welcome.jsp).
I'm kind of stuck here. I'm developing a custom Pipleline component for Commerce Server 2009, but that has little to do with my problem.
In the setup of the pipe, I give the user a windows form to enter some values for configuration. One of those values is a URL for a SharePoint site. Commerce Server uses C++ components behind all this pipeline stuff, so the entered values are put into an IDictionary and eventually persisted to the DB via the C++ component from Microsoft.
When I read the string in during pipeline execution, it is handed to me in an IDictionary object from C++. My C# code sees that URL suffixed with \0\0. I'm not sure where those are coming from, but my code blows up because it's not a valid URI. I am trimming the string before I save it and trimming it when I read it and still can't get rid of those.
Any ideas what is causing this and how I can get rid of it? I prefer not to have a hack like substring it, but something that gets at the root cause.
Thanks,
Corey
Would this help:
string sFixedUrl = "hello\0\0".Trim('\0');
As the others' posts explained, strings in C are null-terminated. (Notice that C++, however, already provides a string type which doesn't depend on that.)
Your case is just a bit different because you're getting double-null-terminated string. I'm not an expert here, so anyone should feel free to correct me if I'm wrong. But this looks like a typical string representation for unicode/i18n aware applications in Windows which use wide characters. Please, take a look at this.
One guess is that the application which is persisting the string into the database is not using a "portable" strategy. For example, it might be persisting the string buffer considering its size in raw bytes instead of its actual length. The former would be counting the extra two zeros in the end (and, consequently, persisting them too) while the latter would discard them.
From this site:
A string in C is simply an array of characters, with the final character set to the NUL character (ascii/unicode point 0). This null-terminator is required; a string is ill-formed if it isn't there. The string literal token in C/C++ ("string") guarantees this.
const char *str = "foo";
is the same as
const char *str = {'f', 'o', 'o', 0};
So as soon as the C++ component gets your IDictionary, it will add the null-terminated string to the end. If you want to remove it, you will have to remove the null terminated char from the end before sending back the dictionary. See this post on how to remove a null terminated character. Basically you need to know the exact size and trim it off.
Another technique you can use is an array of characters and the length of the array. An array of characters does not need a terminating null character.
When you pass this data structure, you must pass the length also. The convention for the C-style strings is to determine the end of the string by searching for a '\0' (or in Unicode, '\0\0'). Since the array doesn't have the terminating characters, the length is always needed.
A much better solution is to use the std::string. It doesn't append null characters. When you need compatibility, or the C-style format, use the c_str() method. I have to use this technique with my program because the GUI framework has its own string data type that is incompatible with std::string.