Quick question asking for insight from this community: Which one is preferable?
Option ①
// How many spaces are there in the beginning of string? (and remove them)
int spaces = text.Length;
text = text.TrimStart(' ');
spaces -= text.Length;
Advantage: Assignment on a separate line, thus side-effect is explicit
Disadvantage: The first line looks nonsensical by itself; you have to notice the third line to understand it
Option ②
// How many spaces are there in the beginning of string? (and remove them)
int spaces = text.Length - (text = text.TrimStart(' ')).Length;
Advantage: Statement makes sense in terms of the computation it performs
Disadvantage: Assignment kinda hidden inside the expression; side-effect can be overlooked
I don't like either of them. Some guidelines for writing clear code:
The meaning of a variable should remain the same throughout the lifetime of the variable.
Option (1) violates this guideline; the variable "spaces" is commented as meaning "how many spaces are in text" but it at no time actually has this meaning! It begins its lifetime by being the number of characters in text, and ends its lifetime as being the number of spaces that used to be in text. It means two different things throughout its lifetime and neither of them is what it is documented to mean.
An expression statement has exactly one side effect. (An "expression statement" is a statement that consists of a single expression; in C# the legal statement expressions are method calls, object constructions, increments, decrements and assignments.)
An expression has no side effects, except when the expression is the single side effect of an expression statement.
Option (2) obviously violates these guidelines. Expression statements that do multiple side effects are hard to reason about, they're hard to debug because you can't put the breakpoints where you want them, it's all bad.
I would rewrite your fragment to follow these guidelines.
string originalText = text;
string trimmedText = originalText.TrimStart(' ');
int removedSpaces = originalText.Length - trimmedText.Length;
text = trimmedText;
One side effect per line, and every variable means exactly the same thing throughout its entire lifetime.
I'd do option 1b:
int initial_length = text.Length;
text = text.TrimStart(' ');
int spaces = initial_length - text.Length;
Sure, it's almost a duplicate of option one, but it's a little clearer (and you might need the initial length of your string later on).
I personally prefer option 1. Although option 2 is more concise, and works correctly, I think of the guy who has to maintain this after I've moved on and I want to make my code as understandable as possible. I may know that an assignment as an expression evaluates to the value assigned, but the next guy may not.
What about an overload?
public static string TrimStart(this string s, char c, out int numCharsTrimmed)
{
numCharsTrimmed = s.Length;
s = s.TrimStart(c);
numCharsTrimmed -= s.Length;
}
Option ① All day. It's readable. Option ② is much more difficult to maintain.
From the point of view of your question itself, I'd say not to do assignments within expressions because it's not supported in all languages, Python for example, so if you want to remain consistent in your own personal coding style, you could stick with the traditional assignments.
Related
That might not be the best way to phrase it, but I'm considering writing a tool that converts identifiers separated by spaces in my code to camel case. A quick example:
var zoo animals = GetZooAnimals(); // i can't help but type this
var zooAnimals = GetZooAnimals(); // i want it to rewrite it like this
I was wondering if writing a tool like this would run into any ambiguities assuming it ignores all keywords. The only reason I can think of is if there is a syntactically valid expression with 2 identifiers only separated by white space.
Looking through the grammar I could not immediately find a place that allows it, but perhaps someone else would know better.
On a side note, I realize this is not a practical solution to a real problem a lot of people have, but just something I do all the time and wanted to take a stab at fixing with tools instead of forcing myself to always write camel case.
It is hard to tell whether a space-separated sequence of identifiers represents a single variable or not without doing full semantic analysis. For example
Myclass myVariable;
is a pair of space-separated identifiers which are perfectly valid. This would cause an ambiguity if you want to camel-case both type names and variable names.
If one enters:
csharp> var i j = 3;
(1,7): error CS1525: Unexpected symbol `j', expecting `,', `;', or `='
in the csharp interactive shell, one gets an error generated by the parser (a (LA)LR parser does bookkeeping what to expect next). Such parser works left-to-right so it doesn't know which characters to come next. It simply knows that the next characters are one of the list shown above.
So that means that there is probably no way to - at least declare a variable - with spaces.
Furthermore based on this context-free grammar for C# there doesn't seem to be a case where one can place two identifiers next to each other. It is for instance possible that a primary expressions is an identifier, but there is no situation where a primary expression is placed next to an identifier (or with an optional part in between).
As #dasblinkenlight says, you can indeed see the rule "local-variable-declaration":
type variable-declarator
with type that can be evaluated to an identifier and variable-declarator starting with an identifier. You can however know that the type is the first identifier (or the var keyword). Some kind of rewrite rule is thus:
(\w+)(\s+\w+)+ -> \1 concat(\2)
where you need to combine (concat) the identifiers of the second group. In case of an assignment.
This question already has answers here:
How to get distinct characters?
(9 answers)
Closed 8 years ago.
Lets say we have variable myString="blabla" or mystring=998769
myString.Length; //will get you your result
myString.Count(char.IsLetter); //if you only want the count of letters:
How to get, unique character count? I mean for "blabla" result must be 3, doe "998769" it will be 4. Is there ready to go function? any suggestions?
You can use LINQ:
var count = myString.Distinct().Count();
It uses a fact, that string implements IEnumerable<char>.
Without LINQ, you can do the same stuff Distinct does internally and use HashSet<char>:
var count = (new HashSet<char>(myString)).Count;
If you handle only ANSI text in English (or characters from BMP) then 80% times if you write:
myString.Distinct().Count()
You will live happy and won't ever have any trouble. Let me post this answer only for who will really need to handle that in the proper way. I'd say everyone should but I know it's not true (quote from Wikipedia):
Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed application software (e.g. CVE-2008-2938, CVE-2012-2135)
Problem of our first naïve solution is that it doesn't handle Unicode properly and it also doesn't consider what user perceive as character. Let's try "𠀑".Distinct().Count() and your code will wrongly return...2 because its UTF-16 representation is 0xD840 0xDC11 (BTW each of them, alone, is not a valid Unicode character because they're high and low surrogate, respectively).
Here I won't be very strict about terms and definitions so please refer to www.unicode.org as reference. For a (much) more broad discussion please read How can I perform a Unicode aware character by character comparison?, encoding isn't only issue you have to consider.
1) It doesn't take into account that .NET System.Char doesn't represent a character (or more specifically a grapheme) but a code unit of a UTF-16 encoded text (possible, for example, with ideographic characters). Often they coincide but now always.
2) If you're counting what user thinks (or perceives) as a character then this will fail again because it doesn't check combined characters like ا́ (many examples of this in Arabic language). There are duplicates that exists for historical reasons: for example é it's both a single Unicode code point and a combination (then that code will fail).
3) We're talking about a western/American definition of character. If you're counting characters for end-users you may need to change your definition to what they expect (for example in Korean language definition of character may not be so obvious, another example is Czech text ch that is always counted as a single character). Finally don't forget some strange things when you convert characters to upper case/lower case (for example in German language ß is SS in upper case, see also this post).
Encoding
C# strings are encoded as UTF-16 (char is two bytes) but UTF-16 isn't a fixed size encoding and char should be properly called code unit. What does it mean? That you may have a string where Length is 2 but actually user will see (and it's actually is) just one character (then count should be 1).
If you need to handle this properly then you have to make things much more complicated (and slow). Fortunately Char class has some helpful methods to handle surrogates.
Following code is untested (and for illustration purposes so absolutely not optimized, I'm sure it can be done much better than this) so get it just as starting point for further investigations:
int CountCharacters(string text)
{
HashSet<string> characters = new HashSet<string>();
string currentCharacter = "";
for (int i = 0; i < text.Length; ++i)
{
if (Char.IsHighSurrogate(text, i))
{
// Do not count this, next one will give the full pair
currentCharacter = text[i].ToString();
continue;
}
else if (Char.IsLowSurrogate(text, i))
{
// Our "character" is encoded as previous one plus this one
currentCharacter += text[i];
}
else
currentCharacter = text[i].ToString();
if (!characters.Contains(currentCharacter))
characters.Add(currentCharacter);
}
return characters.Count;
}
Note that this example doesn't handle duplicates (when same character may have different codes or can be a single code point or a combined character).
Combined Characters
If you have to handle combined characters (and of course encoding) then best way to do it is to use StringInfo class. You'll enumerate (and then count) both combined and encoded characters:
StringInfo.GetTextElementEnumerator(text).Walk()
.Distinct().Count();
Walk() is a trivial to implement extension method that simply walks through all IEnumerator elements (we need it because GetTextElementEnumerator() returns IEnumerator instead of IEnumerable).
Please note that after text has been properly splitted it can be counted with our first solution (the point is that brick isn't char but a sequence of char (for simplicity here returned as string itself). Again this code doesn't handle duplicates.
Culture
There is not much you can do to handle issues listed at point 3. Each language has its own rules and to support them all can be a pain. More examples about culture issues on this longer specific post.
It's important to be aware of them (so you have to know little bit about languages you're targeting) and don't forget that Unicode and few translated resx files won't make your application global.
If text processing is important in your application you can solve many issues using specialized DLLs for each locale you support (to count characters, to count words and so on) like Word Processors do. For example, issues I listed can be simply solved using dictionaries. What I usually do is to do not use standard .NET functions for strings (also because of some bugs), I create a Unicode class with static methods for everything I need (character counting, conversions, comparison) and many specialized derived classes for each supported language. At run-time that static methods will user current thread culture name to pick proper implementation from a dictionary and to delegate work to that. A skeleton may be something like this:
abstract class Unicode
{
public static string CountCharacters(string text)
{
return GetConcreteClass().CountCharactersCore(text);
}
protected virtual string CountCharactersCore(string text)
{
// Default implementation, overridden in derived classes if needed
return StringInfo.GetTextElementEnumerator(text).Cast<string>()
.Distinct().Count();
}
private Dictionary<string, Unicode> _implementations;
private Unicode GetConcreteClass()
{
string cultureName = Thread.Current.CurrentCulture.Name;
// Check if concrete class has been loaded and put in dictionary
...
return _implementations[cultureName];
}
}
If you're using C# then Linq comes nicely to the rescue - again:
"blabla".Distinct().Count()
will do it.
I've got an app with a textbox in it. The user enters text into this box.
I have the following function triggered in an OnKeyUp event in that textbox
private void bxItemText_KeyUp(object sender, System.Windows.Input.KeyEventArgs e)
{
// rules is an array of regex strings
foreach (string rule in rules)
{
Regex regex = new Regex(rule);
if (regex.Match(text.Trim().ToLower()))
{
// matched rule is a variable
matchedRule = rule;
break;
}
}
}
I've got about 12 strings in rules, although this can vary.
Once the length of the text in my textbox gets bigger than 80 characters performance starts to degrade. typing a letter after 100 characters takes a second to show up.
How do I optimise this? Should I match on every 3rd KeyUp ? Should I abandon KeyUp altogether and just auto match every couple of seconds?
How do I optimise this? Should I match on every 3rd KeyUp ? Should I abandon KeyUp altogether and just auto match every couple of seconds?
I would go with the second option, that is abandon KeyUp and trigger the validation every couple of seconds, or better yet trigger the validation when the TextBox loses focus.
On the other hand, I should suggest to cache the regular expressions beforehand and compile them because it seems like you are using them over and over again, in other words instead of storing the rules as strings in that array, you should store them as compiled regular expression objects when they are added or loaded.
Use static method calls instead of create a new object each time, static calls use a caching feature : Optimizing Regular Expression Performance, Part I: Working with the Regex Class and Regex Objects.
That will be a major improvement in performance, then you can provide your regexes (rules) to see if some optimization can be done in the regexes.
Other resources :
Optimizing Regular Expression Performance, Part II: Taking Charge of Backtracking
Optimizing Regex Performance, Part 3
Combining strings to one on Regex level will work faster than foreach in code.
Combining two Regex to one
If you need pattern determination for Each new symbol, and you care about performance, than Final State Machine seems to be the best option...
That is much harder way. You should specify for each symbol list of next symbols, that are allowed.
And OnKeyUp you just walk on next state, if possible. And you will have the amount of patterns, that input text currently matches.
Some useful references, that I could found:
FSM example
Guy explaining how to convert Regex to FSM
Regex - FSM converting discussion
You don't need to create a new regex object each time. Also using static call will cache the pattern if used before (since .Net 2). Here is how I would rewrite it
matchedRule = Rules.FirstOrDefault( rule => Regex.IsMatch(text.Trim().ToLower(), rule));
Given that you seem to be matching keywords, can you just perform the match on the current portion of text that's been edited (i.e. in the vicinity of the cursor)? Might be tricky to engineer, especially for operations like paste or undo, but scope for a big performance gain.
Pre-compile your regexes (using RegexOptions.Compiled). Also, can you make the Trim and ToLower redundant by expanding your regex? You're running Trim and ToLower once for each rule, which is inefficient even if you can't eliminate them altogether
You can try and make your rules mutually exclusive - this should speed things up. I did a short test: matching against the following
"cat|car|cab|box|balloon|button"
can be sped up by writing it like this
"ca(t|r|b)|b(ox|alloon|utton)"
I am writing a Search application that tokenize a big textual corpus.
The text parser needs to remove any gibberish from the text (i.e. [^a-zA-Z0-9])
I had 2 ideas in my head how to do this:
1) Put the text in a string, transform it to a charArray using String.tocharArray and then run char by char with a loop -> while(position < string.length)
Doing so I can tokenize the entire string array in one run over the text.
2) Strip all non digit/alpha using string.replace, and then string.split with some delimiters, this means i have to run twice on the entire string.
Once to remove bad chars and then again to split it.
I assumed, that since #1 does the same as #2 but in O(n) it would be quicker, but after testing both, #2 is way (way!) faster.
I went even further and viewed the code behind String.Strip using red-gate .net reflector.
It runs unmanaged char by char just like #1, but still much much faster.
I have no clue why #2 is way faster than #1.
Any ideas?
How about this idea:
Create a string
Load the entire data set into the string
Create a StringBuilder with enough pre-allocated space to hold the entire string
Go character by character through the string and if the character is alphanumeric, add it to the StringBuilder.
At the end, get the string out of the StringBuilder.
I don't know if this will be any faster to what you've already tried, but timing the above should at least answer that question.
djTeller,
The fact that #2 is faster is merely relative to your #1 method.
You might want to share your #1 method with us; maybe it's just very slow and is possible to make it faster than #2, even.
Yes both are essentially O(n), but is the ACTUAL implementation O(n); how'd you actually do #1?
Also, when you said you tested both, I hope you did with large amounts of input to overcome the margin of error and see a significant difference between the two.
I need to validate the contents of a C# method.
I do not care about syntax errors that do not affect the method's scope.
I do care about characters that will invalidate parsing of the rest of the code. For example:
method()
{
/* valid comment */
/* <-- bad
for (i..) {
}
for (i..) { <-- bad
}
I need to validate/fix any non-paired characters.
This includeds /* */, { }, and maybe others.
How should I go about this?
My first thought was Regex, but that clearly isn't going to get the job done.
You'll need to scope your problem more carefully in order to get a sensible answer.
For example, what are you going to do about methods that contain preprocessor directives?
void M()
{
#if FOO
for(foo;bar;blah) {
#else
while(abc) {
#endif
Blah();
}
}
This is silly but legal, so you have to handle it. Are you going to count that as a mismatched brace or not?
Can you provide a detailed specification of exactly what you want to determine? As we've seen several times on this site, people cannot successfully build a routine that divides two numbers without a specification. You're talking about analysis that is far more complex than dividing two numbers; the code which does what you're describing in the actual compiler is tens of thousands of lines long.
A regex is certainly not the answer to this problem. Regex's are useful tools for certain types of data validation. But once you get into the business of more complicated data like matching braces or comment blocks a regex no longer gets the job done.
Here is a blog article on the limitations encountered when using a regex to validate input.
http://blogs.msdn.com/ianhu/archive/2009/11/16/intellitrace-itrace-files.aspx
In order to do this you will have to write a parser of sorts which does the validation.
A regular expression isn't a very convenient thing for such a task. This is often implemented using a stack with an algorithm like the following:
Create an empty stack S.
While( there are characters left ){
Read a character ch.
If is ch an opening paren (of any kind), push it onto S
Else
If ch is a closing paren (of any kind), look at the top of S.
If S is empty as this point, report failure.
If the top of S is the opening paren that corresponds to c,
then pop S and continue to 1, this paren matches OK.
Else report failure.
If at the end of input the stack S is not empty, return failure.
Else return success.
for more information check http://www.ccs.neu.edu/home/sbratus/com1101/lab4.html and http://codeidol.com/csharp/csharpckbk2/Data-Structures-and-Algorithms/Determining-Where-Characters-or-Strings-Do-Not-Balance/
If you're trying to "validate" the contents of a string defining a method, then you may be better off just trying to use the CodeDom classes and compile the method on the fly into an in memory assembly.
Writing your own fully-functional parser to do validation will be very, very difficult, especially if you want to support C# 3 or later. Lambda expressions and other constructs like that will be very difficult to "validate" cleanly.
You're drawing a false dichotomy between "characters that will invalidating parsing the rest of the code" and "syntax errors". Lacking a closing curly brace (one of the problems you mention) is a syntax error. It looks like you mean you're looking for syntax errors that potentially break scope boundaries? Unfortunately, there's no robust way to do this short of using a full parser.
As an example:
method()
{ <-- is missing closing brace
/* valid comment */
/* <-- bad
for (i..) {
}
for (i..) {
} <-- will be interpreted as the closing brace for the for loop
There's no general, practical way to infer that it's the for loop that's missing its closing brace, rather than the method.
If you're really interested in looking for these sort of things, you should consider running the compiler programmatically and parsing the results - that's the best approach with the lowest entry threshold.