Lucene and Special Characters

Lucene and Special Characters - c#

I am not that expert user of stackoverflow but what i know is that my question is somewhat related to
Lucene and Special Characters
But i have a slight different environment.
I have an index with Lucene.NET but i am searching it with SOLR. Is it possible to search the special characters without re-indexing? while re-indexing i can change my analyzer but is it or not possible to search without re-indexing?

You will need to set up your query analyzer in Solr to match the analyzer config used at index-time.
Solr has a very handy tool -- Field Analysis (solr/admin/analysis.jsp) -- for analyzing analyzer configurations. Check the verbose check boxes, and inspect how analyzers process your query terms. Lucid Imagination has a section about it.
If you're not sure what analyzers were run at index time, then you will also have to inspect what the terms actually look like in the index (although it will likely be very hard to prove that your query analysis is correct). You can use the LukeRequestHandler for this.
If you can concluded a one-to-one mapping of terms (between index time analysis and query time analysis), then you're home safe, otherwise you might be better of re-indexing.

Related

Regular expression detecting hard coded messages

I work on a solution potentially having hard coded strings. The task is to identify in the solution's (C# in my case) code, all the phrases that aren't touched by the translation mechanism.
Usually the translations mechanisms are something like
myTranslator.Get("messageKey", myLang)
so I supposed "\w .*" would potentially give me phrases (I use a space after a word, cause in the Message Keys you rarely have spaces...), but that gave me also the HTML attributes like class="alpha beta gamma" that are not phrases...
"fortunately", the supposed hard coded strings are in French, so I tried to find
".*[äÄëËüÜïÏöÖâÂêÊûÛîÎôÔèÈàÀùÚçÇéÉ].*"
in the *.cs;*.cshtml files (the solution is an ASP.NET one)...
That works rather well, but finds only phrases having accents... is there a smarter way to identify the hard-coded (non-translated) strings in solution's code?
What are the general recommendations to identify, label and remove such kind of strings from a (localisable) solution?

General purpose plain-text linting tool

I'm looking for a command-line tool where I can specify regex patterns (or similar) for certain file extensions (e.g. cs files, js files, xaml files) that can provide errors/warnings when run, like during a build. These would scan plain-text source code of all types.
I know there are tools for specific languages... I plan on using those too. This tool is for quick patterns we want to flag where we don't want to invest in writing a Rosyln rule, for example. I'd like to flag certain patterns or API usages in an easy way where anyone can add a new rule without thinking too hard. Often times we don't add rules because it is hard.
Features like source tokenization is bonus. Open-source / free is mega bonus.
Is there such a tool?

If you want to go old-skool, you can dust-off Awk for this one.
It scans file line by line (for some configurable definition of line, with a sane default) cuts them in pieces (on whitespace IMMSMR) and applies a set of regexes and fires the code behind the matching regex. There are some conditions to match the beginning and end of a file to print headers/footers.
It seems to be what you want, but IMHO, a perl or ruby script is easier and has replaced AWK for me a long time ago. But it IS simple and straightforward for your use-case AFAICT.

Our Source Code Search Engine (SCSE) can do this.
SCSE lexes (using language-accurate tokenization including skipping language-specific whitespace but retaining comments) a set of source files, and then builds a token index for each token type. One can provide SCSE with token-based search queries such as:
'if' '(' I '='
to search for patterns in the source code; this example "lints" C-like code for the common mistake of assigning a variable (I for "identifier") in an IF statement caused by accidental use of '=' instead of the intended '=='.
The search is accomplished using the token indexes to speed up the search. Typically SCSE can search millions of lines of codes in a few seconds, far faster than grep or other scheme that insists on reading the file content for each query. It also produces fewer false positives because the token checks are accurate, and the queries are much easier to write because one does not have to worry about white space/line breaks/comments.
A list of hits on the pattern can be logged or merely counted.
Normally SCSE is used interactively; queries produce a list of hits, and clicking on a hit produces a view of a page of the source text with the hit superimposed. However, one can also script calls on the SCSE.
SCSE can be obtained with langauge-accurate lexers for some 40 languages.

'Regular Expression' VS 'String Comparison operators / functions'

This question is designed around the performance within PHP but you may broaden it to any language if you wish to.
After many years of using PHP and having to compare strings I've learned that using string comparison operators over regular expressions is beneficial when it comes to performance.
I fully understand that some operations have to be done with Regular Expressions down to there complexity but for operations that can be resolved via regex AND string functions.
take this example:
PHP
preg_match('/^[a-z]*$/','thisisallalpha');
C#
new Regex("^[a-z]*$").IsMatch('thisisallalpha');
can easily be done with
PHP
ctype_alpha('thisisallalpha');
C#
VFPToolkit.Strings.IsAlpha('thisisallalpha');
There are many other examples but you should get the point I'm trying to make.
What version of string comparison should you try and lean towards and why?

Looks like this question arose from our small argument here, so i feel myself somehow obliged to respond.
php developers are being actively brainwashed about "performance", whereat many rumors and myths arise, including sheer stupid things like "double quotes are slower". Regexps being "slow" is one of these myths, unfortunately supported by the manual (see infamous comment on the preg_match page). The truth is that in most cases you don't care. Unless your code is repeated 10,000 times, you don't even notice a difference between string function and a regular expression. And if your code does repeat 10,000 times, you must be doing something wrong in any case, and you will gain performance by optimizing your logic, not by stripping down regular expressions.
As for readability, regexps are admittedly hard to read, however, the code that uses them is in most cases shorter, cleaner and simpler (compare yours and mine answers on the above link).
Another important concern is flexibility, especially in php, whose string library doesn't support unicode out of the box. In your concrete example, what happens when you decide to migrate your site to utf8? With ctype_alpha you're kinda out of luck, preg_match would require another pattern, but will keep working.
So, regexes are not slower, more readable and more flexible. Why on earth should we avoid them?

Regular expressions actually lead to a performance gain (not that such microoptimizations are in any way sensible) when they can replace multiple atomic string comparisons. So typically around five strpos() checks it gets advisable to use a regular expression instead. Moreso for readability.
And here's another thought to round things up: PCRE can handle conditionals faster than the Zend kernel can handle IF bytecode.
Not all regular expressions are designed equal, though. If the complexetiy gets too high, regex recursion can kill its performance advantage. Therefore it's often reconsiderworthy to mix regex matching and regular PHP string functions. Right tool for the job and all.

PHP itself recommends using string functions over regex functions when the match is straightforward. For example, from the preg_match manual page:
Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster.
Or from the str_replace manual page:
If you don't need fancy replacing rules (like regular expressions), you should always use this function instead of ereg_replace() or preg_replace().
However, I find that people try to use the string functions to solve problems that would be better solved by regex. For instance, when trying to create a full-word string matcher, I have encountered people trying to use strpos($string, " $word ") (note the spaces), for the sake of "performance", without stopping to think about how spaces aren't the only way to delineate a word (think about how many string functions calls would be needed to fully replace preg_match('/\bword\b/', $string)).
My personal stance is to use string functions for matching static strings (ie. a match of a distinct sequence of characters where the match is always the same) and regular expressions for everything else.

Agreed that PHP people tend to over-emphasise performance of one function over another. That doesn't mean the performance differences don't exists -- they definitely do -- but most PHP code (and indeed most code in general) has much worse bottlenecks than the choice of regex over string-comparison. To find out where your bottlenecks are, use xdebug's profiler. Fix the issues it comes up with before worrying about fine-tuning individual lines of code.

They're both part of the language for a reason. IsAlpha is more expressive. For example, when an expression you're looking at is inherently alpha or not, and that has domain meaning, then use it.
But if it is, say, an input validation, and could possibly be changed to include underscores, dashes, etc., or if it is with other logic that requires regex, then I would use regex. This tends to be the majority of the time for me.

Using reflection for code gen?

I'm writing a console tool to generate some C# code for objects in a class library. The best/easiest way I can actual generate the code is to use reflection after the library has been built. It works great, but this seems like a haphazard approch at best. Since the generated code will be compiled with the library, after making a change I'll need to build the solution twice to get the final result, etc. Some of these issues could be mitigated with a build script, but it still feels like a bit too much of a hack to me.
My question is, are there any high-level best practices for this sort of thing?

Its pretty unclear what you are doing, but what does seem clear is that you have some base line code, and based on some its properties, you want to generate more code.
So the key issue here are, given the base line code, how do you extract interesting properties, and how do you generate code from those properties?
Reflection is a way to extract properties of code running (well, at least loaded) into the same execution enviroment as the reflection user code. The problem with reflection is it only provides a very limited set of properties, typically lists of classes, methods, or perhaps names of arguments. IF all the code generation you want to do can be done with just that, well, then reflection seems just fine. But if you want more detailed properties about the code, reflection won't cut it.
In fact, the only artifact from which truly arbitrary code properties can be extracted is the the source code as a character string (how else could you answer, is the number of characters between the add operator and T in middle of the variable name is a prime number?). As a practical matter, properties you can get from character strings are generally not very helpful (see the example I just gave :).
The compiler guys have spent the last 60 years figuring out how to extract interesting program properties and you'd be a complete idiot to ignore what they've learned in that half century.
They have settled on a number of relatively standard "compiler data structures": abstract syntax trees (ASTs), symbol tables (STs), control flow graphs (CFGs), data flow facts (DFFs), program triples, ponter analyses, etc.
If you want to analyze or generate code, your best bet is to process it first into such standard compiler data structures and then do the job. If you have ASTs, you can answer all kinds of question about what operators and operands are used. If you have STs, you can answer questions about where-defined, where-visible and what-type. If you have CFGs, you can answer questions about "this-before-that", "what conditions does statement X depend upon". If you have DFFs, you can determine which assignments affect the actions at a point in the code. Reflection will never provide this IMHO, because it will always be limited to what the runtime system developers are willing to keep around when running a program. (Maybe someday they'll keep all the compiler data structures around, but then it won't be reflection; it will just finally be compiler support).
Now, after you have determined the properties of interest, what do you do for code generation? Here the compiler guys have been so focused on generation of machine code that they don't offer standard answers. The guys that do are the program transformation community (http://en.wikipedia.org/wiki/Program_transformation). Here the idea is to keep at least one representation of your program as ASTs, and to provide special support for matching source code syntax (by constructing pattern-match ASTs from the code fragments of interest), and provide "rewrite" rules that say in effect, "when you see this pattern, then replace it by that pattern under this condition".
By connecting the condition to various property-extracting mechanisms from the compiler guys, you get relatively easy way to say what you want backed up by that 50 years of experience. Such program transformation systems have the ability to read in source code,
carry out analysis and transformations, and generally to regenerate code after transformation.
For your code generation task, you'd read in the base line code into ASTs, apply analyses to determine properties of interesting, use transformations to generate new ASTs, and then spit out the answer.
For such a system to be useful, it also has to be able to parse and prettyprint a wide variety of source code langauges, so that folks other than C# lovers can also have the benefits of code analysis and generation.
These ideas are all reified in the
DMS Software Reengineering Toolkit. DMS handles C, C++, C#, Java, COBOL, JavaScript, PHP, Verilog, ... and a lot of other langauges.
(I'm the architect of DMS, so I have a rather biased view. YMMV).

Have you considered using T4 templates for performing the code generation? It looks like it's getting much more publicity and attention now and more support in VS2010.
This tutorial seems database centric but it may give you some pointers: http://www.olegsych.com/2008/09/t4-tutorial-creatating-your-first-code-generator/ in addition there was a recent Hanselminutes on T4 here: http://www.hanselminutes.com/default.aspx?showID=170.
Edit: Another great place is the T4 tag here on StackOverflow: https://stackoverflow.com/questions/tagged/t4
EDIT: (By asker, new developments)
As of VS2012, T4 now supports reflection over an active project in a single step. This means you can make a change to your code, and the compiled output of the T4 template will reflect the newest version, without requiring you to perform a second reflect/build step. With this capability, I'm marking this as the accepted answer.

You may wish to use CodeDom, so that you only have to build once.
First, I would read this CodeProject article to make sure there are not language-specific features you'd be unable to support without using Reflection.

From what I understand, you could use something like Common Compiler Infrastructure (http://ccimetadata.codeplex.com/) to programatically analyze your existing c# source.
This looks pretty involved to me though, and CCI apparently only has full support for C# language spec 2. A better strategy may be to streamline your existing method instead.

I'm not sure of the best way to do this, but you could do this
As a post-build step on your base dll, run the code generator
As another post-build step, run csc or msbuild to build the generated dll
Other things which depend on the generated dll will also need to depend on the base dll, so the build order remains correct

When not to use Regex in C# (or Java, C++, etc.)

It is clear that there are lots of problems that look like a simple regex expression will solve, but which prove to be very hard to solve with regex.
So how does someone that is not an expert in regex, know if he/she should be learning regex to solve a given problem?
(See "Regex to parse C# source code to find all strings" for way I am asking this question.)
This seems to sums it up well:
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.”
Now they have two problems...
(I have just changed the title of the question to make it more specific, as some of the problems with Regex in C# are solved in Perl and JScript, for example the fact that the two levels of quoting makes a Regex so unreadable.)

Don't try to use regex to parse hierarchical text like program source (or nested XML): they are proven to be not powerful enough for that, for example, they can't, for a string of parens, figure out whether they're balanced or not.
Use parser generators (or similar technologies) for that.
Also, I'd not recommend using regex to validate data with strict formal standards, like e-mail addresses.
They're harder than you want, and you'll either have unaccurate or a very long regex.

There are two aspects to consider:
Capability: is the language you are trying to recognize a Type-3 language (a regular one)? if so, then you might use regex, if not, you need a more powerful tool.
Maintainability: If it takes more time write, test and understand a regular expression than its programmatic counterpart, then it's not appropriate. How to check this is complicated, I'd recommend peer review with your fellows (if they say "what the ..." when they see it, then it's too complicated) or just leave it undocumented for a few days and then take a look by yourself and measure how long does it take to understand it.

I'm a beginner when it comes to regex, but IMHO it is worthwhile to spend some time learning basic regex, you'll realise that many, many problems you've solved differently could (and maybe should) be solved using regex.
For a particular problem, try to find a solution at a site like regexlib, and see if you can understand the solution.
As indicated above, regex might not be sufficient to solve a specific problem, but browsing a browsing a site like regexlib will certainly tell you if regex is the right solution to your problem.

You should always learn regular expressions - only this way you can judge when to use them. Normally they get problematic, when you need very good performance. But often it is a lot easier to use a regex than to write a big switch statement.
Have a look at this question - which shows you the elegance of a regex in contrast to the similar if() construct ...

Use regular expressions for recognizing (regular) patterns in text. Don't use it for parsing text into data structures. Don't use regular expressions when the expression becomes very large.
Often it's not clear when not to use a regular expression. For example, you shouldn't use regular expressions for proper email address verification. At first it may seem easy, but the specification for valid email addresses isn't as regular as you might think. You could use a regular expression to initial searching of email address candidates. But you need a parser to actually verify if the address candidate conforms to the given standard.

At the very least, I'd say learn regular expressions just so that you understand them fully and be able to apply them in situations where they would work. Off the top of my head I'd use regular expressions for:
Identifying parts of a string.
Checking whether a string conforms to a certain format or construction.
Finding substrings that match a certain pattern.
Transforming strings that fit a certain pattern into a different form (search-replace, capitalization, etc.).
Regular expressions at a theoretical level form the foundations of what a state machine is -- in computer science, you have Deterministic Finite Automata (DFA) and Non-deterministic Finite Automata (NFA). You can use regular expressions to enforce some kind of validation on inputs -- regular expression engines simply interpret or convert regular expression patterns/strings into actual runtime operations.
Once you know whether the string (or data) you want to determine to be valid could be tested by a DFA, you have a choice of whether to implement that DFA yourself using your own code or using a regular expression engine. You'll find that knowing about regular expressions will actually enhance your toolbox and your understanding of how string processing can actually get complex.
Based on simple regular expressions you can then look into learning about parsers and how parsers work. At the lowest level you're looking at lexical analysis (where regular expressions work) and at a higher level a grammar and semantic actions. These are the bases upon which compilers and interpreters work, as well as protocol parser implementations, and document rendering/transformation applications rely on.

The main concern here is maintainability.
It is obvious to me, that any programmer worth his salt must know regular expressions. Not knowing them is like, say, not knowing what abstraction and encapsulation is, only, probably, worse. So this is out of the question.
On the other hand, one should consider, that maintiaining regex-driven code (written in any language) can be a nightmare even for someone who is really good at them. So, in my opinion, the correct approach here is to only use them when it is inevitable and when the code using regex' will be more readable than its non-regex variant. And, of course, as has been already indicated, do not use them for something, that they are not meant to do (like xml). And no email address validation either (one of my pet peeves :P)!
But seriously, doesn't it feel wrong when you use all those substrs for something, that can be solved with a handful of characters, looking like line noise? I know it did for me.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.