Related
I'm looking for a command-line tool where I can specify regex patterns (or similar) for certain file extensions (e.g. cs files, js files, xaml files) that can provide errors/warnings when run, like during a build. These would scan plain-text source code of all types.
I know there are tools for specific languages... I plan on using those too. This tool is for quick patterns we want to flag where we don't want to invest in writing a Rosyln rule, for example. I'd like to flag certain patterns or API usages in an easy way where anyone can add a new rule without thinking too hard. Often times we don't add rules because it is hard.
Features like source tokenization is bonus. Open-source / free is mega bonus.
Is there such a tool?
If you want to go old-skool, you can dust-off Awk for this one.
It scans file line by line (for some configurable definition of line, with a sane default) cuts them in pieces (on whitespace IMMSMR) and applies a set of regexes and fires the code behind the matching regex. There are some conditions to match the beginning and end of a file to print headers/footers.
It seems to be what you want, but IMHO, a perl or ruby script is easier and has replaced AWK for me a long time ago. But it IS simple and straightforward for your use-case AFAICT.
Our Source Code Search Engine (SCSE) can do this.
SCSE lexes (using language-accurate tokenization including skipping language-specific whitespace but retaining comments) a set of source files, and then builds a token index for each token type. One can provide SCSE with token-based search queries such as:
'if' '(' I '='
to search for patterns in the source code; this example "lints" C-like code for the common mistake of assigning a variable (I for "identifier") in an IF statement caused by accidental use of '=' instead of the intended '=='.
The search is accomplished using the token indexes to speed up the search. Typically SCSE can search millions of lines of codes in a few seconds, far faster than grep or other scheme that insists on reading the file content for each query. It also produces fewer false positives because the token checks are accurate, and the queries are much easier to write because one does not have to worry about white space/line breaks/comments.
A list of hits on the pattern can be logged or merely counted.
Normally SCSE is used interactively; queries produce a list of hits, and clicking on a hit produces a view of a page of the source text with the hit superimposed. However, one can also script calls on the SCSE.
SCSE can be obtained with langauge-accurate lexers for some 40 languages.
I need an efficient and (reasonably) reliable way to strip HTML tags from documents. It needs to be able to handle some fairly adverse circumstances:
It's not known ahead of time whether a document contains HTML at all.
More likely than not, any HTML will be very poorly formatted.
Individual documents might be very large, perhaps hundreds of megabytes.
Non-HTML content might still be littered with angle brackets for whatever odd reason, so naive regular expressions along the lines of <.+/?> are a no go. (And stripping XML is less desirable, anyway.)
I'm currently using HTML Agility Pack, and it's just not cutting the mustard. Performance is poorer than I'd like, it doesn't always handle truly awful formatting as gracefully as it could, and lately I've been running into problems with stack overflows on some of the more upsettingly large files.
I suspect that all of these problems stem from the fact that it's trying to actually parse the data, which makes it a poor fit for my needs. I don't want a syntax tree; I just want (most of) the tags to go away.
Using regular expressions seems like the obvious candidate. But then I remember this famous answer and it makes me worry that's not such a great idea. But that diatribe's points are very focused on parsing, and not necessarily dumb tag-stripping. So are regex OK for this purpose?
Assuming it isn't a terrible idea, suggestions for regex that would do a good job are very welcome.
This regex finds all tags avoiding angle brackets inside quotes in tags.
<[a-zA-Z0-9/_-]+?((".*?")|([^<"']+?)|('.*?'))*?>
It isn't able to detect escaped quotes inside quotes (but I think it is unnecessary in html)
Having the list of all allowed tags and replacing it in the first part of the regex, like <(tag1|tag2|...) could bring to a more precise solution, I'm afraid an exact solution can't be found starting with your assumption about angle brackets, think for example to something like b<a ...
EDIT:
Updated regex (performing a lot better than the latter), moreover if you need to strip out code I suggest to perform a little cleaning before the first launch, something like replacing <script.+?</script> with nothing.
I'm just thinking outside the box here, but you may consider leveraging something like Microsoft Word, or maybe OpenOffice.
I've used Word automation to translate HTML to DOC, RTF, or TXT. The HTML to TXT conversion native to Word would give you exactly what you want, stripping all of the HTML tags and converting it to text format. Of course this wouldn't be efficient at all if you're processing tons of tiny HTML files since there's some overhead in all of this. But if you're dealing with massive files this may not be a bad choice as I'm sure Word has plenty of optimizations around these conversions. You could test this theory by manually opening one of your largest HTML files in Word and resaving it as a TXT file and see how long Word takes to save.
And although I haven't tried it, I bet it's possible to programmatically interact with OpenOffice to accomplish something similar.
I need to develop an application that will read and understand text file in which I'll find a custom language that describe a list of operations (ie cooking recipe). This language has not been defined yet, but it will probably take one of the following shape :
C++ like code
(This code is randomly generated, just for example purpose) :
begin
repeat(10)
{
bar(toto, 10, 1999, xxx);
}
result = foo(xxxx, 10);
if(foo == ok)
{
...
}
else
{
...
}
end
XML code
(This code is randomly generated, just for example purpose) :
<recipe>
<action name="foo" argument"bar, toto, xxx" repeat=10/>
<action name="bar" argument"xxxxx;10" condition="foo == ok">
<true>...</true>
<false>...</false>
</action>
</recipe>
No matter which language will be chosen, there will have to handle simple conditions, loops.
I never did such a thing but at first sight, it occurs to me that describing those operations into XML would be simplier yet less powerful.
After browsing StackOverFlow, I've found some chats on a tool called "ANTLR"... I started reading "The Definitive ANTLR Reference" but since I never done that kind of stuff, I find it hard to know if it's really the kind of tool I need...
In other words, what do I need to read a text file, interpret it properly and perform actions in my C# code. Those operations will interact between themselves by simple conditions like :
If operation1 failed, I do operation2 else operation3.
Repeat the operation4 10 times.
What would be the best language to do describe those text file (XML, my own) ? What are the key points during such developments ?
I hope I'm being clear :)
Thanks a lot for your help and advices !
XML is great for storing relational data in a verbose way. I think it is a terrible candidate for writing logic such as a program, however.
Have you considered using an existing grammar/scripting language that you can embed, rather than writing your own? E.g:
LUA
Python
In one of my projects I actually started with an XML like language as I already had an XML parser and parsed the XML structure into an expression tree in memory to be interpreted/run.
This works out very nicely to get passed the problem of figuring out tokenizing/parsing of text files and concentrate instead on your 'language' and the logic of the operations in your language. The down side is writing the text files is a little strange and very wordy. Its also very unnatural for a programmer use to C/C++ syntax.
Eventually you could easily replace your XML with a full blown scanner & lexer to parse a more 'natural C++' like text format into your expression tree.
As for writing a scanner & lexer, I found it easier to write these by hand using simple logic flow/loops for the scanner and recursive decent parser for the lexer.
That said, ANTLR is great at letting you write out rules for your language and generating your scanner & lexer for you. This allows for much more dynamic language which can easily change without having to refactor everything again when new things are added. So, it might be worth looking into as learning this as it would save you much time in rewrites as things change if you hand wrote your own.
I'd recommend writing the app in F#. It has many useful features for parsing strings and xmls like Pattern Matching and Active Patterns.
For parsing C-like code I would recommend F# (just did one interpreter with F#, works like a charm)
For parsing XML's I would recommend C#/F# + XmlDocument class.
You basically need to work on two files:
Operator dictionary
Code file in YourLanguage
Load and interpret the operators and then apply them recursively to your code file.
The best prefab answer: S-expressions
C and XML are good first steps. They have sort of opposite disadvantages. The C-like syntax won't add a ton of extra characters, but it's going to be hard to parse due to ambiguity, the variety of tokens, and probably a bunch more issues I can't think of. XML is relatively easy to parse and there's tons of example code, but it will also contain tons of extra text. It might also give you too many options for where to stick language features - for example, is the number of times to repeat a loop an attribute, element or text?
S-expressions are more terse than XML for sure, maybe even C. At the same time, they're specific to the task of applying operations to data. They don't admit ambiguity. Parsers are simple and easy to find example code for.
This might save you from having to learn too much theory before you start experimenting. I'll emphasize MerickOWA's point that ANTLR and other parser generators are probably a bigger battle than you want to fight right now. See this discussion on programmers.stackexchange for some background on when the full generality of this type of tool could help.
It is clear that there are lots of problems that look like a simple regex expression will solve, but which prove to be very hard to solve with regex.
So how does someone that is not an expert in regex, know if he/she should be learning regex to solve a given problem?
(See "Regex to parse C# source code to find all strings" for way I am asking this question.)
This seems to sums it up well:
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.”
Now they have two problems...
(I have just changed the title of the question to make it more specific, as some of the problems with Regex in C# are solved in Perl and JScript, for example the fact that the two levels of quoting makes a Regex so unreadable.)
Don't try to use regex to parse hierarchical text like program source (or nested XML): they are proven to be not powerful enough for that, for example, they can't, for a string of parens, figure out whether they're balanced or not.
Use parser generators (or similar technologies) for that.
Also, I'd not recommend using regex to validate data with strict formal standards, like e-mail addresses.
They're harder than you want, and you'll either have unaccurate or a very long regex.
There are two aspects to consider:
Capability: is the language you are trying to recognize a Type-3 language (a regular one)? if so, then you might use regex, if not, you need a more powerful tool.
Maintainability: If it takes more time write, test and understand a regular expression than its programmatic counterpart, then it's not appropriate. How to check this is complicated, I'd recommend peer review with your fellows (if they say "what the ..." when they see it, then it's too complicated) or just leave it undocumented for a few days and then take a look by yourself and measure how long does it take to understand it.
I'm a beginner when it comes to regex, but IMHO it is worthwhile to spend some time learning basic regex, you'll realise that many, many problems you've solved differently could (and maybe should) be solved using regex.
For a particular problem, try to find a solution at a site like regexlib, and see if you can understand the solution.
As indicated above, regex might not be sufficient to solve a specific problem, but browsing a browsing a site like regexlib will certainly tell you if regex is the right solution to your problem.
You should always learn regular expressions - only this way you can judge when to use them. Normally they get problematic, when you need very good performance. But often it is a lot easier to use a regex than to write a big switch statement.
Have a look at this question - which shows you the elegance of a regex in contrast to the similar if() construct ...
Use regular expressions for recognizing (regular) patterns in text. Don't use it for parsing text into data structures. Don't use regular expressions when the expression becomes very large.
Often it's not clear when not to use a regular expression. For example, you shouldn't use regular expressions for proper email address verification. At first it may seem easy, but the specification for valid email addresses isn't as regular as you might think. You could use a regular expression to initial searching of email address candidates. But you need a parser to actually verify if the address candidate conforms to the given standard.
At the very least, I'd say learn regular expressions just so that you understand them fully and be able to apply them in situations where they would work. Off the top of my head I'd use regular expressions for:
Identifying parts of a string.
Checking whether a string conforms to a certain format or construction.
Finding substrings that match a certain pattern.
Transforming strings that fit a certain pattern into a different form (search-replace, capitalization, etc.).
Regular expressions at a theoretical level form the foundations of what a state machine is -- in computer science, you have Deterministic Finite Automata (DFA) and Non-deterministic Finite Automata (NFA). You can use regular expressions to enforce some kind of validation on inputs -- regular expression engines simply interpret or convert regular expression patterns/strings into actual runtime operations.
Once you know whether the string (or data) you want to determine to be valid could be tested by a DFA, you have a choice of whether to implement that DFA yourself using your own code or using a regular expression engine. You'll find that knowing about regular expressions will actually enhance your toolbox and your understanding of how string processing can actually get complex.
Based on simple regular expressions you can then look into learning about parsers and how parsers work. At the lowest level you're looking at lexical analysis (where regular expressions work) and at a higher level a grammar and semantic actions. These are the bases upon which compilers and interpreters work, as well as protocol parser implementations, and document rendering/transformation applications rely on.
The main concern here is maintainability.
It is obvious to me, that any programmer worth his salt must know regular expressions. Not knowing them is like, say, not knowing what abstraction and encapsulation is, only, probably, worse. So this is out of the question.
On the other hand, one should consider, that maintiaining regex-driven code (written in any language) can be a nightmare even for someone who is really good at them. So, in my opinion, the correct approach here is to only use them when it is inevitable and when the code using regex' will be more readable than its non-regex variant. And, of course, as has been already indicated, do not use them for something, that they are not meant to do (like xml). And no email address validation either (one of my pet peeves :P)!
But seriously, doesn't it feel wrong when you use all those substrs for something, that can be solved with a handful of characters, looking like line noise? I know it did for me.
C# and Java allow almost any character in class names, method names, local variables, etc.. Is it bad practice to use non-ASCII characters, testing the boundaries of poor editors and analysis tools and making it difficult for some people to read, or is American arrogance the only argument against?
I would stick to english, simply because you usually never know who is working on that code, and because some third-party tools used in the build/testing/bugtracking progress may have problems. Typing äöüß on a Non-German Keyboard is simply a PITA, and I simply believe that anyone involved in software development should speak english, but maybe that's just my arrogance as a non-native-english speaker.
What you call "American arrogance" is not whether or not your program uses international variable names, it's when your program thinks "Währung" and "Wahrung" are the same words.
I'd say it entirely depends on who's working on the codebase.
If you have a small group of developers who all share a common language and you don't ever plan needing anyone who doesn't speak the language to work on the code then go ahead and use whatever characters you want.
If you need to have people of varying cultures and languages working on the code then it's probably best to stick with English since it's the common denominator for just about everyone in the world.
If your business are non-English speakers, and you think Domain Driven Design has something to it, then there is another aspect: How do we, as developers, use the same domain language as our business without any translation overhead?
That does not only mean translations between languages, say English and Norwegian, but also between different words. We should use the exact same words as our business for our entity classes and services.
I have found it easier to just give in and use my native language. Now that my code use the same words, it's easier to have a conversation with my domain experts. And after a while you get used to it, just like how you got used to code without Hungarian notation.
I used to work in a development team that happily wiped their asses with any naming (and for that matter any other coding) conventions. Believe it or not, having to cope with ä's and ö's in the code was a contributing factor of me resigning. Though I'm Finnish, I prefer writing code with US keyboard settings because curly and square brackets are a pain to write in a Finnish keyboard (try right alt and 7 and 0 for curlies).
So I say stick with the ascii characters.
Here's an example of where I've used non-ASCII identifiers, because I found it more readable than replacing the greek letters with their English names. Even though I don't have θ or φ on my keyboard (I relied on copy-and-paste.)
However these are all local variables. I would keep non-ASCII identifiers out of public interfaces.
It depends:
Does your team conform to any existing standards that require your using ASCII?
Is your code ever going to be feasibly reused or read by someone who doesn't speak your native language?
Do you envision a scenario where you'll need to ask for help online and will therefore not be able to copy-paste your code sample in as-is?
Are you certain your entire suite of tools support code encoding?
If you answered 'yes' to any of the above, stay ASCII only. If not, go forward at your own risk.
Part of the problem is that the Java/C# language and its libraries are based on English words like if and toString(). I personally would not like to switch between non-English language and English while reading code.
However, if your database, UI, business logics (including metaphors) are already in some non-English language, there's no need to translate every method names and variables into English.
IF you get past the other prerequisites you then have one extra (IMHO more important) one - How difficult is the symbol to type.
On my regular en-us keyboard, the only way I know of to type the letter ç is to hold alt, and hit 0227 on the numeric keypad, or copy and paste.
This would be a HUGE big roadblock in the way of typing quickly. You don't want to slow your coding down with trivial stuff like this if you aren't forced to. International keyboards may alleviate this, but then what happens if you have to code on your laptop which doesn't have an international keyboard, etc?
I would stick to ASCII characters because if anyone in your development team is using an SDK that only supports ASCII or you wanted to make your code open source, alot of problems could arise. Personally, I would not do it even if you are not planning on bringing anyone who doesn't speak the language in on the project, because you are running a business and it seems to me that one running a business would want his business to expand, which in this day and age means transcending national borders. My opinion is that English is the language of the realm, and even if you name your variables in a different language, there is little to no point to use any non-ASCII characters in your programming. Leave it up to the language to deal with it if you are handling data that is UTF8: my iPhone program (which involves tons of user data going in between the phone and server) has full UTF8 support, but has no UTF8 in the source code. It just seems to open such a large can of worms for almost no benefit.
There is another hazzard to using non-ASCII characters, though it will probably only bite in obscure cases. The allowed characters are defined in terms of the methods Character.isJavaIdentifierStart(int) and Character.isJavaIdentifierPart(int), which are defined in terms of Unicode. However, the exact version of Unicode used depends on the version Java platform, as specified in the documentation for java.lang.Character.
Since character properties change slightly from one Unicode version to the next, it's possible (but probably very unlikely) you could have identifiers that are valid in one version of Java, but not in the next.
As already pointed out, unless method names mostly match the language, it is a bit weird to constantly switch languages while reading.
For the Scandinavian languages & German, which I can speak and thus speak for, I would at least recommend using standard substitutions, ie.
ä/æ -> ae, ö/ø -> oe, å -> aa, ü -> ue
etc. just in case as others may find it difficult to type the original letters without keyboard/keymap changes. Think if you suddenly had to work with a codebase where the developers used a third language (for instance including the French ç) and didn't do this.. Switching between more than 2 keymaps to type efficiently would be painful in my experience.