I'm looking for a command-line tool where I can specify regex patterns (or similar) for certain file extensions (e.g. cs files, js files, xaml files) that can provide errors/warnings when run, like during a build. These would scan plain-text source code of all types.
I know there are tools for specific languages... I plan on using those too. This tool is for quick patterns we want to flag where we don't want to invest in writing a Rosyln rule, for example. I'd like to flag certain patterns or API usages in an easy way where anyone can add a new rule without thinking too hard. Often times we don't add rules because it is hard.
Features like source tokenization is bonus. Open-source / free is mega bonus.
Is there such a tool?
If you want to go old-skool, you can dust-off Awk for this one.
It scans file line by line (for some configurable definition of line, with a sane default) cuts them in pieces (on whitespace IMMSMR) and applies a set of regexes and fires the code behind the matching regex. There are some conditions to match the beginning and end of a file to print headers/footers.
It seems to be what you want, but IMHO, a perl or ruby script is easier and has replaced AWK for me a long time ago. But it IS simple and straightforward for your use-case AFAICT.
Our Source Code Search Engine (SCSE) can do this.
SCSE lexes (using language-accurate tokenization including skipping language-specific whitespace but retaining comments) a set of source files, and then builds a token index for each token type. One can provide SCSE with token-based search queries such as:
'if' '(' I '='
to search for patterns in the source code; this example "lints" C-like code for the common mistake of assigning a variable (I for "identifier") in an IF statement caused by accidental use of '=' instead of the intended '=='.
The search is accomplished using the token indexes to speed up the search. Typically SCSE can search millions of lines of codes in a few seconds, far faster than grep or other scheme that insists on reading the file content for each query. It also produces fewer false positives because the token checks are accurate, and the queries are much easier to write because one does not have to worry about white space/line breaks/comments.
A list of hits on the pattern can be logged or merely counted.
Normally SCSE is used interactively; queries produce a list of hits, and clicking on a hit produces a view of a page of the source text with the hit superimposed. However, one can also script calls on the SCSE.
SCSE can be obtained with langauge-accurate lexers for some 40 languages.
Related
I work on a solution potentially having hard coded strings. The task is to identify in the solution's (C# in my case) code, all the phrases that aren't touched by the translation mechanism.
Usually the translations mechanisms are something like
myTranslator.Get("messageKey", myLang)
so I supposed "\w .*" would potentially give me phrases (I use a space after a word, cause in the Message Keys you rarely have spaces...), but that gave me also the HTML attributes like class="alpha beta gamma" that are not phrases...
"fortunately", the supposed hard coded strings are in French, so I tried to find
".*[äÄëËüÜïÏöÖâÂêÊûÛîÎôÔèÈàÀùÚçÇéÉ].*"
in the *.cs;*.cshtml files (the solution is an ASP.NET one)...
That works rather well, but finds only phrases having accents... is there a smarter way to identify the hard-coded (non-translated) strings in solution's code?
What are the general recommendations to identify, label and remove such kind of strings from a (localisable) solution?
Question:
I want to render MediaWiki syntax (and I mean MediaWiki syntax as used by WikiPedia, not some other wiki format from some other engine such as WikiPlex), and that in C#.
Input: MediaWiki Markup string
Output: HTML string
There are some alternative mediawiki parsers, but nothing in C#, and additionally pinvoking C/C++ looks bleak, because of the structure of those libaries.
As syntax guidance, I use
http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet
My first goal is to render that page's markup correctly.
Markup can be seen here:
http://en.wikipedia.org/w/index.php?title=Wikipedia:Cheatsheet&action=edit
Now, if I use Regex, it's not of much use, because one can't exactly say which tag ends which starting ones, especially when some elements, such as italic, become an attribute of the parent element.
On the other hand, parsing character by character is not a good approach either, because
for example ''' means bold, '' means italic, and ''''' means bold and italic...
I looked into porting some of the other parsers' code, but the java implementations are obscure, and the Python implementations have have a very different regex syntax.
The best approach I see so far would be to port mwlib to IronPython
http://www.mediawiki.org/wiki/Alternative_parsers
But frankly, I'm not looking forward to having the IronPython runtime added as a dependency to my application, and even if I would want to, the documentation is bad at best.
Update per 2017:
You can use ParseoidSharp to get a fully compatible MediaWiki-renderer.
It uses the official Wikipedia Parsoid library via NodeServices.
(NetStandard 2.0)
Since Parsoid is GPL 2.0, and and the GPL-code is invoked in nodejs in a separate process via network, you can even use any license you like ;)
Pre-2017
Problem solved.
As originally assumed, the solution lies in using one of the existing alternative parsers in C#.
WikiModel (Java) works well for that purpose.
First attempt was pinvoke kiwi.
It worked, but but failed because:
kiwi uses char* (fails on anything non-English/ASCII)
not thread safe.
bad because of the need have a native dll in the code for every architecture
(did add x86 and amd64, then it went kaboom on my ARM processor)
Second attempt was mwlib.
That failed because somehow IronPython doesn't work as it should.
Third attempt was Swebele, which essentially turned out to be academic vapoware.
The fourth attempt was using the original mediawiki renderer, using Phalanger. That failed because the MediaWiki renderer is not really modular.
The fifth attempt was using Wiky.php via Phalanger, which worked, but was slow and Wiky.php doesn't very completely implement MediaWiki.
The sixth attempt was using bliki via ikvmc, which failed because of the excessive use of 3rd party libraries ==> it compiles, but yields null-reference exceptions only
The seventh attempt was using JavaScript in C#, which worked but was very slow, plus the MediaWiki functionality implemented was very incomplete.
The 8th attempt was writing an own "parser" via Regex.
But the time required to make it work is just excessive, so I stopped.
The 9th attempt was successful.
Using ikvmc on WikiModel yields a useful dll.
The problem there was the example-code was hoplessly out of date.
But using google and the WikiModel sourcecode, I was able to piece it together.
The end-result can be found here:
https://github.com/ststeiger/MultiWikiParser
Why shouldn't this be possible with regular expressions?
inputString = Regex.Replace(inputString, #"(?:'''''')(.*?)(?:'''''')", #"<strong><em>$1</em></strong>");
inputString = Regex.Replace(inputString, #"(?:''')(.*?)(?:''')", #"<strong>$1</strong>");
inputString = Regex.Replace(inputString, #"(?:'')(.*?)(?:'')", #"<em>$1</em>");
This will, as far as I can see, render all 'Bold and italic', 'Bold' and 'Italic' text.
Here is how I once implemented a solution:
define your regular expressions for Markup->HTML conversion
regular expressions must be non greedy
collect the regular expressions in a Dictionary<char, List<RegEx>>
The char is the first (Markup) character in each RegEx, and RegEx's must be sorted by Markup keyword length desc, e.g. === before ==.
Iterate through the characters of the input string, and check if Dictionary.ContainsKey(char). If it does, search the List for matching RegEx. First matching RegEx wins.
As MediaWiki allows recursive markup (except for <pre> and others), the string inside the markup must also be processed in this fashion recursively.
If there is a match, skip ahead the number of characters matching the RegEx in input string. Otherwise proceed to next character.
Kiwi (https://github.com/aboutus/kiwi, mentioned on http://mediawiki.org/wiki/Alternative_parsers) may be a solution. Since it is C based, and I/O is simply done by stdin/stdout, it should not be too hard to create a "PInvoke"-able DLL from it.
As with the accepted solution I found parsoid is the best way forward as it's the official library - and has the greatest support for the wikimedia markup; that said I found ParseoidSharp to be using obsolete methods such as Microsoft.AspNetCore.NodeServices and really it's just a wrapper for a fairly old version of pasoid's npm package.
Since there is a fairly current version of parsoid in node.js you can use Jering.Javascript.NodeJS to do the same thing as ParseoidSharp, the steps are fairly similar also.
Install nodeJS (
download parsoid https://www.npmjs.com/package/parsoid place the required files in your project.
in powershell cd to your project
npm install
Then it's as simple as
output = StaticNodeJSService.InvokeFromFileAsync(Of String)(HttpContext.Current.Request.PhysicalApplicationPath & "./NodeScripts/parsee.js", args:=New Object() {Markup})
Bonus it's now much easier than ParseoidSharp's method to add the options required, e.g. you'll probably want to set the domain to your own domain.
I need to develop an application that will read and understand text file in which I'll find a custom language that describe a list of operations (ie cooking recipe). This language has not been defined yet, but it will probably take one of the following shape :
C++ like code
(This code is randomly generated, just for example purpose) :
begin
repeat(10)
{
bar(toto, 10, 1999, xxx);
}
result = foo(xxxx, 10);
if(foo == ok)
{
...
}
else
{
...
}
end
XML code
(This code is randomly generated, just for example purpose) :
<recipe>
<action name="foo" argument"bar, toto, xxx" repeat=10/>
<action name="bar" argument"xxxxx;10" condition="foo == ok">
<true>...</true>
<false>...</false>
</action>
</recipe>
No matter which language will be chosen, there will have to handle simple conditions, loops.
I never did such a thing but at first sight, it occurs to me that describing those operations into XML would be simplier yet less powerful.
After browsing StackOverFlow, I've found some chats on a tool called "ANTLR"... I started reading "The Definitive ANTLR Reference" but since I never done that kind of stuff, I find it hard to know if it's really the kind of tool I need...
In other words, what do I need to read a text file, interpret it properly and perform actions in my C# code. Those operations will interact between themselves by simple conditions like :
If operation1 failed, I do operation2 else operation3.
Repeat the operation4 10 times.
What would be the best language to do describe those text file (XML, my own) ? What are the key points during such developments ?
I hope I'm being clear :)
Thanks a lot for your help and advices !
XML is great for storing relational data in a verbose way. I think it is a terrible candidate for writing logic such as a program, however.
Have you considered using an existing grammar/scripting language that you can embed, rather than writing your own? E.g:
LUA
Python
In one of my projects I actually started with an XML like language as I already had an XML parser and parsed the XML structure into an expression tree in memory to be interpreted/run.
This works out very nicely to get passed the problem of figuring out tokenizing/parsing of text files and concentrate instead on your 'language' and the logic of the operations in your language. The down side is writing the text files is a little strange and very wordy. Its also very unnatural for a programmer use to C/C++ syntax.
Eventually you could easily replace your XML with a full blown scanner & lexer to parse a more 'natural C++' like text format into your expression tree.
As for writing a scanner & lexer, I found it easier to write these by hand using simple logic flow/loops for the scanner and recursive decent parser for the lexer.
That said, ANTLR is great at letting you write out rules for your language and generating your scanner & lexer for you. This allows for much more dynamic language which can easily change without having to refactor everything again when new things are added. So, it might be worth looking into as learning this as it would save you much time in rewrites as things change if you hand wrote your own.
I'd recommend writing the app in F#. It has many useful features for parsing strings and xmls like Pattern Matching and Active Patterns.
For parsing C-like code I would recommend F# (just did one interpreter with F#, works like a charm)
For parsing XML's I would recommend C#/F# + XmlDocument class.
You basically need to work on two files:
Operator dictionary
Code file in YourLanguage
Load and interpret the operators and then apply them recursively to your code file.
The best prefab answer: S-expressions
C and XML are good first steps. They have sort of opposite disadvantages. The C-like syntax won't add a ton of extra characters, but it's going to be hard to parse due to ambiguity, the variety of tokens, and probably a bunch more issues I can't think of. XML is relatively easy to parse and there's tons of example code, but it will also contain tons of extra text. It might also give you too many options for where to stick language features - for example, is the number of times to repeat a loop an attribute, element or text?
S-expressions are more terse than XML for sure, maybe even C. At the same time, they're specific to the task of applying operations to data. They don't admit ambiguity. Parsers are simple and easy to find example code for.
This might save you from having to learn too much theory before you start experimenting. I'll emphasize MerickOWA's point that ANTLR and other parser generators are probably a bigger battle than you want to fight right now. See this discussion on programmers.stackexchange for some background on when the full generality of this type of tool could help.
I am not that expert user of stackoverflow but what i know is that my question is somewhat related to
Lucene and Special Characters
But i have a slight different environment.
I have an index with Lucene.NET but i am searching it with SOLR. Is it possible to search the special characters without re-indexing? while re-indexing i can change my analyzer but is it or not possible to search without re-indexing?
You will need to set up your query analyzer in Solr to match the analyzer config used at index-time.
Solr has a very handy tool -- Field Analysis (solr/admin/analysis.jsp) -- for analyzing analyzer configurations. Check the verbose check boxes, and inspect how analyzers process your query terms. Lucid Imagination has a section about it.
If you're not sure what analyzers were run at index time, then you will also have to inspect what the terms actually look like in the index (although it will likely be very hard to prove that your query analysis is correct). You can use the LukeRequestHandler for this.
If you can concluded a one-to-one mapping of terms (between index time analysis and query time analysis), then you're home safe, otherwise you might be better of re-indexing.
I need to build an assembler for a CPU architecture that I've built. The architecture is similar to MIPS, but this is of no importance.
I started using C#, although C++ would be more appropriate. (C# means faster development time for me).
My only problem is that I can't come with a good design for this application. I am building a 2 pass assembler. I know what I need to do in each pass.\
I've implemented the first pass and I realised that if I have to lines assembly code on the same line ...no error is thrown.This means only one thing poor parsing techniques.
So almighty programmers, fathers of assembler enlighten me how should I proceed.
I just need to support symbols and data declaration. Instructions have fixed size.
Please let me know if you need more information.
I've written three or four simple assemblers. Without using a parser generator, what I did was model the S-C assembler that I knew best for 6502.
To do this, I used a simple syntax - a line was one of the following:
nothing
[label] [instruction] [comment]
[label] [directive] [comment]
A label was one letter followed by any number of letters or numbers.
An instruction was <whitespace><mnemonic> [operands]
A directive was <whitespace>.XX [operands]
A comment was a * up to end of line.
Operands depended on the instruction and the directive.
Directives included
.EQ equate for defining constants
.OR set origin address of code
.HS hex string of bytes
.AS ascii string of bytes - any delimiter except white space - whatever started it ended it
.TF target file for output
.BS n reserve block storage of n bytes
When I wrote it, I wrote simple parsers for each component. Whenever I encountered a label, I put it in a table with its target address. Whenever I encountered a label I didn't know, I marked the instruction as incomplete and put the unknown label with a reference to the instruction that needed fixing.
After all source lines had passed, I looked through the "to fix" table and tried to find an entry in the symbol table, if I did, I patched the instructions. If not, then it was an error.
I kept a table of instruction names and all the valid addressing modes for operands. When I got an instruction, I tried to parse each addressing mode in turn until something worked.
Given this structure, it should take a day maybe two to do the whole thing.
Look at this Assembler Development Kit from Randy Hyde's author of the famous "The Art of Assembly Language":
The Assembler Developer's Kit
The first pass of a two-pass assembler assembles the code and puts placeholders for the symbols (as you don't know how big everything is until you've run the assembler). The second pass fills in the addresses. If the assembled code subsequently needs to be linked to external references, this is the job of the eponymous linker.
If you are to write an assembler that just works, and spits out a hex file to be loaded on a microcontroller, it can be simple and easy. Part of my ciforth library is a full Pentium assembler to add inline definitions, of about 150 lines. There is an assembler for the 8080 of a couple dozen lines.
The principle is explained http://home.hccnet.nl/a.w.m.van.der.horst/postitfixup.html .
It amounts to applying the blackboard design pattern to the problem. You start with laying down the instruction, leaving holes for any and all operands. Then you fill in the holes, when you encounter the parameters.
There is a strict separation between the generic tool and the instruction set.
In case the assembler you need is just for yourself, and there are no requirements than usability (not a homework assignment), you can have an example implementation in http://home.hccnet.nl/a.w.m.van.der.horst/forthassembler.html. If you dislike Forth, there is also an example implementation in Perl. If the Pentium instruction set is too much too chew, then still you must be able to understand the principle and the generic part.
You're advised to have a look at the asi8080.frt file first. This is 389 WOC (Words Of Code, not Lines Of Code). An experienced Forther familiar with the instruction set can crank out an assembler like that in an evening. The Pentium is a bitch.