Running multiple RegEx through log files

Running multiple RegEx through log files - c#

I need to build a program that can read through live logging, identify lines, and parse data out of the lines.
Problems I'm facing:
The log output changes from version to version, of the logger.
There are about 500 different types of lines that the logger can output.
There is no easy way to make this, as far as I can tell. I was wondering if there was a specific way of doing stuff like this, as it seems pretty overwhelming?
My current solution is to read the logs and run all my Regexes through each line of the logs to test if it matches.
I have an array with a type I call DataReader, and each contains multiple RegEx formats to read different versions of the logger's lines.
First it tests if the DataReader can read it, using:
bool canUse(String text);
If it returns false, it tries another DataReader until it returns true.
If canUse returns true, it will then construct the data structure using
CompiledLogData constructData(String text);
I am not asking for someone to code this; I'm just asking if this is the right way, or if there is a better way, perhaps a more optimal way for this type of a thing? I am sure someone has worked with a situation like this before? :)
Hope someone can help, thanks.

This is not necessarily a solution, but what i prefer to do in situations like these, is use the RegEx statements on the whole thing... that way you can programmatically build your lines based on reading the entire log into memory. As far as performance is concerned though, it would be the same as yours, just reverse direction (you're doing #for each Line, Do each regexp#, i do #for entire log, Do each regexp#).
What i like about the way i do it is that i can reduce the size of the log on each match by placing the match in a separate var and replacing it in the log with nothing, then move onto next regexp. Different way about it i guess...

If you are bound to C# this might be a good approach. At least I do not know any better.
If you just need to transform the data into a different output format (e.g. into a file), you might use an awk script (awk was made for tasks like this). This skript might read your log messages from stdin and writes the transformed data to stdout. So you can pipe your log messages to your awk script and get the transformed data back from stdout for further processing.
Reading the log and tranforming the data are seperate applications and you have more flexibility in the usage of the tools.
Especially when the regular expressions change oftenly, you might speed up your work by using a script language to avoid the additional compile-step you need in c#.

Related

Advice sought on deriving optimal configuration parameters for C# app - possibly a learning problem

Here is my situation:
My company has a little C# console app, with a config file. In that config file is approximately 20 calibration parameters for our equipment (a vibration measurement device). We go through a manual process where we use in-house knowledge to manually tweak the calibration parameters, do a test run, tweak again, and iterate until we get correct calibration parameters. It's effectively a human loop.
Now this is a bit inefficient and time consuming, and I can't help thinking that this process is something we can automate. However, I'm not sure if I'm thinking about this properly and what techniques are available to me to try and solve this issue.
I can for example, just write some code that loops through every parameter combination, and brute force finds the most accurate parameter set, however this is ugly and we quickly get to billions of iterations.
It seems to me this is some kind of learning or neural network problem, and I could possibly refactor the code to somehow use this. However I'm not sure if this is actually the case.
Based on the above, what is an appropriate technique to use here? What is available in C# to facilitate this?
Thanks in Advance!

This problem cries for a genetic algorithm. You could for example try GeneticSharp for a fast and easy way to achieve what you want.
Every parameter that has to be optimized will be a seperate FloatingPointChoromosome.

Sounds like a version of 'hyper-parameter tuning'. You could use for instance the hyperopt library to help you come to the optimal solution faster. You can see an example here.

Manipulating a Python file from C#

I'm working on some tools for a game I'm making. The tools serve as a front end to making editing game files easier. Several of the files are python scripting files. For instance, I have an Items.py file that contains the following (minimalized for example)
from ItemModule import *
import copy
class ScriptedItem(Item):
def __init__(self, name, description, itemtypes, primarytype, flags, usability, value, throwpower):
Item.__init__(self, name, description, itemtypes, primarytype, flags, usability, value, throwpower, Item.GetNextItemID())
def Clone(self):
return copy.deepcopy(self)
ItemLibrary.AddItem(ScriptedItem("Abounding Crystal", "A colourful crystal composed of many smaller crystals. It gives off a warm glow.", ItemType.SynthesisMaterial, ItemType.SynthesisMaterial, 0, ItemUsage.Unusable, 0, 50))
As I Mentioned, I want to provide a front end for editing this file without requring an editor to know python/edit the file directly. My editor needs to be able to:
Find and list all the class types (in this example, it'd be only
Scripted Item)
Find and list all created items (in this case there'd only be one,
Abounding Crystal). I'd need to find the type (in this
caseScriptedItem) and all the parameter values
Allow editing of parameters and the creation/removal of items.
To do this, I started writing my own parser, looking for the class keyword and when these recorded classes are use to construct objects. This worked for simple data, but when I started using classes with complex constructors (lists, maps, etc.) it became increasing difficult to correctly parse.
After searching around, I found IronPython made it easy to parse python files, so that's what I went about doing. Once I built the Abstract Syntax Tree I used PythonWalkers to identify and find all the information I need. This works perfectly for reading in data, but I don't see an easy way to push updated data into the Python file. As far as I can tell, there's no way to change the values in the AST and much less so to convert the AST back into a script file. If I'm wrong, I'd love for someone to tell me how I could do this. What I'd need to do now is search through the file until I find the correctly line, then try to push the data into the constructor, ensuring correct ordering.
Is there some obvious solution I'm not seeing? Should I just keeping working on my parser and make it support more complex data types? I really though I had it with the IronPython parser, but I didn't think about how tricky it'd be to push modified data back into the file.
Any suggestions would be appreciated

You want a source-to-source program transformation tool.
Such a tool parses a language to an internal data structure (invariably an AST), allows you to modify the AST, and then can regenerate source text from the modified AST without changing essentially anything about the source except where the AST changes were made.
Such a program transformation tool has to parse text to ASTs, and "anti-parse" (called "Prettyprint") ASTs to text. If IronPython has a prettyprinter, that's what you need.
If it doesn't, you can build one with some (maybe a lot) of effort; as you've observed,
this isn't as easy as one might think. See my answer
Compiling an AST back to source code
If that doesn't work, our DMS Software Reengineering Toolkit with its Python front end might do the trick. It has all the above properties.

Provided you can find a complete and up-to-date context free grammar file for Python, you could use CoCo/R parser generator to generate a python parser in C#.
You can add production code to the grammar file itself to populate a data structure in your C# app. Said data structure can hold all the information you need (methods and their arguments, properties, constructors, destructors etc). Once you have this data structure, its just a task of designing a front end for the user and representing this data structure in a way that makes it editable to them (this is more of a design task than a complicated programming task).
Finally, iterate through you data structure and write out a .py file.

You can use the python inspect module to print the source of an object. In your case: To print the source of your module - the file you just parsed with IronPython. I haven't checked to see if inspect works with IronPython yet, though.
As to adding stuff, well, it's a module, right? You can just add stuff to a module... I'd load the module and then alter it, use inspect to view print it and save to disk.
From your post, it looks like you're already deep in the trenches and having fun, so I'd be really happy to see a post here on how you solved this problem!

To me it sounds more like you are at the point where you shove it all into a sqlite database and start editing it that way. Hooking up some forms to edit tables is simpler for the UI. At that point you generate new python files by dumping your tables out with some formatting to provide the surrounding python scripts.
SVN / Git / whatever can merge the updated changes via the python files.
This is what I ended up doing for my project at any rate. I started using python to hook up the various items using their computed keys and then just added some forms UI to avoid editing mistakes in the python files.

What is a good format for command line output when it is being used for further processing?

I have written a console application in Delphi that queries information from several locations. This application will be launched by another process, and the output to STDOUT will be captured by the launching process.
The information I am retrieving is to be interpreted by the calling application for reporting purposes. What is the best way to output this data to STDOUT so that it can be easily parsed? JSON? XML? CSV? The data, specifically, is remote workstation information, so it will pull things back like running processes, and details about each process.
Does anyone have any experience with this or suggestions?

If you want something that can be easily parsed, especially if it has to be done quickly, go with the simplest format that can effectively communicate the information you need. CSV if you can, otherwise try JSON. Definitely not XML unless you really, really need all the extra complexity for some reason.

I'd go for a Tab-delimited file, if your data (as it seems) doesn't contain that character because it allows the fastest and simplest processing.
All the other formats are slower and more complicated (even if they give you more power).
The closest match is CSV but CSV needs to quote the item if the item contains some special characters defined by the CSV (space, comma, quotes etc.).
Because of the above thing, the Tab delimited format is the most compact one, hence it has the greatest speed over-the-wire. (Since you're talking about remote workstations I assume that you're on some kind of network).
Also, another thing worth mentioning is that the Tab delimited format is very readable thus making the debugging much easier, if needed.
As an aside, if the Tab character is present in your data stream you can choose another character which you are sure that cannot be. (For example #1 etc.). Of course, this if your usage scenario permits it.
HTH

It would depend entirely on what the launching process has available. If it's a small Delphi app, CSV is easy to parse with just TStringList. XML may be more heavy weight than JSON, but Delphi ships with an XML parser, and AFAIK, not a JSON parser.

The XML output format has the advantage that you can pipe it to a XSL formatter, so that the XML data can be converted to a user friendly HTML document. (You can almost have the cake and eat it too) ...

C# Creating a log system

I was reading the following article:
http://odetocode.com/articles/294.aspx
This article raised me a lot of question regarding logs.
(I don’t know if I should have made this in separated questions… but I don’t want to spam stackoverflow.com with questions of mine)
The 1st one is if I should store it in a .txt, or .xml file… or even in a table inside the database.
Probably saving in the .txt will be better regarding performance. But when someone needs to find something the .txt file, it may become a pain in the... neck.
So… which one should I use, and why?
The second one, is there any specific class to deal with “log” thing?
I have read several threads about this subject, and I didn’t find the answers to my questions.
Thanks in advance.

The easiest approach I've taken in the past is using log4net. That way you can configure the logging in the config file. If you need it to go to a database, set it up as such. If you want to be notified when a major error occurs, set it up that way.
As far as sorting through the logs, it really depends on the approach you want to take, and how much you plan on logging. Normally I log to a flat text file as I don't enable a lot of logging in my applications. So parsing through them isn't a big deal.

Unless you want to write a system for education purposes, I honestly think that you'd be best off sticking with log4net or nlog.
And further, you would probably be better off studying the code to those systems instead of writing your own.
As to your question, I would stick to a text file and buffer the messages before spitting them to disk.

Why bother inventing wheel? you can check MS enterprise library Logging Block.

definitely not xml.
with xml, you will need to read it all, parse it, add whatever, then generate the whole xml again, and write it back to hard disk. every single time you log something.
unless of course you append the nodes to the xml file manually, in which way you loose most of xml advantages.
warnings to fatal errors - whatever will help you to debug the application if it crashes - those logs i would store in a txt file.
append a new line for every entry.
this way you can also ask from your user to check it out (if you assist him via the phone).
if it's not a meta log, such as mentioned above, in other words, if it's anything related to the program itself you may need to analyze - keep on the db.

Regarding file vs database, it's up to you to choose.
File logs give greater performance but with pain of access.
If the logs are there just to rarely provide information (e.g. the app crashes and you need to know why), you're better off storing the logs in a file.
If you want to give access to those logs, analyze them, etc, you should store them in a database.
.net is really not my zone, but there are lots of reasons why you should use the framework's logging classes.

For my apps I have chosen to write to db. Its easier (for me) to read the logs this way. However I do not go log crazy as some people do, I only log what I need to log and nothing else.
I gave log4net a shot not to long ago and did not like it at all. It was a whole lot of junk to just write to a db and send an email. I ended up writing a custom logging class and it was a whole ~200 lines and took just a few hours. It works great, I don't have another dependency, and it can be easily changed.

If you're dealing with ASP.NET, ELMAH is another good logging tool. It's apparently what Microsoft's Scott Hanselman uses.
It does need some additional code to get it to work with ASP.NET MVC's HandleError attribute, though.

NLog and log4net both provide a rich logging API but neither addresses the challanges you face managing and analyzing all the data in your log files.
If you're willing to consider a commerical tool, take a look at GIBRLATAR - it works with NLog and log4net and also collects useful performance metrics. Most importantly, GIBRALTAR provides great tools for managing and analyzing logs.

What is the best way to implement precomputed data?

I have a computation that calculates a resulting percentage based on certain input. But these calculations can take quite some time, which can be annoying. Since there are about 12500 possible inputs, I thought it would be a good idea to precompute all the data, and look this up during normal program execution.
My first idea was to just create a simple file which is read at program initialization and populates some arrays. Although this will work, I would like to know if there are some other options? For example that the array is populated during compile time.
BTW, I'm writing my code in C#.

This tutorial here implements a serializer, which you can use to easily convert an object to a binary file and back. Once you have the serializer in hand, you can just create an object that holds all your data and serialize it; when you actually run your program, just deserialize the object and use it.
This has all the benefits of saving an object to the hard drive, with an implementation that is object-agnostic (meaning you don't have to write much code for any object you want to serialize) and outputs in binary (thus saving space, if that is a concern).

A file with data is probably the easiest and most flexible way to implement it.
If you wanted it in memory without having to read it from somewhere, I would write a program to output your data in C#-like CSV format suitable for copying and pasting into an array/collection initializer, and thereby generate the source code for your precomputed data.

Create a program that outputs valid C# code which initializes your lookup tables. Make this part of your build process so that it will automatically create the source file and then build the rest of your project.

As Daniel Lew said, serialize it into a binary file.
If you need speed, go for a Dictionary. A Dictionary is indexed on it's key, and should allow rapid lookup even with large amounts of data.

I would always start by considering if there was any way to avoid precomputing. If there's 12500 possible inputs, how many are required per user request ? Will all 12500 be needed at the same time or will they be spread out in time ? If you can get by with calculating a few at a time, I'd do that with lazy initialization. I prefer this solution simply because I'll have fewer issues with it in the long run. What do you do when the persistent format changes, or the data changes. How will you handle it when the file is missing or corrupted ? Persisting to a file does not create less code.
I would serialize such a file to a human-readable format if I had to persist a pre-loaded version. I'd probably use xml serialization since it's simple. But quite often there's issues of invalidation and recalculation. Do the values never change or only very infrequently ?

I agree with mquander and Trent. Use your favorite language or script to generate the whole C# file you need to define your data (no copy-pasting, that's a manual step and error-prone). Add it as a Pre-Build event in Visual Studio. You could even detect that you have an up-to-date file and avoid regeneration for most builds.
There is definitely a way to statically generate almost any data using template metaprogramming in C++, although it can be painful. It's not worth it unless you need many sets of different data in several parts of your program. I am not familiar enough with metaprogrammation in C# to evaluate the general effort in your case. You should look into that.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Running multiple RegEx through log files - c#

Related

Advice sought on deriving optimal configuration parameters for C# app - possibly a learning problem

Manipulating a Python file from C#

What is a good format for command line output when it is being used for further processing?

C# Creating a log system

What is the best way to implement precomputed data?

Categories

Resources