C# Speeding Up a Parser with Constants? Abstract class though

C# Speeding Up a Parser with Constants? Abstract class though - c#

I have a series of parser which parse the same basic sort of text for relevant data but they come from various sources so they differ subtltey. I am parsing millions of documents per day so any speed optimizations help.
Here is a simplified example to show the fundamental issue. The parser is set up such that there is a base abstract parser that actual parsers implement:
abstract class BaseParser
{
protected abstract string SomeRegex { get; }
public string ParseSomethingCool(string text)
{
return Regex.Match(text, SomeRegex).Value;
}
....
}
class Parser1: BaseParser
{
protected override string SomeRegex { get { return "^.*"; } } // example regex
...
}
class Parser2: BaseParser
{
protected override string SomeRegex { get { return "^[0-9]+"; } } // example regex
...
}
So my questions are:
If I were to make the things returned in the get constants would it speed things up?
Theoretically if it didn't use a property and everything was a straight up constant would that speed things up more?
What sort of speed increases if any could I see?
Am I just clutching at straws?

I don't think converting the properties to constants will give you any appreciable performance boost. The Jit'ed code probably have those inlined anyway (since you put in constants).
I think the best approach is profiling your code first and see which parts have the most potential of optimization. My suggestion of things to look at:
RegEx - as you already know, sometimes, a well constructed RegEx expression spells out the difference between fast and extremely slow. Its really a case to case basis, depending on the expression used and the text you feed it.
Alternatives - I'm not sure what kind of matching you perform, but it might be worth considering other approaches especially if what you are trying to match is not that complex. Then benchmark the results.
Other parts of your code - see where the bottle neck occurs. Is it in disk IO, or CPU? See if more threads will help or maybe revisit the function the reads the file contents.
Whatever you end up doing, its always a big help to measure. Identify the areas with opportunity, find a faster way to do it then measure again to verify if it is indeed faster.

The things in the get already are constant.
I bet the jitter is already optimizing away the property accessors, so you probably won't see much performance gain by refactoring them out.

I don't think you'd see appreciable speed improvements from this kind of optimsation. Your best bet, though, is to try it and benchmark the results.
One change that would make a difference is to not use Regex if you can get away without it. Regex is a pretty big and useful hammer, but not every nail needs a hammer that big.

From the code you show not clear why you need an abstract class and inheriting.
Using virtual members is slower. Moreover, your child classes aren't sealed.
Why don't you do something like this:
public class Parser
{
private Regex regex;
public Parser(string someRegex)
{
regex = new Regex(someRegex, RegexOptions.Compiled);
}
public string ParseSomethingCool(string text)
{
return regex.Match(text).Value;
}
}
or like this
public static class Parser
{
public static string ParseSomethingCool(string text, string someRegex)
{
return Regex.Match(text, someRegex).Value;
}
}
However, I think the greatest gain in performance you would achieve if you use multi-threading. Probably you already do. If you don't take a look at Task Parallel Library

Related

What is the best way to get a static List<T>?

I thinking need to reformulate my question.
My questions is. What is the best way to get the same List I can use in my whole project?
My code looks like this now:
public static class MessagingController
{
static List<MessagingDelivery> MessagingDeliveryList = Messaging.GetMessagingDeliveryList();
}
internal static class Messaging
{
static List<MessagingDelivery> MessagingDeliveryList;
static Messaging()
{ MessagingDeliveryList = new List<MessagingDelivery>(); }
internal static void CreateMessagingText(short reference, short number, string text)
{ MessagingDeliveryList.Add(new MessagingDelivery(reference, number, text)); }
internal static void ChangeMessagingDelivery(short reference, string status, string error)
{ MessagingDelivery.ChangeStatus(reference, status, error); }
internal static List<MessagingDelivery> GetMessagingDeliveryList()
{ return MessagingDeliveryList; }
}
Old question:
What is "best practice" for get a static List<T> and why?
Code 1:
public static List<MessagingDelivery> messagingDeliveryList
= Messaging.GetMessagingDeliveryList();
Code 2:
static List<MessagingDelivery> messagingDeliveryList
= Messaging.GetMessagingDeliveryList();
public static List<MessagingDelivery> MessagingDeliveryList
{ get { return messagingDeliveryList; } }
I assume Code 1 is the fastest way. Is there a good reason to use Code 2?

Neither. A static List<T> with a name that sounds like an actively used object (rather than, say, immutable configuration data) is not fast or slow: it is simply broken. It doesn't matter how fast broken code can run (although the faster it runs, the sooner and more often you will notice it break).
That aside, by the time the JIT has done inlining, there will rarely if ever be any appreciable difference between the 2 options shown.
Besides which: that simply isn't your bottleneck. For example, what are you going to do with the list? Search? Append? Remove - from - the - right? From the left? Fetch by index? All these things are where the actual time is spent. Not the list reference lookup.

While the first is going to be a hair faster, I would say that the second is going to be easier to maintain in the long-run by restricting access to the accessor.
To pick a mediocre example: If, in a few weeks, you suddenly need to deal with encryption or limited access rights, you only have one place to make the change. In the first example, you'd need to search the program for places that access your list, which is a far less effective use of your time. For security, especially, it might even be dangerous, if you start dumping access tokens or keys throughout the program.
So, it depends on what you need. In production, unless the few extra cycles for an method call/return is going to be significant for the purpose (which may well be the case, for some situations), I'd go with the second.

What can be read from memory in a C# application

I know that any application running (whether it is built with C#, C, C++, Java, etc) will have elements exposed in memory. I'm curious as to how to control what and how it is exposed in memory?
I'm curious because I know that many games get hacked or modified by a user viewing the contents in memory of the game and altering them. I just want to know more details around how this works. I know special programs must be used to even dive into the memory and there are conversions and stuff that must happen for it to even be some what readable.
Let's take a extremely simple example and I'll ask some questions about it.
using System.Security;
static class Program2
{
private static SecureString fSecureString;
public static string fPublicString = "Test123";
private static string fPrivateString = "321tesT";
static void Main2()
{
}
}
class TestClass
{
private string fInstancedPrivateString;
public TestClass()
{
fInstancedPrivateString = "InstancedSet";
}
private string DoSomething()
{
return fInstancedPrivateString.ToLower();
}
}
}
Given the code above, I imagine that fPublicString is pretty visible to see. What elements can someone reading memory see? Can they read the variable name or do they just see an memory address and an assigned value (Test123). What about Functions like DoSomething that are inside an instanced class? Can someone see that in memory and write malicious code to execute it at their will?
I'm just curious as to how much of this I need to keep in mind while writing applications (or games). I understand the general idea of the accessor properties (public/private/etc) and their relation to other code having visibility to it, but I'm curious if they have any bearing on how it is represented in memory.
My final question will be very specific: EverQuest (game) has a hack called MacroQuest which from my understanding reads memory by having the proper offsets and can then execute code from the EQ client side or simply change values stored in memory for the client. How did EQ get this so wrong? Was it poor programming on their end? A technology limitation that is sort of resolved now? Or can this technically be done with virtually every piece of software that is written with the right amount of knowledge?
Over all I guess I could probably use a good tutorial, article, or book that provides some details on how code looks in memory etc.

Knowing that your application's memory can be read should not be something a "normal" developer needs to worry about. The number of users that are able to exploit this in a useful way are very few (in the grand scheme) and it only really matters for sensitive parts of your application anyway (licensing, passwords, and other personally identifiable information). Otherwise, the risk is really negligible.
If the effort of protecting it can't be justified by the cost of doing so then why should the person/group/etc paying to have it built worry. It isn't worth investing the time to care when there's always a ton of other things that could otherwise use the time investment.
Should Notepad or MS Word care that you can write a sniffer to listen to what is being typed? Probably not, and why? Because it really doesn't effect the bottom line or pose any realistic risk.

C# How to get the number of the method calls?

In Java - basing on the Aspects - we can get the number of the function calls, e.g.:
SomeClass sc = new SomeClass();
sc.m1();
sc.m1();
int no = sc.getNumberOfCalls("m1");
System.out.println(no); //2
How to do it in C# ?

this is another option by using external product..In .NET
The canonical, easiest way would probably be to simply use a profiler application. Personally I have good experiences with jetBrains dotTrace, but there are more out there.
Basically what you do is you let the profiler fire up your app, and it will keep track of all method calls in your code. It will then show you how much time was spent executing those methods, and how many times they are called.
Assuming your reason for wanting to know this is actually performance, I think it's a good idea to look at a profiler. You can try to optimize your code by doing an educated guess to where the bottlenecks are, but if you use a profiler you can actually measure that. And we all know, measure twice, cut once ;-)
or
this is also good option
AQtime does that with no difficulty.

How about simply do this in your class ?
public class SomeClass
{
private int counter = 0;
public void m1()
{
counter++;
}
public int getMethodCalls()
{
return counter;
}
}

This capability is not built into .NET. However, you can use any one of the mock object frameworks to do this. For example, with RhinoMocks you can set the number of expected number of calls to a method and check it.
You can also accomplish this if you create a dynamic, runtime proxy for your objects and have your proxy keep track. That might make the cure worse than the disease though!
-- Michael

I believe theres not suach a built in feature for that, so you may wanna code something this way:
public class SomeClass
{
public Int32 MethodCallCount { get; set; }
public void Method()
{
this.MethodCallCount++;
//Your custom code goes here!
}
}
If you wanna go deeper you may want look for AOP (aspect oriented programming) interceptors! - if so you may start looking for Spring.NET framework!
If you are within a test scenario the most apropriate solution would be using a mock framework for that.

You can also do Aspects in C#.
Spring.Net
EOS

Is a switch statement ok for 30 or so conditions?

I am in the final stages of creating an MP4 tag parser in .Net. For those who have experience with tagging music you would be aware that there are an average of 30 or so tags. If tested out different types of loops and it seems that a switch statement with Const values seems to be the way to go with regard to catching the tags in binary.
The switch allows me to search the binary without the need to know which order the tags are stored or if there are some not present but I wonder if anyone would be against using a switch statement for so many conditionals.
Any insight is much appreciated.
EDIT: One think I should add now that where discussing this is that the function is recursive, should I pull out this conditional and pass the data of to a method I can kill?

It'll probably work fine with the switch, but I think your function will become very long.
One way you could solve this is to create a handler class for each tag type and then register each handler with the corresponding tag in a dictionary. When you need to parse a tag, you can look up in the dictionary which handler should be used.

Personally, if you must, I would go this way. A switch statement is much easier to read than If/Else statements (and at your size will be optimized for you).
Here is a related question. Note that the accepted answer is the incorrect one.
Is there any significant difference between using if/else and switch-case in C#?

Another option (Python inspired) is a dictionary that maps a tag to a lambda function, or an event, or something like that. It would require some re-architecture.

For something low level like this I don't see a problem. Just make sure you place each case in a separate method. You will thank yourself later.

To me, having so many conditions in a switch statement gives me reason for thought. It might be better to refactor the code and rely on virtual methods, association between tags and methods, or any other mechanism to avoid spagetti code.

If you have only one place that has that particular structure of switch and case statements, then it's a matter of style. If you have more than one place that has the same structure, you might want to rethink how you do it to minimize maintenance headaches.

It's hard to tell without seeing your code, but if you're just trying to catch each tag, you could define an array of acceptable tags, then loop through the file, checking to see if each tag is in the array.

ID3Sharp on Sourceforge has a http://sourceforge.net/projects/id3sharp/ uses a more object oriented approach with a FrameRegistry that hands out derived classes for each frame type.
It's fast, it works well, and it's easy to maintain. The 'overhead' of creating a small class object in C# is negligible compared to opening an MP4 file to read the header.

One design that might be useful in some case (but from what I've seen would be over kill here):
class DoStuff
{
public void Do(type it, Context context )
{
switch(it)
{
case case1: doCase1(context) break;
case case2: doCase2(context) break;
//...
}
}
protected abstract void doCase1(Context context);
protected abstract void doCase2(Context context);
//...
}
class DoRealStuff : DoStuff
{
override void doCase1(Context context) { ... }
override void doCase2(Context context) { ... }
//...
}

I'm not familiar with the MP4 technology but I would explore the possiblity of using some interfaces here. Pass in an object, try to cast to the interface.
public void SomeMethod(object obj)
{
ITag it = obj as ITag;
if(it != null)
{
it.SomeProperty = "SomeValue";
it.DoSomthingWithTag();
}
}

I wanted to add my own answer just to bounce off people...
Create an object that holds the "binary tag name", "Data", "Property Name".
Create a list of these totaling the amount of tags known adding the tag name and property name.
When parsing use linq to match the found name with the object.binarytagname and add the data
reflect into the property and add the data...

What about good old for loop? I think you can design it that way. Isn't switch-case only transformed if-else anyway? I always try to write code using loop if amount of case statements is becoming higher than acceptable. And 30 cases in switch is too high for me.

You almost certainly have a Chain of Responsibility in your problem. Refactor.

C# Extension Methods - How far is too far?

Rails introduced some core extensions to Ruby like 3.days.from_now which returns, as you'd expect a date three days in the future. With extension methods in C# we can now do something similar:
static class Extensions
{
public static TimeSpan Days(this int i)
{
return new TimeSpan(i, 0, 0, 0, 0);
}
public static DateTime FromNow(this TimeSpan ts)
{
return DateTime.Now.Add(ts);
}
}
class Program
{
static void Main(string[] args)
{
Console.WriteLine(
3.Days().FromNow()
);
}
}
Or how about:
static class Extensions
{
public static IEnumerable<int> To(this int from, int to)
{
return Enumerable.Range(from, to - from + 1);
}
}
class Program
{
static void Main(string[] args)
{
foreach (var i in 10.To(20))
{
Console.WriteLine(i);
}
}
}
Is this fundamentally wrong, or are there times when it is a good idea, like in a framework like Rails?

I like extension methods a lot but I do feel that when they are used outside of LINQ that they improve readability at the expense of maintainability.
Take 3.Days().FromNow() as an example. This is wonderfully expressive and anyone could read this code and tell you exactly what it does. That is a truly beautiful thing. As coders it is our joy to write code that is self-describing and expressive so that it requires almost no comments and is a pleasure to read. This code is paramount in that respect.
However, as coders we are also responsible to posterity, and those who come after us will spend most of their time trying to comprehend how this code works. We must be careful not to be so expressive that debugging our code requires leaping around amongst a myriad of extension methods.
Extension methods veil the "how" to better express the "what". I guess that makes them a double edged sword that is best used (like all things) in moderation.

First, my gut feeling: 3.Minutes.from_now looks totally cool, but does not demonstrate why extension methods are good. This also reflects my general view: cool, but I've never really missed them.
Question: Is 3.Minutes a timespan, or an angle?
Namespaces referenced through a using statement "normally" only affect types, now they suddenly decide what 3.Minutes means.
So the best is to "not let them escape".
All public extension methods in a likely-to-be-referenced namespace end up being "kind of global" - with all the potential problems associated with that. Keep them internal to your assembly, or put them into a separate namespace that is added to each file separately.

Personally I like int.To, I am ambivalent about int.Days, and I dislike TimeSpan.FromNow.
I dislike what I see as a bit of a fad for 'fluent' interfaces that let you write pseudo English code but do it by implementing methods with names that can be baffling in isolation.
For example, this doesnt read well to me:
TimeSpan.FromSeconds(4).FromNow()
Clearly, it's a subjective thing.

I agree with siz and lean conservative on this issue. Rails has that sort of stuff baked in, so it's not really that confusing ever. When you write your "days" and "fromnow" methods, there is no guarantee that your code is bug free. Also, you are adding a dependency to your code. If you put your extension methods in their own file, you need that file in every project. In a project, you need to include that project whenever you need it.
All that said, for really simple extension methods (like Jeff's usage of "left" or thatismatt's usage of days.fromnow above) that exist in other frameworks/worlds, I think it's ok. Anyone who is familiar with dates should understand what "3.Days().FromNow()" means.

I'm on the conservative side of the spectrum, at least for the time being, and am against extension methods. It is just syntactic sugar that, to me, is not that important. I think it can also be a nightmare for junior developers if they are new to C#. I'd rather encapsulate the extensions in my own objects or static methods.
If you are going to use them, just please don't overuse them to a point that you are making it convenient for yourself but messing with anyone else who touches your code. :-)

Each language has its own perspective on what a language should be. Rails and Ruby are designed with their own, very distinct opinions. PHP has clearly different opinions, as does C(++/#)...as does Visual Basic (though apparently we don't like their style).
The balance is having many, easily-read, built-in functions vs. the nitty-gritty control over everything. I wouldn't want SO many functions that you have to go to a lookup every time you want to do anything (and there's got to be a performance overhead to a bloated framework), but I personally love Rails, because what it has saves me a lot of time developing.
I guess what I'm saying here is that if you were designing a language, take a stance, go from there, and build in the functions you (or your target developer would) use most often.

My personal preference would be to use them sparingly for now and to wait to see how Microsoft and other big organizations use them. If we start seeing a lot of code, tutorials, and books use code like 3.Days().FromNow() it makes use it a lot. If only a small number of people use it, then you run the risk of having your code be overly difficult to maintain because not enough people are familiar with how extensions work.
On a related note, I wonder how the performance compares between a normal for loop and the foreach one? It would seem like the second method would involve a lot of extra work for the computer, but I'm not familiar enough with the concept to know for sure.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Speeding Up a Parser with Constants? Abstract class though - c#

The things in the get already are constant. I bet the jitter is already optimizing away the property accessors, so you probably won't see much performance gain by refactoring them out.

Related

What is the best way to get a static List<T>?

What can be read from memory in a C# application

C# How to get the number of the method calls?

Is a switch statement ok for 30 or so conditions?

C# Extension Methods - How far is too far?

Categories

Resources