Is this possible? Given that C# uses immutable strings, one could expect that there would be a method along the lines of:
var expensive = ReadHugeStringFromAFile();
var cheap = expensive.SharedSubstring(1);
If there is no such function, why bother with making strings immutable?
Or, alternatively, if strings are already immutable for other reasons, why not provide this method?
The specific reason I'm looking into this is doing some file parsing. Simple recursive descent parsers (such as the one generated by TinyPG, or ones easily written by hand) use Substring all over the place. This means if you give them a large file to parse, memory churn is unbelievable. Sure there are workarounds - basically roll your own SubString class, and then of course forget about being able to use String methods such as StartsWith or String libraries such as Regex, so you need to roll your own version of these as well. I assume parser generators such as ANTLR basically do that, but my format is simple enough not to justify using such a monster tool. Even TinyPG is probably an overkill.
Somebody please tell me I am missing some obvious or not-so-obvious standard C# method call somewhere...
No, there's nothing like that.
.NET strings contain their text data directly, unlike Java strings which have a reference to a char array, an offset and a length.
Both solutions have "wins" in some situations, and losses in others.
If you're absolutely sure this will be a killer for you, you could implement a Java-style string for use in your own internal APIs.
As far as I know, all larger parsers use streams to parse from. Isn't that suitable for your situation?
The .NET framework supports string interning. This is a partial solution but does not offer the posibility to reuse parts of a string. I think reusing substring will cause some problems not that obviouse at a first look. If you have to do a lot of string manipulation using the StringBuilder is the way to go.
Nothing in C# provides you the out-of-the-box functionality you're looking for.
What want is a Rope data structure, an immutable data structure which supports O(1) concats and O(log n) substrings. I can't find any C# implementations of a rope, but here a Java one.
Barring that, there's nothing wrong with using TinyPG or ANTLR if that's the easiest way to get things done.
Well you could use "unsafe" to do the memory management yourself, which might allow you to do what you are looking for. Also the StringBuilder class is great for situations where a string needs to be manipulated numerous times, since it doesn't make a new string with each manipulation.
You could easily write a trivial class to represent "cheap". It would just hold the index of the start of the substring and the length of the substring. A couple of methods would allow you to read the substring out when needed - a string cast operator would be ideal as you could use
string text = myCheapObject;
and it would work seamlessly as if it were an actual string. Adding support for a few handy methods like StartsWith would be quick and easy (they'd all be one liners).
The other option is to write a regular parser and store your tokens in a Dictionary from which you share references to the tokens rather than keeping multiple copies.
Related
I need to store about 60000 IP-address-like things such that I can quickly determine if the store contains an address or return all the addresses that match a pattern like 3.4.*.* or *.5.*.*. The previous implementation used HashTables nested four levels deep. It's not fully thread safe, and this is causing us bugs. I need to make this thread safe with locking on the outer layer, or I could change all those to ConcurrentDictionaries, but neither of those options seemed quite right. Using a byte for a key in a dictionary never felt quite right to me in general, especially a heavy-weight dictionary. Suggestions?
Guava uses a prefix trie for storing IP lookup matches. You can see the code here:
https://code.google.com/p/google-collections/source/browse/trunk/src/com/google/common/collect/PrefixTrie.java?r=2
This is Java code, but I'm sure you could easily adapt it to C#. The technique of a prefix trie is applicable independent of the language and gets you trailing wildcard matches for free. If you want arbitrary wildcards, you'll still need to implement that yourself. Alternatively, you could build a data structure similar to a Directed acyclic word graph (DAWG). This will let you more directly implement the arbitrary wild card matches.
This question is designed around the performance within PHP but you may broaden it to any language if you wish to.
After many years of using PHP and having to compare strings I've learned that using string comparison operators over regular expressions is beneficial when it comes to performance.
I fully understand that some operations have to be done with Regular Expressions down to there complexity but for operations that can be resolved via regex AND string functions.
take this example:
PHP
preg_match('/^[a-z]*$/','thisisallalpha');
C#
new Regex("^[a-z]*$").IsMatch('thisisallalpha');
can easily be done with
PHP
ctype_alpha('thisisallalpha');
C#
VFPToolkit.Strings.IsAlpha('thisisallalpha');
There are many other examples but you should get the point I'm trying to make.
What version of string comparison should you try and lean towards and why?
Looks like this question arose from our small argument here, so i feel myself somehow obliged to respond.
php developers are being actively brainwashed about "performance", whereat many rumors and myths arise, including sheer stupid things like "double quotes are slower". Regexps being "slow" is one of these myths, unfortunately supported by the manual (see infamous comment on the preg_match page). The truth is that in most cases you don't care. Unless your code is repeated 10,000 times, you don't even notice a difference between string function and a regular expression. And if your code does repeat 10,000 times, you must be doing something wrong in any case, and you will gain performance by optimizing your logic, not by stripping down regular expressions.
As for readability, regexps are admittedly hard to read, however, the code that uses them is in most cases shorter, cleaner and simpler (compare yours and mine answers on the above link).
Another important concern is flexibility, especially in php, whose string library doesn't support unicode out of the box. In your concrete example, what happens when you decide to migrate your site to utf8? With ctype_alpha you're kinda out of luck, preg_match would require another pattern, but will keep working.
So, regexes are not slower, more readable and more flexible. Why on earth should we avoid them?
Regular expressions actually lead to a performance gain (not that such microoptimizations are in any way sensible) when they can replace multiple atomic string comparisons. So typically around five strpos() checks it gets advisable to use a regular expression instead. Moreso for readability.
And here's another thought to round things up: PCRE can handle conditionals faster than the Zend kernel can handle IF bytecode.
Not all regular expressions are designed equal, though. If the complexetiy gets too high, regex recursion can kill its performance advantage. Therefore it's often reconsiderworthy to mix regex matching and regular PHP string functions. Right tool for the job and all.
PHP itself recommends using string functions over regex functions when the match is straightforward. For example, from the preg_match manual page:
Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster.
Or from the str_replace manual page:
If you don't need fancy replacing rules (like regular expressions), you should always use this function instead of ereg_replace() or preg_replace().
However, I find that people try to use the string functions to solve problems that would be better solved by regex. For instance, when trying to create a full-word string matcher, I have encountered people trying to use strpos($string, " $word ") (note the spaces), for the sake of "performance", without stopping to think about how spaces aren't the only way to delineate a word (think about how many string functions calls would be needed to fully replace preg_match('/\bword\b/', $string)).
My personal stance is to use string functions for matching static strings (ie. a match of a distinct sequence of characters where the match is always the same) and regular expressions for everything else.
Agreed that PHP people tend to over-emphasise performance of one function over another. That doesn't mean the performance differences don't exists -- they definitely do -- but most PHP code (and indeed most code in general) has much worse bottlenecks than the choice of regex over string-comparison. To find out where your bottlenecks are, use xdebug's profiler. Fix the issues it comes up with before worrying about fine-tuning individual lines of code.
They're both part of the language for a reason. IsAlpha is more expressive. For example, when an expression you're looking at is inherently alpha or not, and that has domain meaning, then use it.
But if it is, say, an input validation, and could possibly be changed to include underscores, dashes, etc., or if it is with other logic that requires regex, then I would use regex. This tends to be the majority of the time for me.
I have a Portable Bridge Notation formatted-file that I have to work with. I have already a few simple examples working by using indexing and substrings extracting what I need and I suppose for this PBN business it would suite me well since it isn't run too often. If I one the other hand would run that code like all the time (thinking of working with vCards) then I am worried about memory usage under high workload because of the high amout of temporary string variables created from all substrings and splits.
There are two other options available that I know of. Regex and StringReader / TextReader and I wanted some general opinion on what to use.
Intended usage is to extract to objects and serialize to json so that I can more easily work with or persist this information. Hell if it's fast enough I might even do it on the fly.
So hit me, what would you chose?
Personally, I would read the file line by line and store it in an internal representation and then query it with LINQ.
The advantage of storing to an internal representation is that you just read the file from top to bottom, so that just easy. And when you need to query, you have powerful linq queries, which make life a lot easier.
I am getting to the last stage of my rope (a more scalable version of String) implementation. Obviously, I want all operations to give the same result as the operations on Strings whenever possible.
Doing this for ordinal operations is pretty simple, but I am worried about implementing culture-sensitive operations correctly. Especially since I know only two languages and in both of them culture-sensitive operations behave precisely the same as ordinal operations do!
So are there any specific things that I could test and get at least some confidence that I am doing things correctly? I know, for example, about ß being equal to SS when ignoring cases in German; about dotted and undotted i in Turkish.
Surrogate pairs, if you plan to support them - including invalid combinations (e.g. only one part of one).
If you're doing encoding and decoding, make sure you retain enough state to cope with being given arbitrarily blocks of binary data to decode which may end half way through a character, with the remaining half coming in the next character.
The Turkish test is the best I know :)
You should mimic the String methods implementations and use the core library to do this for you. It is very hard to take into account every possible aspect of every culture. Instead of re-inventing the wheel use reflector on the String methods and see the internal calls. For example String.Compare uses CultureInfo.CurrentCulture.CompareInfo.Compare for comparing 2 strings in current culture.
This must be a classic .NET question for anyone migrating from Java.
.NET does not seem to have a direct equivalent to java.io.StreamTokenizer, however the JLCA provides a SupportClass that attempts to implement it. I believe the JLCA also provides a Tokenizer SupportClass that takes a String as the source, which I thought a StreamTokenizer would be derived from, but isn't.
What is the preferred way to Tokenize both a Stream and a String? or is there one? How are streams tokenized in .Net? I'd like to have the flexibility that java.io.StreamTokenizer provides. Any thoughts?
There isn't anything in .NET that is completely equivalent to StreamTokenizer. For simple cases, you can use String.Split(), but for more advanced token parsing, you'll probably end up using System.Text.RegularExpressions.Regex.
Use System.String.Split if you need to split a string based on a collection
of specific characters.
Use System.Text.RegularExpressions.RegEx.Split to split based
on matching patterns.
There's a tokenizer in the Nextem library -- you can see an example here: http://trac.assembla.com/nextem/browser/trunk/Examples/Parsing.n
It's implemented as a Nemerle macro, but you can write this and then use it from C# easily.
I don't think so, for very simple tokenizing have a look at System.String.Split().
More complex tokenizing can be achieved by System.Text.RegularExpressions.Regex.
We had the same problem of finding a StreamTokenizer equivalent when porting tuProlog from Java to C#. We ended up writing what as far as I know is a straight conversion of StreamTokenizer which takes a TextReader as a "stream" for input purposes. You will find the code in the download for tuProlog.NET 2.1 (LGPL-licensed) so feel free to reuse and adapt it to your needs.
To tokenize a string, use string.Split(...).