C strtol vs C# long.Parse - c#

I wonder why C# does not have a version of long.Parse accepting an offset in the string and length. In effect, I am forced to call string.Substring first.
This is unlike C strtol where one does not need to extract the substring first.
If I need to parse millions of rows I have a feeling there will be overhead creating those small strings that immediately become garbage.
Is there a way to parse a string into numbers efficiently without creating temporary short lived garbage strings on the heap? (Essentially doing it the C way)

Unless I'm reading this wrong, strtol doesn't take an offset into the string. It takes a memory address, which the caller can set to any position within a character buffer (or outside the buffer, if they aren't paying attention).
This presents a couple issues:
Computation of the offset requires an understanding of how the string is encoded. I believe c# uses UTF16 for in-memory strings, currently anyway. if that were ever to change, your offsets would be off, possibly with disastrous results.
Computation of the address could easily go stale for managed objects since they are not pinned in memory-- they could be moved around by memory management at any time. You'd have to pin it in memory using something like GCHandle.Alloc. When you're done, you'd better unpin it, or you could have serious problems!
If you get the address wrong, e.g. outside your buffer, your program is likely going to blow up.
I think C programmers are more accustomed to managing memory mapped objects themselves and have no issue computing offsets and addresses and monkeying around with them like you would with assembly. With a managed language like c# those sorts of things require more work and aren't typically done-- the only time we pin things in memory is when we have to pass objects off to unmanaged code. When we do it, it incurs overhead. I wouldn't advise it if your overall goal is to improve performance.
But if you are hell bent on getting down to the bare metal on this, you could try this solution where one clever c# programmer would read the string as an array of ASCII-encoded bytes and compute the numbers based on that. With his solution you can specify start and length to your heart's content. You'd have to write something different if your strings are encoded in UTF. I would go this route rather than trying to hack the string object's memory mapping.

Related

Mutable String in unmanaged memory useable in managed space

NOTE: My case is in the ecosystem of an old API that only work with Strings, no modern .NET additions.
So I have a strong need to have mutable string that has no allocations. String is updated every X ms so you can figure out how much garbage it can produce just in few minutes (StringBuilder is not even close to being relevant here at all). My current approach is to pre-allocate string of fixed size and mutate it via pinning, writing characters directly and either falling off silently of throwing when capacity reached.
This works fine. The allocated string is long-lived so eventually GC will promote it to Gen2 and pinning wont bother it that much, minimizing overhead. There are 2 major issues though:
Because string is fixed, I have to pad it with \0 and while this worked fine so far with all default NET/MONO functionality and 3rd party stuff, no way telling how something else will react when string is 1024 in len, but last 100 are \0
I cant resize it, because this will incur allocation. I could take one allocation once a blue moon but since string is fairly dynamic I cant be sure when it will try expand or shrink further. I COULD use "expand only" approach, this way I allocate only when expansion needed, however, this has disadvantages of padding overhead (if string expanded to be 5k characters, but next string is just 3k - 2k characters will be padded for extra cycles) and also memory extra usage. I'm not sure how GC will feel about mchuge, often pinned string in Gen2 and not in LOH. Another way would be pool re-useable string objects, however, this has higher memory and GC overhead + lookup overhead.
Since the target string has to live for quite some time, I was thinking about moving it into Unmanaged memory, via byte buffer. This will remove burden from GC (pinning penalty overhead) and I can re-size/re-allocate at less cost than in managed heap.
What I'm having hard time to understand is - how can I possibly slice specific part of allocated unmanaged buffer and wrap it as a normal net string to use in managed space/code? Like, pass it to Console.WriteLine or some 3rd party library that draws UI label on screen and accepts string. Is this even doable?
P.S. As far as I know, the plan for NET5 (and to be finalized in NET6, I think) that you will no longer be able to mutate things like string (either blocked at runtime or undefined failure). Their solution seems to be POH which is essentially what I describe, with the same limitations.
how can I possibly slice specific part of allocated unmanaged buffer and wrap it as a normal net string to use in managed space/code
As far as I know this is not possible. .Net has their own way to define objects (object headers etc), you cannot treat some arbitrary memory region as a .net object. Pinning and mutating a string seem dangerous since strings are intended to be immutable, and some things might not work correctly (using the string as a dictionary key for example).
The correct way would be (as Canton7 mentions) to use a char[] buffer and Span<char> / Memory<char> for slicing the string. When passing to other methods you can convert a slice of the string to an actual string object. When calling methods like Console.WriteLine or UI methods, the overhead of allocating the string object will be irrelevant compared to everything else that is going on.
If you have old code that only accepts string you would either need to accept the limitations this entails, or rewrite the code to accept memory/span representations.
I would highly recommend profiling to see if it is an actual problem with frequent allocations. As long as the string fits in the small object heap (SOH, i.e. less than 87kb) and is not promoted to gen 2 the overhead might not be huge. Allocations on the SOH is fast, and the time to run a gen 0 GC does not scale directly with the amount allocated. So updating every few milliseconds might not be terrible. I would be more worried if you where talking about microseconds.

C# string to byte array speed

So coming off of this question:
Which is fast comparison: Convert.ToInt32(stringValue)==intValue or stringValue==intValue.ToString()
I am looking a base type for my networked application to be stored in packets.
The Idea:
Packet class stores a list of (type)
Add objects to the packet class
Serialize and send it between machines
Deserialize into (type)
Convert (type) into the type of object you added originally.
Originally, I was using strings as (type). However, I am a bit dubious as every time I want to convert an int to a string, it seems like a tasking process. When I am communicating packets containing lots of uints to strings at 30FPS, I would like to make this process as fast as possible.
Therefore, I was wondering if byte[] would be a more suitable type. How fast is converting back and forth between a byte[] and ints/strings vs just strings to ints? BTW, I will not be sending a lot of strings on the network very often. Almost all of what I will be sending will be uints.
If you are using the same program on both ends, use BinarySerialization if possible. You are worried about speed; but unless this is just going between two processes on localhost, actual wire time, let alone latancy, will be orders of magnitude slower than any real conversion process.
Of course, don't concatenate strings; you will make a liar out of me.
The thing you need to save here is your coding time, plus the possibility of errors for rolling your own serialization. If you properly encapsulate the data transfer parts of your program, upgrading them would be easy. Trying to spend extra time making something fast is called premature optimization (google it - it's a valid argument - most of the time). If it is a bottleneck, leverage your encapsulated design, and change it. You won't spend that much extra time then if you'd done it first - but likely won't end up spending that time at all.
A warning about binary serialization. The types you are sending must be the same version and type name. If you can put the same version into production on both ends, easily, it's no worry. If you need more than this, or binaryserialization is too slow, look into FastJson, which makes big promises and is free, or something similar.
byte[] is the "natural" data type for socket operations, so this seems a good fit, ints/uints will be very fast to convert also. Strings are a bit different, but if you chose the natural encoding of the platform, this will be fast also.
Convert.ToInt32 is decently fast provided it does not fail. If it fails then you incur the overhead of a thrown/caught exception which is massive.
The byte [] vs. some other type dichotomy is false. The network transports all information as -- essentially -- an array of bytes. So whether a StreamReader wrapped around a NetworkStream is turning the byte [] into a String, or you are yourself, it's still getting done.

Copy string to memory buffer in C#

What is the best way to copy a string into a raw memory buffer in C#?
Note that I already know how and when to use String and StringBuilder, so don't suggest that ;) - I'm going to need some text processing & rendering code and currently it looks like a memory buffer is both easier to code and more performant, as long as I can get the data into it. (I'm thinking of B-tree editor buffers and memory mapped files, something which doesn't map well into managed C# objects but is easily coded with pointers.)
Things I already considered:
C++/CLI can do the thing, there is PtrToStringChars in vcclr.h which can then be passed to memcpy, but I'm usually preferring only having one assembly and merging the IL from multiple languages is something I like to avoid. Any way to rewrite that function in C#?
System.Runtime.InteropServices.Marshal has functions which copy the string, but only to a newly allocated buffer. Couldn't find any function to copy into an existing buffer.
I could use String.CopyTo and use an array instead of a memory buffer, but then I need to pin that buffer a lot (or keep it pinned all the time) which is going to be bad for GC. (By using a memory buffer in the first place I can allocate it outside the managed heap so it doesn't mess with the GC.)
If there's a way to pin or copy a StringBuilder then that would probably work too. My text usually comes from either a file or a StringBuilder, so if I can already move it into the memory buffer at that point it never needs to go through a String instance. (Note that going from StringBuilder to String doesn't matter for performance because this is optimized to not make a copy if you stop using the StringBuilder afterwards.)
Can I generate IL which pins a String or StringBuilder? Then instead of writing the copy-function in C# I could generate a DynamicMethod by emitting the required IL. Just now thought of this while writing the question, so I might just try to disassembly the C++/CLI way and reproduce the IL.
enable unsafe code(Somewhere in the project options), then use:
unsafe
{
fixed(char* pc = myString)
{
}
}
and then just use low level memory copies.

Is there a string type with 8 BIT chars?

I need to store much strings in RAM. But they do not contain special unicode characters, they all contains only characters from "ISO 8859-1" that is one byte.
Now I could convert every string, store it in memory and convert it back to use it with .Contains() and methods like this, but this would be overhead (in my opinion) and slow.
Is there a string class that is fast and reliable and offers some methods of the original string class like .Contains()?
I need this to store more strings in memory with less RAM used. Or is there an other way to do it?
Update:
Thank you for your comments and your answer.
I have a class that stores string. Then with one method call I need to figure out if I already have that string in memory. I have about 1000 strings to figure out if they are in the list a second. hundred of millions in total.
The average size of the string is about 20 chars. It is really the RAM that cares me.
I even thought about compress some millions of strings and store these packages in memory. But then I need to decompress it every time I need to access the values.
I also tried to use a HashSet, but the needed memory amount was even higher.
I don't need the true value. Just to know if the value is in the list. So if there is a hash-value that can do it, even better. But all I found need more memory than the pure string.
Currently there is no plan for further internationalization. So it is something I would deal with when it is time to :-)
I don't know if using a database would solve it. I don't need to fetch anything, just to know if the value was stored in the class. And I need to do this fast.
It is very unlikely that you will win any significant performance from this. However, if you need to save memory, this strategy may be appropriate.
To convert a string to a byte[] for this purpose, use Encoding.Default.GetBytes()[1].
To convert a byte[] back to a string for display or other string-based processing, use Encoding.Default.GetString().
You can make your code look nicer if you use extension methods defined on string and byte[]. Alternatively, you can wrap the byte[] in a wrapper type and put the methods there. Make this wrapper type a struct, not a class, otherwise it will incur extra heap allocations, which is what you’re trying to avoid.
I want to warn you, though — you are throwing away the ability to have Unicode in your application. You should normally have all alarm bells go off every time you think you need to do this. It is best if you structure your code in such a way that you can easily go back to using string when memory sizes will have gone up and memory consumption stops being an issue.
[1] Encoding.Default returns the current 8-bit codepage of the running operating system. The default for this on English-language Windows is Windows-1252, which is what you want. For Russian Windows it will be Windows-1251 (Cyrillic) etc.
As per comments, a basically bad idea. If you have to do it, byte[] is your friend. There is no byte-oriented string class in .NET.
Checkout the string.Intern method, that could help you out:
http://www.yoda.arachsys.com/csharp/strings.html
http://en.csharp-online.net/CSharp_String_Theory%E2%80%94String_intern_pool
However looking at your requirements, I think you are over engineering it. You have 1000 strings at 20 chars = 1000 * 20 * 2 = 40,000 bytes, that's not much memory.
If you really have a large amount, store it in a DB with an index. That would be much faster than anything the average programmer can come up with.

Memory Efficient Recursion

I have written an application in C# that generates all the words that can be existed in the combination of alphabets, numbers and few special characters.
The problem is that it isn't memory efficient as it is adapting Recursion and also some collection like List.
Is there any way I can make it to run in limited memory environment?
Umair
Convert it to an iterative function.
Unfortunately C# compiler does not perform tail call optimization, which is something that you want to happen in this case. CLR supports it, kinda, but you shouldn't rely on it.
Perhaps left of field, but maybe you can write the recursive part of your program in F#? This way you can leverage guaranteed tail call optimization and reuse bits of your C# code. Whilst a steep learning curve, F# is a more suitable language for these combinatorial tasks.
Well...I am not sure whom with I go amongst you but I got the solution. I am using more than one process one that is interacting with user and other for finding the words combination. The other process finds 5000 words, save them and quit. Communication is being achieved through WCF. This looks pretty fine as when process quits = frees memory.
Well, you obviously cannot store the intermediate results in memory (unless you've got some sort of absurd computer at your disposal); you will have to write the results to disk.
The recursion depth isn't a result of the number of considered characters - its determined by what the maximum string length you're willing to consider.
For instance, my install of python 2.6.2 has it's default recursion limit set to 1000. Arguable, I should be able to generate all possible 1-1000 length strings given a character set within this limitation (now, I think the recursion limit applies to total stack depth, so the actual limit may be less than 1000).
Edit (added python sample):
The following python snippet will produce what you're asking for (limiting itself to the given runtime stack limits):
from string import ascii_lowercase
def generate(base="", charset=ascii_lowercase):
for c in charset:
next = base + c
yield next
try:
for s in generate(next, charset):
yield s
except:
continue
for s in generate():
print s
One could produce essentially the same in C# by try/catching on StackOverflowException. As I'm typing this update, the script is running, chewing up one of my cores. However, memory usage is constant at less than 7MB. Now, I'm only print to stdout since I'm not interested in capturing the result, but I think it proves the point above. ;)
Addendum to the example:
Interesting note: Looking closer at running processes, python is actually I/O bound with the above example. It's only using 7% of my CPU, while the rest of the core is bound rending the results in my command window. Minimizing the window allows python to climb to 40% of total CPU usage, this is on a 2 core machine.
One more consideration: When you concatenate or use some other method to generate a string in C#, it occupies its own memory and may stick around for a while. If you are generating millions of strings, you are likely to notice some performance drag.
If you don't need to keep your many strings around, I would see if there's away to avoid generating the strings. For example, maybe you have a character array that you keep updating as you move through the character combinations, and if you're outputting them to a file, you would output them one character at a time so you don't have to build the string.

Categories

Resources