create fixed size string in a struct in C#? - c#

I m a newbie in C#.I want to create a struct in C# which consist of string variable of fixed size. example DistributorId of size [20]. What is the exact way of giving the string a fixed size.
public struct DistributorEmail
{
public String DistributorId;
public String EmailId;
}

If you need fixed, preallocated buffers, String is not the correct datatype.
This type of usage would only make sense in an interop context though, otherwise you should stick to Strings.
You will also need to compile your assembly with allow unsafe code.
unsafe public struct DistributorEmail
{
public fixed char DistributorId[20];
public fixed char EmailID[20];
public DistributorEmail(string dId)
{
fixed (char* distId = DistributorId)
{
char[] chars = dId.ToCharArray();
Marshal.Copy(chars, 0, new IntPtr(distId), chars.Length);
}
}
}
If for some reason you are in need of fixed size buffers, but not in an interop context, you can use the same struct but without unsafe and fixed. You will then need to allocate the buffers yourself.
Another important point to keep in mind, is that in .NET, sizeof(char) != sizeof(byte). A char is at the very least 2 bytes, even if it is encoded in ANSI.

If you really need a fixed length, you can always use a char[] instead of a string. It's easy to convert to/from, if you also need string manipulation.
string s = "Hello, world";
char[] ca = s.ToCharArray();
string s1 = new string(ca);
Note that, aside from some special COM interop scenarios, you can always just use strings, and let the framework worry about sizes and storage.

You can create a new fixed length string by specifying the length when you create it.
string(char c, int count)
This code will create a new string of 40 characters in length, filled with the space character.
string newString = new string(' ', 40);

As string extension, covers source string longer and shorter thand fixed:
public static string ToFixedLength(this string inStr, int length)
{
if (inStr.Length == length)
return inStr;
if(inStr.Length > length)
return inStr.Substring(0, length);
var blanks = Enumerable.Range(1, length - inStr.Length).Select(v => " ").Aggregate((a, b) => $"{a}{b}");
return $"{inStr}{blanks}";
}

Related

Marshalling utf8 encoded chinese characters from C# to C++

I'm marshaling some Chinese characters which have the decimal representation (utf8) as
228,184,145,230,161,148
however when I receive this in C++ I end up with the chars
-77,-13,-67,-37
I can solve this using a sbyte[] instead of string in c#, but now I'm trying to marshal a string[] so I can't use this method. Anyone have an idea as to why this is happening?
EDIT: more detailed code:
C#
[DllImport("mydll.dll",CallingConvention=CallingConvention.Cdecl)]
static extern IntPtr inputFiles(IntPtr pAlzObj, string[] filePaths, int fileNum);
string[] allfiles = Directory.GetFiles("myfolder", "*.jpg", SearchOption.AllDirectories);
string[] allFilesutf8 = allfiles.Select(i => Encoding.UTF8.GetString(Encoding.Default.GetBytes(i))).ToArray();
IntPtr pRet = inputFiles(pObj, allfiles, allfiles.Length);
C++
extern __declspec(dllexport) char* inputFiles(Alz* pObj, char** filePaths, int fileNum);
char* massAdd(Alz* pObj, char** filePaths, int fileNum)
{
if (pObj != NULL) {
try{
std::vector<const char*> imgPaths;
for (int i = 0; i < fileNum; i++)
{
char* s = *(filePaths + i);
//Here I would print out the string and the result in bytes (decimals representation) are already different.
imgPaths.push_back(s);
}
string ret = pAlzObj->myfunc(imgPaths);
const char* retTemp = ret.c_str();
char* retChar = _strdup(retTemp);
return retChar;
}
catch (const std::runtime_error& e) {
cout << "some runtime error " << e.what() << endl;
}
}
}
Also, something I found is that if I change the windows universal encoding (In language settings) to use unicode UTF-8, it works fine. Not sure why though.
When marshaling to unsigned char* (or unsigned char** as it's an array) I end up with another output, which is literally just 256+the nummbers shown when in char. 179,243,189,219. This leads me to believe there is something happening during marshaling rather than a conversion mistake on the C++ side of things.
That is because C++ strings uses standard char when stored. The char type is indeed signed and that makes those values being interpreted as negative ones.
I guess that traits may be handled inside the <xstring> header on windows (as far as I know). Specifically in:
_STD_BEGIN
template <class _Elem, class _Int_type>
struct _Char_traits { // properties of a string or stream element
using char_type = _Elem;
using int_type = _Int_type;
using pos_type = streampos;
using off_type = streamoff;
using state_type = _Mbstatet;
#if _HAS_CXX20
using comparison_category = strong_ordering;
#endif // _HAS_CXX20
I have some ideas: You solve problem by using a sbyte[] instead of string in c#, and now you are trying to marshal a string[], just use List<sbyte[]> for string array.
I am not experienced with c++ but I guess there are another libraries for strings use one of them. Look this link, link show string types can marshalling to c#. https://learn.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.unmanagedtype?view=net-7.0
The issue was in the marshaling. I think it was because as the data is transferred, the locale setting in the C++ dll was set to GBK (at least not UTF-8). The trick was to convert the incoming strings into UTF-8 from GBK, which I was able to do with the following function:
std::string gb_to_utf8(char* src)
{
wchar_t* strA;
int i = MultiByteToWideChar(CP_ACP, 0, src, -1, NULL, 0);
strA = (wchar_t*)malloc(i * 2);
MultiByteToWideChar(CP_ACP, 0, src, -1, strA, i);
if (!strlen((char*)strA)) {
throw std::runtime_error("error converting");
}
char utf8[1024]; //Unsure how long converted string could be, set as large number
int n = 0;
n = wcstombs(utf8, strA, sizeof(utf8));
std::string resStr = utf8;
free(strA);
return resStr;
}
Also needed to set setlocale(LC_ALL, "en_US.UTF-8"); in order for the function above to work.

How can I marshal this character array parameter from C to a string in C#?

I have the following function, written as part of another class in C:
int example(char *remoteServerName)
{
if (doSomething(job))
return getError(job);
if (job->server != NULL) {
int length = strlen(jobPtr->server->name); // name is a char * of length 1025
remoteServerName = malloc (length * sizeof(char));
strncpy(remoteServerName, jobPtr->server->name, length);
}
return 0;
}
How can I get the remoteServerName back from it? I have tried the following:
[DllImport("example.dll")]
public static extern int example(StringBuilder remoteServerName);
var x = new StringBuilder();
example(x);
Console.WriteLine(x.ToString());
But the string is always empty.
You need to allocate some space for the string to be returned in. Instead of:
var x = new StringBuilder();
provide a capacity value:
var x = new StringBuilder(1024);
You should also remove your call to malloc. The caller allocates the memory. That is the purpose of marshalling with StringBuilder.
You are not using strncpy correctly, and so fail to write a null terminator. You could pass the buffer length like this:
int example(char *remoteServerName)
{
if (doSomething(job))
return getError(job);
if (job->server != NULL) {
// note that new StringBuilder(N) means a buffer of length N+1 is marshaled
strncpy(remoteServerName, jobPtr->server->name, 1025);
}
return 0;
}
But that would be a bit wasteful, with all the zero padding that is implied. Really, strncpy is next to useless and you should use a different function to copy, as has been discussed many times before here. I don't really want to get drawn into that because it's a little off to the side of the question.
It would be prudent to design your API to allow the caller to also pass the length of the character array so that the callee can make sure not to overrun the buffer, and so that you don't need to use magic constants as the code here does.

What's the correct way to count the bytes needed for a UTF8 conversion?

I need to count the size, in bytes, that a substring will be once converted into a UTF8 byte array. This needs to happen without actually doing the conversion of that substring. The string I'm working with is very large, unfortunately, and I've got to be careful not to create another large string (or byte array) in memory.
There's a method on the Encoding.UTF8 object called GetByteCount, but I'm not seeing an overload that does it where I don't have to copy the string into a byte array. This doesn't work for me:
Encoding.UTF8.GetByteCount(stringToCount.ToCharArray(), startIndex, count);
because stringToCount.ToCharArray() will create a copy of my string.
Here's what I have right now:
public static int CalculateTotalBytesForUTF8Conversion(string stringToCount, int startIndex, int endIndex)
{
var totalBytes = 0;
for (int i = startIndex ; i < endIndex; i++)
totalBytes += Encoding.UTF8.GetByteCount(new char[] { stringToCount[i] });
return totalBytes;
}
The GetByteCount method doesn't appear to have the ability to take in just a char, so this was the compromise I'm at.
Is this the right way to determine the byte count of a substring, after conversion to UTF8, without actually doing that conversion? Or is there a better method to do this?
There doesn't appear to be a built-in method for doing this, so you could either analyze the characters yourself or do the sort of thing you're doing above. The only thing I would recommend -- reuse a char[1] array, rather than creating a new array with each iteration. Here's an extension method that jives well with the built-in methods.
public static class EncodingExtensions
{
public static int GetByteCount(this Encoding encoding, string s, int index, int count)
{
var output = 0;
var end = index + count;
var charArray = new char[1];
for (var i = index; i < end; i++)
{
charArray[0] = s[i];
output += Encoding.UTF8.GetByteCount(charArray);
}
return output;
}
}
So, there is an overload which doesn't require the caller create an array of characters first: Encoding.GetByteCount Method (Char*, Int32)
The issue is that this isn't a CLS-compliant method and will require you do some exotic coding:
public static unsafe int CalculateTotalBytesForUTF8Conversion(
string stringToCount,
int startIndex,
int endIndex)
{
// Fix the string in memory so we can grab a pointer to its location.
fixed (char* stringStart = stringToCount)
{
// Get a pointer to the start of the substring.
char* substring = stringStart + startIndex;
return Encoding.UTF8.GetByteCount(substring, endIndex - startIndex);
}
}
Key things to note here:
The method has to be marked unsafe, since we're working with pointers and direct memory manipulation.
The string is fixed for the duration of the call in order prevent the runtime moving it around - it gives us a constant location to point to, but it prevents the runtime doing memory optimization.
You should consider doing thorough performance profiling on this method to ensure it gives you a better performance profile than simply copying the string to an array.
A bit of basic profiling (a console application executing the algorithms in sequence on my desktop machine) shows that this approach executes ~35 times faster than looping over the string or converting it to a character-array.
Using pointer: ~86ms
Looping over string: ~2957ms
Converting to char array: ~3156ms
Take these figures with a pinch of salt, and also consider other factors besides just execution speed, such as long-term execution overheads (i.e. in a service process), or memory usage.

Why does .NET create new substrings instead of pointing into existing strings?

From a brief look using Reflector, it looks like String.Substring() allocates memory for each substring. Am I correct that this is the case? I thought that wouldn't be necessary since strings are immutable.
My underlying goal was to create a IEnumerable<string> Split(this String, Char) extension method that allocates no additional memory.
One reason why most languages with immutable strings create new substrings rather than refer into existing strings is because this will interfere with garbage collecting those strings later.
What happens if a string is used for its substring, but then the larger string becomes unreachable (except through the substring). The larger string will be uncollectable, because that would invalidate the substring. What seemed like a good way to save memory in the short term becomes a memory leak in the long term.
Not possible without poking around inside .net using String classes. You would have to pass around references to an array which was mutable and make sure no one screwed up.
.Net will create a new string every time you ask it to. Only exception to this is interned strings which are created by the compiler (and can be done by you) which are placed into memory once and then pointers are established to the string for memory and performance reasons.
Each string has to have it's own string data, with the way that the String class is implemented.
You can make your own SubString structure that uses part of a string:
public struct SubString {
private string _str;
private int _offset, _len;
public SubString(string str, int offset, int len) {
_str = str;
_offset = offset;
_len = len;
}
public int Length { get { return _len; } }
public char this[int index] {
get {
if (index < 0 || index > len) throw new IndexOutOfRangeException();
return _str[_offset + index];
}
}
public void WriteToStringBuilder(StringBuilder s) {
s.Write(_str, _offset, _len);
}
public override string ToString() {
return _str.Substring(_offset, _len);
}
}
You can flesh it out with other methods like comparison that is also possible to do without extracting the string.
Because strings are immutable in .NET, every string operation that results in a new string object will allocate a new block of memory for the string contents.
In theory, it could be possible to reuse the memory when extracting a substring, but that would make garbage collection very complicated: what if the original string is garbage-collected? What would happen to the substring that shares a piece of it?
Of course, nothing prevents the .NET BCL team to change this behavior in future versions of .NET. It wouldn't have any impact on existing code.
Adding to the point that Strings are immutable, you should be that the following snippet will generate multiple String instances in memory.
String s1 = "Hello", s2 = ", ", s3 = "World!";
String res = s1 + s2 + s3;
s1+s2 => new string instance (temp1)
temp1 + s3 => new string instance (temp2)
res is a reference to temp2.

ReverseString, a C# interview-question

I had an interview question that asked me for my 'feedback' on a piece of code a junior programmer wrote. They hinted there may be a problem and said it will be used heavily on large strings.
public string ReverseString(string sz)
{
string result = string.Empty;
for(int i = sz.Length-1; i>=0; i--)
{
result += sz[i]
}
return result;
}
I couldn't spot it. I saw no problems whatsoever.
In hindsight I could have said the user should resize but it looks like C# doesn't have a resize (i am a C++ guy).
I ended up writing things like use an iterator if its possible, [x] in containers could not be random access so it may be slow. and misc things. But I definitely said I never had to optimize C# code so my thinking may have not failed me on the interview.
I wanted to know, what is the problem with this code, do you guys see it?
-edit-
I changed this into a wiki because there can be several right answers.
Also i am so glad i explicitly said i never had to optimize a C# program and mentioned the misc other things. Oops. I always thought C# didnt have any performance problems with these type of things. oops.
Most importantly? That will suck performance wise - it has to create lots of strings (one per character). The simplest way is something like:
public static string Reverse(string sz) // ideal for an extension method
{
if (string.IsNullOrEmpty(sz) || sz.Length == 1) return sz;
char[] chars = sz.ToCharArray();
Array.Reverse(chars);
return new string(chars);
}
The problem is that string concatenations are expensive to do as strings are immutable in C#. The example given will create a new string one character longer each iteration which is very inefficient. To avoid this you should use the StringBuilder class instead like so:
public string ReverseString(string sz)
{
var builder = new StringBuilder(sz.Length);
for(int i = sz.Length-1; i>=0; i--)
{
builder.Append(sz[i]);
}
return builder.ToString();
}
The StringBuilder is written specifically for scenarios like this as it gives you the ability to concatenate strings without the drawback of excessive memory allocation.
You will notice I have provided the StringBuilder with an initial capacity which you don't often see. As you know the length of the result to begin with, this removes needless memory allocations.
What normally happens is it allocates an amount of memory to the StringBuilder (default 16 characters). Once the contents attempts to exceed that capacity it doubles (I think) its own capactity and carries on. This is much better than allocating memory each time as would happen with normal strings, but if you can avoid this as well it's even better.
A few comments on the answers given so far:
Every single one of them (so far!) will fail on surrogate pairs and combining characters. Oh the joys of Unicode. Reversing a string isn't the same as reversing a sequence of chars.
I like Marc's optimisation for null, empty, and single character inputs. In particular, not only does this get the right answer quickly, but it also handles null (which none of the other answers do)
I originally thought that ToCharArray followed by Array.Reverse would be the fastest, but it does create one "garbage" copy.
The StringBuilder solution creates a single string (not char array) and manipulates that until you call ToString. There's no extra copying involved... but there's a lot more work maintaining lengths etc.
Which is the more efficient solution? Well, I'd have to benchmark it to have any idea at all - but even so that's not going to tell the whole story. Are you using this in a situation with high memory pressure, where extra garbage is a real pain? How fast is your memory vs your CPU, etc?
As ever, readability is usually king - and it doesn't get much better than Marc's answer on that front. In particular, there's no room for an off-by-one error, whereas I'd have to actually put some thought into validating the other answers. I don't like thinking. It hurts my brain, so I try not to do it very often. Using the built-in Array.Reverse sounds much better to me. (Okay, so it still fails on surrogates etc, but hey...)
Since strings are immutable, each += statement will create a new string by copying the string in the last step, along with the single character to form a new string. Effectively, this will be an O(n2) algorithm instead of O(n).
A faster way would be (O(n)):
// pseudocode:
static string ReverseString(string input) {
char[] buf = new char[input.Length];
for(int i = 0; i < buf.Length; ++i)
buf[i] = input[input.Length - i - 1];
return new string(buf);
}
You can do this in .NET 3.5 instead:
public static string Reverse(this string s)
{
return new String((s.ToCharArray().Reverse()).ToArray());
}
Better way to tackle it would be to use a StringBuilder, since it is not immutable you won't get the terrible object generation behavior that you would get above. In .net all strings are immutable, which means that the += operator there will create a new object each time it is hit. StringBuilder uses an internal buffer, so the reversal could be done in the buffer w/ no extra object allocations.
You should use the StringBuilder class to create your resulting string. A string is immutable so when you append a string in each interation of the loop, a new string has to be created, which isn't very efficient.
I prefer something like this:
using System;
using System.Text;
namespace SpringTest3
{
static class Extentions
{
static private StringBuilder ReverseStringImpl(string s, int pos, StringBuilder sb)
{
return (s.Length <= --pos || pos < 0) ? sb : ReverseStringImpl(s, pos, sb.Append(s[pos]));
}
static public string Reverse(this string s)
{
return ReverseStringImpl(s, s.Length, new StringBuilder()).ToString();
}
}
class Program
{
static void Main(string[] args)
{
Console.WriteLine("abc".Reverse());
}
}
}
x is the string to reverse.
Stack<char> stack = new Stack<char>(x);
string s = new string(stack.ToArray());
This method cuts the number of iterations in half. Rather than starting from the end, it starts from the beginning and swaps characters until it hits center. Had to convert the string to a char array because the indexer on a string has no setter.
public string Reverse(String value)
{
if (String.IsNullOrEmpty(value)) throw new ArgumentNullException("value");
char[] array = value.ToCharArray();
for (int i = 0; i < value.Length / 2; i++)
{
char temp = array[i];
array[i] = array[(array.Length - 1) - i];
array[(array.Length - 1) - i] = temp;
}
return new string(array);
}
Necromancing.
As a public service, this is how you actually CORRECTLY reverse a string (reversing a string is NOT equal to reversing a sequence of chars)
public static class Test
{
private static System.Collections.Generic.List<string> GraphemeClusters(string s)
{
System.Collections.Generic.List<string> ls = new System.Collections.Generic.List<string>();
System.Globalization.TextElementEnumerator enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
{
ls.Add((string)enumerator.Current);
}
return ls;
}
// this
private static string ReverseGraphemeClusters(string s)
{
if(string.IsNullOrEmpty(s) || s.Length == 1)
return s;
System.Collections.Generic.List<string> ls = GraphemeClusters(s);
ls.Reverse();
return string.Join("", ls.ToArray());
}
public static void TestMe()
{
string s = "Les Mise\u0301rables";
// s = "noël";
string r = ReverseGraphemeClusters(s);
// This would be wrong:
// char[] a = s.ToCharArray();
// System.Array.Reverse(a);
// string r = new string(a);
System.Console.WriteLine(r);
}
}
See:
https://vimeo.com/7403673
By the way, in Golang, the correct way is this:
package main
import (
"unicode"
"regexp"
)
func main() {
str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
println("u\u0308" + "o\u0308" + "a\u0308" + "\u0308" == ReverseGrapheme(str))
println("u\u0308" + "o\u0308" + "a\u0308" + "\u0308" == ReverseGrapheme2(str))
}
func ReverseGrapheme(str string) string {
buf := []rune("")
checked := false
index := 0
ret := ""
for _, c := range str {
if !unicode.Is(unicode.M, c) {
if len(buf) > 0 {
ret = string(buf) + ret
}
buf = buf[:0]
buf = append(buf, c)
if checked == false {
checked = true
}
} else if checked == false {
ret = string(append([]rune(""), c)) + ret
} else {
buf = append(buf, c)
}
index += 1
}
return string(buf) + ret
}
func ReverseGrapheme2(str string) string {
re := regexp.MustCompile("\\PM\\pM*|.")
slice := re.FindAllString(str, -1)
length := len(slice)
ret := ""
for i := 0; i < length; i += 1 {
ret += slice[length-1-i]
}
return ret
}
And the incorrect way is this (ToCharArray.Reverse):
func Reverse(s string) string {
runes := []rune(s)
for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
runes[i], runes[j] = runes[j], runes[i]
}
return string(runes)
}
Note that you need to know the difference between
- a character and a glyph
- a byte (8 bit) and a codepoint/rune (32 bit)
- a codepoint and a GraphemeCluster [32+ bit] (aka Grapheme/Glyph)
Reference:
Character is an overloaded term than can mean many things.
A code point is the atomic unit of information. Text is a sequence of
code points. Each code point is a number which is given meaning by the
Unicode standard.
A grapheme is a sequence of one or more code points that are displayed
as a single, graphical unit that a reader recognizes as a single
element of the writing system. For example, both a and ä are
graphemes, but they may consist of multiple code points (e.g. ä may be
two code points, one for the base character a followed by one for the
diaresis; but there's also an alternative, legacy, single code point
representing this grapheme). Some code points are never part of any
grapheme (e.g. the zero-width non-joiner, or directional overrides).
A glyph is an image, usually stored in a font (which is a collection
of glyphs), used to represent graphemes or parts thereof. Fonts may
compose multiple glyphs into a single representation, for example, if
the above ä is a single code point, a font may chose to render that as
two separate, spatially overlaid glyphs. For OTF, the font's GSUB and
GPOS tables contain substitution and positioning information to make
this work. A font may contain multiple alternative glyphs for the same
grapheme, too.
static string reverseString(string text)
{
Char[] a = text.ToCharArray();
string b = "";
for (int q = a.Count() - 1; q >= 0; q--)
{
b = b + a[q].ToString();
}
return b;
}

Categories

Resources