IndexOf char within an ReadOnlySpan<byte> of UTF8 bytes - c#

I'm looking for an efficient, allocation-free (!) implementation of
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, char #char)
{
// Should return the index of the first byte of #char within utf8Bytes
// (not the character index of #char within the string)
}
I've not found a way to iterate through the span char by char yet. Utf8Parser does not have an overload supporting single characters.
And System.Text.Encoding seems to work mostly on the entire span, and does allocate internally while doing so.
Is there any builtin functionality I haven't spotted yet? If not, can anyone think of a reasonable custom implementation?

Rather than trying to iterate through the utf8Bytes character by character, it may be easier to convert the character to a short stackalloc'ed utf8 byte sequence, and search for that:
public static class StringExtensions
{
const int MaxBytes = 4;
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, char #char)
{
Rune rune;
try
{
rune = new Rune(#char);
}
catch (ArgumentOutOfRangeException)
{
// Malformed unicode character, return -1 or throw?
return -1;
}
return utf8Bytes.IndexOf(rune);
}
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, Rune #char)
{
Span<byte> charBytes = stackalloc byte[MaxBytes];
var n = #char.EncodeToUtf8(charBytes);
charBytes = charBytes.Slice(0, n);
for (int i = 0, thisLength = 1; i <= utf8Bytes.Length - charBytes.Length; i += thisLength)
{
thisLength = Utf8ByteSequenceLength(utf8Bytes[i]);
if (thisLength == charBytes.Length && charBytes.CommonPrefixLength(utf8Bytes.Slice(i)) == charBytes.Length)
return i;
}
return -1;
}
static int Utf8ByteSequenceLength(byte firstByte)
{
//https://en.wikipedia.org/wiki/UTF-8#Encoding
if ( (firstByte & 0b11111000) == 0b11110000) // 11110xxx
return 4;
else if ((firstByte & 0b11110000) == 0b11100000) // 1110xxxx
return 3;
else if ((firstByte & 0b11100000) == 0b11000000) // 110xxxxx
return 2;
return 1; // Either a 1-byte sequence (matching 0xxxxxxx) or an invalid start byte.
}
}
Notes:
Rune is a struct introduced in .NET Core 3.x that represents a Unicode scalar value. If you need to search your utf8Bytes for a Unicode codepoint that is not in the basic multilingual plane, you will need to use Rune.
Rune has the added advantage that its method Rune.TryEncodeToUtf8() is lightweight and allocation-free.
If char #char is an invalid Unicode character, the .NET encoding algorithms will throw an exception if you attempt to construct a Rune from it. The above code catches the exception and returns -1. You may wish to rethrow the exception.
As an alternative, Rune.DecodeFromUtf8(ReadOnlySpan<Byte>, Rune, Int32) can be used to iterate through a utf8 byte span Rune by Rune. You could use that to locate an incoming Rune by index. However, I suspect doing so would be less efficient than the method above.
Demo fiddle here.

You can negate allocations with stackalloc. First approximation can look like:
static (int Found, int Processed) IndexOf(ReadOnlySpan<byte> utf8Bytes, char #char)
{
Span<char> chars = stackalloc char[utf8Bytes.Length]; // "worst" case every byte is a separate char
var proc = Encoding.UTF8.GetChars(utf8Bytes, chars);
var indexOf = chars.IndexOf(#char);
if (indexOf > 0)
{
Span<byte> bytes = stackalloc byte[indexOf * 4];
var result = Encoding.UTF8.GetBytes(chars.Slice(0, indexOf), bytes);
return (result, proc);
}
return (indexOf, proc);
}
There are few notes here:
Big incoming spans can result in SO
Decoding the whole array is not optimal
Span can contain "partial" codepoints at start and end so Processed should be processed accordingly
First two points can be mitigated by processing the incoming span in slices of smaller size (for example reading 4 bytes into 4 chars spand).
Actually I believe that System.IO.Pipelines handles the same issues (via System.Buffers I believe) though it 1) it can be not completely allocation free I believe 2) I still have not investigated it that much so would not be able to provide a completely working example.

From .NET 5 onwards, there's a library method EncodingExtensions.GetChars to help you.
Specifically, you want the overload that gets the byte data from a ReadOnlySpan and writes to an IBufferWriter<char>, which you can then implement to receive your characters one by one and run whatever on them (your matching algorithm, for example). This solution is allocation-free of course, as long as you put your custom buffer writer in a static field and allocate it only once.

Related

How can I marshal this character array parameter from C to a string in C#?

I have the following function, written as part of another class in C:
int example(char *remoteServerName)
{
if (doSomething(job))
return getError(job);
if (job->server != NULL) {
int length = strlen(jobPtr->server->name); // name is a char * of length 1025
remoteServerName = malloc (length * sizeof(char));
strncpy(remoteServerName, jobPtr->server->name, length);
}
return 0;
}
How can I get the remoteServerName back from it? I have tried the following:
[DllImport("example.dll")]
public static extern int example(StringBuilder remoteServerName);
var x = new StringBuilder();
example(x);
Console.WriteLine(x.ToString());
But the string is always empty.
You need to allocate some space for the string to be returned in. Instead of:
var x = new StringBuilder();
provide a capacity value:
var x = new StringBuilder(1024);
You should also remove your call to malloc. The caller allocates the memory. That is the purpose of marshalling with StringBuilder.
You are not using strncpy correctly, and so fail to write a null terminator. You could pass the buffer length like this:
int example(char *remoteServerName)
{
if (doSomething(job))
return getError(job);
if (job->server != NULL) {
// note that new StringBuilder(N) means a buffer of length N+1 is marshaled
strncpy(remoteServerName, jobPtr->server->name, 1025);
}
return 0;
}
But that would be a bit wasteful, with all the zero padding that is implied. Really, strncpy is next to useless and you should use a different function to copy, as has been discussed many times before here. I don't really want to get drawn into that because it's a little off to the side of the question.
It would be prudent to design your API to allow the caller to also pass the length of the character array so that the callee can make sure not to overrun the buffer, and so that you don't need to use magic constants as the code here does.

What's the correct way to count the bytes needed for a UTF8 conversion?

I need to count the size, in bytes, that a substring will be once converted into a UTF8 byte array. This needs to happen without actually doing the conversion of that substring. The string I'm working with is very large, unfortunately, and I've got to be careful not to create another large string (or byte array) in memory.
There's a method on the Encoding.UTF8 object called GetByteCount, but I'm not seeing an overload that does it where I don't have to copy the string into a byte array. This doesn't work for me:
Encoding.UTF8.GetByteCount(stringToCount.ToCharArray(), startIndex, count);
because stringToCount.ToCharArray() will create a copy of my string.
Here's what I have right now:
public static int CalculateTotalBytesForUTF8Conversion(string stringToCount, int startIndex, int endIndex)
{
var totalBytes = 0;
for (int i = startIndex ; i < endIndex; i++)
totalBytes += Encoding.UTF8.GetByteCount(new char[] { stringToCount[i] });
return totalBytes;
}
The GetByteCount method doesn't appear to have the ability to take in just a char, so this was the compromise I'm at.
Is this the right way to determine the byte count of a substring, after conversion to UTF8, without actually doing that conversion? Or is there a better method to do this?
There doesn't appear to be a built-in method for doing this, so you could either analyze the characters yourself or do the sort of thing you're doing above. The only thing I would recommend -- reuse a char[1] array, rather than creating a new array with each iteration. Here's an extension method that jives well with the built-in methods.
public static class EncodingExtensions
{
public static int GetByteCount(this Encoding encoding, string s, int index, int count)
{
var output = 0;
var end = index + count;
var charArray = new char[1];
for (var i = index; i < end; i++)
{
charArray[0] = s[i];
output += Encoding.UTF8.GetByteCount(charArray);
}
return output;
}
}
So, there is an overload which doesn't require the caller create an array of characters first: Encoding.GetByteCount Method (Char*, Int32)
The issue is that this isn't a CLS-compliant method and will require you do some exotic coding:
public static unsafe int CalculateTotalBytesForUTF8Conversion(
string stringToCount,
int startIndex,
int endIndex)
{
// Fix the string in memory so we can grab a pointer to its location.
fixed (char* stringStart = stringToCount)
{
// Get a pointer to the start of the substring.
char* substring = stringStart + startIndex;
return Encoding.UTF8.GetByteCount(substring, endIndex - startIndex);
}
}
Key things to note here:
The method has to be marked unsafe, since we're working with pointers and direct memory manipulation.
The string is fixed for the duration of the call in order prevent the runtime moving it around - it gives us a constant location to point to, but it prevents the runtime doing memory optimization.
You should consider doing thorough performance profiling on this method to ensure it gives you a better performance profile than simply copying the string to an array.
A bit of basic profiling (a console application executing the algorithms in sequence on my desktop machine) shows that this approach executes ~35 times faster than looping over the string or converting it to a character-array.
Using pointer: ~86ms
Looping over string: ~2957ms
Converting to char array: ~3156ms
Take these figures with a pinch of salt, and also consider other factors besides just execution speed, such as long-term execution overheads (i.e. in a service process), or memory usage.

Why does every Char static "Is..." have a string overload, e.g. IsWhiteSpace(string, Int32)?

http://msdn.microsoft.com/en-us/library/1x308yk8.aspx
This allows me to do this:
var str = "string ";
Char.IsWhiteSpace(str, 6);
Rather than:
Char.IsWhiteSpace(str[6]);
Seems unusual, so I looked at the reflection:
[TargetedPatchingOptOut("Performance critical to inline across NGen image boundaries")]
public static bool IsWhiteSpace(char c)
{
if (char.IsLatin1(c))
{
return char.IsWhiteSpaceLatin1(c);
}
return CharUnicodeInfo.IsWhiteSpace(c);
}
[SecuritySafeCritical]
public static bool IsWhiteSpace(string s, int index)
{
if (s == null)
{
throw new ArgumentNullException("s");
}
if (index >= s.Length)
{
throw new ArgumentOutOfRangeException("index");
}
if (char.IsLatin1(s[index]))
{
return char.IsWhiteSpaceLatin1(s[index]);
}
return CharUnicodeInfo.IsWhiteSpace(s, index);
}
Three things struck me:
Why does it bother to do the limit check only on the upper bound? Throwing an ArgumentOutOfRangeException, while index below 0 would give string's standard IndexOutOfRangeException
The precense of SecuritySafeCriticalAttribute which I've read the general blerb about, but still unclear what it is doing here and if it is linked to the upper bound check.
TargetedPatchingOptOutAttribute is not present on other Is...(char) methods. Example IsLetter, IsNumber etc.
Because not every character fits in a C# char. For instance, "𠀀" takes 2 C# chars, and you couldn't get any information about that character with just a char overload. With String and an index, the methods can see if the character at index i is a High Surrogate char, and then read the Low Surrogate char at next index, add them up according to the algorithm, and retrieve info about the code point U+20000.
This is how UTF-16 can encode 1 million different code points, it's a variable-width encoding. It takes 2-4 bytes to encode a character, or 1-2 C# chars.
Why does it bother to do the limit check only on the upper bound?
It doesn't. It performs an unsigned comparison, so every negative number will compare larger than the length and cause the appropriate exception to be thrown. This happens to not get decompiled accurately.

.NET Regular expressions on bytes instead of chars

I'm trying to do some parsing that will be easier using regular expressions.
The input is an array (or enumeration) of bytes.
I don't want to convert the bytes to chars for the following reasons:
Computation efficiency
Memory consumption efficiency
Some non-printable bytes might be complex to convert to chars. Not all the bytes are printable.
So I can't use Regex.
The only solution I know, is using Boost.Regex (which works on bytes - C chars), but this is a C++ library that wrapping using C++/CLI will take considerable work.
How can I use regular expressions on bytes in .NET directly, without working with .NET strings and chars?
Thank you.
There is a bit of impedance mismatch going on here. You want to work with Regular expressions in .Net which use strings (multi-byte characters), but you want to work with single byte characters. You can't have both at the same time using .Net as per usual.
However, to break this mismatch down, you could deal with a string in a byte oriented fashion and mutate it. The mutated string can then act as a re-usable buffer. In this way you will not have to convert bytes to chars, or convert your input buffer to a string (as per your question).
An example:
//BLING
byte[] inputBuffer = { 66, 76, 73, 78, 71 };
string stringBuffer = new string('\0', 1000);
Regex regex = new Regex("ING", RegexOptions.Compiled);
unsafe
{
fixed (char* charArray = stringBuffer)
{
byte* buffer = (byte*)(charArray);
//Hard-coded example of string mutation, in practice you would
//loop over your input buffers and regex\match so that the string
//buffer is re-used.
buffer[0] = inputBuffer[0];
buffer[2] = inputBuffer[1];
buffer[4] = inputBuffer[2];
buffer[6] = inputBuffer[3];
buffer[8] = inputBuffer[4];
Console.WriteLine("Mutated string:'{0}'.",
stringBuffer.Substring(0, inputBuffer.Length));
Match match = regex.Match(stringBuffer, 0, inputBuffer.Length);
Console.WriteLine("Position:{0} Length:{1}.", match.Index, match.Length);
}
}
Using this technique you can allocate a string "buffer" which can be re-used as the input to Regex, but you can mutate it with your bytes each time. This avoids the overhead of converting\encoding your byte array into a new .Net string each time you want to do a match. This could prove to be very significant as I have seen many an algorithm in .Net try to go at a million miles an hour only to be brought to its knees by string generation and the subsequent heap spamming and time spent in GC.
Obviously this is unsafe code, but it is .Net.
The results of the Regex will generate strings though, so you have an issue here. I'm not sure if there is a way of using Regex that will not generate new strings. You can certainly get at the match index and length information but the string generation violates your requirements for memory efficiency.
Update
Actually after disassembling Regex\Match\Group\Capture, it looks like it only generates the captured string when you access the Value property, so you may at least not be generating strings if you only access index and length properties. However, you will be generating all the supporting Regex objects.
Well, if I faced this problem, I would DO the C++/CLI wrapper, except I'd create specialized code for what I want to achieve. Eventually develop the wrapper with time to do general things, but this just an option.
The first step is to wrap the Boost::Regex input and output only. Create specialized functions in C++ that do all the stuff you want and use CLI just to pass the input data to the C++ code and then fetch the result back with the CLI. This doesn't look to me like too much work to do.
Update:
Let me try to clarify my point. Even though I may be wrong, I believe you wont be able to find any .NET Binary Regex implementation that you could use. That is why - whether you like it or not - you will be forced to choose between CLI wrapper and bytes-to-chars conversion to use .NET's Regex. In my opinion the wrapper is better choice, because it will be working faster. I did not do any benchmarking, this is just an assumption based on:
Using wrapper you just have to cast
the pointer type (bytes <-> chars).
Using .NET's Regex you have to
convert each byte of the input.
As an alternative to using unsafe, just consider writing a simple, recursive comparer like:
static bool Evaluate(byte[] data, byte[] sequence, int dataIndex=0, int sequenceIndex=0)
{
if (sequence[sequenceIndex] == data[dataIndex])
{
if (sequenceIndex == sequence.Length - 1)
return true;
else if (dataIndex == data.Length - 1)
return false;
else
return Evaluate(data, sequence, dataIndex + 1, sequenceIndex + 1);
}
else
{
if (dataIndex < data.Length - 1)
return Evaluate(data, sequence, dataIndex+1, 0);
else
return false;
}
}
You could improve efficiency in a number of ways (i.e. seeking the first byte match instead of iterating, etc.) but this could get you started... hope it helps.
I personally went a different approach and wrote a small state machine that can be extended. I believe if parsing protocol data this is much more readable than regex.
bool ParseUDSResponse(PassThruMsg rxMsg, UDScmd.Mode txMode, byte txSubFunction, out UDScmd.Response functionResponse, out byte[] payload)
{
payload = new byte[0];
functionResponse = UDScmd.Response.UNKNOWN;
bool positiveReponse = false;
var rxMsgBytes = rxMsg.GetBytes();
//Iterate the reply bytes to find the echod ECU index, response code, function response and payload data if there is any
//If we could use some kind of HEX regex this would be a bit neater
//Iterate until we get past any and all null padding
int stateMachine = 0;
for (int i = 0; i < rxMsgBytes.Length; i++)
{
switch (stateMachine)
{
case 0:
if (rxMsgBytes[i] == 0x07) stateMachine = 1;
break;
case 1:
if (rxMsgBytes[i] == 0xE8) stateMachine = 2;
else return false;
case 2:
if (rxMsgBytes[i] == (byte)txMode + (byte)OBDcmd.Reponse.SUCCESS)
{
//Positive response to the requested mode
positiveReponse = true;
}
else if(rxMsgBytes[i] != (byte)OBDcmd.Reponse.NEGATIVE_RESPONSE)
{
//This is an invalid response, give up now
return false;
}
stateMachine = 3;
break;
case 3:
functionResponse = (UDScmd.Response)rxMsgBytes[i];
if (positiveReponse && rxMsgBytes[i] == txSubFunction)
{
//We have a positive response and a positive subfunction code (subfunction is reflected)
int payloadLength = rxMsgBytes.Length - i;
if(payloadLength > 0)
{
payload = new byte[payloadLength];
Array.Copy(rxMsgBytes, i, payload, 0, payloadLength);
}
return true;
} else
{
//We had a positive response but a negative subfunction error
//we return the function error code so it can be relayed
return false;
}
default:
return false;
}
}
return false;
}

ReverseString, a C# interview-question

I had an interview question that asked me for my 'feedback' on a piece of code a junior programmer wrote. They hinted there may be a problem and said it will be used heavily on large strings.
public string ReverseString(string sz)
{
string result = string.Empty;
for(int i = sz.Length-1; i>=0; i--)
{
result += sz[i]
}
return result;
}
I couldn't spot it. I saw no problems whatsoever.
In hindsight I could have said the user should resize but it looks like C# doesn't have a resize (i am a C++ guy).
I ended up writing things like use an iterator if its possible, [x] in containers could not be random access so it may be slow. and misc things. But I definitely said I never had to optimize C# code so my thinking may have not failed me on the interview.
I wanted to know, what is the problem with this code, do you guys see it?
-edit-
I changed this into a wiki because there can be several right answers.
Also i am so glad i explicitly said i never had to optimize a C# program and mentioned the misc other things. Oops. I always thought C# didnt have any performance problems with these type of things. oops.
Most importantly? That will suck performance wise - it has to create lots of strings (one per character). The simplest way is something like:
public static string Reverse(string sz) // ideal for an extension method
{
if (string.IsNullOrEmpty(sz) || sz.Length == 1) return sz;
char[] chars = sz.ToCharArray();
Array.Reverse(chars);
return new string(chars);
}
The problem is that string concatenations are expensive to do as strings are immutable in C#. The example given will create a new string one character longer each iteration which is very inefficient. To avoid this you should use the StringBuilder class instead like so:
public string ReverseString(string sz)
{
var builder = new StringBuilder(sz.Length);
for(int i = sz.Length-1; i>=0; i--)
{
builder.Append(sz[i]);
}
return builder.ToString();
}
The StringBuilder is written specifically for scenarios like this as it gives you the ability to concatenate strings without the drawback of excessive memory allocation.
You will notice I have provided the StringBuilder with an initial capacity which you don't often see. As you know the length of the result to begin with, this removes needless memory allocations.
What normally happens is it allocates an amount of memory to the StringBuilder (default 16 characters). Once the contents attempts to exceed that capacity it doubles (I think) its own capactity and carries on. This is much better than allocating memory each time as would happen with normal strings, but if you can avoid this as well it's even better.
A few comments on the answers given so far:
Every single one of them (so far!) will fail on surrogate pairs and combining characters. Oh the joys of Unicode. Reversing a string isn't the same as reversing a sequence of chars.
I like Marc's optimisation for null, empty, and single character inputs. In particular, not only does this get the right answer quickly, but it also handles null (which none of the other answers do)
I originally thought that ToCharArray followed by Array.Reverse would be the fastest, but it does create one "garbage" copy.
The StringBuilder solution creates a single string (not char array) and manipulates that until you call ToString. There's no extra copying involved... but there's a lot more work maintaining lengths etc.
Which is the more efficient solution? Well, I'd have to benchmark it to have any idea at all - but even so that's not going to tell the whole story. Are you using this in a situation with high memory pressure, where extra garbage is a real pain? How fast is your memory vs your CPU, etc?
As ever, readability is usually king - and it doesn't get much better than Marc's answer on that front. In particular, there's no room for an off-by-one error, whereas I'd have to actually put some thought into validating the other answers. I don't like thinking. It hurts my brain, so I try not to do it very often. Using the built-in Array.Reverse sounds much better to me. (Okay, so it still fails on surrogates etc, but hey...)
Since strings are immutable, each += statement will create a new string by copying the string in the last step, along with the single character to form a new string. Effectively, this will be an O(n2) algorithm instead of O(n).
A faster way would be (O(n)):
// pseudocode:
static string ReverseString(string input) {
char[] buf = new char[input.Length];
for(int i = 0; i < buf.Length; ++i)
buf[i] = input[input.Length - i - 1];
return new string(buf);
}
You can do this in .NET 3.5 instead:
public static string Reverse(this string s)
{
return new String((s.ToCharArray().Reverse()).ToArray());
}
Better way to tackle it would be to use a StringBuilder, since it is not immutable you won't get the terrible object generation behavior that you would get above. In .net all strings are immutable, which means that the += operator there will create a new object each time it is hit. StringBuilder uses an internal buffer, so the reversal could be done in the buffer w/ no extra object allocations.
You should use the StringBuilder class to create your resulting string. A string is immutable so when you append a string in each interation of the loop, a new string has to be created, which isn't very efficient.
I prefer something like this:
using System;
using System.Text;
namespace SpringTest3
{
static class Extentions
{
static private StringBuilder ReverseStringImpl(string s, int pos, StringBuilder sb)
{
return (s.Length <= --pos || pos < 0) ? sb : ReverseStringImpl(s, pos, sb.Append(s[pos]));
}
static public string Reverse(this string s)
{
return ReverseStringImpl(s, s.Length, new StringBuilder()).ToString();
}
}
class Program
{
static void Main(string[] args)
{
Console.WriteLine("abc".Reverse());
}
}
}
x is the string to reverse.
Stack<char> stack = new Stack<char>(x);
string s = new string(stack.ToArray());
This method cuts the number of iterations in half. Rather than starting from the end, it starts from the beginning and swaps characters until it hits center. Had to convert the string to a char array because the indexer on a string has no setter.
public string Reverse(String value)
{
if (String.IsNullOrEmpty(value)) throw new ArgumentNullException("value");
char[] array = value.ToCharArray();
for (int i = 0; i < value.Length / 2; i++)
{
char temp = array[i];
array[i] = array[(array.Length - 1) - i];
array[(array.Length - 1) - i] = temp;
}
return new string(array);
}
Necromancing.
As a public service, this is how you actually CORRECTLY reverse a string (reversing a string is NOT equal to reversing a sequence of chars)
public static class Test
{
private static System.Collections.Generic.List<string> GraphemeClusters(string s)
{
System.Collections.Generic.List<string> ls = new System.Collections.Generic.List<string>();
System.Globalization.TextElementEnumerator enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
{
ls.Add((string)enumerator.Current);
}
return ls;
}
// this
private static string ReverseGraphemeClusters(string s)
{
if(string.IsNullOrEmpty(s) || s.Length == 1)
return s;
System.Collections.Generic.List<string> ls = GraphemeClusters(s);
ls.Reverse();
return string.Join("", ls.ToArray());
}
public static void TestMe()
{
string s = "Les Mise\u0301rables";
// s = "noël";
string r = ReverseGraphemeClusters(s);
// This would be wrong:
// char[] a = s.ToCharArray();
// System.Array.Reverse(a);
// string r = new string(a);
System.Console.WriteLine(r);
}
}
See:
https://vimeo.com/7403673
By the way, in Golang, the correct way is this:
package main
import (
"unicode"
"regexp"
)
func main() {
str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
println("u\u0308" + "o\u0308" + "a\u0308" + "\u0308" == ReverseGrapheme(str))
println("u\u0308" + "o\u0308" + "a\u0308" + "\u0308" == ReverseGrapheme2(str))
}
func ReverseGrapheme(str string) string {
buf := []rune("")
checked := false
index := 0
ret := ""
for _, c := range str {
if !unicode.Is(unicode.M, c) {
if len(buf) > 0 {
ret = string(buf) + ret
}
buf = buf[:0]
buf = append(buf, c)
if checked == false {
checked = true
}
} else if checked == false {
ret = string(append([]rune(""), c)) + ret
} else {
buf = append(buf, c)
}
index += 1
}
return string(buf) + ret
}
func ReverseGrapheme2(str string) string {
re := regexp.MustCompile("\\PM\\pM*|.")
slice := re.FindAllString(str, -1)
length := len(slice)
ret := ""
for i := 0; i < length; i += 1 {
ret += slice[length-1-i]
}
return ret
}
And the incorrect way is this (ToCharArray.Reverse):
func Reverse(s string) string {
runes := []rune(s)
for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
runes[i], runes[j] = runes[j], runes[i]
}
return string(runes)
}
Note that you need to know the difference between
- a character and a glyph
- a byte (8 bit) and a codepoint/rune (32 bit)
- a codepoint and a GraphemeCluster [32+ bit] (aka Grapheme/Glyph)
Reference:
Character is an overloaded term than can mean many things.
A code point is the atomic unit of information. Text is a sequence of
code points. Each code point is a number which is given meaning by the
Unicode standard.
A grapheme is a sequence of one or more code points that are displayed
as a single, graphical unit that a reader recognizes as a single
element of the writing system. For example, both a and ä are
graphemes, but they may consist of multiple code points (e.g. ä may be
two code points, one for the base character a followed by one for the
diaresis; but there's also an alternative, legacy, single code point
representing this grapheme). Some code points are never part of any
grapheme (e.g. the zero-width non-joiner, or directional overrides).
A glyph is an image, usually stored in a font (which is a collection
of glyphs), used to represent graphemes or parts thereof. Fonts may
compose multiple glyphs into a single representation, for example, if
the above ä is a single code point, a font may chose to render that as
two separate, spatially overlaid glyphs. For OTF, the font's GSUB and
GPOS tables contain substitution and positioning information to make
this work. A font may contain multiple alternative glyphs for the same
grapheme, too.
static string reverseString(string text)
{
Char[] a = text.ToCharArray();
string b = "";
for (int q = a.Count() - 1; q >= 0; q--)
{
b = b + a[q].ToString();
}
return b;
}

Categories

Resources