I'm trying to do some parsing that will be easier using regular expressions.
The input is an array (or enumeration) of bytes.
I don't want to convert the bytes to chars for the following reasons:
Computation efficiency
Memory consumption efficiency
Some non-printable bytes might be complex to convert to chars. Not all the bytes are printable.
So I can't use Regex.
The only solution I know, is using Boost.Regex (which works on bytes - C chars), but this is a C++ library that wrapping using C++/CLI will take considerable work.
How can I use regular expressions on bytes in .NET directly, without working with .NET strings and chars?
Thank you.
There is a bit of impedance mismatch going on here. You want to work with Regular expressions in .Net which use strings (multi-byte characters), but you want to work with single byte characters. You can't have both at the same time using .Net as per usual.
However, to break this mismatch down, you could deal with a string in a byte oriented fashion and mutate it. The mutated string can then act as a re-usable buffer. In this way you will not have to convert bytes to chars, or convert your input buffer to a string (as per your question).
An example:
//BLING
byte[] inputBuffer = { 66, 76, 73, 78, 71 };
string stringBuffer = new string('\0', 1000);
Regex regex = new Regex("ING", RegexOptions.Compiled);
unsafe
{
fixed (char* charArray = stringBuffer)
{
byte* buffer = (byte*)(charArray);
//Hard-coded example of string mutation, in practice you would
//loop over your input buffers and regex\match so that the string
//buffer is re-used.
buffer[0] = inputBuffer[0];
buffer[2] = inputBuffer[1];
buffer[4] = inputBuffer[2];
buffer[6] = inputBuffer[3];
buffer[8] = inputBuffer[4];
Console.WriteLine("Mutated string:'{0}'.",
stringBuffer.Substring(0, inputBuffer.Length));
Match match = regex.Match(stringBuffer, 0, inputBuffer.Length);
Console.WriteLine("Position:{0} Length:{1}.", match.Index, match.Length);
}
}
Using this technique you can allocate a string "buffer" which can be re-used as the input to Regex, but you can mutate it with your bytes each time. This avoids the overhead of converting\encoding your byte array into a new .Net string each time you want to do a match. This could prove to be very significant as I have seen many an algorithm in .Net try to go at a million miles an hour only to be brought to its knees by string generation and the subsequent heap spamming and time spent in GC.
Obviously this is unsafe code, but it is .Net.
The results of the Regex will generate strings though, so you have an issue here. I'm not sure if there is a way of using Regex that will not generate new strings. You can certainly get at the match index and length information but the string generation violates your requirements for memory efficiency.
Update
Actually after disassembling Regex\Match\Group\Capture, it looks like it only generates the captured string when you access the Value property, so you may at least not be generating strings if you only access index and length properties. However, you will be generating all the supporting Regex objects.
Well, if I faced this problem, I would DO the C++/CLI wrapper, except I'd create specialized code for what I want to achieve. Eventually develop the wrapper with time to do general things, but this just an option.
The first step is to wrap the Boost::Regex input and output only. Create specialized functions in C++ that do all the stuff you want and use CLI just to pass the input data to the C++ code and then fetch the result back with the CLI. This doesn't look to me like too much work to do.
Update:
Let me try to clarify my point. Even though I may be wrong, I believe you wont be able to find any .NET Binary Regex implementation that you could use. That is why - whether you like it or not - you will be forced to choose between CLI wrapper and bytes-to-chars conversion to use .NET's Regex. In my opinion the wrapper is better choice, because it will be working faster. I did not do any benchmarking, this is just an assumption based on:
Using wrapper you just have to cast
the pointer type (bytes <-> chars).
Using .NET's Regex you have to
convert each byte of the input.
As an alternative to using unsafe, just consider writing a simple, recursive comparer like:
static bool Evaluate(byte[] data, byte[] sequence, int dataIndex=0, int sequenceIndex=0)
{
if (sequence[sequenceIndex] == data[dataIndex])
{
if (sequenceIndex == sequence.Length - 1)
return true;
else if (dataIndex == data.Length - 1)
return false;
else
return Evaluate(data, sequence, dataIndex + 1, sequenceIndex + 1);
}
else
{
if (dataIndex < data.Length - 1)
return Evaluate(data, sequence, dataIndex+1, 0);
else
return false;
}
}
You could improve efficiency in a number of ways (i.e. seeking the first byte match instead of iterating, etc.) but this could get you started... hope it helps.
I personally went a different approach and wrote a small state machine that can be extended. I believe if parsing protocol data this is much more readable than regex.
bool ParseUDSResponse(PassThruMsg rxMsg, UDScmd.Mode txMode, byte txSubFunction, out UDScmd.Response functionResponse, out byte[] payload)
{
payload = new byte[0];
functionResponse = UDScmd.Response.UNKNOWN;
bool positiveReponse = false;
var rxMsgBytes = rxMsg.GetBytes();
//Iterate the reply bytes to find the echod ECU index, response code, function response and payload data if there is any
//If we could use some kind of HEX regex this would be a bit neater
//Iterate until we get past any and all null padding
int stateMachine = 0;
for (int i = 0; i < rxMsgBytes.Length; i++)
{
switch (stateMachine)
{
case 0:
if (rxMsgBytes[i] == 0x07) stateMachine = 1;
break;
case 1:
if (rxMsgBytes[i] == 0xE8) stateMachine = 2;
else return false;
case 2:
if (rxMsgBytes[i] == (byte)txMode + (byte)OBDcmd.Reponse.SUCCESS)
{
//Positive response to the requested mode
positiveReponse = true;
}
else if(rxMsgBytes[i] != (byte)OBDcmd.Reponse.NEGATIVE_RESPONSE)
{
//This is an invalid response, give up now
return false;
}
stateMachine = 3;
break;
case 3:
functionResponse = (UDScmd.Response)rxMsgBytes[i];
if (positiveReponse && rxMsgBytes[i] == txSubFunction)
{
//We have a positive response and a positive subfunction code (subfunction is reflected)
int payloadLength = rxMsgBytes.Length - i;
if(payloadLength > 0)
{
payload = new byte[payloadLength];
Array.Copy(rxMsgBytes, i, payload, 0, payloadLength);
}
return true;
} else
{
//We had a positive response but a negative subfunction error
//we return the function error code so it can be relayed
return false;
}
default:
return false;
}
}
return false;
}
Related
I'm looking for an efficient, allocation-free (!) implementation of
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, char #char)
{
// Should return the index of the first byte of #char within utf8Bytes
// (not the character index of #char within the string)
}
I've not found a way to iterate through the span char by char yet. Utf8Parser does not have an overload supporting single characters.
And System.Text.Encoding seems to work mostly on the entire span, and does allocate internally while doing so.
Is there any builtin functionality I haven't spotted yet? If not, can anyone think of a reasonable custom implementation?
Rather than trying to iterate through the utf8Bytes character by character, it may be easier to convert the character to a short stackalloc'ed utf8 byte sequence, and search for that:
public static class StringExtensions
{
const int MaxBytes = 4;
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, char #char)
{
Rune rune;
try
{
rune = new Rune(#char);
}
catch (ArgumentOutOfRangeException)
{
// Malformed unicode character, return -1 or throw?
return -1;
}
return utf8Bytes.IndexOf(rune);
}
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, Rune #char)
{
Span<byte> charBytes = stackalloc byte[MaxBytes];
var n = #char.EncodeToUtf8(charBytes);
charBytes = charBytes.Slice(0, n);
for (int i = 0, thisLength = 1; i <= utf8Bytes.Length - charBytes.Length; i += thisLength)
{
thisLength = Utf8ByteSequenceLength(utf8Bytes[i]);
if (thisLength == charBytes.Length && charBytes.CommonPrefixLength(utf8Bytes.Slice(i)) == charBytes.Length)
return i;
}
return -1;
}
static int Utf8ByteSequenceLength(byte firstByte)
{
//https://en.wikipedia.org/wiki/UTF-8#Encoding
if ( (firstByte & 0b11111000) == 0b11110000) // 11110xxx
return 4;
else if ((firstByte & 0b11110000) == 0b11100000) // 1110xxxx
return 3;
else if ((firstByte & 0b11100000) == 0b11000000) // 110xxxxx
return 2;
return 1; // Either a 1-byte sequence (matching 0xxxxxxx) or an invalid start byte.
}
}
Notes:
Rune is a struct introduced in .NET Core 3.x that represents a Unicode scalar value. If you need to search your utf8Bytes for a Unicode codepoint that is not in the basic multilingual plane, you will need to use Rune.
Rune has the added advantage that its method Rune.TryEncodeToUtf8() is lightweight and allocation-free.
If char #char is an invalid Unicode character, the .NET encoding algorithms will throw an exception if you attempt to construct a Rune from it. The above code catches the exception and returns -1. You may wish to rethrow the exception.
As an alternative, Rune.DecodeFromUtf8(ReadOnlySpan<Byte>, Rune, Int32) can be used to iterate through a utf8 byte span Rune by Rune. You could use that to locate an incoming Rune by index. However, I suspect doing so would be less efficient than the method above.
Demo fiddle here.
You can negate allocations with stackalloc. First approximation can look like:
static (int Found, int Processed) IndexOf(ReadOnlySpan<byte> utf8Bytes, char #char)
{
Span<char> chars = stackalloc char[utf8Bytes.Length]; // "worst" case every byte is a separate char
var proc = Encoding.UTF8.GetChars(utf8Bytes, chars);
var indexOf = chars.IndexOf(#char);
if (indexOf > 0)
{
Span<byte> bytes = stackalloc byte[indexOf * 4];
var result = Encoding.UTF8.GetBytes(chars.Slice(0, indexOf), bytes);
return (result, proc);
}
return (indexOf, proc);
}
There are few notes here:
Big incoming spans can result in SO
Decoding the whole array is not optimal
Span can contain "partial" codepoints at start and end so Processed should be processed accordingly
First two points can be mitigated by processing the incoming span in slices of smaller size (for example reading 4 bytes into 4 chars spand).
Actually I believe that System.IO.Pipelines handles the same issues (via System.Buffers I believe) though it 1) it can be not completely allocation free I believe 2) I still have not investigated it that much so would not be able to provide a completely working example.
From .NET 5 onwards, there's a library method EncodingExtensions.GetChars to help you.
Specifically, you want the overload that gets the byte data from a ReadOnlySpan and writes to an IBufferWriter<char>, which you can then implement to receive your characters one by one and run whatever on them (your matching algorithm, for example). This solution is allocation-free of course, as long as you put your custom buffer writer in a static field and allocate it only once.
This is a common question but I hope this does not get tagged as a duplicate since the nature of the question is different (please read the whole not only the title)
Unaware of the existence of String.Replace I wrote the following:
int theIndex = 0;
while ((theIndex = message.IndexOf(separationChar, theIndex)) != -1) //we found the character
{
theIndex++;
if (theIndex < message.Length)//not in the last position
{
message = message.Insert(theIndex, theTime);
}
else
{
// I dont' think this is really neccessary
break;
}
} //while finding characters
As you can see I am replacing occurrences of separationChar in the message String with a String called "theTime".
Now, this works ok for small strings but I have been given a really huge String (in the order of several hundred Kbytes- by the way is there a limit for String or StringBuilder??) and it takes a lot of time...
So my questions are:
1) Is it more efficient if I just do
oldString=separationChar.ToString();
newString=oldString.Insert(theTime);
message= message.Replace(oldString,newString);
2) Is there any other way I can process very long Strings to insert a String (theTime) when finding some char in a very fast and efficient way??
Thanks a lot
As Danny already mentioned, string.Insert() actually creates a new instance each time you use it, and these also have to be garbage collected at some point.
You could instead start with an empty StringBuilder to construct the result string:
public static string Replace(this string str, char find, string replacement)
{
StringBuilder result = new StringBuilder(str.Length); // initial capacity
int pointer = 0;
int index;
while ((index = str.IndexOf(find, pointer)) >= 0)
{
// Append the unprocessed data up to the character
result.Append(str, pointer, index - pointer);
// Append the replacement string
result.Append(replacement);
// Next unprocessed data starts after the character
pointer = index + 1;
}
// Append the remainder of the unprocessed data
result.Append(str, pointer, str.Length - pointer);
return result.ToString();
}
This will not cause a new string to be created (and garbage collected) for each occurrence of the character. Instead, when the internal buffer of the StringBuilder is full, it will create a new buffer chunk "of sufficient capacity". Quote from reference source, when its buffer is full:
Compute the length of the new block we need
We make the new chunk at least big enough for the current need (minBlockCharCount), but also as big as the current length (thus doubling capacity), up to a maximum
(so we stay in the small object heap, and never allocate really big chunks even if
the string gets really big).
Thank you for answering my question.
I am writing an answer because I have to report that I tried the solution in my question 1) and it is indeed more efficient according to the results of my program. String.Replace can replace a string(from a char) with another string very fast.
oldString=separationChar.ToString();
newString=oldString.Insert(theTime);
message= message.Replace(oldString,newString);
I'm writing a library to simplify my network programming in future projects. I'm wanting it to be robust and efficient because this will be in nearly all of my projects in the future. (BTW both the server and the client will be using my library so I'm not assuming a protocol in my question) I'm writing a function for receiving strings from a network stream where I use 31 bytes of buffer and one for sentinel. The sentinel value will indicate which byte if any is the EOF. Here's my code for your use or scrutiny...
public string getString()
{
string returnme = "";
while (true)
{
int[] buff = new int[32];
for (int i = 0; i < 32; i++)
{
buff[i] = ns.ReadByte();
}
if (buff[31] > 31) { /*throw some error*/}
for (int i = 0; i < buff[31]; i++)
{
returnme += (char)buff[i];
}
if (buff[31] != 31)
{
break;
}
}
return returnme;
}
Edit: Is this the best (efficient, practical, etc) to accomplish what I'm doing.
Is this the best (efficient, practical, etc) to accomplish what I'm doing.
No. Firstly, you are limiting yourself to characters in the 0-255 code-point range, and that isn't enough, and secondly: serializing strings is a solved problem. Just use an Encoding, typically UTF-8. As part of a network stream, this probably means "encoode the length, encode the data" and "read the length, buffer that much data, decode the data". As another note: you aren't correctly handling the EOF scenario if ReadByte() returns a negative value.
As a small corollary, note that appending to a string in a loop is never a good idea; if you did do it that way, use a StringBuilder. But don't do it that way. My code would be something more like (hey, whadya know, here's my actual string-reading code from protobuf-net, simplified a bit):
// read the length
int bytes = (int)ReadUInt32Variant(false);
if (bytes == 0) return "";
// buffer that much data
if (available < bytes) Ensure(bytes, true);
// read the string
string s = encoding.GetString(ioBuffer, ioIndex, bytes);
// update the internal buffer data
available -= bytes;
position += bytes;
ioIndex += bytes;
return s;
As a final note, I would say: if you are sending structured messages, give some serious consideration to using a pre-rolled serialization API that specialises in this stuff. For example, you could then just do something like:
var msg = new MyMessage { Name = "abc", Value = 123, IsMagic = true };
Serializer.SerializeWithLengthPrefix(networkStream, msg);
and at the other end:
var msg = Serializer.DeserializeWithLengthPrefix<MyMessage>(networkStream);
Console.WriteLine(msg.Name); // etc
Job done.
I think tou should use a StringBuilder object with fixed size for better performance.
I'm writing an application that needs to verify HMAC-SHA256 checksums. The code I currently have looks something like this:
static bool VerifyIntegrity(string secret, string checksum, string data)
{
// Verify HMAC-SHA256 Checksum
byte[] key = System.Text.Encoding.UTF8.GetBytes(secret);
byte[] value = System.Text.Encoding.UTF8.GetBytes(data);
byte[] checksum_bytes = System.Text.Encoding.UTF8.GetBytes(checksum);
using (var hmac = new HMACSHA256(key))
{
byte[] expected_bytes = hmac.ComputeHash(value);
return checksum_bytes.SequenceEqual(expected_bytes);
}
}
I know that this is susceptible to timing attacks.
Is there a message digest comparison function in the standard library? I realize I could write my own time hardened comparison method, but I have to believe that this is already implemented elsewhere.
EDIT: Original answer is below - still worth reading IMO, but regarding the timing attack...
The page you referenced gives some interesting points about compiler optimizations. Given that you know the two byte arrays will be the same length (assuming the size of the checksum isn't particularly secret, you can immediately return if the lengths are different) you might try something like this:
public static bool CompareArraysExhaustively(byte[] first, byte[] second)
{
if (first.Length != second.Length)
{
return false;
}
bool ret = true;
for (int i = 0; i < first.Length; i++)
{
ret = ret & (first[i] == second[i]);
}
return ret;
}
Now that still won't take the same amount of time for all inputs - if the two arrays are both in L1 cache for example, it's likely to be faster than if it has to be fetched from main memory. However, I suspect that is unlikely to cause a significant issue from a security standpoint.
Is this okay? Who knows. Different processors and different versions of the CLR may take different amounts of time for an & operation depending on the two operands. Basically this is the same as the conclusion of the page you referenced - that it's probably as good as we'll get in a portable way, but that it would require validation on every platform you try to run on.
At least the above code only uses relatively simple operations. I would personally avoid using LINQ operations here as there could be sneaky optimizations going on in some cases. I don't think there would be in this case - or they'd be easy to defeat - but you'd at least have to think about them. With the above code, there's at least a reasonably close relationship between the source code and IL - leaving "only" the JIT compiler and processor optimizations to worry about :)
Original answer
There's one significant problem with this: in order to provide the checksum, you have to have a string whose UTF-8 encoded form is the same as the checksum. There are plenty of byte sequences which simply don't represent UTF-8-encoded text. Basically, trying to encode arbitrary binary data as text using UTF-8 is a bad idea.
Base64, on the other hand, is basically designed for this:
static bool VerifyIntegrity(string secret, string checksum, string data)
{
// Verify HMAC-SHA256 Checksum
byte[] key = Encoding.UTF8.GetBytes(secret);
byte[] value = Encoding.UTF8.GetBytes(data);
byte[] checksumBytes = Convert.FromBase64String(checksum);
using (var hmac = new HMACSHA256(key))
{
byte[] expectedBytes = hmac.ComputeHash(value);
return checksumBytes.SequenceEqual(expectedBytes);
}
}
On the other hand, instead of using SequenceEqual on the byte array, you could Base64 encode the actual hash, and see whether that matches:
static bool VerifyIntegrity(string secret, string checksum, string data)
{
// Verify HMAC-SHA256 Checksum
byte[] key = Encoding.UTF8.GetBytes(secret);
byte[] value = Encoding.UTF8.GetBytes(data);
using (var hmac = new HMACSHA256(key))
{
return checksum == Convert.ToBase64String(hmac.ComputeHash(value));
}
}
I don't know of anything better within the framework. It wouldn't be too hard to write a specialized SequenceEqual operator for arrays (or general ICollection<T> implementations) which checked for equal lengths first... but given that the hashes are short, I wouldn't worry about that.
If you're worried about the timing of the SequenceEqual, you could always replace it with something like this:
checksum_bytes.Zip( expected_bytes, (a,b) => a == b ).Aggregate( true, (a,r) => a && r );
This returns the same result as SequenceEquals but always check every element before given an answer this less chance of revealing anything through a timing attack.
How it is susceptible to timing attacks? Your code works the same amount of time in the case of valid or invalid digest. And calculate digest/check digest looks like the easiest way to check this.
I had an interview question that asked me for my 'feedback' on a piece of code a junior programmer wrote. They hinted there may be a problem and said it will be used heavily on large strings.
public string ReverseString(string sz)
{
string result = string.Empty;
for(int i = sz.Length-1; i>=0; i--)
{
result += sz[i]
}
return result;
}
I couldn't spot it. I saw no problems whatsoever.
In hindsight I could have said the user should resize but it looks like C# doesn't have a resize (i am a C++ guy).
I ended up writing things like use an iterator if its possible, [x] in containers could not be random access so it may be slow. and misc things. But I definitely said I never had to optimize C# code so my thinking may have not failed me on the interview.
I wanted to know, what is the problem with this code, do you guys see it?
-edit-
I changed this into a wiki because there can be several right answers.
Also i am so glad i explicitly said i never had to optimize a C# program and mentioned the misc other things. Oops. I always thought C# didnt have any performance problems with these type of things. oops.
Most importantly? That will suck performance wise - it has to create lots of strings (one per character). The simplest way is something like:
public static string Reverse(string sz) // ideal for an extension method
{
if (string.IsNullOrEmpty(sz) || sz.Length == 1) return sz;
char[] chars = sz.ToCharArray();
Array.Reverse(chars);
return new string(chars);
}
The problem is that string concatenations are expensive to do as strings are immutable in C#. The example given will create a new string one character longer each iteration which is very inefficient. To avoid this you should use the StringBuilder class instead like so:
public string ReverseString(string sz)
{
var builder = new StringBuilder(sz.Length);
for(int i = sz.Length-1; i>=0; i--)
{
builder.Append(sz[i]);
}
return builder.ToString();
}
The StringBuilder is written specifically for scenarios like this as it gives you the ability to concatenate strings without the drawback of excessive memory allocation.
You will notice I have provided the StringBuilder with an initial capacity which you don't often see. As you know the length of the result to begin with, this removes needless memory allocations.
What normally happens is it allocates an amount of memory to the StringBuilder (default 16 characters). Once the contents attempts to exceed that capacity it doubles (I think) its own capactity and carries on. This is much better than allocating memory each time as would happen with normal strings, but if you can avoid this as well it's even better.
A few comments on the answers given so far:
Every single one of them (so far!) will fail on surrogate pairs and combining characters. Oh the joys of Unicode. Reversing a string isn't the same as reversing a sequence of chars.
I like Marc's optimisation for null, empty, and single character inputs. In particular, not only does this get the right answer quickly, but it also handles null (which none of the other answers do)
I originally thought that ToCharArray followed by Array.Reverse would be the fastest, but it does create one "garbage" copy.
The StringBuilder solution creates a single string (not char array) and manipulates that until you call ToString. There's no extra copying involved... but there's a lot more work maintaining lengths etc.
Which is the more efficient solution? Well, I'd have to benchmark it to have any idea at all - but even so that's not going to tell the whole story. Are you using this in a situation with high memory pressure, where extra garbage is a real pain? How fast is your memory vs your CPU, etc?
As ever, readability is usually king - and it doesn't get much better than Marc's answer on that front. In particular, there's no room for an off-by-one error, whereas I'd have to actually put some thought into validating the other answers. I don't like thinking. It hurts my brain, so I try not to do it very often. Using the built-in Array.Reverse sounds much better to me. (Okay, so it still fails on surrogates etc, but hey...)
Since strings are immutable, each += statement will create a new string by copying the string in the last step, along with the single character to form a new string. Effectively, this will be an O(n2) algorithm instead of O(n).
A faster way would be (O(n)):
// pseudocode:
static string ReverseString(string input) {
char[] buf = new char[input.Length];
for(int i = 0; i < buf.Length; ++i)
buf[i] = input[input.Length - i - 1];
return new string(buf);
}
You can do this in .NET 3.5 instead:
public static string Reverse(this string s)
{
return new String((s.ToCharArray().Reverse()).ToArray());
}
Better way to tackle it would be to use a StringBuilder, since it is not immutable you won't get the terrible object generation behavior that you would get above. In .net all strings are immutable, which means that the += operator there will create a new object each time it is hit. StringBuilder uses an internal buffer, so the reversal could be done in the buffer w/ no extra object allocations.
You should use the StringBuilder class to create your resulting string. A string is immutable so when you append a string in each interation of the loop, a new string has to be created, which isn't very efficient.
I prefer something like this:
using System;
using System.Text;
namespace SpringTest3
{
static class Extentions
{
static private StringBuilder ReverseStringImpl(string s, int pos, StringBuilder sb)
{
return (s.Length <= --pos || pos < 0) ? sb : ReverseStringImpl(s, pos, sb.Append(s[pos]));
}
static public string Reverse(this string s)
{
return ReverseStringImpl(s, s.Length, new StringBuilder()).ToString();
}
}
class Program
{
static void Main(string[] args)
{
Console.WriteLine("abc".Reverse());
}
}
}
x is the string to reverse.
Stack<char> stack = new Stack<char>(x);
string s = new string(stack.ToArray());
This method cuts the number of iterations in half. Rather than starting from the end, it starts from the beginning and swaps characters until it hits center. Had to convert the string to a char array because the indexer on a string has no setter.
public string Reverse(String value)
{
if (String.IsNullOrEmpty(value)) throw new ArgumentNullException("value");
char[] array = value.ToCharArray();
for (int i = 0; i < value.Length / 2; i++)
{
char temp = array[i];
array[i] = array[(array.Length - 1) - i];
array[(array.Length - 1) - i] = temp;
}
return new string(array);
}
Necromancing.
As a public service, this is how you actually CORRECTLY reverse a string (reversing a string is NOT equal to reversing a sequence of chars)
public static class Test
{
private static System.Collections.Generic.List<string> GraphemeClusters(string s)
{
System.Collections.Generic.List<string> ls = new System.Collections.Generic.List<string>();
System.Globalization.TextElementEnumerator enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
{
ls.Add((string)enumerator.Current);
}
return ls;
}
// this
private static string ReverseGraphemeClusters(string s)
{
if(string.IsNullOrEmpty(s) || s.Length == 1)
return s;
System.Collections.Generic.List<string> ls = GraphemeClusters(s);
ls.Reverse();
return string.Join("", ls.ToArray());
}
public static void TestMe()
{
string s = "Les Mise\u0301rables";
// s = "noël";
string r = ReverseGraphemeClusters(s);
// This would be wrong:
// char[] a = s.ToCharArray();
// System.Array.Reverse(a);
// string r = new string(a);
System.Console.WriteLine(r);
}
}
See:
https://vimeo.com/7403673
By the way, in Golang, the correct way is this:
package main
import (
"unicode"
"regexp"
)
func main() {
str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
println("u\u0308" + "o\u0308" + "a\u0308" + "\u0308" == ReverseGrapheme(str))
println("u\u0308" + "o\u0308" + "a\u0308" + "\u0308" == ReverseGrapheme2(str))
}
func ReverseGrapheme(str string) string {
buf := []rune("")
checked := false
index := 0
ret := ""
for _, c := range str {
if !unicode.Is(unicode.M, c) {
if len(buf) > 0 {
ret = string(buf) + ret
}
buf = buf[:0]
buf = append(buf, c)
if checked == false {
checked = true
}
} else if checked == false {
ret = string(append([]rune(""), c)) + ret
} else {
buf = append(buf, c)
}
index += 1
}
return string(buf) + ret
}
func ReverseGrapheme2(str string) string {
re := regexp.MustCompile("\\PM\\pM*|.")
slice := re.FindAllString(str, -1)
length := len(slice)
ret := ""
for i := 0; i < length; i += 1 {
ret += slice[length-1-i]
}
return ret
}
And the incorrect way is this (ToCharArray.Reverse):
func Reverse(s string) string {
runes := []rune(s)
for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
runes[i], runes[j] = runes[j], runes[i]
}
return string(runes)
}
Note that you need to know the difference between
- a character and a glyph
- a byte (8 bit) and a codepoint/rune (32 bit)
- a codepoint and a GraphemeCluster [32+ bit] (aka Grapheme/Glyph)
Reference:
Character is an overloaded term than can mean many things.
A code point is the atomic unit of information. Text is a sequence of
code points. Each code point is a number which is given meaning by the
Unicode standard.
A grapheme is a sequence of one or more code points that are displayed
as a single, graphical unit that a reader recognizes as a single
element of the writing system. For example, both a and ä are
graphemes, but they may consist of multiple code points (e.g. ä may be
two code points, one for the base character a followed by one for the
diaresis; but there's also an alternative, legacy, single code point
representing this grapheme). Some code points are never part of any
grapheme (e.g. the zero-width non-joiner, or directional overrides).
A glyph is an image, usually stored in a font (which is a collection
of glyphs), used to represent graphemes or parts thereof. Fonts may
compose multiple glyphs into a single representation, for example, if
the above ä is a single code point, a font may chose to render that as
two separate, spatially overlaid glyphs. For OTF, the font's GSUB and
GPOS tables contain substitution and positioning information to make
this work. A font may contain multiple alternative glyphs for the same
grapheme, too.
static string reverseString(string text)
{
Char[] a = text.ToCharArray();
string b = "";
for (int q = a.Count() - 1; q >= 0; q--)
{
b = b + a[q].ToString();
}
return b;
}