Find nearest safe split in byte array containing UTF-8 data

Find nearest safe split in byte array containing UTF-8 data - c#

I want to split a large array of UTF-8 encoded data, so that decoding it into chars can be parallelized.
It seems that there's no way to find out how many bytes Encoding.GetCharCount reads. I also can't use GetByteCount(GetChars(...)) since it decodes the entire array anyways, which is what I'm trying to avoid.

UTF-8 has well-defined byte sequences and is considered self-synchronizing, meaning given any position in bytes you can find where the character at that position begins at.
The UTF-8 spec (Wikipedia is the easiest link) defines the following byte sequences:
0_______ : ASCII (0-127) char
10______ : Continuation
110_____ : Two-byte character
1110____ : Three-byte character
11110___ : Four-byte character
So, the following method (or something similar) should get your result:
Get the byte count for bytes (bytes.Length, et. al.)
Determine how many sections to split into
Select byte byteCount / sectionCount
Test byte against table:
If byte & 0x80 == 0x00 then you can make this byte part of either section
If byte & 0xE0 == 0xC0 then you need to seek ahead one byte, and keep it with the current section
If byte & 0xF0 == 0xE0 then you need to seek ahead two bytes, and keep it with the current section
If byte & 0xF8 == 0xF0 then you need to seek ahead three bytes, and keep it with the current section
If byte & 0xC0 == 0x80 then you are in a continuation, and should seek ahead until the first byte that does not fit val & 0xB0 == 0x80, then keep up to (but not including) this value in the current section
Select byteStart through byteCount + offset where offset can be defined by the test above
Repeat for each section.
Of course, if we redefine our test as returning the current char start position, we have two cases:
1. If (byte[i] & 0xC0) == 0x80 then we need to move around the array
2. Else, return the current i (since it's not a continuation)
This gives us the following method:
public static int GetCharStart(ref byte[] arr, int index) =>
(arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;
Next, we want to get each section. The easiest way is to use a state-machine (or abuse, depending on how you look at it) to return the sections:
public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
var sectionStart = 0;
var sectionEnd = 0;
for (var i = 0; i < sectionCount; i++)
{
sectionEnd = i == (sectionCount - 1) ? utf8Array.Length : GetCharStart(ref utf8Array, (int)Math.Round((double)utf8Array.Length / sectionCount * i));
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
sectionStart = sectionEnd;
}
}
Now I built this in this manner because I want to use Parallel.ForEach to demonstrate the result, which makes it super easy if we have an IEnumerable, and it also allows me to be extremely lazy with the processing: we only continue to gather sections when needed, which means we can lazily process it and do it on-demand, which is a good thing, no?
Lastly, we need to be able to get a section of bytes, so we have the GetSection method:
public static byte[] GetSection(ref byte[] array, int start, int end)
{
var result = new byte[end - start];
for (var i = 0; i < result.Length; i++)
{
result[i] = array[i + start];
}
return result;
}
Finally, the demonstration:
var sourceText = "Some test 平仮名, ひらがな string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.";
var source = Encoding.UTF8.GetBytes(sourceText);
Console.WriteLine(sourceText);
var results = new ConcurrentBag<string>();
Parallel.ForEach(GetByteSections(source, 10),
new ParallelOptions { MaxDegreeOfParallelism = 1 },
x => { Console.WriteLine(Encoding.UTF8.GetString(x)); results.Add(Encoding.UTF8.GetString(x)); });
Console.WriteLine();
Console.WriteLine("Assemble the result: ");
Console.WriteLine(string.Join("", results.Reverse()));
Console.ReadLine();
The result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.
Some test ???, ??
?? string that should b
e decoded in parallel, thi
s demonstrates that we work
flawlessly with Parallel.
ForEach. The only downside
to using `Parallel.ForEach`
the way I demonstrate is
that it doesn't take order into account, but oh-well.
Assemble the result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.
Not perfect, but it does the job. If we change MaxDegreesOfParallelism to a higher value, our string gets jumbled:
Some test ???, ??
e decoded in parallel, thi
flawlessly with Parallel.
?? string that should b
to using `Parallel.ForEach`
ForEach. The only downside
that it doesn't take order into account, but oh-well.
s demonstrates that we work
the way I demonstrate is
So, as you can see, super easy. You'll want to make modifications to allow for correct order-reassembly, but this should demonstrate the trick.
If we modify the GetByteSections method as follows, the last section is no longer ~2x the size of the remaining ones:
public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
var sectionStart = 0;
var sectionEnd = 0;
var sectionSize = (int)Math.Ceiling((double)utf8Array.Length / sectionCount);
for (var i = 0; i < sectionCount; i++)
{
if (i == (sectionCount - 1))
{
var lengthRem = utf8Array.Length - i * sectionSize;
sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
sectionStart = sectionEnd;
sectionEnd = utf8Array.Length;
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
}
else
{
sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
sectionStart = sectionEnd;
}
}
}
The result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.
Some test ???, ???? string that should be de
coded in parallel, this demonstrates that we work flawless
ly with Parallel.ForEach. The only downside to using `Para
llel.ForEach` the way I demonstrate is that it doesn't tak
e order into account, but oh-well. We can continue to incr
ease the length of this string to demonstrate that the las
t section is usually about double the size of the other se
ctions, we could fix that if we really wanted to. In fact,
with a small modification it does so, we just have to rem
ember that we'll end up with `sectionCount + 1` results.
Assemble the result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.
And finally, if for some reason you split into an abnormally large number of sections compared to input size (my input size of ~578 bytes at 250 chars demonstrates this) you'll hit an IndexOutOfRangeException in GetCharStart, the following version fixes that:
public static int GetCharStart(ref byte[] arr, int index)
{
if (index > arr.Length)
{
index = arr.Length - 1;
}
return (arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;
}
Of course this leaves you with a bunch of empty results, but when you reassemble the string doesn't change, so I'm not even going to bother posting the full scenario test here. (I leave it up to you to experiment.)

Great answer Mathieu and Der, adding a python variant 100% based on your answer which works great:
def find_utf8_split(data, bytes=None):
bytes = bytes or len(data)
while bytes > 0 and data[bytes - 1] & 0xC0 == 0x80:
bytes -= 1
if bytes > 0:
if data[bytes - 1] & 0xE0 == 0xC0: bytes = bytes - 1
if data[bytes - 1] & 0xF0 == 0xE0: bytes = bytes - 1
if data[bytes - 1] & 0xF8 == 0xF0: bytes = bytes - 1
return bytes
This code finds a UTF-8 compatible split in a given byte string. It does not do the split as that would take more memory, that is left to the rest of the code.
For example you could:
position = find_utf8_split(data)
leftovers = data[position:]
text = data[:position].decode('utf-8')

Related

.NET BitArray cardinality [duplicate]

I am implementing a library where I am extensively using the .Net BitArray class and need an equivalent to the Java BitSet.Cardinality() method, i.e. a method which returns the number of bits set. I was thinking of implementing it as an extension method for the BitArray class. The trivial implementation is to iterate and count the bits set (like below), but I wanted a faster implementation as I would be performing thousands of set operations and counting the answer. Is there a faster way than the example below?
count = 0;
for (int i = 0; i < mybitarray.Length; i++)
{
if (mybitarray [i])
count++;
}

This is my solution based on the "best bit counting method" from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
public static Int32 GetCardinality(BitArray bitArray)
{
Int32[] ints = new Int32[(bitArray.Count >> 5) + 1];
bitArray.CopyTo(ints, 0);
Int32 count = 0;
// fix for not truncated bits in last integer that may have been set to true with SetAll()
ints[ints.Length - 1] &= ~(-1 << (bitArray.Count % 32));
for (Int32 i = 0; i < ints.Length; i++)
{
Int32 c = ints[i];
// magic (http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel)
unchecked
{
c = c - ((c >> 1) & 0x55555555);
c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
c = ((c + (c >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
count += c;
}
return count;
}
According to my tests, this is around 60 times faster than the simple foreach loop and still 30 times faster than the Kernighan approach with around 50% bits set to true in a BitArray with 1000 bits. I also have a VB version of this if needed.

you can accomplish this pretty easily with Linq
BitArray ba = new BitArray(new[] { true, false, true, false, false });
var numOnes = (from bool m in ba
where m
select m).Count();

BitArray myBitArray = new BitArray(...
int
bits = myBitArray.Count,
size = ((bits - 1) >> 3) + 1,
counter = 0,
x,
c;
byte[] buffer = new byte[size];
myBitArray.CopyTo(buffer, 0);
for (x = 0; x < size; x++)
for (c = 0; buffer[x] > 0; buffer[x] >>= 1)
counter += buffer[x] & 1;
Taken from "Counting bits set, Brian Kernighan's way" and adapted for bytes. I'm using it for bit arrays of 1 000 000+ bits and it's superb.
If your bits are not n*8 then you can count the mod byte manually.

I had the same issue, but had more than just the one Cardinality method to convert. So, I opted to port the entire BitSet class. Fortunately it was self-contained.
Here is the Gist of the C# port.
I would appreciate if people would report any bugs that are found - I am not a Java developer, and have limited experience with bit logic, so I might have translated some of it incorrectly.

Faster and simpler version than the accepted answer thanks to the use of System.Numerics.BitOperations.PopCount
C#
Int32[] ints = new Int32[(bitArray.Count >> 5) + 1];
bitArray.CopyTo(ints, 0);
Int32 count = 0;
for (Int32 i = 0; i < ints.Length; i++) {
count += BitOperations.PopCount(ints[i]);
}
Console.WriteLine(count);
F#
let ints = Array.create ((bitArray.Count >>> 5) + 1) 0u
bitArray.CopyTo(ints, 0)
ints
|> Array.sumBy BitOperations.PopCount
|> printfn "%d"
See more details in Is BitOperations.PopCount the best way to compute the BitArray cardinality in .NET?

You could use Linq, but it would be useless and slower:
var sum = mybitarray.OfType<bool>().Count(p => p);

There is no faster way with using BitArray - What it comes down to is you will have to count them - you could use LINQ to do that or do your own loop, but there is no method offered by BitArray and the underlying data structure is an int[] array (as seen with Reflector) - so this will always be O(n), n being the number of bits in the array.
The only way I could think of making it faster is using reflection to get a hold of the underlying m_array field, then you can get around the boundary checks that Get() uses on every call (see below) - but this is kinda dirty, and might only be worth it on very large arrays since reflection is expensive.
public bool Get(int index)
{
if ((index < 0) || (index >= this.Length))
{
throw new ArgumentOutOfRangeException("index", Environment.GetResourceString("ArgumentOutOfRange_Index"));
}
return ((this.m_array[index / 0x20] & (((int) 1) << (index % 0x20))) != 0);
}
If this optimization is really important to you, you should create your own class for bit manipulation, that internally could use BitArray, but keeps track of the number of bits set and offers the appropriate methods (mostly delegate to BitArray but add methods to get number of bits currently set) - then of course this would be O(1).

If you really want to maximize the speed, you could pre-compute a lookup table where given a byte-value you have the cardinality, but BitArray is not the most ideal structure for this, since you'd need to use reflection to pull the underlying storage out of it and operate on the integral types - see this question for a better explanation of that technique.
Another, perhaps more useful technique, is to use something like the Kernighan trick, which is O(m) for an n-bit value of cardinality m.
static readonly ZERO = new BitArray (0);
static readonly NOT_ONE = new BitArray (1).Not ();
public static int GetCardinality (this BitArray bits)
{
int c = 0;
var tmp = new BitArray (myBitArray);
for (c; tmp != ZERO; c++)
tmp = tmp.And (tmp.And (NOT_ONE));
return c;
}
This too is a bit more cumbersome than it would be in say C, because there are no operations defined between integer types and BitArrays, (tmp &= tmp - 1, for example, to clear the least significant set bit, has been translated to tmp &= (tmp & ~0x1).
I have no idea if this ends up being any faster than naively iterating for the case of the BCL BitArray, but algorithmically speaking it should be superior.
EDIT: cited where I discovered the Kernighan trick, with a more in-depth explanation

If you don't mind to copy the code of System.Collections.BitArray to your project and Edit it,you can write as fellow:
(I think it's the fastest. And I've tried use BitVector32[] to implement my BitArray, but it's still so slow.)
public void Set(int index, bool value)
{
if ((index < 0) || (index >= this.m_length))
{
throw new ArgumentOutOfRangeException("index", "Index Out Of Range");
}
SetWithOutAuth(index,value);
}
//When in batch setting values,we need one method that won't auth the index range
private void SetWithOutAuth(int index, bool value)
{
int v = ((int)1) << (index % 0x20);
index = index / 0x20;
bool NotSet = (this.m_array[index] & v) == 0;
if (value && NotSet)
{
CountOfTrue++;//Count the True values
this.m_array[index] |= v;
}
else if (!value && !NotSet)
{
CountOfTrue--;//Count the True values
this.m_array[index] &= ~v;
}
else
return;
this._version++;
}
public int CountOfTrue { get; internal set; }
public void BatchSet(int start, int length, bool value)
{
if (start < 0 || start >= this.m_length || length <= 0)
return;
for (int i = start; i < length && i < this.m_length; i++)
{
SetWithOutAuth(i,value);
}
}

I wrote my version of after not finding one that uses a look-up table:
private int[] _bitCountLookup;
private void InitLookupTable()
{
_bitCountLookup = new int[256];
for (var byteValue = 0; byteValue < 256; byteValue++)
{
var count = 0;
for (var bitIndex = 0; bitIndex < 8; bitIndex++)
{
count += (byteValue >> bitIndex) & 1;
}
_bitCountLookup[byteValue] = count;
}
}
private int CountSetBits(BitArray bitArray)
{
var result = 0;
var numberOfFullBytes = bitArray.Length / 8;
var numberOfTailBits = bitArray.Length % 8;
var tailByte = numberOfTailBits > 0 ? 1 : 0;
var bitArrayInBytes = new byte[numberOfFullBytes + tailByte];
bitArray.CopyTo(bitArrayInBytes, 0);
for (var i = 0; i < numberOfFullBytes; i++)
{
result += _bitCountLookup[bitArrayInBytes[i]];
}
for (var i = (numberOfFullBytes * 8); i < bitArray.Length; i++)
{
if (bitArray[i])
{
result++;
}
}
return result;
}

The problem is naturally O(n), as a result your solution is probably the most efficient.
Since you are trying to count an arbitrary subset of bits you cannot count the bits when they are set (would would provide a speed boost if you are not setting the bits too often).
You could check to see if the processor you are using has a command which will return the number of set bits. For example a processor with SSE4 could use the POPCNT according to this post. This would probably not work for you since .Net does not allow assembly (because it is platform independent). Also, ARM processors probably do not have an equivalent.
Probably the best solution would be a look up table (or switch if you could guarantee the switch will compiled to a single jump to currentLocation + byteValue). This would give you the count for the whole byte. Of course BitArray does not give access to the underlying data type so you would have to make your own BitArray. You would also have to guarantee that all the bits in the byte will always be part of the intersection which does not sound likely.
Another option would be to use an array of booleans instead of a BitArray. This has the advantage not needing to extract the bit from the others in the byte. The disadvantage is the array will take up 8x as much space in memory meaning not only wasted space, but also more data push as you iterate through the array to perform your count.
The difference between a standard array look up and a BitArray look up is as follows:
Array:
offset = index * indexSize
Get memory at location + offset and save to value
BitArray:
index = index/indexSize
offset = index * indexSize
Get memory at location + offset and save to value
position = index%indexSize
Shift value position bits
value = value and 1
With the exception of #2 for Arrays and #3 most of these commands take 1 processor cycle to complete. Some of the commands can be combined into 1 command using x86/x64 processors, though probably not with ARM since it uses a reduced set of instructions.
Which of the two (array or BitArray) perform better will be specific to your platform (processor speed, processor instructions, processor cache sizes, processor cache speed, amount of system memory (Ram), speed of system memory (CAS), speed of connection between processor and RAM) as well as the spread of indexes you want to count (are the intersections most often clustered or are they randomly distributed).
To summarize: you could probably find a way to make it faster, but your solution is the fastest you will get for your data set using a bit per boolean model in .NET.
Edit: make sure you are accessing the indexes you want to count in order. If you access indexes 200, 5, 150, 151, 311, 6 in that order then you will increase the amount of cache misses resulting in more time spent waiting for values to be retrieved from RAM.

How to take array segments out of a byte array after every X step?

I got a big byte array (around 50kb) and i need to extract numeric values from it. Every three bytes are representing one value.
What i tried is to work with LINQs skip & take but it's really slow regarding the large size of the array.
This is my very slow routine:
List<int> ints = new List<int>();
for (int i = 0; i <= fullFile.Count(); i+=3)
{
ints.Add(BitConverter.ToInt16(fullFile.Skip(i).Take(i + 3).ToArray(), 0));
}
I think i got a wrong approach to this.

Your code
First of all, ToInt16 only uses two bytes. So your third byte will be discarded.
You can't use ToInt32 as it would include one extra byte.
Let's review this:
fullFile.Skip(i).Take(i + 3).ToArray()
..and take a careful look at Take(i + 3). It says that you want to copy a larger and larger buffer. For instance, when i is on index 32000 you copy 32003 bytes into your new buffer.
That's why the code is quite slow.
The code is also slow since you allocate a lot of byte buffers which will need to be garbage collected. 65535 extra buffers of growing size which would have to be garbage collected.
You could also have done like this:
List<int> ints = new List<int>();
var workBuffer = new byte[4];
for (int i = 0; i <= fullFile.Length; i += 3)
{
// Copy the three bytes into the beginning of the temp buffer
Buffer.BlockCopy(fullFile, i, workBuffer, 0, 3);
// Now we can use ToInt32 as the last byte always is zero
var value = BitConverter.ToInt32(workBuffer, 0);
ints.Add(value);
}
Quite easy to understand, but not the fastest code.
A better solution
So the most efficient way is to do the conversion by yourself (bit shifting).
Something like:
List<int> ints = new List<int>();
for (int i = 0; i <= fullFile.Length; i += 3)
{
// This code assume little endianess
var value = (fullFile[i + 2] << 16)
+ (fullFile[i + 1] << 8)
+ fullFile[i];
ints.Add(value);
}
This code do not allocate anything extra (except the ints), and should be quite fast.
You can read more about Shift operators in MSDN. And about endianess

C# "Tag" 4-byte Hex Chunks for reconstruction to original string later

I am wrestling with a particular issue and like to ask for guidance on how I can achieve what I seek. Given the below function, a variable length string is used as input which produces a 4-byte Hex chunk equivalent. These 4-byte chunks are being written to an XML file for storage. And that XML file's schema cannot be altered. However, my issue is when the application which governs the XML file sorts the 4-byte chunks in the XML file. The result, is when I read that same XML file my string is destroyed. So, I'd like a way to "tag" each 4-byte chunk with some sort of identifier that I can in my decoder function inspite of the sorting that may have been done to it.
Encoding Function (Much of which was provided by (Antonín Lejsek)
private static string StringEncoder(string strInput)
{
try
{
// instantiate our StringBuilder object and set the capacity for our sb object to the length of our message.
StringBuilder sb = new StringBuilder(strInput.Length * 9 / 4 + 10);
int count = 0;
// iterate through each character in our message and format the sb object to follow Microsofts implementation of ECMA-376 for rsidR values of type ST_LongHexValue
foreach (char c in strInput)
{
// pad the first 4 byte chunk with 2 digit zeros.
if (count == 0)
{
sb.Append("00");
count = 0;
}
// every three bytes add a space and append 2 digit zeros.
if (count == 3)
{
sb.Append(" ");
sb.Append("00");
count = 0;
}
sb.Append(String.Format("{0:X2}", (int)c));
count++;
}
// handle encoded bytes which are greater than a 1 byte but smaller than 3 bytes so we know how many bytes to pad right with.
for (int i = 0; i < (3 - count) % 3; ++i)
{
sb.Append("00");
}
// DEBUG: echo results for testing.
//Console.WriteLine("");
//Console.WriteLine("String provided: {0}", strInput);
//Console.WriteLine("Hex in 8-digit chunks: {0}", sb.ToString());
//Console.WriteLine("======================================================");
return sb.ToString();
}
catch (NullReferenceException e)
{
Console.WriteLine("");
Console.WriteLine("ERROR : StringEncoder has received null input.");
Console.WriteLine("ERROR : Please ensure there is something to read in the output.txt file.");
Console.WriteLine("");
//Console.WriteLine(e.Message);
return null;
}
}
For Example : This function when provided with the following input "coolsss" would produce the following output : 0020636F 006F6C73 00737300
The above (3) 8 digit chunks would get written to the XML file starting with the first chunk and proceeding onto the last. Like so,
0020636F
006F6C73
00737300
Now, there are other 8-digit chunks in the XML file which were not created by the function above. This presents an issue as the Application can reorder these chunks among themselves and the others already present in the file like so,
00737300
00111111
006F6C73
00000000
0020636F
So, can you help me think of anyway to add a tag of some sort or use some C# Data Structure to be able to read each chunk and reconstruct my original string despite the the reordering?
I appreciate any guidance you can provide. Credit to Antonín Lejsek for his help with the function above.
Thank you,
Gabriel Alicea

Well, I am reluctant to suggest this as a proposed solution because it feels a bit too hackish for me.
Having said that; I suppose you could leverage the second byte as an ordinal so you can track the chunks and "re-assemble" the string later.
You could use the following scheme to track your chunks.
00XY0000
Where the second byte XY could be split up into two 4-bit parts representing an ordinal and a checksum.
X = Ordinal
Y = 16 % X
When reading the chunks you can split up the second byte into two words just like above and verify that the checksum aligns for the ordinal.
This solution does introduce a 16 character constraint on string length unless you eliminate the checksum and use the entire byte as an ordinal for which you can increase your string lengths to 256 characters.

Cut and take new line String with maxchar

I tried to create a function for print all the articles of a billing with some max length vars, but when this max length is exceeded Index out of range errors appears or in another cases total and quantity Doesn't appear in line:
Max Length variables:
private Int32 MaxCharPage = 36;
private Int32 MaxCharArticleName = 15;
private Int32 MaxCharArticleQuantity = 4;
private Int32 MaxCharArticleSellPrice = 6;
private Int32 MaxCharArticleTotal = 8;
private Int32 MaxCharArticleLineSpace = 1;
This is my current function:
private IEnumerable<String> ArticleLine(Double Quantity, String Name, Double SellPrice, Double Total)
{
String QuantityChunk = (String)(Quantity).ToString("#,##0.00");
String PriceChunk = (String)(SellPrice).ToString("#,##0.00");
String TotalChunk = (String)(Total).ToString("#,##0.00");
// full chunks with "size" length
for (int i = 0; i < Name.Length / this.MaxCharArticleName; ++i)
{
String Chunk = String.Empty;
if (i == 0)
{
Chunk = QuantityChunk + new String((Char)32, (MaxCharArticleQuantity + MaxCharArticleLineSpace - QuantityChunk.Length)) +
Name.Substring(i * this.MaxCharArticleName, this.MaxCharArticleName) +
new String((Char)32, (MaxCharArticleLineSpace)) +
new String((Char)32, (MaxCharArticleSellPrice - PriceChunk.Length)) +
PriceChunk +
new String((Char)32, (MaxCharArticleLineSpace)) +
new String((Char)32, (MaxCharArticleTotal - TotalChunk.Length)) +
TotalChunk;
}
else
{
Chunk = new String((Char)32, (MaxCharArticleQuantity + MaxCharArticleLineSpace)) +
Name.Substring(i * this.MaxCharArticleName, this.MaxCharArticleName);
}
yield return Chunk;
}
if (Name.Length % this.MaxCharArticleName > 0)
{
String chunk = Name.Substring(Name.Length / this.MaxCharArticleName * this.MaxCharArticleName);
yield return new String((Char)32, (MaxCharArticleQuantity + MaxCharArticleLineSpace)) + chunk;
}
}
private void AddArticle(Double Quantity, String Name, Double SellPrice, Double Total)
{
Lines.AddRange(ArticleLine(Quantity, Name, SellPrice, Total).ToList());
}
For example:
private List<String> Lines = new List<String>();
private void Form1_Load(object sender, EventArgs e)
{
AddArticle(2.50, "EGGS", 0.50, 1.25); // Problem: Dont show the numbers like (Quantity, Sellprice, Total)
//AddArticle(100.52, "HAND SOAP /w EP", 5.00, 502.60); //OutOfRangeException
AddArticle(105.6, "LONG NAME ARTICLE DESCRIPTION", 500.03, 100.00);
//AddArticle(100, "LONG NAME ARTICLE DESCRIPTION2", 1500.03, 150003.00); // OutOfRangeException
foreach (String line in Lines)
{
Console.WriteLine(line);
}
}
Console Output:
LINE1: EGGS
LINE2:105.6LONG NAME ARTIC 500.03 100.00
LINE3: LE DESCRIPTION
Desired output:
LINE:2.50 EGGS 0.50 1.25
LINE:100. HAND SOAP /w EP 5.00 502.60
LINE: 52
LINE:105. LONG NAME ARTIC 500.03 100.00
LINE: 60 LE DESCRIPTION
LINE:100. LONG NAME ARTIC 1,500. 150,003.
LINE: 00 LE DESCRIPTION2 03 00

You are attempting to take values (string values, ultimately) and extract specified lengths from them (chunks), where the each individual chunk of a given group goes on one line, and the next chunk(s) go on the next line(s). There are several challenges in your posted code. One of the biggest is that you may not understand what the / operator does. It is, of course, used for division, but it is integer division, meaning you get a whole number as the result. E.g. 3 / 15 gives you 0, whereas 17 / 15 would give you 1. This means your loop never runs unless the length of the value is greater than the specified chunk limit.
Another possible issue is that you only perform this check against Name, not the other items (though you may have omitted the code for them for brevity reasons).
Your current code is also creating a lot of unnecessary strings, which will lead to performance degradation. Remember that in C# strings are immutable - meaning they cannot be changed. When you "change" the value of a string, you are actually creating another copy of the string.
The crux of your requirement is how to "chunk" the values in such a way that the correct output is achieved. One way to do this is to create an extension method that will take a string value and "chunkify" it for you. One such example is found here: Split String Into Array of Chunks. I've modified it slightly to use a List<T>, but an array would work just as well.
public static List<string> SplitIntoChunks(this string toSplit, int chunkSize)
{
int stringLength = toSplit.Length;
int chunksRequired = (int)Math.Ceiling(decimal)stringLength / (decimal)chunkSize);
List<string> chunks = new List<string>();
int lengthRemaining = stringLength;
for (int i = 0; i < chunksRequired; i++)
{
int lengthToUse = Math.Min(lengthRemaining, chunkSize);
int startIndex = chunkSize * i;
chunks.Add(toSplit.Substring(startIndex, lengthToUse));
lengthRemaining = lenghtRemaining - lengthToUse;
}
return chunks;
}
Say you have a string named myString. You would use the above method like this: string[] chunks = myString.SplitIntoChunks(15);, and you would receive an array of 15 character strings (depending on the size of the string).
A quick walk-through of the code (as there is not much explanation on the page). The size of the chunk is passed into the extension method. The length of the string is recorded, and the number of chunks for that string is determined using the Math.Ceiling function.
Then a for loop is constructed with the number of chunks required as the limit. Inside the loop, the length of the chunk is determined (using the lower of either the chunk size or the remaining length of the string), the starting index is calculated based on the chunk size and the loop index, and then the chunk is extracted via Substring. Finally remaining length is calculated, and once the loop ends the chunks are returned.
One way to use this extension method in your code would look like this. The extension method will need to be in a separate, static class (I suggest building a library that has a class dedicated solely to extension methods, as they come in quite handy). Note that I haven't had time to test this, and it's a bit kludgy for my tastes, but it should at least get you going in the right direction.
private IEnumerable<string> ArticleLine(double quantity, string name, double sellPrice, double total)
{
List<string> quantityChunks = quantity.ToString("#,##0.00").SplitIntoChunks(maxCharArticleQuantity);
List<string> nameChunks = name.SplitIntoChunks(maxCharArticleName);
List<string> sellPriceChunks = sellPrice.ToString("#,##0.00").SplitIntoChunks(maxCharArticleSellPrice);
List<string> totalChunks = total.ToString("#,##0.00").SplitIntoChunks(maxCharArticleTotal);
int maxLines = (new List<int>() { quantityChunks.Count,
nameChunks.Count,
sellPriceChunks.Count,
totalChunks.Count }).Max();
for (int i = 0; i < maxLines; i++)
{
lines.Add(String.Format("{0}{1}{2}{3}",
quantityChunks.Count > i ?
quantityChunks[i].PadLeft(maxCharArticleQuantity) :
String.Empty.PadLeft(maxCharArticleQuantity),
nameChunks.Count > i ?
nameChunks[i].PadLeft(maxCharArticleName) :
String.Empty.PadLeft(maxCharArticleName, ' '),
sellPriceChunks.Count > i ?
sellPriceChunks[i].PadLeft(maxCharArticleSellPrice) :
String.Empty.PadeLeft(maxCharArticleSellPrice),
totalChunks.Count > i ?
totalChunks[i].PadLeft(maxCharArticleTotal) :
String.Empty.PadLeft(maxCharArticleTotal));
}
return lines;
}
The above code does a couple of things. First, it calls SplitIntoChunks on each of the four variables.
Next, it uses the Max() extension method (you'll need to add a reference to System.Linq if you don't already have one, plus a using directive). to determine the maximum number of lines needed (based on the highest count of the four lists).
Finally, it uses a 4for loop to build the lines. Note that I use the ternary operator (?:) to build each line. The reason I went this route is so that the lines would be properly formatted, even if one or more of the 4 values didn't have something for that line. In other words:
quantityChunks.Count > i
If the count of items in quantityChunks is greater than i, then there is an entry for that item for that line.
quanityChunks[i].PadLeft(maxCharArticleQuantity, ' ')
If the condition evaluates to true, we use the corresponding entry (and pad it with spaces to keep alignment).
String.Empty.PadLeft(maxCharArticleQuantity, ' ')
If the condition is false, we simply put in spaces for the maximum number of characters for that position in the line.
Then you can print the output like this:
foreach(string line in lines)
{
Console.WriteLine(line);
}
The portions of the code where I check for the maximum number of lines and the use of a ternary operator in the String.Format feel very kludgy to me, but I don't have time to finesse it into something more respectable. I hope this at least points you in the right direction.
EDIT
Fixed the PadLeft syntax and removed the character specified, as space is used by default.

What is the Fastest way to split ':' seperated string into given number of chunks where result/record length is variable

I have a large string accepted from TCP listner which is in following format
"1,7620257787,0123456789,99,0922337203,9223372036,32.5455,87,12.7857,1/1/2012,9223372036:1,7620257787,0123456789,99,0922337203,9223372036,32.5455,87,12.7857:2/1/2012,234234234:3,7620257787,01234343456789,99,0922337203,9223372036,32.5455,87,12.7857,1/1/2012,9223372036:34,76202343457787,012434343456789,93339,34340922337203,9223372036,32.5455,87,12.7857,1/1/2012,9223372036"
You can see that this is a : seperated string which contains Records which are comma seperated fields.
I am looking for the best (fastest) way that split the string in given number of chunks and take care that one chunk should contain full record (string upto ':')
or other way of saying , there should not be any chunck which is not ending with :
e.g. 20 MB string to 4 chunks of 5 MB each with proper records (thus size of each chunk may not be exactly 5 MB but very near to it and total of all 4 chunks will be 20 MB)
I hope you can understand my question (sorry for the bad english)
I like the following link , but it does not take care of full record while spliting also don't know if that is the best and fastest way.
Split String into smaller Strings by length variable

I don't know how large a 'large string' is, but initially I would just try it with the String.Split method.

The idea is to divide the lenght of your data for the num of blocks required, then look backwards to search the last sep in the current block.
private string[] splitToBlocks(string data, int numBlocks, char sep)
{
// We return an array of the request length
if (numBlocks <= 1 || data.Length == 0)
{
return new string [] { data };
}
string[] result = new string[numBlocks];
// The optimal size of each block
int blockLen = (data.Length / numBlocks);
int idx = 0; int pos = 0; int lastSepPos = blockLen;
while (idx < numBlocks)
{
// Search backwards for the first sep starting from the lastSepPos
char c = data[lastSepPos];
while (c != sep) { lastSepPos--; c = data[lastSepPos]; }
// Get the block data in the result array
result[idx] = data.Substring(pos, (lastSepPos + 1) - pos);
// Reposition for then next block
idx++;
pos = lastSepPos + 1;
if(idx == numBlocks-1)
lastSepPos = data.Length - 1;
else
lastSepPos = blockLen * (idx + 1);
}
return result;
}
Please test it. I have not fully tested for fringe cases.

OK, I suggest you way with two steps:
Split string into chunks (see below)
Check chunks for completeness
Splitting string into chunks with help of linq (linq extension method taked from Split a collection into `n` parts with LINQ? ):
string tcpstring = "chunk1 : chunck2 : chunk3: chunk4 : chunck5 : chunk6";
int numOfChunks = 4;
var chunks = (from string z in (tcpstring.Split(':').AsEnumerable()) select z).Split(numOfChunks);
List<string> result = new List<string>();
foreach (IEnumerable<string> chunk in chunks)
{
result.Add(string.Join(":",chunk));
}
.......
static class LinqExtensions
{
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> list, int parts)
{
int i = 0;
var splits = from item in list
group item by i++ % parts into part
select part.AsEnumerable();
return splits;
}
}
Am I understand your aims clearly?
[EDIT]
In my opinion, In case of performance consideration, better way to use String.Split method for chunking

It seems you want to split on ":" (you can use the Split method).
Then you have to add ":" after splitting to each chunk that has been split.
(you can then split on "," for all the strings that have been split by ":".

int index = yourstring.IndexOf(":");
string[] whatever = string.Substring(0,index);
yourstring = yourstring.Substring(index); //make a new string without the part you just cut out.
this is a general view example, all you need to do is establish an iteration that will run while the ":" character is encountered; cheers...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.