C# Int vs Byte performance & SQL Int vs Binary performance - c#

In a C# windows app I handle HEX Strings. A single HEX string will have 5-30 HEX parts.
07 82 51 2A F1 C9 63 69 17 C1 1B BA C7 7A 18 20 20 8A 95 7A 54 5A E0 2E D4 3D 29
Currently I take this string and parse it into N number of integers using Convert.ToInt32(string, 16). I then add these int values to a database. When I extract these values from the database, I extract them as Ints and then convert them back into HEX string.
Would it be better performance wise to convert these string to bytes and then add them as binary data types within the database?
EDIT:
The 5-30 HEX parts correspond to specific tables where all the parts make up 1 record with individual parts. For instance, if i had 5 HEX values, they correspond to 5 seperate columns of 1 record.
EDIT:
To clarify (sorry):
I have 9 tables. Each table has a set number of columns.
table1:30
table2:18
table3:18
table4:18
table5:18
table6:13
table7:27
table8:5
table9:11
Each of these columns in every table corresponds to a specific HEX value.
For example, my app will receive a "payload" of 13 HEX components in a single string format: 07 82 51 2A F1 C9 63 69 17 C1 1B BA C7. Currently I take this string and parse the individual HEX components and convert them to ints, storing them in an int array. I then take these int values and store them in the corresponding table and columns in the database. When I read these values I get them as ints and then convert them to HEX strings.
What I am wondering is If I should conver the HEX string into a Byte array and store the bytes as SQL Binary variable types.

Well in terms of performance, you should of course test both ways.
However, in terms of readability, if this is just arbitrary data, I'd certainly suggest using a byte array. If it's actually meant to represent a sequence of integers, that's fine - but why would you represent an arbitrary byte array using a collection of 4-byte integers? It doesn't fit in well with anything else:
You have to consider padding if your input data isn't a multiple of 4 bytes
It's a pain to work with in terms of reading and writing the data with streams
It's not clear how you're storing the integers in the database, but I'd expect a blob to be more efficient if you're just trying to store the whole thing
I would suggest writing the code the more natural way, keeping your data close to the kind of thing it's really trying to represent, and then measuring the performance. If it's good enough, then you don't need to look any further. If it's not, you'll have a good basis for tweaking.

Yes, by far. Inserting many rows is far worse than inserting few bigger rows.

A data model often depends not on just how you want to write, but also how you want to find and read the data.
Some considerations:
If you ever have a need to find a particular a "HEX part", even when not at the start of the "HEX string", then each "HEX part" will need to be in a separate row so a database index can pick it up.
Depending on your DBMS/API, it may not be easy to seek through a BLOB or byte array. This may be important for loading non-prefix "HEX parts" or performing modifications in the middle of the "HEX string".
If the "HEX string" needs to be a PRIMARY, UNIQUE or FOREIGN KEY, or needs to be searchable by prefix, then you'll typically need a database type that is actually indexable (BLOBs typically aren't, but most DBMSes have alternate types for smaller byte arrays that are).
All in all, a byte array is probably what you need, but beware of the considerations above.

Related

Surrogate Key Generator state file, is there a way to read file from another prog language?

Good day. I am trying to find a way to read the surrogate key state file to get what is its current value and how to change it. The problem being that the database is being constantly refreshed and I am needing a mechanism where I can get the max value from the table and then set the surrogate key state file.
From what I have been reading its not like the dataset (.ds) files where you can use the DataStage Designer tool to read it. I tried making a small C# application where it would read it as a binary file. Various articles explain that it is an unsigned 64 bit integer. Still when I try to read it, it gives a random set of numbers. It starts with one, then numbers ending in 999, and then it repeats. I tried reading it with the bit converter class but no luck either.
So far the only solution I have seen is to create a parallel or sequential job that gets the max number from the database and then creates the surrogate key with it as explained in http://it.toolbox.com/blogs/infosphere/datastage-8-tutorial-surrogate-key-state-files-17403.
I am not the first one to try changing it through code and was curious if there was some way to do it.
using DataStage 8.7
Tried with C# BinaryReader.ReadUInt64, BinaryReader.ReadInt64 and BitConverter.ConvertToUInt64
Update 2016-10-19:
The partial answer is that it can be read as a binary file. It is divided in 4 sets of 8 bytes. Something like this (you can see it with a hex editor.
01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 05
00 00 00 00 00 00 00 08
00 00 00 00 00 00 00 08
The first set I think is the incremental number (+1, +5, etc)
Second Set is the initial value
Third set is the next number to assign
Fourth set I think is the end of the batch to assign. If you are doing 10 by 10 batches then third is 10 and fourth is 20 or that is how I think it works.
So for reading it by code you need to read it with a binary reader and get sets of 8 bytes to convert to UINT64.
The question still stands because I am not sure what they mean.
Why do you think you need the Surrogate Key Generator stage?
An alternative and much simpler solution would be Transformaer stage to do the numbering and a Sequential File (or parameter) to initialize it. The new number - after processing - could then be written back to the database sequence. So you just need to handle flat files with no programming.
To generate unique numbers in a transformer (parallel) you have to consider the partitions - this formula would do
(#NUMPARTITIONS * ( #INROWNUM - 1)) + #PARTITIONNUM + Max_Field1
The reason for all this is that I am looking for a bug in the Surrogate Key State File.
So when you go to the complex transformer (sorry there are no pictures), you will go to the properties and see it has a Surrogate Key Tab. You have three settings. One is for the file, one is for the initial value and another is for the block size.
The file is where current surrogate key is stored. I will explain soon how it is formatted. The Initial value is from what number you wish to start in, and the block size is to reserve a group of numbers for your transformer.
The file is formatted in 16 byte increments, first one is the current number. The number to assign is this number + 1, and the second is the end of the block size. It will only be 16 bytes when you do not define the initial value or the block size. if you define these it will be 32 bytes. Where the last two values are the current number and the block end.
So when you have two transformers or more using a the same file. It will assign a block that has numbers available before getting a new block and increasing the file size by 16 more bytes if its needed.
So what was the error, when you do not define the block size but define an initial value, the system block size will be around 1000 or so. Lets say you do a small example where all you have is a Row Generator connected to a transformer that ends in a sequential file. All you need is one row. Execute it many times, and lets say your initial value is 200. It will be 200,201,203,204,(1),205. For some reason it bugs in DataStage 8.7 and when you do not define the block size it returns back to one.
I hope this research on the subject will help someone because I looked and looked and there was not much on how to best use Surrogate Keys.
If you wish for the error to happen faster just delete the file and create a new one with C#, assign 4 UINT64 values saved as BYTES. first two values 1,1,200,300. Eventually it will do what I described.

Read Fortran binary file into C# without knowledge of Fortran source code?

Part one of my question is even if this is possible? I will briefly describe my situation first.
My work has a licence for a software that performs a very specific task, however most of our time is spent exporting data from the results into excel etc to perform further analysis. I was wondering if it was possible to dump all of the data into a C# object so that I can then write my own analysis code, which would save us a lot of time.
The software we licence was written in Fortran, but we have no access to the source code. The file looks like it is written out in binary, however I do not know if it is unformatted / sequential etc (is there anyway to discern this?).
I have used some of the other answers on this site to successfully read in the data to a byte[], however this is as far as I have got. I have tried to change portions to doubles (which I assume most of the data is) but the numbers do not strike me as being meaningful (most appear too large or too small).
I have the documentation for the software and I can see that most of the internal variable names are 8 character strings, would this be saved with the data? If not I think it would be almost impossible to match all the data to its corresponding variable. I imagine most of the data will be double arrays of the same length (the number of time points), however there will also be some arrays with a longer length as some data would have been interpolated where shorter time steps were needed for convergence.
Any tips or hints would be appreciated, or even if someone tells me its just not possible so I don't waste any more time trying to solve this.
Thank you.
If it was formatted, you should be able to read it with a text editor: The numbers are written in plain text.
So yes, it's probably unformatted.
There are different methods still. The file can have a fixed record length, or it might have a variable one.
But it seems to me that the first 4 bytes represent an integer containing the length of that record in bytes. For example, here I've written the numbers 1 to 10, and then 11 to 30 into an unformatted file, and the file looks like this:
40 1 2 3 4 5 6 7 8 9 10 40
80 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 80
(I added the new line) In here, the first 4 bytes represent the number 40, followed by 10 4-byte blocks representing the numbers 1-10, followed by another 40. The next record starts with an 80, and 20 4-byte blocks containing the numbers 11 through 30, followed by another 80.
So that might be a pattern you could try to see. Read the first 4 bytes and convert them to integer, then read that many bytes and convert them to whatever you think it should be (4 byte float, 8 byte float, et cetera), and then check whether the next 4 bytes again represents the number that you read first.
But there other methods to write data in Fortran that doesn't seem to have this behaviour, for example direct access and stream. So no guarantees.

Send float array from C++ server to C# client

I'm trying to send some data from a C++ server to a C# client. I was able to send over char arrays. But there is some problem with float array.
This is the code on the C++ server side
float* arr;
arr = new float[12];
//array init...
if((bytecount = send(*csock, (const char*)arr, 12*sizeof(float), 0))==SOCKET_ERROR){
}
so yes i'm trying to send over a float array of size 12.
here's the code for the client side. (it was strange that there was no easy way to get the float out of the stream in the first place. I have never used C# before and maybe there's something better?)
//get the data in a char array
streamReader.Read(temp, 0, temp.Length);
//**the problem lies right here in receiving the data itself
//now convert the char array to byte array
for (int i = 0; i < (elems*4); i++) //elems = size of the float array
{
byteArray = BitConverter.GetBytes(temp[i]);
byteMain[i] = byteArray[0];
}
//finally convert it to a float array
for (int i = 0; i < elems; i++)
{
float val = BitConverter.ToSingle(byteMain, i * 4);
myarray[i] = val;
}
let's look at the memory dump on both sides and the problem will be clear -
//c++ bytes corresponding to the first 5 floats in the array
//(2.1 9.9 12.1 94.9 2.1 ...)
66 66 06 40 66 66 1e 41 9a 99 41 41 cd cc bd 42 66 66 06 40
//c# - this is what i get in the byteMain array
66 66 06 40 66 66 1e 41 fd fd 41 41 fd 3d ? 42 66 66 06 40
there are 2 problems here on the c# side-
1) first it does not handle anything above 0x80 (above 127) (incompatible structures?)
2) for some unbelievable reason it drops a byte!!
and this happens in 'temp' right at the time of receiving the data
I've been trying to figure something out but nothing still.
Do you have any idea why this might be happening? I'm sure I'm doing something wrong...
Suggestions for a better approach?
Thanks a lot
It's not clear from your code what the streamReader variable points to (ie what's its type?) but I would suggest you use the BinaryReader instead. That way, you can just read data one float at a time and never bother with the byte[] array at all:
var reader = new BinaryReader(/* put source stream here */)
var myFloat = reader.ReadSingle();
// do stuff with myFloat...
// then you can read another
myFloat = reader.ReadSingle();
// etc.
Different readers do different things with data. For instance the text reader (and stream reader) will assume all is text in a particular encoding (like UTF-8) and may interpret the data in a way you didn't expect. The BinaryReader will not do that as it was designed to let you specify exactly the data types you want to read out of your stream.
I'm not sure about C#, but C++ makes no guarantees about the internal, binary representations of floats (or any other data type). For all you know, 0.42 might be represented using these 4 bytes: '0', '.', '4', '2'.
The easiest solution would be transferring human-readable strings such as "2.1 9.9 12.1 94.9 2.1" and using cin/cout/printf/scanf and friends.
In network, you should always convert your numbers into a common format and then read back. In other terms any data other than bytes should be encapsulated. So regardless of your programming languages, this is what you need to do. I cannot comment on what is wrong with your code but this might solve your issue and will save some headache later on. Think if the architecture is 64 bits or it uses different endian.
EDIT:
I guess your problem lies with signed unsigned and can be solved with Isak's answer, but still mind what I had said.
If you need help on encapsulation check Beej's Network Guide. It should have a sample how to encode floats over network.

Wasted bytes in protocol buffer arrays?

I have a protocol buffer setup like this:
[ProtoContract]
Foo
{
[ProtoMember(1)]
Bar[] Bars;
}
A single Bar gets encoded to a 67 byte protocol buffer. This sounds about right because I know that a Bar is pretty much just a 64 byte array, and then there are 3 bytes overhead for length prefixing.
However, when I encode a Foo with an array of 20 Bars it takes 1362 bytes. 20 * 67 is 1340, so there are 22 bytes of overhead just for encoding an array!
Why does this take up so much space? And is there anything I can do to reduce it?
This overhead is quite simply the information it needs to know where each of the 20 objects starts and ends. There is nothing I can do different here without breaking the format (i.e. doing something contrary to the spec).
If you really want the gory details:
An array or list is (if we exclude "packed", which doesn't apply here) simply a repeated block of sub-messages. There are two layouts available for sub-messages; strings and groups. With a string, the layout is:
[header][length][data]
where header is the varint-encoded mash of the wire-type and field-number (hex 08 in this case with field 1), length is the varint-encoded size of data, and data is the sub-object itself. For small objects (data less than 128 bytes) this often means 2 bytes overhead per object, depending on a: the field number (fields above 15 take more space), and b: the size of the data.
With a group, the layout is:
[header][data][footer]
where header is the varint-encoded mash of the wire-type and field-number (hex 0B in this case with field 1), data is the sub-object, and footer is another varint mash to indicate the end of the object (hex 0C in this case with field 1).
Groups are less favored generally, but they have the advantage that they don't incur any overhead as data grows in size. For small field-numbers (less than 16) again the overhead is 2 bytes per object. Of course, you pay double for large field-numbers, instead.
By default, arrays aren't actually passed as arrays, but as repeated members, which have a little more overhead.
So I'd guess you actually have 1 byte of overhead for each repeated array element, plus 2 extra bytes overhead on top.
You can lose the overhead by using a "packed" array. protobuf-net supports this: http://code.google.com/p/protobuf-net/
The documentation for the binary format is here: http://code.google.com/apis/protocolbuffers/docs/encoding.html

Binary to Ascii and back again

I'm trying to interface with a hardware device via the serial port. When I use software like Portmon to see the messages they look like this:
42 21 21 21 21 41 45 21 26 21 29 21 26 59 5F 41 30 21 2B 21 27
42 21 21 21 21 41 47 21 27 21 28 21 27 59 5D 41 32 21 2A 21 28
When I run them thru a hex to ascii converter the commands don't make sense. Are these messages in fact something different than hex? My hope was to see the messages the device is passing and emulate them using c#. What can I do to find out exactly what the messages are?
Does the hardware device specify a protocol? Just because it's a serial port connection it doesn't mean that it has to be ASCII/Readable english Text. It could as well be just a sequence of bytes where for example 42 is a command and 21212121 is data to that command. Could be an initialization sequence or whatever.
At the end of the day, all you work with is a series of bytes. The meaning of them can be found in a protocol specification or if you don't have one, you need to manually look at each command. Issue a command to the device, capture the input, issue another command.
Look for patterns. Common Initialization? What could be the commands? What data gets passed?
Yes, it's tedious, but reverse engineering is rarely easy.
The ASCII for the Hex is this:
B!!!!AE!&!)!&Y_A0!+!'
B!!!!AG!'!(!'Y]A2!*!(
That does look like some sort of protocol to me, with some Initialization Sequence (B!!!!) and commands (AE and AG), but that's just guessing.
The decive is sending data to the computer. All digital data has the form of ones and zeroes, such as 10101001010110010... . Most often one combines groups of eight such bits (binary digits) into bytes, so all data consists of bytes. One byte can thus represent any of the 2^8 values 0 to 2^8 - 1 = 255, or, in hexadecimal notation, any of the numbers 0x00 to 0xFF.
Sometimes the bytes represent a string of alphanumerical (and other) characters, often ASCII encoded. This data format assigns a character to each value from 0 to 127. But all data is not ASCII-encoded characters.
For instance, if the device is a light-intensity sensor, then each byte could give the light intensity as a number between 0 (pitch-black) and 255 (as bright as it gets). Or, the data could be a bitmap image. Then the data would start with a couple of well-defined structures (namely this and this) specifying the colour depth (number of bits per pixel, i.e. more or less the number of colours), the width, the height, and the compression of the bitmap. Then the pixel data would begin. Typically the bytes would go BBGGRRBBGGRRBBGGRR where the first BB is the blue intensity of the first pixel, the first GG is the green intensity of the first pixel, the first RR is the red intensity of the first pixel, the second BB is the blue intensity of the second pixel, and so on.
In fact the data could mean anything. Whay kind of device is it? Does it have an open specification?

Categories

Resources