I got some old LED board to which you'd send some text and hang it up somewhere... it was manufactured in 1994/95 and it communicates over a serial port, with a 16-bit MS-DOS application in which you can type in some text.
So, because you probably couldn't run it anywhere except by using DOSBox or similar tricks, I decided to rewrite it in C#.
After port-monitoring the original dos-exe I've found that it's really not interested in you rebuilding it - requests must be answered suitable, varying bytes, pre-sent "ping" messages, etc...
Maybe you know a similar checksum routine/pattern as my dos-exe uses or you could give any tips in trying to reverse-engineer this... Additionally, because I am only familiar with programming and didn't spend much time on reversing methods and/or analyzing protocols, please don't judge me if this topic is a bit of a stupid idea - I'll be glad about any help I get...
The message really containing the text that should be displayed is 143 bytes long (just that long because it puts filler bytes if you don't use up all the space with your text), and in that msg I noticed the following patterns:
The fourth byte (which still belongs to the msg header) varies from a list of 6 or 7 repeating values (in my examples, that byte will always be 0F).
The two last bytes function as a checksum
Some examples:
displayed text: "123" (hex: "31 32 33"), checksum hex: "45 52"
text: "132" ("31 33 32"), checksum hex: "55 FF"
text: "122" ("31 32 32"), checksum hex: "95 F4"
text: "133" ("31 33 33"), checksum hex: "85 59"
text: "112" ("31 31 32"), checksum hex: "C5 C8"
text: "124" ("31 32 34"), checksum hex: "56 62"
text: "134" ("31 33 34"), checksum hex: "96 69"
text: "211" ("32 31 31"), checksum hex: "5D 63"
text: "212" ("32 31 32"), checksum hex: "3C A8"
text: {empty}, checksum hex: "DB BA"
text: "1" ("31"), checksum hex: "AE 5F"
So far I am completely sure that the checksum really does depend on this fourth byte in the header, because if it changes, the checksums will be completely different for the same text to be displayed.
Here's an an example of a full 143 bytes-string displaying "123", just for giving you a better orientation:
02 86 04 0F 05 03 01 03 01 03 01 03 00 01 03 00 ...............
00 31 00 32 00 33 00 20 00 20 00 20 00 20 00 20 .1.2.3. . . . .
00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 . . . . . . . .
00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 . . . . . . . .
00 20 00 20 00 20 00 20 00 20 00 FE 03 01 03 01 . . . . . .þ....
04 01 03 00 01 03 00 00 20 00 20 00 20 00 20 00 ........ . . . .
20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 . . . . . . . .
20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 . . . . . . . .
20 00 20 00 20 00 20 00 20 00 20 00 20 45 52
(the text information starts with 2nd byte in 2nd line "31 00 32 00 33 00 (...)"
Unfortunately on the whole web, there are no user manuals, documentations, not even a real evidence that this info board-device ever existed.
I'll write F(s) for the checksum you get when feeding in string s.
Observe that:
F("122") xor F("123") = 95 F4 xor 45 52 = D0 A6
F("132") xor F("133") = 55 FF xor 85 59 = D0 A6
F("123") xor F("124") = 45 52 xor 56 62 = 13 30
F("133") xor F("134") = 85 59 xor 96 69 = 13 30
all of which is consistent with the checksum having the following property, which checksums not infrequently have: changing a given bit in the input always XORs the output with the same thing.
I predict, e.g., that F("210") = F("211") xor D0 A6 = 8D C5, and similarly that F("222") = 3C A8 xor C5 C8 xor 95 F4 = 6C 94.
If this is true, then the following gives you a brute-force-y way to figure out the checksum in general, provided you have a black box that computes checksums for you (which apparently you have):
Find the checksum of an input all of whose bits are 0. Call this a.
For each bit position k, find the checksum of an input all of whose bits are 0 except for bit k which is 1. Call this a XOR b(k).
Now the checksum of an arbitrary input is a XOR each b(k) where bit k is set in the input.
Usually the b(k) will be closely related to one another -- the usual pattern is that you're feeding bits into a shift register -- so the above is more brute-force-y than you'd need given an understanding of the algorithm. But I expect it works, if you are able to feed in arbitrarily chosen bit patterns as input.
If not, you may still be able to do it. E.g., suppose all you actually get to choose is 29 7-bit ASCII character values, at positions 17,19,...73 of your input. Then you can first of all feed in all spaces (0x20) and then XOR each in turn with 1-bits in positions 0..6. That won't give you all the b(k) but it will give you enough for arbitrary 29-ASCII-character inputs.
Related
What is the memory layout of a .NET array?
Take for instance this array:
Int32[] x = new Int32[10];
I understand that the bulk of the array is like this:
0000111122223333444455556666777788889999
Where each character is one byte, and the digits corresponds to indices into the array.
Additionally, I know that there is a type reference, and a syncblock-index for all objects, so the above can be adjusted to this:
ttttssss0000111122223333444455556666777788889999
^
+- object reference points here
Additionally, the length of the array needs to be stored, so perhaps this is more correct:
ttttssssllll0000111122223333444455556666777788889999
^
+- object reference points here
Is this complete? Are there more data in an array?
The reason I'm asking is that we're trying to estimate how much memory a couple of different in-memory representations of a rather large data corpus will take and the size of the arrays varies quite a bit, so the overhead might have a large impact in one solution, but perhaps not so much in the other.
So basically, for an array, how much overhead is there, that is basically my question.
And before the arrays are bad squad wakes up, this part of the solution is a static build-once-reference-often type of thing so using growable lists is not necessary here.
One way to examine this is to look at the code in WinDbg. So given the code below, let's see how that appears on the heap.
var numbers = new Int32[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
The first thing to do is to locate the instance. As I have made this a local in Main(), it is easy to find the address of the instance.
From the address we can dump the actual instance, which gives us:
0:000> !do 0x0141ffc0
Name: System.Int32[]
MethodTable: 01309584
EEClass: 01309510
Size: 52(0x34) bytes
Array: Rank 1, Number of elements 10, Type Int32
Element Type: System.Int32
Fields:
None
This tells us that it is our Int32 array with 10 elements and a total size of 52 bytes.
Let's dump the memory where the instance is located.
0:000> d 0x0141ffc0
0141ffc0 [84 95 30 01 0a 00 00 00-00 00 00 00 01 00 00 00 ..0.............
0141ffd0 02 00 00 00 03 00 00 00-04 00 00 00 05 00 00 00 ................
0141ffe0 06 00 00 00 07 00 00 00-08 00 00 00 09 00 00 00 ................
0141fff0 00 00 00 00]a0 20 40 03-00 00 00 00 00 00 00 00 ..... #.........
01420000 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................
01420010 10 6d 99 00 00 00 00 00-00 00 01 40 50 f7 3d 03 .m.........#P.=.
01420020 03 00 00 00 08 00 00 00-00 01 00 00 00 00 00 00 ................
01420030 1c 24 40 03 00 00 00 00-00 00 00 00 00 00 00 00 .$#.............
I have inserted brackets for the 52 bytes.
The first four bytes are the reference to the method table at 01309584.
Then four bytes for the Length of the array.
Following that are the numbers 0 to 9 (each four bytes).
The last four bytes are null. I'm not entirely sure, but I guess that must be where the reference to the syncblock array is stored if the instance is used for locking.
Edit: Forgot length in first posting.
The listing is slightly incorrect because as romkyns points out the instance actually begins at the address - 4 and the first field is the Syncblock.
Great question. I found this article which contains block diagrams for both value types and reference types. Also see this article in which Ritcher states:
[snip] each array has some additional
overhead information associated with
it. This information contains the rank
of the array (number of dimensions),
the lower bounds for each dimension of
the array (almost always 0), and the
length of each dimension. The overhead
also contains the type of each element
in the array.
Great question! I wanted to see it for myself, and it seemed a good opportunity to try out CorDbg.exe...
It seems that for simple integer arrays, the format is:
ssssllll000011112222....nnnn0000
where s is the sync block, l the length of the array, and then the individual elements. It seems that there is a finally 0 at the end, I'm not sure why that is.
For multidimensional arrays:
ssssttttl1l1l2l2????????
000011112222....nnnn000011112222....nnnn....000011112222....nnnn0000
where s is the sync block, t the total number of elements, l1 the length of the first dimension, l2 the length of the second dimension, then two zeroes?, followed by all the elements sequentially, and finally a zero again.
Object arrays are treated as the integer array, the contents are references this time. Jagged arrays are object arrays where the references point to other arrays.
An array object would have to store how many dimensions it has and the length of each dimension. So there is at least one more data element to add to your model
I'm trying to pass a binary string (which represents a file) from classic ASP (vbscript) to a .Net method. This is of course, very simple indeed. Testing with a simple txt file, we find that the data can be reliably passed to .Net and worked with.
However, as soon as we get to files which include chars in the unicode range, we see that they are routinely converted to different characters and therefore the file is corrupted. For example:
3Ü£ÜkÜGÝ
becomes
ýÿýÿýÿýÿ
So I created a couple of methods, one to write out the hex of the string within the .Net world (passed from vb), and the other to write out the hex in the vb world (the source). here are the results (using the above example text)?
What .Net THINKS we sent it:
fd ff fd ff fd ff fd ff 0d 0a 0d 0a 62 65 63 6f 6d 65 73 0d 0a 0d 0a fd ff fd ff fd ff fd ff 00
What we REALLY sent it:
33 DC A3 DC 6B DC 47 DD 0D 0A 0D 0A 62 65 63 6F 6D 65 73 0D 0A 0D 0A FD FF FD FF FD FF FD FF 00
We can see the word "becomes" has been carried across perfectly (ascii chars) as have the final characters. But the initial characters (3Ü£ÜkÜGÝ) were converted to:
fd ff fd ff fd ff fd ff
Or the string:
ýÿýÿýÿýÿ
What is going on here? Both languages support unicode, is there some odd marshalling of the string happening? Why these particular characters? Thanks.
UPDATE 1
If I take 3Ü in a file and write that out as binary string data in vbscript we get í°³. So what's going on there? I can understand that we now have 3 characters as the first is single byte and the second is double byte. But why isn't it 3°³ (or something similar), why is 3 converted to í etc? How can I get my original data from this string? Of course if I say it's encoded to ASCII, I'll get old ? 3f character. I'm obviously missing a fundamental understanding here.
I'm parsing a file (which I don't generate) that contains a string. The string is always preceded by 2 bytes which tell me the length of the string that follows.
For example:
05 00 53 70 6F 72 74
would be:
Sport
Using a C# BinaryReader, I read the string using:
string s = new string(binaryReader.ReadChars(size));
Sometimes there's the odd funky character which seems to push the position of the stream on further than it should. For example:
0D 00 63 6F 6F 6B 20 E2 80 94 20 62 6F 6F 6B
Should be:
cook - book
and although it reads fine the stream ends up two bytes further along than it should?! (Which then messes up the rest of the parsing.)
I'm guessing it has something to do with the 0xE2 in the middle, but I'm not really sure why or how to deal with it.
Any suggestions greatly appreciated!
My guess is that the string is encoded in UTF-8. The 3-byte sequence E2 80 94 corresponds to the single Unicode character U+2014 (EM DASH).
In your first example
05 00 53 70 6F 72 74
none of the bytes are over 0x7F and that happens to be the limit for 7 bit ASCII. UTF-8 retains compability with ASCII by using the 8th bit to indicate that there will be more information to come.
0D 00 63 6F 6F 6B 20 E2 80 94 20 62 6F 6F 6B
Just as Ted noticed your "problems" starts with 0xE2 because that is not a 7 bit ASCII character.
The first byte 0x0D tells us there should be 11 characters but there are 13 bytes.
0xE2 tells us that we've found the beginning of a UTF-8 sequence since the most significant bit is set (it's over 127). In this case a sequence that represents — (EM Dash).
As you did correctly state the E2 character is the problem. BinaryReader.ReadChars(n) does not read n-bytes but n UTF-8 encoded Unicode characters. See Wikipedia for Unicode Encodings. The term you are after are Surrogate Characters. In UTF-8 characters in the range of 000080 – 00009F are represented by two bytes. This is the reason for your offset mismatch.
You need to use BinaryReader.ReadBytes to fix the offset issue and the pass it to an Encoding instance.
To make it work you need to read the bytes with BinaryReader and then decode it with the correct encoding. Assuming you are dealing with UTF-8 then you need to pass the byte array to
Encoding.UTF8.GetString(byte [] rawData)
to get your correctly encoded string back.
Yours,
Alois Kraus
I'm trying to read a null terminated string from a byte array; the parameter to the function is the encoding.
string ReadString(Encoding encoding)
For example, "foo" in the following encodings are:
UTF-32: 66 00 00 00 6f 00 00 00 6f 00 00 00
UTF-8: 66 6f 6f
UTF-7: 66 6f 6f 2b 41 41 41 2d
If I copied all the bytes into an array (reading up to the null terminator) and passed that array into encoding.GetString(), it wouldn't work because if the string was UTF-32 encoded my algorithm would reach the "null terminator" after the second byte.
So I sort of have a double question: Are null terminators part of the encoding? If not, how could I decode the string character by character and check the following byte for the null terminator?
Thanks in advance
(suggestions are also appreciated)
Edit:
If "foo" was null terminated and utf-32 encoded, which would it be?:
1. 66 00 00 00 6f 00 00 00 6f 00 00 00 00
2. 66 00 00 00 6f 00 00 00 6f 00 00 00 00 00 00 00
The null terminator is not "logically" part of the string; it's not considered payload. It's widely used in C/C++ to indicate where the string ends.
Having said that you can have strings with embedded \0's but then you have to be careful to ensure the string doesn't appear truncated. For example std::string doesn't have a problem with embedded \0's. But if do a c_str() and and not account for the reported length() your string will appear cut off.
Null terminators are not part of the encoding, but the string representation used by some programming language, such as C. In .NET, System.String is prefixed by the string length as a 32-bit integer and is not null-terminated. Internally System.String is always UTF-16, but you can use the encoding to output different representations.
For the second part... Use the classes in System.Text such as UTF8Encoding and UTF32Encoding to read the string. You just have to select the right one based on your parameter...
This seems to work well for me (sample from actual code that reads a unicode, null terminated string from a byte array):
//trim null-termination from end of string
byte[] languageId = ...
string language = Encoding.Unicode.GetString(languageId,
0,
languageId.Length).Trim('\0');
I want to create a very simple piece of software in C# .NET that I can pass a folder's path to and detect all files with a frequency of below a given threshold. Any pointers on how I would do this?
You have to read mp3 files. To do that you have to find specifications for them.
Generally mp3 file is wrapped into ID3 tag, so that you have to read it, find its length and skip it. Let's take ID3v2.3 for example:
ID3v2/file identifier "ID3"
ID3v2 version $03 00
ID3v2 flags %abc00000
ID3v2 size 4 * %0xxxxxxx
so bytes 6,7,8,9 store header length in big-endian form. Here is sample of some file:
0 1 2 3 4 5 6 7 8 9 A B C D E F
49 44 33 03 00 00 00 00 07 76 54 43 4f 4e 00 00
07 76 - is the size. You need to shift left first byte so that actual size is 3F6. Then add 10 (A) to get the offset = 400. This is address of start of mp3 header.
Then you take description of mp3 header:
bits are: AAAAAAAA AAABBCCD EEEEFFGH IIJJKLMM, we need FF , sampling frequency and convert t to actual frequency:
bits MPEG1 MPEG2 MPEG2.5
00 44100 22050 11025
01 48000 24000 12000
10 32000 16000 8000
11 reserv. reserv. reserv.
You can use UltraID3Lib to get mp3 metadata (bitrate, frequency)
Check value of frequency bits in a file. There is some info about mp3 format.