How to overwrite specific bytes in dump file in C# - c#

i'm having a mysql dump with some special characters ("Ä, ä, Ö, ö, Ü, ü, ß"). I have to reimport this dump into the latest mysql version. This is crashing the special characters because of the encoding. The dump is not encoded with UTF-8.
Within this dump there are also some binary attachments which should not be overwritten. Otherwise the attachments will be broken.
I have to overwrite every special character with the bytes that are readable for UTF-8.
I'm currently trying it that way (this is changing the ANSI ü to an for UTF-8 readable ü):
newByteArray[y] = 195;
if (bytesFromLine[i] == 252)
{
newByteArray[y + 1] = 188;
}
newByteArray[y + 2] = bytesFromLine[y + 1];
252 is displaying a 'ü' in Encoding.Default. 195 188 is displaying a 'ü' in Encoding.UTF8.
Now i need help with searching this specific characters in this dump file an overwriting this bytes with the right bytes. I can't replace all '252' with '195 188' because the attachments would get broken then.
Thanks in advance.
Relax

DISCLAIMER: This might corrupt your data. The best way of dealing with this is to get a proper mysqldump from the source database. This solution should only be use when you don't have that option and stuck with a potentially broken dump file.
Assuming all strings in the dump file in quotes (using single quote ') and can be escaped as \':
INSERT INTO `some_table` VALUES (123, 'this is a string', ...
Not too sure how binary data is represented. That might need more checks, you need to check your dump file and see if these assumptions are correct.
const char quote = '\'';
const char escape = '\\';
using (var dumpOut = new FileStream("dump_out.txt", FileMode.Create, FileAccess.Write))
using (var dumpIn = new FileStream("dump_in.txt", FileMode.Open, FileAccess.Read))
{
bool inquotes = false;
byte previousByte = 0;
var stringBytes = new List<byte>();
while (true)
{
int readByte = dumpIn.ReadByte();
if (readByte == -1) break;
var b = (byte) readByte;
if (b == quote && previousByte != escape)
{
if (inquotes) // closing quote
{
var buffer = stringBytes.ToArray();
stringBytes.Clear();
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, buffer);
dumpOut.Write(converted, 0, converted.Length);
dumpOut.WriteByte(b);
}
else // opening quote
{
dumpOut.WriteByte(b);
}
inquotes = !inquotes;
continue;
}
previousByte = b;
if (inquotes)
stringBytes.Add(b);
else
dumpOut.WriteByte(b);
}
}

Related

Encoding UTF-16 to UTF-8 C#

Hello everyone i have some problem with Encoding..
i want convert utf-16 to utf-8 i founded many code but didn't work..
I hope help me.. Thanks
This text =>
'\x04\x1a\x040\x04#\x04B\x040\x00 \x00*\x003\x003\x000\x001\x00:\x00 \x000\x001\x00.\x001\x001\x00.\x002\x000\x002\x002\x00 \x001\x004\x00:\x001\x000\x00,\x00 \x04?\x04>\x04?\x04>\x04;\x04=\x045\x04=\x048\x045\x00 \x003\x003\x00.\x003\x003\x00 \x00T\x00J\x00S\x00.\x00 \x00 \x04\x14\x04>\x04A\x04B\x04C\x04?\x04=\x04>\x00 \x003\x002\x002\x003'
#I tryed this
string v = Regex.Unescape(text);
get result like
♦→♦0♦#♦B♦0 *3301: 01.11.2022 14:10, ♦?♦>♦?♦>♦;♦=♦5♦=♦8♦5 33.33 TJS. ♦¶♦>♦A♦B♦C♦?♦=♦> 3223
and continue
public static string Utf16ToUtf8(string utf16String)
{
// Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);
// Return UTF8 bytes as ANSI string
return Encoding.Default.GetString(utf8Bytes);
}
don't worked
I need result like this
Карта *4411: 01.11.2022 14:10, пополнение 33.33 TJS. Доступно 3223
The code below decodes the text to what you want, but it would be much better to avoid getting into this situation in the first place. If the data is fundamentally text, store it as text in your log files without the extra "convert to UTF-16 then encode that binary data" aspect - that's just causing problems.
The code below "decodes" the text log data into a byte array by treating each \x escape sequence as a single byte (assuming \\ is used to encode backslashes) and treating any other character as a single byte - effectively ISO-8859-1.
It then converts the byte array to a string using big-endian UTF-16. The output is as desired:
Карта *3301: 01.11.2022 14:10, пополнение 33.33 TJS. Доступно 3223
The code is really inefficient - it's effectively a proof of concept to validate the text format you've got. Don't use it as-is; instead, use this as a starting point for improving your storage representation.
using System.Text;
class Program
{
static void Main()
{
string logText = #"\x04\x1a\x040\x04#\x04B\x040\x00 \x00*\x003\x003\x000\x001\x00:\x00 \x000\x001\x00.\x001\x001\x00.\x002\x000\x002\x002\x00 \x001\x004\x00:\x001\x000\x00,\x00 \x04?\x04>\x04?\x04>\x04;\x04=\x045\x04=\x048\x045\x00 \x003\x003\x00.\x003\x003\x00 \x00T\x00J\x00S\x00.\x00 \x00 \x04\x14\x04>\x04A\x04B\x04C\x04?\x04=\x04>\x00 \x003\x002\x002\x003";
byte[] utf16 = DecodeLogText(logText);
string text = Encoding.BigEndianUnicode.GetString(utf16);
Console.WriteLine(text);
}
static byte[] DecodeLogText(string logText)
{
List<byte> bytes = new List<byte>();
for (int i = 0; i < logText.Length; i++)
{
if (logText[i] == '\\')
{
if (i == logText.Length - 1)
{
throw new Exception("Trailing backslash");
}
switch (logText[i + 1])
{
case 'x':
if (i >= logText.Length - 3)
{
throw new Exception("Not enough data for \\x escape sequence");
}
// This is horribly inefficient, but never mind.
bytes.Add(Convert.ToByte(logText.Substring(i + 2, 2), 16));
// Consume the x and hex
i += 3;
break;
case '\\':
bytes.Add((byte) '\\');
// Consume the extra backslash
i++;
break;
// TODO: Any other escape sequences?
default:
throw new Exception("Unknown escape sequence");
}
}
else
{
bytes.Add((byte) logText[i]);
}
}
return bytes.ToArray();
}
}
This also helped me:
string reg = Regex.Unescape(text2);
byte[] ascii = Encoding.BigEndianUnicode.GetBytes(reg);
byte[] utf8 = Encoding.Convert(Encoding.BigEndianUnicode, Encoding.UTF8, ascii);
Console.WriteLine(Encoding.BigEndianUnicode.GetString(utf8));

C# utf string conversion, characters which don't display correctly get converted to "unknown character" - how to prevent this?

I've got two strings which are derived from Windows filenames, which contain unicode characters that do not display correctly in Windows (they show just the square box "unknown character" instead of the correct character). However the filenames are valid and these files exist without problems in the operating system, which means I need to be able to deal with them correctly and accurately.
I'm loading the filenames the usual way:
string path = #"c:\folder";
foreach (FileInfo file in DirectoryInfo.EnumerateFiles(path))
{
string filename = file.FullName;
}
but for the purposes of explaining this problem, these are the two filenames I'm having issues with:
string filename1 = "\ude18.txt";
string filename2 = "\udca6.txt";
Two strings, two filenames with a single unicode character plus an extension, both different. This so far is fine, I can read and write these files no problem, however I need to store these strings in a sqlite db and later retrieve them. Every attempt I make to do so results in both of these characters being changed to the "unknown character", so the original data is lost and I can no longer differentiate the two strings. At first I thought this was an sqlite issue, and I've made sure my db is in UTF16, but it turns out it's the conversion in c# to UTF16 that is causing the problem.
If I ignore sqlite entirely, and simply try to manually convert these strings to UTF16 (or to any other encoding), these characters are converted to the "unknown character" and the original data is lost. If I do this:
System.Text.Encoding enc = System.Text.Encoding.Unicode;
string filename1 = "\ude18.txt";
string filename2 = "\udca6.txt";
byte[] name1Bytes = enc.GetBytes(filename1);
byte[] name2Bytes = enc.GetBytes(filename2);
and I then inspect the bytearrays 'name1Bytes' and 'name2Bytes' they are both identical. and I can see that the unicode character in both cases has been converted to a pair of bytes 253 and 255 - the unknown character. and sure enough when I convert back
string newFilename1 = enc.GetString(name1Bytes);
string newFilename2 = enc.GetString(name2Bytes);
the orignal unicode character in each case is lost, and replaced with a diamond question mark symbol. I have lost the original filenames altogether.
It seems that these encoding conversions rely on the system font being able to display the characters, and this is a problem as these strings already exist as filenames, and changing the filenames isn't an option. I need to preserve this data somehow when sending it to sqlite, and when it's sent to sqlite it will go through a conversion process to UTF16, and it's this conversion that I need it to survive without losing data.
If you cast a char to an int, you get the numeric value, bypassing the Unicode conversion mechanism:
foreach (char ch in filename1)
{
int i = ch; // 0x0000de18 == 56856 for the first char in filename1
... do whatever, e.g., create an int array, store it as base64
}
This turns out to work as well, and is perhaps more elegant:
foreach (int ch in filename1)
{
...
}
So perhaps something like this:
string Encode(string raw)
{
byte[] bytes = new byte[2 * raw.Length];
int i = 0;
foreach (int ch in raw)
{
bytes[i++] = (byte)(ch & 0xff);
bytes[i++] = (byte)(ch >> 8);
}
return Convert.ToBase64String(bytes);
}
string Decode(string encoded)
{
byte[] bytes = Convert.FromBase64String(encoded);
char[] chars = new char[bytes.Length / 2];
for (int i = 0; i < chars.Length; ++i)
{
chars[i] = (char)(bytes[i * 2] | (bytes[i * 2 + 1] << 8));
}
return new string(chars);
}

Replace() working with hex value

I would like to use the Replace() method but using hex values instead of string value.
I have a programm in C# who write text file.
I don't know why, but when the programm write the '°' (-> Number) it's wrotten ° ( in hex : C2 B0 instead of B0).
I just would like to patch it, in order to corect this.
Is it possible to do re place in order to replace C2B0 by B0 ? How doing this ?
Thanks a lot :)
Not sure if this is the best solution for your problem but if you want a replace function for a string using hex values this will work:
var newString = HexReplace(sourceString, "C2B0", "B0");
private static string HexReplace(string source, string search, string replaceWith) {
var realSearch = string.Empty;
var realReplace = string.Empty;
if(search.Length % 2 == 1) throw new Exception("Search parameter incorrect!");
for (var i = 0; i < search.Length / 2; i++) {
var hex = search.Substring(i * 2, 2);
realSearch += (char)int.Parse(hex, System.Globalization.NumberStyles.HexNumber);
}
for (var i = 0; i < replaceWith.Length / 2; i++) {
var hex = replaceWith.Substring(i * 2, 2);
realReplace += (char)int.Parse(hex, System.Globalization.NumberStyles.HexNumber);
}
return source.Replace(realSearch, realReplace);
}
C# strings are Unicode. When they are written to a file, an encoding must be applied. The default encoding used by File.WriteAllText is utf-8 with no byte order mark.
The two-byte sequence 0xC2B0 is the representation of the ° degree sign U+00B0 codepoint in utf-8.
To get rid of the 0xC2 part, apply a different encoding, for example latin-1:
var latin1 = Encoding.GetEncoding(1252);
File.WriteAllText(path, text, latin1);
To address the "hex replace" idea of the question: Best practice to remove the utf-8 leading byte from existing files would be to do a ReadAllText with utf-8, followed by a WriteAllText as shown above (or stream chunking if the files are too big to read to memory as a whole).
Single-byte character encodings cannot represent all Unicode characters, so substitution will happen for any such character in your DataTable.
The rendition as ° must be blamed on the viewer/editor you are using to display the file.
Further reading: https://stackoverflow.com/a/17269952/1132334

Notepad++ .NET plugin - get current buffer text -- encoding issues

I have a .NET plugin which needs to get the text of the current buffer. I found this page, which shows a way to do it:
public static string GetDocumentText(IntPtr curScintilla)
{
int length = (int)Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1;
StringBuilder sb = new StringBuilder(length);
Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb);
return sb.ToString();
}
And that's fine, until we reach the character encoding issues. I have a buffer that is set in the Encoding menu to "UTF-8 without BOM", and I write that text to a file:
System.IO.File.WriteAllText(#"C:\Users\davet\BBBBBB.txt", sb.ToString());
when I open that file (in notepad++) the encoding menu shows UTF-8 without BOM but the ß character is broken (ß).
I was able to get as far as finding the encoding for my current buffer:
int currentBuffer = (int)Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_GETCURRENTBUFFERID, 0, 0);
Console.WriteLine("currentBuffer: " + currentBuffer);
int encoding = (int) Win32.SendMessage(PluginBase.nppData._nppHandle, NppMsg.NPPM_GETBUFFERENCODING, currentBuffer, 0);
Console.WriteLine("encoding = " + encoding);
And that shows "4" for "UTF-8 without BOM" and "0" for "ASCII", but I cannot find what notepad++ or Scintilla thinks those values are supposed to represent.
So I'm a bit lost for where to go next (Windows not being my natural habitat). Anyone know what I'm getting wrong, or how to debug it further?
Thanks.
Removing the StringBuilder fixes this problem.
public static string GetDocumentTextBytes(IntPtr curScintilla) {
int length = (int) Win32.SendMessage(curScintilla, SciMsg.SCI_GETLENGTH, 0, 0) + 1;
byte[] sb = new byte[length];
unsafe {
fixed (byte* p = sb) {
IntPtr ptr = (IntPtr) p;
Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, ptr);
}
return System.Text.Encoding.UTF8.GetString(sb).TrimEnd('\0');
}
}
Alternative approach:
The reason for the broken UTF-8 characters is that this line..
Win32.SendMessage(curScintilla, SciMsg.SCI_GETTEXT, length, sb);
..reads the string using [MarshalAs(UnmanagedType.LPStr)], which uses your computer's default ANSI encoding when decoding strings (MSDN). This means you get a string with one character per byte, which breaks for multi-byte UTF-8 characters.
Now, to save the original UTF-8 bytes to disk, you simply need to use the same default ANSI encoding when writing the file:
File.WriteAllText(#"C:\Users\davet\BBBBBB.txt", sb.ToString(), Encoding.Default);

reading large file, wrong file size

I'm trying to read a large file from a disk and report percentage while it's loading. The problem is FileInfo.Length is reporting different size than my Encoding.ASCII.GetBytes().Length.
public void loadList()
{
string ListPath = InnerConfig.dataDirectory + core.operation[operationID].Operation.Trim() + "/List.txt";
FileInfo f = new FileInfo(ListPath);
int bytesLoaded = 0;
using (FileStream fs = File.Open(ListPath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
string line;
while ((line = sr.ReadLine()) != null)
{
byte[] array = Encoding.ASCII.GetBytes(line);
bytesLoaded += array.Length;
}
}
MessageBox.Show(bytesLoaded + "/" + f.Length);
}
The result is
13357/15251
There's 1900 bytes 'missing'. The file contains list of short strings. Any tips why it's reporting different file sizes? does it has to do anything with '\r' and '\n' characters in the file? In addition, I have the following line:
int bytesLoaded = 0;
if the file is lets say 1GB large, do I have to use 'long' instead? Thank you for your time!
Your intuition is correct; the difference in the reported sizes is due to the newline characters. Per the MSDN documentation on StreamReader.ReadLine:
The string that is returned does not contain the terminating carriage return or line feed.
Depending on the source which created your file, each newline would consist of either one or two characters (most commonly: \r\n on Windows; just \n on Linux).
That said, if your intention is to read the file as a sequence of bytes (without regard to lines), you should use the FileStream.Read method, which avoids the overhead of ASCII encoding (as well as returns the correct count in total):
byte[] array = new byte[1024]; // buffer
int total = 0;
using (FileStream fs = File.Open(ListPath, FileMode.Open,
FileAccess.Read, FileShare.ReadWrite))
{
int read;
while ((read = fs.Read(array, 0, array.Length)) > 0)
{
total += read;
// process "array" here, up to index "read"
}
}
Edit: spender raises an important point about character encodings; your code should only be used on ASCII text files. If your file was written using a different encoding – the most popular today being UTF-8 – then results may be incorrect.
Consider, for example, the three-byte hex sequence E2-98-BA. StreamReader, which uses UTF8Encoding by default, would decode this as a single character, ☺. However, this character cannot be represented in ASCII; thus, calling Encoding.ASCII.GetBytes("☺") would return a single byte corresponding to the ASCII value of the fallback character, ?, thereby leading to loss in character count (as well as incorrect processing of the byte array).
Finally, there is also the possibility of an encoding preamble (such as Unicode byte order marks) at the beginning of the text file, which would also be stripped by the ReadLine, resulting in a further discrepancy of a few bytes.
It's the line endings which get swallowed by ReadLine, and could also possibly be because your source file is in a more verbose encoding than ASCII (perhaps it's UTF8?).
int.MaxValue is 2147483647, so you're going to run into problem using an int for bytesLoaded if your file is >2GB. Switch to a long. After all, FileInfo.Length is defined as a long.
The ReadLine method removes the trailing line termination character.

Categories

Resources