Convert a String, which is already malformed

Convert a String, which is already malformed - c#

I have a class, which uses another class which reads a Textfile.
The Textfile is written in Ascii or to be clear CP1525.
Background info: The Textfile is generated in Axapta and uses the ASCIIio class which writes the text by using the writeRaw method
The class which I am using is by a collegue and he is using a C# StreamReader to read files. Normally this works okay because the files are written in UTF8, but in this particular case it isn't.
So the Streamreader reads the file as UTF8 and passes the read string to me.
I now have some letters, like for example the Lating small letter o with Diaeresis (ö) which aren't formated as I would need them to be.
A simple convert of the String doesn't help in this case and I can't figure out how I can get the right letters.
So this is basically how he reads it:
char quotationChar = '"';
String line = "";
using (StreamReader reader = new StreamReader(fileName))
{
if((line = reader.ReadLine()) != null)
{
line = line.Replace(quotationChar.ToString(), "");
}
}
return line;
What now happens is, in the Textfile I have the german word "Röhre" which, after reading it with the streamreader, transforms to R�hre (which looks stupid in a database).
I could try to convert every letter
Encoding enc = Encoding.GetEncoding(1252);
byte[] utf8_Bytes = new byte[line.Length];
for (int i = 0; i < line.Length; ++i)
{
utf8_Bytes[i] = (byte)line[i];
}
String propEncodeString = enc.GetString(utf8_Bytes, 0, utf8_Bytes.Length);
That doesn't give me the right character !
byte[] myarr = Encoding.UTF8.GetBytes(line);
String propEncodeString = enc.GetString(myarr);
That also returns the wrong character.
I am aware that I could just solve the problem by using this:
using (StreamReader reader = new StreamReader(fileName, Encoding.Default, true))
But just for fun:
How can I get the right string from an already wrongly decoded string ?

Once the UTF8 to ASCII conversion is first made, all characters that don't map to valid ASCII entries are replaced with the same bad data character which means that data is just lost and you can't simply 'convert' back to a good character downstream. See this example: https://dotnetfiddle.net/XWysml

Related

Problems with writing bytes format of string data in Text File in C#

I have a text file stored locally. I want to store string data in binary format there and then retrieve the data again. In the following code snippet, I have done the conversion.
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
class ConsoleApplication
{
const string fileName = "AppSettings.dat";
static void Main()
{
string someText = "settings";
byte[] byteArray = Encoding.UTF8.GetBytes(someText);
int byteArrayLenght = byteArray.Length;
using (BinaryWriter writer = new BinaryWriter(File.Open(fileName, FileMode.Create)))
{
writer.Write(someText);
}
byte[] x = new byte[byteArrayLenght];
if (File.Exists(fileName))
{
using (BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
{
x = reader.ReadBytes(byteArrayLenght);
}
string str = Encoding.UTF8.GetString(x);
Console.Write(str);
Console.ReadKey();
}
}
}
In the AppSettings.dat file the bytes are written in the following way
But when I have assigned some random value in a byte array and save it in a file using BinaryWriter as I have done in the following code snippet
const string fileName = "AppSettings.dat";
static void Main()
{
byte[] array = new byte[8];
Random random = new Random();
random.NextBytes(array);
using (BinaryWriter writer = new BinaryWriter(File.Open(fileName, FileMode.Create)))
{
writer.Write(array);
}
}
It's actually saved the data in binary format in the text file, shown in the picture.
I don't understand why (in my first case) the byte data converted from string showing human readable format where I want to save the data in non-readable byte format(later case). What's the explanation regarding this?
Is there any way where I can store string data in binary format without approaching brute force?
FYI - I don't want to keep the data in Base64String format, I want it to be in binary format.

If security isn't a concern, and you just don't want the average usage to find your data while meddling into the settings files, a simple XOR will do:
const string fileName = "AppSettings.dat";
static void Main()
{
string someText = "settings";
byte[] byteArray = Encoding.UTF8.GetBytes(someText);
for (int i = 0; i < byteArray.Length; i++)
{
byteArray[i] ^= 255;
}
File.WriteAllBytes(fileName, byteArray);
if (File.Exists(fileName))
{
var x = File.ReadAllBytes(fileName);
for (int i = 0; i < byteArray.Length; i++)
{
x[i] ^= 255;
}
string str = Encoding.UTF8.GetString(x);
Console.Write(str);
Console.ReadKey();
}
}
It takes advantage of an interesting property of character encoding:
In ASCII, the 0-127 range contains the most used characters (a to z, 0 to 9) and the 128-256 range contains only special symbols and accents
For compatibility reasons, in UTF-8 the 0-127 range contains the same characters as ASCII, and the 128-256 range have a special meaning (it tells the decoder that the characters are encoded into multiple bytes)
All I do is flipping the strong-bit of each byte. Therefore, everything in the 0-127 range ends up in the 128-256 range, and vice-versa. Thanks to the property I described, no matter if the text-reader tries to parse in ASCII or in UTF-8, it will only get gibberish.
Please note that, while it doesn't produce human-readable content, it isn't secure at all. Don't use it to store sensitive data.

The notepad just reads your binary data and converts it to UTF8 text.
This code snippet would give you the same result.
byte[] randomBytes = new byte[20];
Random rand = new Random();
rand.NextBytes(randomBytes);
Console.WriteLine(Encoding.UTF8.GetString(randomBytes));
If you want to stop people from converting your data back to a string. then you need to encrypt your data. Here is a project that can help you with that.
But they are still able to read the data in a text editor because it converts your encrypted data to UFT8. They can't Convert it back to usable data unless they have to key to decrypt your data.

Write Tab Delimited File

I'm having trouble writing a Tab-delimited File and I've checked around here and have not gotten my answers yet.
So I've got a function that returns the string with the important pieces below (delimiter used and how I build each line):
var delimiter = #"\t";
sb.Append(string.Join(delimiter, itemContent));
sb.Append(Environment.NewLine);
The string returned is like this:
H\t13\t170000000000001\t20150630
D\t1050\t10.0000\tY
D\t1050\t5.0000\tN
And then I write it to a file with this (content below is the string above):
var content = BuildFile(item);
var filePath = tempDirectory + fileName;
// Create the File
using (FileStream fs = File.Create(filePath))
{
Byte[] info = new UTF8Encoding(true).GetBytes(content);
fs.Write(info, 0, info.Length);
}
However, the file output is this with no tabs (opened in notepad++):
H\t13\t170000000000005\t20150630
D\t1050\t20.0000\tN
D\t1050\t2.5000\tY
When it should be more like this (sample file provided):
H 100115980 300010000000003 20150625
D 430181 1 N
D 342130 2 N
D 459961 1 N
Could this be caused by the encoding I used? Appreciate any input you may have, thanks!

Using var delimiter = #"\t";, the variable contains a literal \ plus t. The # syntax disables the backslash as "special". In this case you really want
var delimiter = "\t";
to have a tab character.

There is a typo in your code. The # prefix means that the following string is a literal so #"\t" is a two-character string with the characters \ and t
You should use "\t" without the prefix.
You should consider using a StreamWriter instead of constructing the entire string in memory and writing the raw bytes though. StreamWriter uses UTF-8 by default and allows you to write formatted lines just as you would with Console.WriteLine:
var delimiter ="\t";
using(var writer=new StreamWriter(filePath))
{
var line=string.Join(delimiter, itemContent);
writer.WriteLine(line);
}

Importing Data from a csv file

I have a csv file.
When I try to read that file using filestream readtoend(), I get inverted commas and \r at many places that breaks my number of rows in each column.
Is there a way to remove inverted commas and \r.
I tried to replace
FileStream obj = new FileStream();
string a = obj.ReadToEnd();
a.Replace("\"","");
a.Replace("\r\"","");
When I visualize a all \r and inverted commas are removed.
But when I read the file again from beginning using ReadLine() they appear again?

First of all, a String is immutable. You might think this is not important for your question, but actualy it's important whenever you are developing.
If I look at your code snippet, I'm pretty sure you have no knowledge of immutable objects so I advice you to make sure you fully understand the concept.
More information regarding immutable objects can be found: http://en.wikipedia.org/wiki/Immutable_object
Basicly, it means one can never modify a string object. Strings will always point to a new object whenever we change the value.
That's why the Replace method returns a value, which's documentation can be found here: https://msdn.microsoft.com/en-us/library/system.string.replace%28v=vs.110%29.aspx and states clearly that it Returns a new string in which all occurrences of a specified string in the current instance are replaced with another specified string.
In your example, you aren't using the return value of the Replace function.
Could you show us that the string values are actuably being replaced from your a variable? Because I do not believe this is going to be the case. When you visualize a string, carriage returns (\r) are not visual and replaced by an actual carriage return. If you debug and take alook at the actual string value, you should still see the \n.
Take the following code snippet:
var someString = "Hello / world";
someString.Replace("/", "");
Console.Log(someString);
You might think that the console will show "Hello world". However, on this fiddle you can see that it still logs "Hello / World": https://dotnetfiddle.net/cp59i3
What you have to do to correctly use String.Replace can be seen in this fiddle: https://dotnetfiddle.net/XCGtOu
Basicly, you want to log the return value of the Replace function:
var a = "Some / Value";
var b = a.Replace("/", "");
Console.WriteLine(b);
Also, as mentioned by others in the comment section at ur post, you are not replacing the contents of the file, but the string variable in your memory.
If you want to save the new string, make sure to use the Write method of the FileStream (or any other way to write to a file), an explanation can be found here: How to Find And Replace Text In A File With C#
Apart from all what I have been saying throughout this answer, you should not replace both inverted comma's and carriage returns in a file in most cases, they are there for a reason. Unless you do have a specific reason.

At last I succeeded. Thanks to everybody. Here is the code I did.
FileStream obj = new FileStream();
using(StreamReader csvr = new StreamReader(obj))
{
string a = obj.ReadToEnd();
a = a.Replace("\"","");
a = a.Replace("\r\"","");
obj.Dispose();
}
using(StreamWriter Wr = new StreamWriter(TempPath))
{
Wr.Write(a);
}
using(StreamReader Sr = new StreamReader(Tempath))
{
Sr.ReadLine();
}
I Created a temp path on the system. After this things were easy to enter into database.

Try something like this
StreamReader sReader = new StreamReader("filename");
string a = sReader.ReadToEnd();
a.Replace("\"", "");
a.Replace("\r\"", "");
StringReader reader = new StringReader(a);
string inputLine = "";
while ((inputLine = reader.ReadLine()) != null)
{
}

Compare Windows-1252 string to UTF-8 string

my goal is to convert a .NET string (Unicode) into Windows-1252 and - if necessary - store the original UTF-8 string in a Base64 entity.
For example, the string "DJ Doena" converted to 1252 is still "DJ Doena".
However if you convert the Japanese kanjii for tree (木) into 1251 you end up with a question mark.
These are my test strings:
String doena = "DJ Doena";
String umlaut = "äöüßéèâ";
String allIn = "< ä ß á â & 木 >";
This is how I convert the string in the first place:
using (MemoryStream ms = new MemoryStream())
{
using (StreamWriter sw = new StreamWriter(ms, Encoding.UTF8))
{
sw.Write(decoded);
sw.Flush();
ms.Seek(0, SeekOrigin.Begin);
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding(1252)))
{
encoded = sr.ReadToEnd();
}
}
}
Problem is, while debugging string comparison claims that both are indeed identical, so a simple == or .Equals() doesn't suffice.
This is how I try to find out if I need base64 and produce it:
private static String GetBase64Alternate(String utf8Text, String windows1252Text)
{
Byte[] utf8Bytes;
Byte[] windows1252Bytes;
String base64;
utf8Bytes = Encoding.UTF8.GetBytes(utf8Text);
windows1252Bytes = Encoding.GetEncoding(1252).GetBytes(windows1252Text);
base64 = null;
if (utf8Bytes.Length != windows1252Bytes.Length)
{
base64 = Convert.ToBase64String(utf8Bytes);
}
else
{
for(Int32 i = 0; i < utf8Bytes.Length; i++)
{
if(utf8Bytes[i] != windows1252Bytes[i])
{
base64 = Convert.ToBase64String(utf8Bytes);
break;
}
}
}
return (base64);
}
The first string doena is completely identical and doesn't produce a base64 result
Console.WriteLine(String.Format("{0} / {1}", windows1252Text, base64Text));
results in
DJ Doena /
But the second string umlauts already has twice the bytes in UTF-8 than in 1252 and thus produces an Base64 string even though it does not appear to be necessary:
äöüßéèâ / w6TDtsO8w5/DqcOow6I=
And the third one does what it's supposed to do (no more "木" but a "?", thus base64 needed):
< ä ß á â & ? > / PCDDpCDDnyDDoSDDoiAmIOacqCA+
Any clues how my Base64 getter could be enhanced a) for performance b) for better results?
Thank you in advance. :-)

I'm not sure I completely understood the question. But I tried. :) If I do understand correctly, this code does what you want:
static void Main(string[] args)
{
string[] testStrings = { "DJ Doena", "äöüßéèâ", "< ä ß á â & 木 >" };
foreach (string text in testStrings)
{
Console.WriteLine(ReencodeText(text));
}
}
private static string ReencodeText(string text)
{
Encoding encoding = Encoding.GetEncoding(1252);
string text1252 = encoding.GetString(encoding.GetBytes(text));
return text.Equals(text1252, StringComparison.Ordinal) ?
text : Convert.ToBase64String(Encoding.UTF8.GetBytes(text));
}
I.e. it encodes the text to Windows-1252, then decodes back to a string object, which it then compares with the original. If the comparison succeeds, it returns the original string, otherwise it encodes it to UTF8, and then to base64.
It produces the following output:
DJ Doena
äöüßéèâ
PCDDpCDDnyDDoSDDoiAmIOacqCA+
In other words, the first two strings are left intact, while the third is encoded as base64.

In your first code you are encoding the string using one encoding, then decoding it using a different encoding. That doesn't give you any reliable result at all; it's the equivalent of writing out a number in octal, then reading it as if it was in decimal. It seems to work just fine for numbers up to 7, but after that you get useless results.
The problem with the GetBase64Alternate method is that it's encoding a string to two different encodings, and assumes that the first encoding doesn't support some of the characters if the second encoding resulted in a different set of bytes.
Comparing the byte sequences doesn't tell you whether any of the encodings failed. The sequences will be different if it failed, but it will also be different if there are any characters that are encoded differently between the encodings.
What you want to do is to determine if the encoding actually worked for all characters. You can do that by creating an Encoding instance with a fallback for unsupported characters. There is an EncoderExceptionFallback class that you can use for that, which throws an EncoderFallbackException if it's called.
This code will try use the Windows-1252 encoding on a string, and sets the ok variable to false if the encoding doesn't support all characters in the string:
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
As you are not actually going to used the encoded result for anything, you can use the GetByteCount method. It will check how all characters would be encoded without producing the encoded result.
Used in your method it would be:
private static String GetBase64Alternate(string text) {
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
return ok ? null : Convert.ToBase64(Encoding.UTF8.GetBytes(text));
}

How to strip out 0x0a special char from utf8 file using c# and keep file as utf8?

The following is a line from a UTF-8 file from which I am trying to remove the special char (0X0A), which shows up as a black diamond with a question mark below:
2464577 外國法譯評 True s6620178 Unspecified <1>�1009-672
This is generated when SSIS reads a SQL table then writes out, using a flat file mgr set to code page 65001.
When I open the file up in Notepad++, displays as 0X0A.
I'm looking for some C# code to definitely strip that char out and replace it with either nothing or a blank space.
Here's what I have tried:
string fileLocation = "c:\\MyFile.txt";
var content = string.Empty;
using (StreamReader reader = new System.IO.StreamReader(fileLocation))
{
content = reader.ReadToEnd();
reader.Close();
}
content = content.Replace('\u00A0', ' ');
//also tried: content.Replace((char)0X0A, ' ');
//also tried: content.Replace((char)0X0A, '');
//also tried: content.Replace((char)0X0A, (char)'\0');
Encoding encoding = Encoding.UTF8;
using (FileStream stream = new FileStream(fileLocation, FileMode.Create))
{
using (BinaryWriter writer = new BinaryWriter(stream, encoding))
{
writer.Write(encoding.GetPreamble()); //This is for writing the BOM
writer.Write(content);
}
}
I also tried this code to get the actual string value:
byte[] bytes = { 0x0A };
string text = Encoding.UTF8.GetString(bytes);
And it comes back as "\n". So in the code above I also tried replacing "\n" with " ", both in double quotes and single quotes, but still no change.
At this point I'm out of ideas. Anyone got any advice?
Thanks.

may wanna have a look at regex replacement, for a good example of this, take a look at the post towards the bottom of this page...
http://social.msdn.microsoft.com/Forums/en-US/1b523d24-dab6-4870-a9ca-5d313d1ee602/invalid-character-returned-from-webservice

You can convert the string to a char array and loop through the array.
Then check what char the black diamond is and just remove it.

string content = "blahblah" + (char)10 + "blahblah";
char find = (char)10;
content = content.Replace(find.ToString(), "");

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Convert a String, which is already malformed - c#

Once the UTF8 to ASCII conversion is first made, all characters that don't map to valid ASCII entries are replaced with the same bad data character which means that data is just lost and you can't simply 'convert' back to a good character downstream. See this example: https://dotnetfiddle.net/XWysml

Related

Problems with writing bytes format of string data in Text File in C#

Write Tab Delimited File

Importing Data from a csv file

Compare Windows-1252 string to UTF-8 string

How to strip out 0x0a special char from utf8 file using c# and keep file as utf8?

Categories

Resources