Different results after encoding/decoding base64 - c#

I have the following base64 string:
R1NNQiBBZ2VuY3kgR21iSCAvIFdlYmRlc2lnbiBBZ2VudHVyIFVsbSAvIE9ubGluZXNob3AgQWdlbnR1ciAvIEFwcCBBZ2VudHVyIFVsbSwgR2VybWFueS==
And using an online base64 decoder I get the following result:
GSMB Agency GmbH / Webdesign Agentur Ulm / Onlineshop Agentur / App Agentur Ulm, Germany
All good, right? But now if I try to convert this text back to base64 - the result is becomes
R1NNQiBBZ2VuY3kgR21iSCAvIFdlYmRlc2lnbiBBZ2VudHVyIFVsbSAvIE9ubGluZXNob3AgQWdlbnR1ciAvIEFwcCBBZ2VudHVyIFVsbSwgR2VybWFueQ==
Any ideas?
This is the C# code I am using for decoding:
string basestring = "R1NNQiBBZ2VuY3kgR21iSCAvIFdlYmRlc2lnbiBBZ2VudHVyIFVsbSAvIE9ubGluZXNob3AgQWdlbnR1ciAvIEFwcCBBZ2VudHVyIFVsbSwgR2VybWFueS==";
string output = Encoding.UTF8.GetString(Convert.FromBase64String(basestring));
return output;
And here's the encoding part
string basestring = "GSMB Agency GmbH / Webdesign Agentur Ulm / Onlineshop Agentur / App Agentur Ulm, Germany";
string output = Convert.ToBase64String(Encoding.UTF8.GetBytes(basestring));
return output;

This is actually an artefact of moving from 8-bit encoding (UTF8) to a 6-bit encoding (Base64).
As reference, here's the Base64 encoding table
We'll take an example of the string "AB"; A and B are char(65 and 66) respectively. In 8-bit binary grouping, 65/66 are 01000001/01000010.
Encoding
When encoding to Base64, the same bits of your string are separated in groups of 6 instead of 8. So the same 16-bit sequence above are split into 010000/010100/0010 (same bit pattern, just grouped differently).
Now, the first two groups are easy. You look up the encoding table linked above, and you'll see that 010000 = Q / 010100 = U. You then have the last group with only 4 bits instead of the expected 6. This is where things get interesting.
When encoding, the end is usually padded with zeroes to get to 6 bits. So your 0010 becomes 001000 which is I. So "AB" when encoded in Base64 become "QUI=". The = is optional, it's just there to make the number of characters multiples of 4.
Decoding
Remember when your last group of 0010 is padded to become 6 bits? Here's the fun part: they don't have to be zeroes. The 16-bits (2x8) in your original string became 18-bits (3x6) because of the padding. Since 18 is not a multiple of 8 (bits), the encoder/decoder know enough to drop the excess bits. So the two bit padding could be anything, and they'll still decode properly.
0010 when padded could either be 001000, 001001, 001010, or 001011 - which translates to I, J, K, or L. Bring up any decoder, and try decoding QUI, QUJ, QUK, and QUL. They will all decode to "AB"
Your string
Now, your string when split 6-bit groups looks like the following (see fiddle):
var basestring = "GSMB Agency GmbH / Webdesign Agentur Ulm / Onlineshop Agentur / App Agentur Ulm, Germany";
var sixBitGroups = Encoding.UTF8.GetBytes(basestring)
.SelectMany(b => $"{Convert.ToString(b, 2).PadLeft(8,'0')}")
.Chunk(6)
.Select(c => new string(c.ToArray()));
string.Join("/", sixBitGroups).Dump();
You'll notice that it ends with ../01. That 01 needs to be padded with 4 extra bits. Again, usually, they're zeroes, making it 010000 which is Q. So you'll see your encoded string ends with ..FueQ==. But when you realise that they don't have to be all zeroes, you'll see in the table that 01xxxx covers everything from Q,R,S, .. i,j. This explains why your base64 ..FueS== still decode to the exact same string.

Related

Can you please explain how padleft works in C#

Can you please explain me this code -
return string.Join(string.Empty, checkSum.Select(x => Convert.ToString(x, 2).PadLeft(8, '0')).
checkSum will have binary value like 10101010100011.
Checked in google but didn't find clear explanation.
You can find the PadLeft documentation here:
https://learn.microsoft.com/en-us/dotnet/api/system.string.padleft
It works by "resizing" the string to a certain length by prepending spaces to make it the desired length. That same page includes this example:
string str = "BBQ and Slaw";
Console.WriteLine(str.PadLeft(15)); // Displays " BBQ and Slaw".
Console.WriteLine(str.PadLeft(5)); // Displays "BBQ and Slaw".
Your particular code works such that an array of numbers checkSum is mapped via Select to all be binary numerals not integers (ToString(x, 2)) and the binary form is padded to always be 8 characters but not padded by spaces but by zeroes.

How does a bullet point become \u2022 in c# for JSON?

This isn't a duplicate of JSON and escaping characters - That doesn't answer this, because those code samples are in JavaScript - I need C#.
What C# method/library converts a bullet point (•) into \u2022? The same converter would convert a newline char into \n. Those are just 2 examples, but the overall solution I'm looking for is to pass in a string (containing a combination of ASCII and special chars), and it converts all that to the same ASCII, but with the special chars escaped. For example, I need the following string:
• 3 Ply 330 3/16in x 1/16in(#77)
• 25 ft Long X 22 in Wide
• 2022 (2) Beltwall Blk Standard 4in (102mm)
...converted to this:
\u2022 3 Ply 330 3/16in x 1/16in(#77)\n\u2022 25 ft Long X 22 in Wide\n\u2022 (2) Beltwall Blk Standard 4in (102mm)
...so it can become a valid JSON string value.
I have been down a dozen rabbit holes trying to find the answer to this, though I have no doubt it's something ridiculously simple.
You need to set which characters are escaped. If you are using Newtonsoft (comments indicate you are) then by default it will only escape control characters (newlines, etc).
You can pass the option StringEscapeHandling.EscapeNonAscii to have it escape all possible characters.
public string EncodeNonAsciiCharacters(string value) {
return JsonConvert.SerializeObject(value, Newtonsoft.Json.Formatting.None,
new JsonSerializerSettings { StringEscapeHandling = StringEscapeHandling.EscapeNonAscii }
);
}

Encoding string from reading email

I am using Gmail API to read emails from Gmail account.
In the body I am replacing some chars which are needed as I read in the forums:
String codedBody = body.Replace("-", "+");
codedBody = codedBody.Replace("_", "/");
Problem is that when I try to convert it
byte[] data = Convert.FromBase64String(codedBody);
there is an exception which is firing with some emails:
System.FormatException: 'The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.'
The string which is coming from the request is:
"0J7QsdGP0LLQsDogSGVhbHRoY2FyZSBTZXJ2aWNlIFJlcHJlc2VudGF0aXZlIHdpdGggRHV0Y2gsIEdlcm1hbiANCiDQktCw0LbQvdC-ISDQnNC-0LvRjywg0L3QtSDQvtGC0LPQvtCy0LDRgNGP0LnRgtC1INC90LAg0YLQvtC30LggZW1haWwuICANCiAg0KLQvtC30LggZW1haWwg0LUg0LjQt9C_0YDQsNGC0LXQvSDQv9GA0LXQtyBqb2JzLmJnINC-0YIg0LjQvNC10YLQviDQvdCwINCa0YDQuNGB0YLQuNCw0L0g0JrRitC90LXQsiAg0JfQsCDQtNCwINGB0LUg0YHQstGK0YDQttC10YLQtSDRgSDQutCw0L3QtNC40LTQsNGC0LAg0YfRgNC10LcgZW1haWwg0LjQt9C_0L7Qu9C30LLQsNC50YLQtToga3Jpc3RpYW5fdG9uaUBhYnYuYmcgIA0KICDQodGK0L7QsdGJ0LXQvdC40LUg0L7RgiDQutCw0L3QtNC40LTQsNGC0LA6ICANCiAg0LHQu9Cw0LHQu9Cw0LHQu9Cw0LHQu9CwDQoNCg0KDQoNCg0KICA=PEhUTUw-PEJPRFk-DQrQntCx0Y_QstCwOiBIZWFsdGhjYXJlIFNlcnZpY2UgUmVwcmVzZW50YXRpdmUgd2l0aCBEdXRjaCwgR2VybWFuPGRpdj48YnI-PGRpdj7QktCw0LbQvdC-ISDQnNC-0LvRjywg0L3QtSDQvtGC0LPQvtCy0LDRgNGP0LnRgtC1INC90LAg0YLQvtC30LggZW1haWwuPC9kaXY-PGRpdj48YnI-PC9kaXY-PGRpdj7QotC-0LfQuCBlbWFpbCDQtSDQuNC30L_RgNCw0YLQtdC9INC_0YDQtdC3IGpvYnMuYmcg0L7RgiDQuNC80LXRgtC-INC90LAg0JrRgNC40YHRgtC40LDQvSDQmtGK0L3QtdCyPC9kaXY-PGRpdj7Ql9CwINC00LAg0YHQtSDRgdCy0YrRgNC20LXRgtC1INGBINC60LDQvdC00LjQtNCw0YLQsCDRh9GA0LXQtyBlbWFpbCDQuNC30L_QvtC70LfQstCw0LnRgtC1OiBrcmlzdGlhbl90b25pQGFidi5iZzwvZGl2PjxkaXY-PGJyPjwvZGl2PjxkaXY-0KHRitC-0LHRidC10L3QuNC1INC-0YIg0LrQsNC90LTQuNC00LDRgtCwOjwvZGl2PjxkaXY-PGJyPjwvZGl2PjxkaXY-0LHQu9Cw0LHQu9Cw0LHQu9Cw0LHQu9CwPGJyPjxicj48YnI-PGJyPjxicj48YnI-PC9kaXY-PC9kaXY-PC9CT0RZPjwvSFRNTD4NCg=="
What is causing this problem?
Your source Base64 string is not valid. It contains a padding character = at position 604 in the middle of the string.
It appears as if you have two valid Base64 string that have been concatenated together. Go back to your source and ensure that you're collecting them correctly.
The source has to provide some detail on this as Base64 itself provides no means to determine if you have two values joined like this. If the first source byte array had a length which was a multiple of 3, there would be no padding character in the middle, and it would have decoded successfully and given garbage.
For what it's worth, replacing those characters appears to be correct as there is no de-facto standard for which two symbols characters are used in Base64. However, make sure you've gotten them right way around.
Update
Having investigated further (learning is fun) there is a defined Base64 standard, which defines two separate Base64 encodings.
The Base 64 Alphabet defines + and / for the two symbols, and = for the padding character.
The same RFC also specifies The "URL and Filename safe" Base 64 Alphabet which uses - and _ for the two symbols, and = (or %3D) for the padding character.
It appears your source data uses the "URL and Filename safe" format, while FromBase64String() only accepts the normal format. Therefore you are quite correct to replace - with + and _ with / to convert from one to the other.

Line Of Hex Into ASCII

I am trying to make a string of hex like
4100200062006C0061006E006B002000630061006E0076006100730020007200650063006F006D006D0065006E00640065006400200066006F007200200046006F007200670065002000650064006900740069006E00670020006F006E006C0079002E
Turn into ASCII and look like:
A blank canvas recommended for Forge editing only.
The variable for the hex is collected from a file that I opened into the program, reading a specific address like so:
BinaryReader br = new BinaryReader(File.OpenRead(ofd.FileName));
string mapdesc = null;
for (int i = 0x1C1; i <= 0x2EF; i++)
{
br.BaseStream.Position = i;
mapdesc += br.ReadByte().ToString("X2");
}
richTextBox1.Text = ("" + mapdesc);
Now that I have the mapdesc, I made it print into the richtextbox, and it just looked like a line of hex. I wanted it too look like readable ASCII.
In Hex Editor, the other side reading in ANSI looks like
A. .b.l.a.n.k. .c.a.n.v.a.s. .r.e.c.o.m.m.e.n.d.e.d. .f.o.r. .F.o.r.g.e. .e.d.i.t.i.n.g. .o.n.l.y
The dots are 00s in the hex view, so I believe with the ASCII format, they should be nothing so that I get the readable sentence which is how the game reads it. What would I have to do to convert mapdesc into ASCII?
To be fair, the output matches the decoded output exactly, the issue is actually with the input data.
If you look closely, you will notice that ever other pair of characters is 00, using some simple heuristics, we can determine that we have 16 bit words here, 4 hex chars.
The problem that you are facing, and the reason for the . characters, is that while decoding this as UTF-8, every other character will be null.
You have two solutions to solve this:
To continue decoding in UTF-8, remove every other null character from the string, all the _00_s.
Or
Decode at UTF-16
If you choose this option, you still have an issue with your data - the very first word is only 8 bits, which would cause a shift among ever other byte; to decode in UTF-16, prepend an additional 00 at the beginning of the data blob ( or start your loop one position sooner )

How to retrieve the unicode decimal representation of the chars in a string containing hindi text?

I am using visual studio 2010 in c# for converting text into unicodes. Like i have a string abc= "मेरा" .
there are 4 characters in this string. i need all the four unicode characters.
Please help me.
When you write a code like string abc= "मेरा";, you already have it as Unicode (specifically, UTF-16), so you don't have to convert anything. If you want to access the singular characters, you can do that using normal index: e.g. abc[1] is े (DEVANAGARI VOWEL SIGN E).
If you want to see the numeric representations of those characters, just cast them to integers. For example
abc.Select(c => (int)c)
gives the sequence of numbers 2350, 2375, 2352, 2366. If you want to see the hexadecimal representation of those numbers, use ToString():
abc.Select(c => ((int)c).ToString("x4"))
returns the sequence of strings "092e", "0947", "0930", "093e".
Note that when I said numeric representations, I actually meant their encoding using UTF-16. For characters in the Basic Multilingual Plane, this is the same as their Unicode code point. The vast majority of used characters lie in BMP, including those 4 Hindi characters presented here.
If you wanted to handle characters in other planes too, you could use code like the following.
byte[] bytes = Encoding.UTF32.GetBytes(abc);
int codePointCount = bytes.Length / 4;
int[] codePoints = new int[codePointCount];
for (int i = 0; i < codePointCount; i++)
codePoints[i] = BitConverter.ToInt32(bytes, i * 4);
Since UTF-32 encodes all (21-bit) code points directly, this will give you them. (Maybe there is a more straightforward solution, but I haven't found one.)
Since a .Net char is a Unicode character (at least, for the BMP code point), you can simply enumerate all characters in a string:
var abc = "मेरा";
foreach (var c in abc)
{
Console.WriteLine((int)c);
}
resulting in
2350
2375
2352
2366
use
System.Text.Encoding.UTF8.GetBytes(abc)
that will return your unicode values.
If you are trying to convert files from a legacy encoding into Unicode:
Read the file, supplying the correct encoding of the source files, then write the file using the desired Unicode encoding scheme.
using (StreamReader reader = new StreamReader(#"C:\MyFile.txt", Encoding.GetEncoding("ISCII")))
using (StreamWriter writer = new StreamWriter(#"C:\MyConvertedFile.txt", false, Encoding.UTF8))
{
writer.Write(reader.ReadToEnd());
}
If you are looking for a mapping of Devanagari characters to the Unicode code points:
You can find the chart at the Unicode Consortium website here.
Note that Unicode code points are traditionally written in hexidecimal. So rather than the decimal number 2350, the code point would be written as U+092E, and it appears as 092E on the code chart.
If you have the string s = मेरा then you already have the answer.
This string contains four code points in the BMP which in UTF-16 are represented by 8 bytes. You can access them by index with s[i], with a foreach loop etc.
If you want the underlying 8 bytes you can access them as so:
string str = #"मेरा";
byte[] arr = System.Text.UnicodeEncoding.GetBytes(str);

Categories

Resources