All,
So I'm uploading a text file from C# to an IBM MVS mainframe. The file is converted to ebcdic using C# libraries and it works well as I can read the data on the mainframe. The problem is the new lines. The text file has 10 rows of data and while viewing it in the mainframe environment, all data is present. But there are no new lines as it translates each new line from the text file as 0D25, which is CRLF. This segment appears as .. on screen.
I don't want the the 2 dots that have the hex reading of 0D25 because I need it to actually place the data on the next line as it is in the text file. The file is variable block length once on the mainframe btw. How can I achieve the same formatting as the text file while viewing the uploaded file on MVS ?
example:
TEXT FILE VIEW
12345
23456
12346
IBM MAinFrame View
12345..23456..12346
or if block length has been reached..
12345..2345
6..12346
Thanks
If you're doing the ASCII-EBCDIC translation outside of the FTP transfer process, I have to assume that you're transferring in binary mode (otherwise the translation would be done again and your data would be bad).
If that is the case, then I'm pretty certain you're responsible yourself for the conversion of line endings as well. Binary transfers will not attempt to convert line endings. You'll need to pad out the lines to the desired lengths and remove the line endings altogether, before sending it up to the host.
By way of example, if you transfer this file:
12345
67890
up in binary mode using literal site recfm=vb, you'll get the following (shown in ISPF editor with hex on):
000001
3333300333330044444
12345DA67890DA00000
--------------------------
You can see it's just transferred the bytes as-is, including the CR/LF. If you switch to ASCII mode in FTP and upload again, you get:
000001 12345
FFFFF44444444
1234500000000
--------------------
000002 67890
FFFFF44444444
6789000000000
--------------------
Here, the characters have been converted to the right EBCDIC code points and the line endings have been morphed into padding with EBCDIC spaces.
I suppose my first question to you would be: "Why are you doing the translation outside of FTP?"
IBM invests quite a lot of money in ensuring that it will accept all sorts of different encodings and translate them into the correct code page. It's very unlikely that a stand-alone solution will work on all the internationalised versions of z/OS as well as IBM's own.
If you must convert on the client and transfer in binary mode, you'll either have to have the client do the line ending conversion and padding as well or post-process the file after the transfer, such as with a REXX script.
If you don't know what the properties of the target data set will be (such as if you're transferring into a member in a PDS), the latter option may be the only viable one.
Related
I am trying to read a text file and writing to a new text file. The input file could be ANSI or UTF-8. I don't care what the output encoding is but I want to preserve all characters when writing. How to do this? Do I need to get the input file's encoding (seems like alot of work).
The following code reads ANSI file and writes output as UTF-8 but there is some gibberish characters "�".
I am looking for a way to read the file no matter which of the 2 encoding and write it correctly without knowing the encoding of input file before hand.
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv"));
Note that this batch command reads a UTF-8 and ANSI file and writes the output as ANSI with all chars preserved so I'm looking to do this but in C#:
type ST60_0.csv inputUTF.csv > outputBASH.txt
Q: The following code reads ANSI file and writes output as UTF-8 but
there is some giberrish characters "�".
A: It would definitely be useful to see the hex values of some of these "gibberish" characters. Perhaps you could install a Hex plugin to Notepad++ and tell us?
Q: It blows my mind its so hard to do something in C# that command
prompt can do easy
A: Typically, it IS easy. There seems to be "something special" written into this particular file.
The difference between C# and other, "simpler" approaches is that C# (unlike C character I/O or .bat files) gives you the flexibility to deal with text that doesn't happen to be "standard ASCII".
ANYWAY:
If "?" you posted (hex 0xefbfbd) is a valid example of your actual text, this might explain what's going on:
https://stackoverflow.com/a/25510366/421195
... %EF%BF%BD is the url-encoded version of the hex representation of
the 3 bytes (EF BF BD) of the UTF-8 replacement character.
See also:
https://en.wikipedia.org/wiki/Specials_(Unicode_block)
The Replacement character � (often displayed as a black rhombus with a
white question mark) is a symbol found in the Unicode standard at code
point U+FFFD in the Specials table. It is used to indicate problems
when a system is unable to render a stream of data to a correct
symbol.[4] It is usually seen when the data is invalid and does not
match any character
You might also be interested in this:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding
Best-Fit Fallback When a character does not have an exact match in the target encoding, the encoder can try to map it to a similar
character.
UPDATE:
The offending character was "»", hex 0xc2bb. This is a "Right Angle Quote", a Guillemet. Angle quotes are the quotation marks used in certain languages with an otherwise roman alphabet, such as French.
One possible solution is to specify "iso-8859-1", vs. the default encoding "UTF-8":
File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + #"\ST60_0.csv", System.Text.Encoding.GetEncoding("iso-8859-1")));
I am trying to make a FXB file previewer (VST preset banks for those who don't know) for Sylenth1 banks. I have encoded the FXB as an ASCII string and had it print to the console. The preset names show up fine. My issue is that the parameters for the oscillators, filters and effects are encoded as random characters (mainly "?" and fairly big spaces).
Underlined in red: file header (?)
Underlined in blue: preset name (which I want to keep)
Underlined in yellow: osc/FX/filter parameters (which I want to discard from the string)
Here's the code I wrote:
byte[] arr = File.ReadAllBytes(Properties.Resources.pointer); /* pointer is a string in resources I
used to point to the external FXB file for testing */
System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
string fstr = enc.GetString(arr);
Console.Write(fstr);
Console.ReadKey();
I had written a foreach loop to replace every unwanted character with string.Empty, but it also removes parts of the preset names (e.g. the L from "Lead"), leaves the spaces intact and creates new ones, so I deleted it.
My end goal for those that are curious is this:
Preset 1
Preset 2
Preset 3
Preset 4
...
I'm at a total loss. I've tried different solutions from various websites and Stack Overflow posts, but none gave me the desired result.
(I also noticed that the preset names have almost the same space between them (~ 200 chars apart), can I use the difference to exclude the unwanted parts?)
It looks like a binary file not ascii. Some data in the file is easily readable because it is ASCII encoded, but other data, for example numbers, are encoded in their binary format.
Not all binary data can be converted to printable ASCII characters, so when you print it out like this you get the ???? mess.
It is better to read this file using a binary editor. Visual studio has one, there is probably an extension for vs code, other editors have a binary viewer (e.g. sublime). This will show you data in the file as it is encoded, usually using hex with the ascii in a second column.
But that is just so you can accurately see the content. It does not help you for understanding the meaning or the layout. You might be able to make something work by reverse engineering like this, but chances are it will not work for all cases. Using and API is going to be way easier.
I'm not familiar with these files but did you find this? https://new.steinberg.net/developers/ There is a forum there that might help.
I found the answer to this myself. I basically somewhat reverse engineered the FXB in a hex editor, and proceeded to load specific bytes of the file (31 to be exact) in order to encode those in a string and have that print to the console.
I managed to do so by literally counting how many bytes there are from the beginning to the 1st preset name, then from the end of the preset name (31 bytes) to the beginning of the other preset name, and so on.
For those who are interested, I am going to develop a GUI version of it in the future. But it does (and probably will) support only Sylenth1 v2 soundbanks/FXBs.
Also thanks to the people who reached out. They helped in their own way.
I've seen questions where the two characters are the same, but noting that relates to this specific question so here goes.
I'm running a C# console app that reads an input file that is variable length records. Each record is variable length fields. I've got everything working in terms of parsing out each individual field within each record, not a problem. Except that today I cam across the ñ character in the input file. Now I know this translates to ñ, so I'm ok with it. However, because I the input file sees ñ as 2 characters, the record length changes in the C# app because the app is interpreting those 2 characters as a single ñ. This is causing my record length to change from 154 characters to 153, and then during the parsing, messing up the individual fields.
I'm ok with the ñ character getting stored in my DB. But my question is this.
Prior to parsing the fields out of the record, how can I go about easily (with checking every single character) detecting that the ñ exists and trigger it to change the parsing logic? Should I simply do a IndexOf on the character and code it that way? I would think that would add a bit of overhead of I had to put that logic on every single field, although it seems like the easiest way. I would think there's a better way to handle it overall but I've not encountered this before. Most of the posts I have found are more for handling the ñ character in text as opposed to text being converted (properly) from ñ to ñ
Ideas?
the streamreader open I am using is as follows:
System.IO.StreamReader concatenatedFile = new System.IO.StreamReader("c:\Testing\test.txt",System.Text.Encoding.UTF8);
The record length changes from 154 characters on the input to 153 interpreted characters.
You must always read a text file in the encoding it was written. Of course, sometimes you don't which encoding that was...
Thing of the input file as a stream of bytes. Most are 1-byte-1-ASCII-character, but there are 2 bytes (probably) that can be interpreted differently depending on encoding:
UTF8 - 1 character, ñ
(some other encoding) - 2 characters, ñ
Since you say "the input file sees ñ as 2 characters", this would probably be the encoding intended by whoever produces the file.
So, you should find out which encoding was originally meant, and use that - it's probably some ANSI encoding. You could try System.Text.Encoding.Default, but beware that this changes on different machines, so your code will now depend on the machine's default encoding.
You should set the StreamReader you use to read your input file to UTF-8 encoding. I don't believe for a second the original input was meant to be ñ, so why do you care how many bytes the original input was - you care about character length, right?
Refer to this article to understand what's what in text encoding: http://www.joelonsoftware.com/articles/Unicode.html .
I've written a quick-and-dirty utility to parse a text file, but in some cases it's writing out a "�" character. My utility reads from a .txt file which contains "records" in this format:
Biography
Title:George F. Kennan: An American Life
Author:John Lewis Gaddis
Kindle: B0054TVO1G
Hardcover: B007R93I1U
Paperback: 0143122150
Image link: <img src="http://images.amazon.com/images/P/B0054TVO1G.01.MZZZZZZZ.jpg" alt="Book Cover" />
...and writes out lines from that to a CSV file such as:
Biography,"George F. Kennan: An American Life","John Lewis Gaddis",B0054TVO1G,B007R93I1U,0143122150,<img src="http://images.amazon.com/images/P/B0054TVO1G.01.MZZZZZZZ.jpg" alt="Book Cover" />
...but in several cases, as mentioned, that weird character is appending itself to an author's name. In most cases where this is happening, it's what appears to be a space character in the .txt file. I'm trimming the author's name prior to writing it out to the CSV file, so it's obviously not being seen as a space, though.
When I save the text file with these characters, I get the message about non-unicode characters, etc.
What could be the cause of that? And better yet, how can I delete them with a search and replace operation? In Notepad, they are not found, so I have to delete them one-by-one.
Prior to being in the .txt file, this data was in an Open Office/.odt file, if that means anything to anyone.
BTW, I have no idea how that "stackoverflow" got into the href above; it's not in the original text I pasted in...
UPDATE
I am curious how that character got in my files. I sure didn't put it there (deliberately), any more than I added the "stackoverflow" to the URL above. Could it be that a call to Environment.Newline would add that?
Here was my process:
1) Copy and paste info from the interwebs into an Open Office/.odt file
2) Copy and past that into a text (Notepad) file
3) Open that text file programmatically and loop through it, writing to a new "csv"/.txt file.
UPDATE 2
Silly me - all I had to do was save the file (which wouldn't save those weird characters), then open it again. IOW, when I opened it today (at home, after work) those were gone.
UPDATE 3
I wrote too soon - it replaced the weird character with a question mark (a "normal" one, not a stylized one).
They are almost certainly non-breaking spaces, U+00A0 (although there are other fixed-width space characters which are also possible.) These won't be trimmed as spaces, but will be rendered as spaces if the encoding of the file matches the encoding of the output device.
My guess is that your text file is in CP-1252 (i.e., Windows default one-byte coding) but your output is being rendered as though it were UTF-8.
Normally you would type these characters as AltGr+Space. You might try that with Notepad, but no guarantees.
I am trying to parse some RTF, that i get back from the server. For most text i get back this works fine (and using a RichTextBox control will do the job), however some of the RTF seems to contain an additional "encoding" and some of the characters get corrupted.
The original string is as follows (and contains some of the characters used in Polish):
ąćęłńóśźż
The RTF string with hex encoded characters that is send back looks like this
{\lang1045\langfe1045\f16383 {\'b9\'e6\'ea\'b3{\f7 \'a8\'bd\'a8\'ae}\'9c\'9f\'bf}}
I am having problems decoding the ńó characters in the returned string, they seem to be represented by two hex values each, whereas the rest of the string is represented (as expected) by single hex values.
Using a RichTextBox control to "parse" the RTF results in corrupter text (the two characters in question are displayed as four different unwanted characters).
If i would encode the plain string myself to hex using the expected codepage (1250, Latin 2, the ANSI codepage for lcid 1045) i would get the following:
\'B9\'E6\'EA\'B3\'F1\'F3\'9C\'9F\'BF
I am lost as to how i can correctly decode the {\f7 \'a8\'bd\'a8\'ae} part of the returned string that should correspond to ńó.
Note that there is no font definition for \f7 in the RTF header and the string looks fine when viewed directly on the server meaning that the characters (if they are corrupted) are corrupted somewhere in the conversion before sending.
I am not sure if the problem is on the server side (as i have no control over that), but since the server is used for a lot of translation work i assume that the returned string is ok.
I have been going through the RTF specs but can not find any hint regarding this type of combination of encodings.
I don't know why it's happening, but the encoding appears to be GBK (or something sufficiently similar).
Perhaps the server tries to do some "clever" matching to find the characters, or the server's default character encoding is GBK or so, and those characters (and only those) also occur in GBK so it prefers that.
I found out by adding the offending hex codes (A8 BD A8 AE) as bytes into a simple HTML file, so I could go through my browser's encodings and see if anything matched:
<html><body>¨½¨®</body></html>
To my surprise, my browser came up with "ńó" straight away.