Using StreamReader and StreamWriter for TGZ file copied from Solaris - c#

We have a very old file delivery application(IPGear, if you have heard about it, written in tcl). We upload our IP files there and our customers download it from the system.
When you upload a file to this application, it adds .RCA extension to uploaded file and add some metadata to file. if we view the content of any file in a text editor(Usually tgz, pdf and text files), we see some metadata added to the top of the file by the application(5-10 lines, readable).
If you download a file from the system, they somehow strip this metadata from the file and returns as TGZ file which works fine(we can extract it)
if we find that RCA file on the storage where this application keeps files and edit the metadata they have added via text editor, we are able to extract the file without any problem., which fine too. But we need to do this process for 22K files, therefore we need to script it.
We are able to find the bits the application adds by opening via StreamReader, and strip the metadata and write file to the disk via StreamWriter. However, the file we write to the system is corrupted if it is TGZ file. if we do same thing for text files, they work.
the content of the tgz file looks below when we open in text editor
The bits on lines 29-38 are the metadata we strip.
it looks like the streamreader is not able to write this content back to disk even if we tried different encoding settings.
One another note about this is that the file we are trying to read and write is copied from a Solaris based server into local machine(Windows 7) via WinSCP.
So, my question is, what is the best way of reading TGZ file into memory(as text) so manipulation, and save back without corruption? is streamreader and streamwriter not good for this purpose?
I tried to give as much information as I can, please add comments if you need more clarification.

it looks like the streamreader is not able to write this content back to disk even if we tried different encoding settings.
Yes, because a tgz file isn't plain text. StreamReader and StreamWriter are for text content, not arbitrary binary content.
So, my question is, what is the best way of reading TGZ file into memory(as text)
You don't. You read it as binary data, because it is binary data.
If the TGZ archive contains text files, you'll need to decompress the TGZ to the TAR format, then extract the relevant data from that. Then you can work with it as text. Before that point, it's just binary data.
But it sounds like you actually may just want to read text information before the TGZ file... in which case you need to work out where that text information ends, and not read any of the TGZ file as text (because it's not). This is non-trivial, but if you know that the text is in ASCII it'll be a bit easier - you will need to work out how to detect the end of the text and the start of the real content though, and we can't really tell that from the screenshot you've given.

Related

c# Multiple files and file streams

This might not be the best place. I don't have a problem with some code but rather looking for a code idea.
I want to be able to scan a file to a see if the file contains multiple files within, hidden or not.
For example: Take a movie with mp4 extension that movie has a video stream and an audio stream and/or srt file embedded. you can hide a zip file behind a jpeg file using standard cmd command line.
So I want to be able to scan a file for those multiple hidden files/streams inside. Is there such a way and can anyone guide me packages or code snippet or website?
So far I haven't found anything cause I don't know what too google for.

Web API action returns FileContentResult that, if saved as .csv, will open as gibberish , while if as .txt, is ok. Why?

I am exporting a file via a http get response, using ASP.NET Web API.
For that, I am returning a FileContentResult object, as in:
return File(Encoding.UTF8.GetBytes(fileContents.ToString()), "text/plain; charset=UTF-8");
After several minutes stucked with encoding issues, I use google's Advanced REST Client to perform the get to the web api controller's action, and the file is being download just ok.
Well, not exactly. I originally wanted it to be sent/downloaded as a .csv file.
If I set the http request content-type to "text/csv" and the File() call sets the response's content type to "text/csv" just as well, Advanced REST Client will show the contents properly, but excel will open it as gibberish data.
If I simply change the content-type to "text/plain", save it as a .txt file (have to rename it after saving, don't know why it is being saved as _.text-plain, while as a csv it is being saved with .csv extension), and finally perform an import in Excel like described here Excel Import Text Wizard, then then excel opens the file correctly.
Why is the .csv being opened as gibberish, while as a .txt it is not ? For opening a .csv, there is no import wizard like with a .txt file (not that I am aware of).
Providing a bit of the source below:
StringBuilder fileContents = new StringBuilder();
//csv header
fileContents.AppendLine(String.Join(CultureInfo.CurrentCulture.TextInfo.ListSeparator, fileData.Select(fileRecord => fileRecord.Name)));
//csv records
foreach (ExportFileField fileField in fileData)
fileContents.AppendLine(fileField.Value);
return File(Encoding.UTF8.GetBytes(fileContents.ToString()), "text/plain; charset=UTF-8");
As requested, the binary contents of both files.
The text-plain (.txt) version (the one that will open in excel, using import):
and the .csv one (the one that excel will open with junk data):
The (files are the same, the cropping of the screen shots was not the same...)
I was able to reproduce the issue by saving a file containing Greek characters with BOM. Double clicking attempts to import the file using the system's locale (Greek). When manually importing, Excel detects the codepage and offers to use the 65001 (UTF8) codepage.
This behavior is strange but not a bug. Text files contain no indication that would help detect their codepage, nor is it possible to guess. An ASCII file containing only A-Z characters saved as 1252 is identical to one saved using 1253. That's why Windows uses the system codepage, which is the local used for all non-Unicode programs and files.
When you double click on a text file, Excel can't ask you for the correct encoding - this could get tedious very quickly. Instead, it opens the file using your regional settings and the system codepage. ASCII files created on your machine are saved using your system's codepage so this behaviour is logical. Files given to you by non-programmers will probably be saved using your country's codepage as well. Programmers typically switch everything to US English and that's how problems start. Your REST client may have saved the text as ASCII using the Latin encoding used by most programmers.
When you import the text file to an empty sheet though, Excel can ask you what to do. It tries to detect the codepage by checking for a BOM or a codepage that may be matching the file's contents and presents the guess in the import dialog box, together with a preview. The decimal and column separators are still those provided by your regional settings (can't guess those). UTF8 is generally easy to guess - the file starts with a BOM or contains NUL entries.
ASCII codepages are harder though. Saving my Greek file as ASCII results in a Japanese guess. That's English humour for you I guess.
To my surprise, trying to perform the request via a browser instead of using google's Advanced REST Client, clicking on the the file that is downloaded just works! Excel opens it correctly. So the problem must be with ARC.
In any case, since the process is not going to be done using an http client other than a browser... my problem is gone. Again, in ARC's output screen the file is displayed correctly. I do not know why upon clicking it to be opened in Excel it "gets corrupted".
Strange.
The binary contents of the file show a correctly utf-8 encoded CSV file with hebrew characters. If,a s you state in the comments, Excel does not allow you to change it's guessed file encoding when opening a CSV file, that is rather a misbehavior in Excel itself (call it a bug if you want).
Your options are: use LibreOffice (http://www.libreoffice.org/) which spreadsheet component does allow you to customize the settings for opening a CSV file.
Another one is to write a small program to explicitely convert your file to the encoding excel is expecting - if you have a Python3 interpreter installed, you could for example type:
python -c "open('correct.csv', 'wt', encoding='cp1255').write(open('utf8.csv', encoding='utf8').read())"
However, if your default Windows encoding is not cp1255 for handling Hebrew, as I suppose above, that won't help excel, but to give you different gibberish :-) In that case, you should resort to use programs that can correctly deal with different encodings.
(NB. there is a Python call to return the default system encoding in Windows, but I forgot which it is, and it is not easily googleable)

Where does file information (like DateCreated) get stored when you create a new file?

Suppose that I would like to add extra information about a file, without writing that information as content of that file. How would I do this? A couple of good examples are:
With Word documents, you can add Author tag to a document. And,
MP3 files have lots of info stored inside of them but when you play the file, you don't see that info (unless the program playing the file has been programmed to display that information).
How does Windows do this?
This information is stored in the file system (on windows - NTFS).
In NTFS, you can actually store another file, as part of this information, and it stores much more information about each file than you may expected.
NTFS file streams
Exapmle in C how to consume them
About MP3 and word - In these cases the information is stored inside the file, as part of its format.

Efficiently finding the segment that has undergone changes recently in a Docx File

I am developing an application which takes the Back Up of Docx file. For the Initial Back Up I copy the entire file in the destination, but next time I want to perform an incremental Back Up i.e I want to backup only that segment of the Docx file that has undergone changes. I need to find the most efficient to do the same.
I would really be thankful if I get any help in this regard.
The DOCX file is different from the previous Microsoft Word programs, which use the file extension DOC, in the sense that whereas a DOC file uses a text or binary format for storing a document, a DOCX file is based on XML and uses ZIP compression for a smaller file size. In other words, a DOCX file is a set of XML files that have been compressed using ZIP.
It might help if you can use ZipFile to dissect and tell which file is really changed and then incrementally save only the changes in your VCS.

Options for header in raw byte file

I have a large raw data file (up to 1GB) which contains raw samples from a USB data logger.
I need to store extra information relating to the file (sample rate, description, trigger point, last seek position etc) and was looking into adding this as a some sort of header.
The header file should ideally be human readable and flexible so I've so far ruled out some sort of binary serialization into a header.
I also want to avoid two separate files as they could end up separated when copied or backed up. I remembered somebody telling me that newer *.*x Microsoft Office documents are actually a number of files in a zip. Is there a simple way to achieve this? Could I still keep the quick seek times to the raw file?
Update
I started using the binary serializer and found it to be a pain. I ended up using the xml serializer as I'm more comfortable using it.
I reserve some space at the start of the files for the xml. Simple
When you say you want to make the header human readable, this suggests opening the file in a text editor. Do you really want to do this considering the file size and (I'm assuming), the remainder of the file being non-human readable binary data? If it is, just write the text header data to the start of the binary file - it will be visible when the file is opened but, of course, the remainder of the file will look like garbage.
You could create an uncompressed ZIP archive, which may allow you to seek directly to the binary data. See this for information on creating a ZIP archive: http://weblogs.asp.net/jgalloway/archive/2007/10/25/creating-zip-archives-in-net-without-an-external-library-like-sharpziplib.aspx

Categories

Resources