I'm using a finite-state machine to read a extra large file. It's not multi-threaded, so there won't be any problem of thread safety.
It contains 3 kinds of content:
binary number, indicates the length of the following string, counts a character as 1
ANSI, takes 1~2 Bytes for a character
UTF-8, takes 1~4 Bytes for a character
I've found this question that might be useful, but it failed. The similiar python question is neither useful, because it won't throw any error. I have to read the content with proper encoding, or the behavior will go unknown.
Currently, i'm using StreamReader, but the CurrentEncoding property cannot be changed, once the StreamReader is initialized.
So i've also tried to recreate the StreamReader on the same Stream:
reader = new StreamReader(stream, encoding65001); //UTF-8
DoSomething(reader);
reader = new StreamReader(stream, encoding1252); //ANSI
DoSomething(reader);
reader = new StreamReader(stream, encoding936); //ANSI
//...
But it starts to read strange content from an unknown position. I haven't find out the possible cause for this strange behavior.
Have I made mistake on creating multiple StreamReader, or it is designed not to create multiple on the same stream?
If it is designed so, is there any solution for reading such file?
Thank you for the time reading.
Edit:
I've run the following code on .NET Core 3.1:
Stream stream = File.OpenRead(testFilePath);
Console.WriteLine(stream.Position);
Console.WriteLine(stream.ReadByte());
Console.WriteLine(stream.Position + "\r\n");
StreamReader reader = new StreamReader(stream, Encoding.UTF8);
Console.WriteLine(reader.Read());
Console.WriteLine(stream.Position + "\r\n");
reader = new StreamReader(stream, CodePagesEncodingProvider.Instance.GetEncoding(1252));
Console.WriteLine(reader.Read());
Console.WriteLine(stream.Position);
With the example text of following:
abcdefg
And the output:
0
97
1
98
7
-1
7
It's strange and interesting.
The stream readers are going to buffer the content from the underlying stream they're reading, which is what's causing your problems. Just because you read one character from your reader doesn't mean it'll read just one character from the underlying stream. It'll fill a while buffer with bytes, and then yield you one character from the buffer.
If you want to be reading values from a stream and interpreting different sections of bytes as different encodings (for the record, if at all possible you should avoid putting yourself in this position of having mixed encodings in your data) you'll have to pull the bytes out of the stream yourself and then convert the bytes using the appropriate encodings, so that you can be sure you only pull the exact sections of bytes you want and no more.
Related
This question already has answers here:
Failed to write large amount of data to stream
(2 answers)
Closed 1 year ago.
I have a big stream (4Go), I need to replace some character (I need to replace one specific character with 2 or 3 ones) in that stream, i get the stream from à service à.d I have to return back a stream.
This is what I'm doing
private static Stream UpdateStream(Stream stream, string oldCharacters, string newCharacters, int size = 2048)
{
stream.Position = 0;
StreamReader reader = new StreamReader(stream);
MemoryStream outputStream = new MemoryStream();
StreamWriter writer = new StreamWriter(outputStream);
writer.AutoFlush = true;
char[] buffer = new char[size];
while (!reader.EndOfStream)
{
reader.Read(buffer, 0, buffer.Length);
if (buffer != null)
{
string line = new string(buffer);
if (!string.IsNullOrEmpty(line))
{
string newLine = line.Replace(oldCharacters, newCharacters);
writer.Write(newLine);
}
}
}
return outputStream;
}
But I'm getting an OutOfMemory exception at some point in this line but when looking at computer memery I still have planty available.
writer.Write(newLine);
Any advise ?
This is not an answer, but I couldn't possibly fit it into a comment.
Currently your problem is not solvable without making some assumptions. The problem, as I hear it, is that you want to replace some parts of a large body of text saved in a file and save the modified text in the file again.
Some unknown variables:
How long are those strings you are replacing?
How long are those strings you are replacing it with? The same length as the replaced strings?
What kinds of strings are you looking to replace? A single word? A whole sentence? A whole paragraph?
A solution to your problem would be to read the file into memory in chunks, replace the necessary text and save the "updated" text in a new file and then finally rename the "new file" to the name of the old file. However, without knowing the answers to the above points, you could potentially be wanting to replace a string as long as all text in the file (unlikely, yes). This means in order to do the "replacing" I would have to read the whole file into memory before I can replace any of the text, which causes an OutOfMemoryException. (Yes, you could do some clever scanning to replace such large strings without reading it all into memory at once, but I doubt such a solution is necessary here).
Please edit your question to address the above points.
So to make it work I had to :
use the HugeMemoryStream class from this post Failed to write large amount of data to stream
and define the gcAllowVeryLargeObjects parameter to true
and set the build to 64 bit (Prefer 32-bit unchecked)
I have homework from my university teacher. I have to write code which will encrypt\decrypt small part of big file (about 10GB). I use algorithm Salsa20.
The main thing is not to load RAM. As he said, I should read, for example, 100 lines then encrypt\decrypt it, write to file and back.
I create List
List<string> dict = new List<string>();
Read lines (because reading all bytes is loading lots of RAM)
using (StreamReader sReader = new StreamReader(filePath))
{
while (dict.Count < 100)
{
dict.Add(sReader.ReadLine());
}
}
Try to create one line from
string words = string.Join("", dict.ToArray());
Encrypt this line
string encrypted;
using (var salsa = new Salsa20.Salsa20())
using (var mstream_out = new MemoryStream())
{
salsa.Key = key;
salsa.IV = iv;
using (var cstream = new CryptoStream(mstream_out,
salsa.CreateEncryptor(), CryptoStreamMode.Write))
{
var bytes = Encoding.UTF8.GetBytes(words);
cstream.Write(bytes, 0, bytes.Length);
}
encrypted = Encoding.UTF8.GetString(mstream_out.ToArray());
}
Then I need to write 100 lines of encrypted string, but I don't know how to do it! Is there any solution?
OK, so here's what you could do.
Accept a filename, a starting line number and an ending line number.
Read the lines, simply writing them them to another file if they are lower than the starting line number or larger than the ending line number.
Once you read a line that is in the range, you can encrypt it with the key and an IV. You will possibly need to encode it to a byte array e.g. using UTF-8 first, as modern ciphers such as Salsa operate on bytes, not text.
You can use the line number possibly as nonce/IV for your stream cipher, if you don't expect the number of lines to change. Otherwise you can prefix the ciphertext with a large, fixed size, random nonce.
The ciphertext - possibly including the nonce - can be encoded as base64 without line endings. Then you write the base 64 line to the other file.
Keep encrypting the lines until you found the end index. It is up to you if your ending line is inclusive or exclusive.
Now read the remaining lines, and write them to the other file.
Don't forget to finalize the encryption and close the file. You may possibly want to destroy the source input file.
Encrypting bytes may be easier as you could write to the original file. However, writing encrypted strings will likely always expand the ciphertext compared with the plaintext. So you need to copy the file, as it needs to grow from the middle out.
I haven't got a clue why you would keep a list or dictionary in memory. If that's part of the requirements then I don't see it in the rest of the question. If you read in all the lines of a file that way then clearly you're using up memory.
Of course, if your 4 GiB file is just a single line then you're still using way too much memory. In that case you need to stream everything, parsing text from files, putting it in a character buffer, character-decoding it, encrypting it, encoding it again to base 64 and writing it to file. Certainly doable, but tricky if you've never done such things.
Hey there! I'm trying to read a 150mb file with a file stream but every time I do it all I get is: |zl instead of the whole stream. Note that it has some special characters in it.
Does anybody know what the problem could be? here is my code:
using (FileStream fs = File.OpenRead(path))
{
byte[] buffer = new byte[fs.Length];
fs.Read(buffer, 0, buffer.Length);
extract = Encoding.Default.GetString(buffer);
}
Edit:
I tried to read all text but it still returned the same four characters. It works fine on any other file except for these few. When I use read all lines it only gets the first line.
fs.Read() does not read the whole smash of bytes all at once, it reads some number of bytes and returns the number of bytes read. MSDN has an excellent example of how to use it to get the whole file:
http://msdn.microsoft.com/en-us/library/system.io.filestream.read.aspx
For what it's worth, reading the entire 150MB of data into memory is really going to put a drain on your client's system -- the preferred option would be to optimize it so that you don't need the whole file all at once.
If you want to read text this way File.ReadAllLine (or ReadAllText) - http://msdn.microsoft.com/en-us/library/s2tte0y1.aspx is better option.
My guess the file is not text file to start with and the way you display resulting string does stop at 0 characters.
As debracey pointed out Read returns number of bytes read - check that out. Also for file operations it is unlikely to stop at 4 characters...
I have a file that exists within a text and a binary image, I need to read from 0 to 30 position the text in question, and the position on 31 would be the image in binary format.
What are the steps that I have to follow to proceed with that problem?
Currently, I am trying to read it using FileStream, and then I move the FileStream var to one BinaryReader as shown below:
FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read)
BinaryReader br = new BinaryReader(fs)
From there forward, I'm lost.
UPDATE
Alright, so I Can read my file now.
Until the position 30 is my 30 string, from position 30 is the bit string Which is Actually an image.
I wonder how do I read the bytes from position 30 and then save the images!
Does anyone have any ideas?
Follow an example from my file to you have some ideia:
£ˆ‰¢#‰¢#¢–”…#•…¦#„£################################.-///%<<??#[K}#k{M÷]kðñôôô}ù~øòLKóôòÿg
Note that even the # # # is my string and from that the picture would be one byte.
Expanding on Roger's answer a bit, with some code.
A string is always encoded in some format, and to read it you need to know that encoding (especially when using binary reader). In many cases, it's plain ASCII and you can use Encoding.ASCII.GetString to parse it if you get unexpected results (weird characters etc.) then try another encoding.
To parse the image you need to use an image parser. .NET has several as part of their GUI namespaces. In the sample below I'm using the one from System.Drawing (windows forms) but similar ones exists in WPF and there are many third party libraries out there.
using (var reader = new BinaryReader(File.Open(someFile, FileMode.Open))
{
// assuming your string is in plain ASCII encoding:
var myString = System.Text.Encoding.ASCII.GetString(reader.ReadBytes(30));
// The rest of the bytes is image data, use an image library to process it
var myImage = System.Drawing.Image.FromStream(reader.BaseStream);
}
Now MSDN has a caution about using the BaseStream in conjunction with BinaryReader but I believe in the above case you should be safe since you're not using the stream after the image. But keep an eye out for problems. If it fails, you can always read the bytes into a new byte[] and create a new MemoryStream from those bytes.
EDIT:
You indicated in your comment your string is EBCDIC which unfortunately means you cannot use any of the built in Encodings to decode it. A quick google search revealed a post by Jon Skeet on a EBCDIC .NET Encoding class that may get you started. It will essentially give you ebcdicEncoding.GetString(...);
You can use FileStream to open and read from the file. If you read the first 30 bytes into a buffer you can then convert that to a string using "string Encoding.ASCII.GetString(byte[] buffer, int offset, int length)".
I'm trying to read a (small-ish) file in chunks of a few lines at a time, and I need to return to the beginning of particular chunks.
The problem is, after the very first call to
streamReader.ReadLine();
the streamReader.BaseStream.Position property is set to the end of the file! Now I assume some caching is done in the backstage, but I was expecting this property to reflect the number of bytes that I used from that file. And yes, the file has more than one line :-)
For instance, calling ReadLine() again will (naturally) return the next line in the file, which does not start at the position previously reported by streamReader.BaseStream.Position.
How can I find the actual position where the 1st line ends, so I can return there later?
I can only think of manually doing the bookkeeping, by adding the lengths of the strings returned by ReadLine(), but even here there are a couple of caveats:
ReadLine() strips the new-line character(s) which may have a variable length (is is '\n'? Is it "\r\n"? Etc.)
I'm not sure if this would work OK with variable-length characters
...so right now it seems like my only option is to rethink how I parse the file, so I don't have to rewind.
If it helps, I open my file like this:
using (var reader = new StreamReader(
new FileStream(
m_path,
FileMode.Open,
FileAccess.Read,
FileShare.ReadWrite)))
{...}
Any suggestions?
If you need to read lines, and you need to go back to previous chunks, why not store the lines you read in a List? That should be easy enough.
You should not depend on calculating a length in bytes based on the length of the string - for the reasons you mention yourself: Multibyte characters, newline characters, etc.
I have done a similar implementation where I needed to access the n-th line in an extremely big text file fast.
The reason streamReader.BaseStream.Position had pointed to the end of file is that it has a built-in buffer, as you expected.
Bookkeeping by counting number of bytes read from each ReadLine() call will work for most plain text files. However, I have encounter cases where there control character, the unprintable one, mixed in the text file. The number of bytes calculated is wrong and caused my program not beeing able to seek to the correct location thereafter.
My final solution was to go with implementing the line reader on my own. It worked well so far. This should give some ideas what it looks like:
using (FileStream fs = new FileStream(filePath, FileMode.Open))
{
int ch;
int currentLine = 1, offset = 0;
while ((ch = fs.ReadByte()) >= 0)
{
offset++;
// This covers all cases: \r\n and only \n (for UNIX files)
if (ch == 10)
{
currentLine++;
// ... do sth such as log current offset with line number
}
}
}
And to go back to logged offset:
using (FileStream fs = new FileStream(filePath, FileMode.Open))
{
fs.Seek(yourOffset, SeekOrigin.Begin);
TextReader tr = new StreamReader(fs);
string line = tr.ReadLine();
}
Also note there is already buffering mechanism built into FileStream.
StreamReader isn't designed for this kind of usage, so if this is what you need I suspect that you'll have to write your own wrapper for FileStream.
A problem with the accepted answer is that if ReadLine() encounters an exception, say due to the logging framework locking the file temporarily right when you ReadLine(), then you will not have that line "saved" into a list because it never returned a line. If you catch this exception you cannot retry the ReadLine() a second time because StreamReaders internal state and buffer are screwed up from the last ReadLine() and you will only get part of a line returned, and you cannot ignore that broken line and seek back to the beginning of it as OP found out.
If you want to get to the true seekable location then you need to use reflection to get to StreamReaders private variables that allow you calculate its position inside it's own buffer. Granger's solution seen here: StreamReader and seeking, should work. Or do what other answers in other related questions have done: create your own StreamReader that exposes the true seekable location (this answer in this link: Tracking the position of the line of a streamreader). Those are the only two options I've come across while dealing with StreamReader and seeking, which for some reason decided to completely remove the possibility of seeking in nearly every situation.
edit: I used Granger's solution and it works. Just be sure you go in this order: GetActualPosition(), then set BaseStream.Position to that position, then make sure you call DiscardBufferedData(), and finally you can call ReadLine() and you will get the full line starting from the position given in the method.