I have a process which picks up a series of "xml" files. The reason I put xml in quotes is that that the text in the file does not have a root element which makes in invalid xml. In my processing, I want to correct this and open up each file add a root node to the beginning and end of each file, and then close it up. Here is what I had in mind, but this involves opening the file, reading the entire file, tagging on the nodes, and then writing the entire file out. These files may be more than 20 MB in size.
foreach (FileInfo file in files)
{
//open the file
StreamReader sr = new StreamReader(file.FullName);
// add the opening and closing tags
string text = "<root>" + sr.ReadToEnd() + "<root>";
sr.Close();
// now open the same file for writing
StreamWriter sw = new StreamWriter(file.FullName, false);
sw.Write(text);
sw.Close();
}
Any recommendations?
To avoid holding the whole file in memory, rename the original file, then open it with StreamReader. Then open the original filename with StreamWriter to create a new file.
Write the <root> prefix to the file, then copy data in large-ish chunks from the reader to the writer. When you've transferred all the data, write the closing </root> (note the forward slash if you want it to be XML). Then close both files and delete the renamed original.
char[] buffer = new char[10000];
string renamedFile = file.FullName + ".orig";
File.Move(file.FullName, renamedFile);
using (StreamReader sr = new StreamReader(renamedFile))
using (StreamWriter sw = new StreamWriter(file.FullName, false))
{
sw.Write("<root>");
int read;
while ((read = sr.Read(buffer, 0, buffer.Length)) > 0)
sw.Write(buffer, 0, read);
sw.Write("</root>");
}
File.Delete(renamedFile);
20 MB is not terribly much, but when you read it as a string, it will use about 40 MB of memory. That's not terribly much either, but it's processing that you don't need to do. You can handle it as raw bytes to reduce the memory usage, and to avoid decoding and re-encoding the data:
byte[] start = Encoding.UTF8.GetBytes("<root>");
byte[] ending = Encoding.UTF8.GetBytes("</root>");
byte[] data = File.ReadAllBytes(file.FullName);
int bom = (data[0] == 0xEF) ? 3 : 0;
using (FileStream s = File.Create(file.FullName)) {
if (bom > 0) {
s.Write(data, 0, bom);
}
s.Write(start, 0, start.Length);
s.Write(data, bom, data.Length - bom);
s.Write(ending, 0, ending.Length);
}
If you need to recude the memory usage much more, use a second file as Earwicker suggested.
Edit:
Added code to handle BOM (byte order mark).
I can't see any real improvement on this...which is kind of a bummer. Since there's no way to "shift" a file you'll always have to move the bytes in the entire file to inject anything at the top.
You may find some performance benefit by using raw streams rather than the StreamReader which has to actually parse the stream as text.
If you do not want to do this is C#, it would be easy to handle at the commandline or in a batch file.
ECHO ^<root^> > outfile.xml
TYPE temp.xml >> outfile.xml
ECHO ^</root^> >> outfile.xml
This would assume that you have some existing process for getting the data files that this could be hooked into.
Related
I need to change a file's encoding. The method that I've used loads all the file in memory:
string DestinationString = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(File.ReadAllText(FileName)));
File.WriteAllText(FileName, DestinationString, new System.Text.ASCIIEncoding());
This works for smaller files (in case that I want to change the file's encoding to ASCII), but it won't be ok with files larger than 2 GB. How to change the encoding without loading all the file's content in memory?
You can't do so by writing to the same file - but you can easily do it to a different file, just by reading a chunk of characters at a time in one encoding and writing each chunk in the target encoding.
public void RewriteFile(string source, Encoding sourceEncoding,
string destination, Encoding destinationEncoding)
{
using (var reader = File.OpenText(source, sourceEncoding))
{
using (var writer = File.CreateText(destination, destinationEncoding))
{
char[] buffer = new char[16384];
int charsRead;
while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
{
writer.Write(buffer, 0, charsRead);
}
}
}
}
You could always end up with the original filename via renaming, of course.
I am dealing with files in many formats, including Shift-JIS and UTF8 NoBOM. Using a bit of language knowledge, I can detect if the files are being interepeted correctly as UTF8 or ShiftJIS, but if I detect that the file is not of the type I read in, I was wondering if there is a way to just reinterperet my in-memory array without having to re-read the file with a new encoding specified.
Right now, I read in the file assuming Shift-JIS as such:
using (StreamReader sr = new StreamReader(path, Encoding.GetEncoding("shift-jis"), true))
{
String line = sr.ReadToEnd();
// Detection must be done AFTER you read from the file. Silly rabbit.
fileFormatCertain = !sr.CurrentEncoding.Equals(Encoding.GetEncoding("shift-jis"));
codingFromBOM = sr.CurrentEncoding;
}
and after I do my magic to determine if it is either a known format (has a BOM) or that the data makes sense as Shift-JIS, all is well. If the data is garbage though, then I am re-reading the file via:
using (StreamReader sr = new StreamReader(path, Encoding.UTF8))
{
String line = sr.ReadToEnd();
}
I am trying to avoid this re-read step and reinterperet the data in memory if possible.
Or is magic already happening and I am needlessly worrying about double I/O access?
var buf = File.ReadAllBytes(path);
var text = Encoding.UTF8.GetString(buf);
if (text.Contains("\uFFFD")) // Unicode replacement character
{
text = Encoding.GetEncoding(932).GetString(buf);
}
Part of a list of projects I'm doing is a little text-editor.
At one point, you can load all the sub directories and files in a given directory. The program will add each as a node in a TreeView.
What I want the functionality to be is to only add the files that are readable by a normal text reader.
This code currently adds it to the tree:
TreeNode navNode = new TreeNode();
navNode.Text = file.Name;
navNode.Tag = file.FullName;
directoryNode.Nodes.Add(navNode);
I know I could easily create an if statement with something like:
if(file.extension.equals(".txt"))
but I would have to expand that statement to contain every single extension that it could possibly be.
Is there an easier way to do this? I'm thinking it may have something to do with the mime types or file encoding.
There is no general way of figuring type of information stored in the file.
Even if you know in advance that it is some sort of text if you don't know what encoding was used to create file you may not be able to load it properly.
Note that HTTP give you some hints on type of file by content-type header, but there is no such information on file system.
There are a few methods you could use to "best guess" whether or not the file is a text file. Of course, the more encodings you support, the harder this becomes, especially if plan to support CJK (Chinese, Japanese, Korean) scripts. Let's just start with Encoding.Ascii and Encoding.UTF-8 for now.
Fortunately, most non-text files (executables, images, and the like) have a lot of non-parsable characters in their first couple of kilobytes.
What you could do is take a file and scan the first 1-4KB (up to you) and see if any "non-printable" characters come up. This operation shouldn't take much time and will at least give you some certainty of the contents of the file.
public static async Task<bool> IsValidTextFileAsync(string path,
int scanLength = 4096)
{
using(var stream = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read))
using(var reader = new StreamReader(stream, Encoding.UTF8))
{
var bufferLength = (int)Math.Min(scanLength, stream.Length);
var buffer = new char[bufferLength];
var bytesRead = await reader.ReadBlockAsync(buffer, 0, bufferLength);
reader.Close();
if(bytesRead != bufferLength)
throw new IOException("There was an error reading from the file.");
for(int i = 0; i < bytesRead; i++)
{
var c = buffer[i];
if(char.IsControl(c))
return false;
}
return true;
}
}
My approach based on #Rubenisme's comment and #Erik's answer.
public static bool IsValidTextFile(string path)
{
using (var stream = System.IO.File.Open(path, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
using (var reader = new System.IO.StreamReader(stream, System.Text.Encoding.UTF8))
{
var bytesRead = reader.ReadToEnd();
reader.Close();
return bytesRead.All(c => // Are all the characters either a:
c == (char)10 // New line
|| c == (char)13 // Carriage Return
|| c == (char)11 // Tab
|| !char.IsControl(c) // Non-control (regular) character
);
}
}
A hacky way to do it would be to see if the file contains any of the lower control characters (0-31) that aren't forms of white space (carriage return, tab, vertical tab, line feed, and just to be safe null and end of text). If it does, then it is probably binary. If it does not, it probably isn't. I haven't done any testing or anything to see what happens when applying this rule to non ASCII encodings, so you'd have to investigate further yourself :)
I am receiving a stream of data from a webservice and trying to save the contents of the stream to file. The stream contains standard lines of text alongside large chunks of xml data (on a single line). The size of the file is about 800Mb.
Problem: Receiving an out of memory exception when I process the xml section of each line.
==start file
line 1
line 2
<?xml version=.....huge line etc</xml>
line 3
line4
<?xml version=.....huge line etc</xml>
==end file
Current code, as you can see when it reads in the huge xml line then it spikes the memory.
string readLine;
using (StreamReader reader = new StreamReader(downloadStream))
{
while ((readLine = reader.ReadLine()) != null)
{
streamWriter.WriteLien(readLine); //writes to file
}
}
I was trying to think of a solution where I used both a TextReader/StreamReader and XmlTextReader in combination to process each section. As I get to the xml section I could switch to the XmlTextReader and use the Read() method to read each node thus stopping the memory spike.
Any suggestions on how I could do this? Alternatively, I could create a custom XmlTextReader that was able to read in these lines? Any pointers for this?
Updated
A further problem to this is that I need to read this file back in and split out the two xml sections to separate xml files! I converted the solution to write the file using a binary writer and then started to read the file back in using a binary reader. I have text processing to detect the start of the xml section and specifically which xml section so I can map it to the correct file! However this causes problems reading in the binary file and doing detection...
using (BinaryReader reader = new BinaryReader(savedFileStream))
{
while ((streamLine = reader.ReadString()) != null)
{
if (streamLine.StartsWith("<?xml version=\"1.0\" ?><tag1"))
//xml file 1
else if (streamLine.StartsWith("<?xml version=\"1.0\" ?><tag2"))
//xml file 2
XML may contain all content as one single line, so you'd probably better use a binary reader/writer where you can decide about the read/write size.
An example below, here we read BUFFER_SIZE bytes for each iteration:
Stream s = new MemoryStream();
Stream outputStream = new MemoryStream();
int BUFFER_SIZE = 1024;
using (BinaryReader reader = new BinaryReader(s))
{
BinaryWriter writer = new BinaryWriter(outputStream);
byte[] buffer = new byte[BUFFER_SIZE];
int read = buffer.Length;
while(read != 0)
{
read = reader.Read(buffer, 0, BUFFER_SIZE);
writer.Write(buffer, 0, read);
}
writer.Flush();
writer.Close();
}
I don't know if this causes you problems with encodings etc, but I think you will have to read the file as binary.
If all you want to do is copy one stream to another without modifying the data, you don't need the Stream text or binary helpers (StreamReader, StreamWriter, BinaryReader, BinaryWriter, etc.), simply copy the stream.
internal static class StreamExtensions
{
public static void CopyTo(this Stream readStream, Stream writeStream)
{
byte[] buffer = new byte[4096];
int read;
while ((read = readStream.Read(buffer, 0, buffer.Length)) > 0)
writeStream.Write(buffer, 0, read);
}
}
I think there is a memory leakage
Are you getting out of memory exception after processing a few lines or on the first line itself?
And there is no streamWriter.Flush() inside the while loop.
Don't you think there should be one?
I'm reading data (an adCenter report, as it happens), which is supposed to be zipped. Reading the contents with an ordinary stream, I get a couple thousand bytes of gibberish, so this seems reasonable. So I feed the stream to DeflateStream.
First, it reports "Block length does not match with its complement." A brief search suggests that there is a two-byte prefix, and indeed if I call ReadByte() twice before opening DeflateStream, the exception goes away.
However, DeflateStream now returns nothing at all. I've spent most of the afternoon chasing leads on this, with no luck. Help me, StackOverflow, you're my only hope! Can anyone tell me what I'm missing?
Here's the code. Naturally I only enabled one of the two commented blocks at a time when testing.
_results = new List<string[]>();
using (Stream compressed = response.GetResponseStream())
{
// Skip the zlib prefix, which conflicts with the deflate specification
compressed.ReadByte(); compressed.ReadByte();
// Reports reading 3,000-odd bytes, followed by random characters
/*byte[] buffer = new byte[4096];
int bytesRead = compressed.Read(buffer, 0, 4096);
Console.WriteLine("Read {0} bytes.", bytesRead.ToString("#,##0"));
string content = Encoding.ASCII.GetString(buffer, 0, bytesRead);
Console.WriteLine(content);*/
using (DeflateStream decompressed = new DeflateStream(compressed, CompressionMode.Decompress))
{
// Reports reading 0 bytes, and no output
/*byte[] buffer = new byte[4096];
int bytesRead = decompressed.Read(buffer, 0, 4096);
Console.WriteLine("Read {0} bytes.", bytesRead.ToString("#,##0"));
string content = Encoding.ASCII.GetString(buffer, 0, bytesRead);
Console.WriteLine(content);*/
using (StreamReader reader = new StreamReader(decompressed))
while (reader.EndOfStream == false)
_results.Add(reader.ReadLine().Split('\t'));
}
}
As you can probably guess from the last line, the unzipped content should be TDT.
Just for fun, I tried decompressing with GZipStream, but it reports that the magic number is not correct. MS' docs just say "The downloaded report is compressed by using zip compression. You must unzip the report before you can use its contents."
Here's the code that finally worked. I had to save the content out to a file and read it back in. This does not seem reasonable, but for the small quantities of data I'm working with, it's acceptable, I'll take it!
WebRequest request = HttpWebRequest.Create(reportURL);
WebResponse response = request.GetResponse();
_results = new List<string[]>();
using (Stream compressed = response.GetResponseStream())
{
// Save the content to a temporary location
string zipFilePath = #"\\Server\Folder\adCenter\Temp.zip";
using (StreamWriter file = new StreamWriter(zipFilePath))
{
compressed.CopyTo(file.BaseStream);
file.Flush();
}
// Get the first file from the temporary zip
ZipFile zipFile = ZipFile.Read(zipFilePath);
if (zipFile.Entries.Count > 1) throw new ApplicationException("Found " + zipFile.Entries.Count.ToString("#,##0") + " entries in the report; expected 1.");
ZipEntry report = zipFile[0];
// Extract the data
using (MemoryStream decompressed = new MemoryStream())
{
report.Extract(decompressed);
decompressed.Position = 0; // Note that the stream does NOT start at the beginning
using (StreamReader reader = new StreamReader(decompressed))
while (reader.EndOfStream == false)
_results.Add(reader.ReadLine().Split('\t'));
}
}
You will find that DeflateStream is hugely limited in what data it will decompress. In fact if you are expecting entire files it will be of no use at all.
There are hundereds of (mostly small) variations of ZIP files and DeflateStream will get along only with two or three of them.
Best way is likely to use a dedicated library for reading Zip files/streams like DotNetZip or SharpZipLib (somewhat unmaintained).
You could write the stream to a file and try my tool Precomp on it. If you use it like this:
precomp -c- -v [name of input file]
any ZIP/gZip stream(s) inside the file will be detected and some verbose information will be reported (position and length of the stream). Additionally, if they can be decompressed and recompressed bit-to-bit identical, the output file will contain the decompressed stream(s).
Precomp detects ZIP/gZip (and some other) streams anywhere in the file, so you won't have to worry about header bytes or garbage at the beginning of the file.
If it doesn't detect a stream like this, try to add -slow, which detects deflate streams even if they don't have a ZIP/gZip header. If this fails, you can try -brute which even detects deflate streams that lack the two byte header, but this will be extremely slow and can cause false positives.
After that, you'll know if there is a (valid) deflate stream in the file and if so, the additional information should help you to decompress other reports correctly using zLib decompression routines or similar.