Read mainframe file and parse data using .net - c#

I have a file which is very long, and has no line breaks, CR or LF or other delimiters.
Records are fixed length, and the first control record length is 24 and all other record lengths are of fixed length 81 bytes.
I know how to read a fixed length file per line basis and I am using Multi Record Engine and have defined classes for each 81 byte line record but can’t figure out how I can read 80 characters at a time and then parse that string for the actual fields.

You can use the FileStream to read the number of bytes you need - like in your case either 24 or 81. Keep in mind that progressing through the stream the position changes and therefor you should not use the offset (should always be 0) - also be aware that if there is no information "left" on the stream it will cause an exception.
So you would end up with something like this:
var recordlength = 81;
var buffer = new byte[recordlength];
stream.Read(buffer, 0, recordlength); // offset = 0, start at current position
var record = System.Text.Encoding.UTF8.GetString(buffer); // single record
Since the recordlength is different for the control record you could use that part into a single method, let's name it Read and use that read method to traverse through the stream untill you reach the end, like this:
public List<string> Records()
{
var result = new List<string>();
using(var stream = new FileStream(#"c:\temp\lipsum.txt", FileMode.Open))
{
// first record
result.Add(Read(stream, 24));
var record = "";
do
{
record = Read(stream);
if (!string.IsNullOrEmpty(record)) result.Add(record);
}
while (record.Length > 0);
}
return result;
}
private string Read(FileStream stream, int length = 81)
{
if (stream.Length < stream.Position + length) return "";
var buffer = new byte[length];
stream.Read(buffer, 0, length);
return System.Text.Encoding.UTF8.GetString(buffer);
}
This will give you a list of records (including the starting control record).
This is far from perfect, but an example - also keep in mind that even if the file is empty there is always 1 result in the returned list.

Related

MemoryMapped file Access Exception

I am trying to read a huge file (binary\text mixed) using MemoryMap.
However, it comes to a point of my loop iteration that it just gives an access exception ; just it, it doesn't say anything about what kind of exception it is, why it couldn't read, etc. I've been trying to figure out it for a few hours but can't get any conclusion.
Here's the code i am using to read it:
//numOfColors = 6
private static void ReadChunksFromLargeFile(int offsetToBegin, string fName, int numOfColors)
{
var idx = offsetToBegin;
int byteSizeForEachColor = (int)new FileInfo(fName).Length/ numOfColors;
var buffer = new byte[byteSizeForEachColor];
using (var mmf = MemoryMappedFile.CreateFromFile(fName))
{
for(int i=0; i < numOfColors; i++)
{
//numOfColors = 6
using (var view = mmf.CreateViewStream(idx, byteSizeForEachColor, MemoryMappedFileAccess.Read))
{
view.Seek(idx, SeekOrigin.Begin);
view.Read(buffer, 0, byteSizeForEachColor);
var temp = ByteArrayToHexString(buffer);
File.WriteAllText($#"C:\test\buffertest{i}.hex", temp);
}
idx += byteSizeForEachColor;
}
}
}
EDIT: offsetToBegin is 937
What i'm trying to do is read huge chunks based on a size i need. However, when it comes to i = 5 it just throws the exception.
The file i'm trying to read is this one: https://drive.google.com/file/d/1DsLaNnAOQDyWJ_g4PPNXGCNfbuirs_Ss/view?usp=sharing
Any input is appreciated. Thanks !
Your calculations are wrong. When you calculate the size of each color you are not taking into account the offset that you don't want. When you call CreateViewFromStream and tell it to skip the offset it is trying to read too many bytes, causing the AccessException.
For example:
Filesize = 60 bytes
offset = 2 bytes
num of colors = 6
Your original calculation would result in:
byteSizeForEachColor = 10
So your loop will skip the first 2 bytes and then read 10 bytes for each color, but when it comes to the last color it has already gone past the end of the file
5 x (10 + 2) = 60
6 x (10 + 2) = 72 // The file is only 60 bytes long - it has gone too far
You need to subtract the offsetToBegin from the calculated size to ensure it only reads the correct number of bytes.
Using the above values
bytesizeForEachColor = (60 / 6) - 2 = 8
So it should only read 8 bytes for each color. You need to change your code to:
using (var view = mmf.CreateViewStream(idx, byteSizeForEachColor - idx, MemoryMappedFileAccess.Read))
{
...
}
Now each loop will skip 2 bytes and read 8 - which will not cause it to go beyond the length of the file.

Index of the line in StreamWriter

I'm using StreamWriter to write into file, but I need the index of line I'm writing to.
int i;
using (StreamWriter s = new StreamWriter("myfilename",true) {
i= s.Index(); //or something that works.
s.WriteLine("text");
}
My only idea is to read the whole file and count the lines. Any better solution?
The definition of a line
The definition of a line index and more specifically a line in a file is denoted by the \n character. Typically (and on Windows moreso) this can be preceded by the carriage return \r character too, but not required and not typically present on Linux or Mac.
Correct Solution
So what you are asking is for the line index at the current position basically means you are asking for the number of \n present before the current position in the file you are writing to, which seems to be the end (appending to the file), so you can think of it as how many lines are in the file.
You can read the stream and count these, with consideration for your machines RAM and to not just read in the entire file into memory. So this would be safe to use on very large files.
// File to read/write
var filePath = #"C:\Users\luke\Desktop\test.txt";
// Write a file with 3 lines
File.WriteAllLines(filePath,
new[] {
"line 1",
"line 2",
"line 3",
});
// Get newline character
byte newLine = (byte)'\n';
// Create read buffer
var buffer = new char[1024];
// Keep track of amount of data read
var read = 0;
// Keep track of the number of lines
var numberOfLines = 0;
// Read the file
using (var streamReader = new StreamReader(filePath))
{
do
{
// Read the next chunk
read = streamReader.ReadBlock(buffer, 0, buffer.Length);
// If no data read...
if (read == 0)
// We are done
break;
// We read some data, so go through each character...
for (var i = 0; i < read; i++)
// If the character is \n
if (buffer[i] == newLine)
// We found a line
numberOfLines++;
}
while (read > 0);
}
The lazy solution
If your files are not that large (large being dependant on your intended machine/device RAM and program as a whole) and you want to just read the entire file into memory (so into your programs RAM) you can do a one liner:
var numberOfLines = File.ReadAllLines(filePath).Length;

C# BinaryReader ReadBytes(len) returns different results than Read(bytes, 0, len)

I've got a BinaryReader reading in a number of bytes into an array. The underlying Stream for the reader is a BufferedStream(whose underlying stream is a network stream). I noticed that sometimes the reader.Read(arr, 0, len) method is returning different(wrong) results than reader.ReadBytes(len).
Basically my setup code looks like this:
var httpClient = new HttpClient();
var reader = new BinaryReader(new BufferedStream(await httpClient.GetStreamAsync(url).ConfigureAwait(false)));
Later on down the line, I'm reading a byte array from the reader. I can confirm the sz variable is the same for both scenarios.
int sz = ReadSize(reader); //sz of the array to read
if (bytes == null || bytes.Length <= sz)
{
bytes = new byte[sz];
}
//reader.Read will return different results than reader.ReadBytes sometimes
//everything else is the same up until this point
//var tempBytes = reader.ReadBytes(sz); <- this will return right results
reader.Read(bytes, 0, sz); // <- this will not return the right results sometimes
It seems like the reader.Read method is reading further into the stream than it needs to or something, because the rest of the parsing will break after this happens. Obviously I could stick with reader.ReadBytes, but I want to reuse the byte array to go easy on the GC here.
Would there ever be any reason that this would happen? Is a setting wrong or something?
Make sure you clear out bytes array before calling this function because Read(bytes, 0, len) does NOT clear given byte array, so some previous bytes may conflict with new one. I also had this problem long ago in one of my parsers. just set all elements to zero, or make sure that you are only reading (parsing) up to given len

How to read a specific part in text file?

I have a really big text file (500mb) and i need to get its text.
Of course the problem is the exception-out of memory, but i want to solve it with taking strings (or char arrays) and put them in List.
I search in google and i really don't know how to take a specific part.
* It's a one long line, if that helps.
Do that:
using (FileStream fsSource = new FileStream(pathSource,
FileMode.Open, FileAccess.Read))
{
// Read the source file into a byte array.
int numBytesToRead = // Your amount to read at a time
byte[] bytes = new byte[numBytesToRead];
int numBytesRead = 0;
while (numBytesToRead > 0)
{
// Read may return anything from 0 to numBytesToRead.
int n = fsSource.Read(bytes, numBytesRead, numBytesToRead);
// Break when the end of the file is reached.
if (n == 0)
break;
// Do here what you want to do with the bytes read (convert to string using Encoding.YourEncoding.GetString())
}
}
You can use StreamReader class to read parts of a file.

Copying a part of a byte[] array into a PDFReader

This is a continuation of the ongoing struggle to reduce my memory load mention in
How do you refill a byte array using SqlDataReader?
So I have a byte array that is a set size, for this example, I'll say new byte[400000]. Inside of this array, I'll be placing pdf's of different sizes (less than 400000).
psuedo code would be:
public void Run()
{
byte[] fileRetrievedFromDatabase = new byte[400000];
foreach (var document in documentArray)
{
// Refill the file with data from the database
var currentDocumentSize = PopulateFileWithPDFDataFromDatabase(fileRetrievedFromDatabase);
var reader = new iTextSharp.text.pdf.PdfReader(fileRetrievedFromDatabase.Take((int)currentDocumentSize ).ToArray());
pageCount = reader.NumberOfPages;
// DO ADDITIONAL WORK
}
}
private int PopulateFileWithPDFDataFromDatabase(byte[] fileRetrievedFromDatabase)
{
// DataAccessCode Goes here
int documentSize = 0;
int bufferSize = 100; // Size of the BLOB buffer.
byte[] outbyte = new byte[bufferSize]; // The BLOB byte[] buffer to be filled by GetBytes.
myReader = logoCMD.ExecuteReader(CommandBehavior.SequentialAccess);
Array.Clear(fileRetrievedFromDatabase, 0, fileRetrievedFromDatabase.Length);
if (myReader == null)
{
return;
}
while (myReader.Read())
{
documentSize = myReader.GetBytes(0, 0, null, 0, 0);
// Reset the starting byte for the new BLOB.
startIndex = 0;
// Read the bytes into outbyte[] and retain the number of bytes returned.
retval = myReader.GetBytes(0, startIndex, outbyte, 0, bufferSize);
// Continue reading and writing while there are bytes beyond the size of the buffer.
while (retval == bufferSize)
{
Array.Copy(outbyte, 0, fileRetrievedFromDatabase, startIndex, retval);
// Reposition the start index to the end of the last buffer and fill the buffer.
startIndex += retval;
retval = myReader.GetBytes(0, startIndex, outbyte, 0, bufferSize);
}
}
return documentSize;
}
The problem with the above code is that that I keep getting a "Rebuild trailer not found. Original Error: PDF startxref not found" error when I try to access the PDF Reader. I believe it's because the byte array is too long and has trailing 0's. But since I'm using the byte array so that I'm not continuously building new objects on the LOH, I need to do this.
So how do I get just the piece of the Array that I need and send it to the PDFReader?
Updated
So I looked at the source and realized I had some variables from my actual code that was confusing. I'm basically reusing the fileRetrievedFromDatabase object in each iteration of the loop. Since it's passed by reference, it gets cleared (set to all zero's), and then filled in the PopulateFileWithPDFDataFromDatabase. This object is then used to create a new PDF.
If I didn't do it this way, a new large byte array would be created in every iteration and the Large Object Heap gets full and eventually throws an OutOfMemory exception.
You have at least two options:
Treat your buffer like a circular buffer with two indexes for starting and ending position.
need an index of the last byte written in outByte and you have to stop reading when you reach that index.
Simply read the same number of bytes as you have in your data array to avoid reading into the "unknown" parts of the buffer which don't belong to the same file.
In other words, instead of having bufferSize as the last parameter, have the data.Length.
// Read the bytes into outbyte[] and retain the number of bytes returned.
retval = myReader.GetBytes(0, startIndex, outbyte, 0, data.Length);
If data length is 10 and your outbyte buffer is 15, then you should only read the data.Length not the bufferSize.
However, I still don't see how you're reusing the outbyte "buffer", if that's what you're doing... I'm simply not following based on what you've provided in your answer. Maybe you can clarify exactly what is being reused.
Apparently, I the way the while loop is currently structured, it wasn't copying the data on it's last iteration. Needed to add this:
if (outbyte != null && outbyte.Length > 0 && retval > 0)
{
Array.Copy(outbyte, 0, currentDocument.Data, startIndex, retval);
}
It's now working, although I will definitely need to refactor.

Categories

Resources