I am trying to read a huge file (binary\text mixed) using MemoryMap.
However, it comes to a point of my loop iteration that it just gives an access exception ; just it, it doesn't say anything about what kind of exception it is, why it couldn't read, etc. I've been trying to figure out it for a few hours but can't get any conclusion.
Here's the code i am using to read it:
//numOfColors = 6
private static void ReadChunksFromLargeFile(int offsetToBegin, string fName, int numOfColors)
{
var idx = offsetToBegin;
int byteSizeForEachColor = (int)new FileInfo(fName).Length/ numOfColors;
var buffer = new byte[byteSizeForEachColor];
using (var mmf = MemoryMappedFile.CreateFromFile(fName))
{
for(int i=0; i < numOfColors; i++)
{
//numOfColors = 6
using (var view = mmf.CreateViewStream(idx, byteSizeForEachColor, MemoryMappedFileAccess.Read))
{
view.Seek(idx, SeekOrigin.Begin);
view.Read(buffer, 0, byteSizeForEachColor);
var temp = ByteArrayToHexString(buffer);
File.WriteAllText($#"C:\test\buffertest{i}.hex", temp);
}
idx += byteSizeForEachColor;
}
}
}
EDIT: offsetToBegin is 937
What i'm trying to do is read huge chunks based on a size i need. However, when it comes to i = 5 it just throws the exception.
The file i'm trying to read is this one: https://drive.google.com/file/d/1DsLaNnAOQDyWJ_g4PPNXGCNfbuirs_Ss/view?usp=sharing
Any input is appreciated. Thanks !
Your calculations are wrong. When you calculate the size of each color you are not taking into account the offset that you don't want. When you call CreateViewFromStream and tell it to skip the offset it is trying to read too many bytes, causing the AccessException.
For example:
Filesize = 60 bytes
offset = 2 bytes
num of colors = 6
Your original calculation would result in:
byteSizeForEachColor = 10
So your loop will skip the first 2 bytes and then read 10 bytes for each color, but when it comes to the last color it has already gone past the end of the file
5 x (10 + 2) = 60
6 x (10 + 2) = 72 // The file is only 60 bytes long - it has gone too far
You need to subtract the offsetToBegin from the calculated size to ensure it only reads the correct number of bytes.
Using the above values
bytesizeForEachColor = (60 / 6) - 2 = 8
So it should only read 8 bytes for each color. You need to change your code to:
using (var view = mmf.CreateViewStream(idx, byteSizeForEachColor - idx, MemoryMappedFileAccess.Read))
{
...
}
Now each loop will skip 2 bytes and read 8 - which will not cause it to go beyond the length of the file.
Related
I have a file which is very long, and has no line breaks, CR or LF or other delimiters.
Records are fixed length, and the first control record length is 24 and all other record lengths are of fixed length 81 bytes.
I know how to read a fixed length file per line basis and I am using Multi Record Engine and have defined classes for each 81 byte line record but can’t figure out how I can read 80 characters at a time and then parse that string for the actual fields.
You can use the FileStream to read the number of bytes you need - like in your case either 24 or 81. Keep in mind that progressing through the stream the position changes and therefor you should not use the offset (should always be 0) - also be aware that if there is no information "left" on the stream it will cause an exception.
So you would end up with something like this:
var recordlength = 81;
var buffer = new byte[recordlength];
stream.Read(buffer, 0, recordlength); // offset = 0, start at current position
var record = System.Text.Encoding.UTF8.GetString(buffer); // single record
Since the recordlength is different for the control record you could use that part into a single method, let's name it Read and use that read method to traverse through the stream untill you reach the end, like this:
public List<string> Records()
{
var result = new List<string>();
using(var stream = new FileStream(#"c:\temp\lipsum.txt", FileMode.Open))
{
// first record
result.Add(Read(stream, 24));
var record = "";
do
{
record = Read(stream);
if (!string.IsNullOrEmpty(record)) result.Add(record);
}
while (record.Length > 0);
}
return result;
}
private string Read(FileStream stream, int length = 81)
{
if (stream.Length < stream.Position + length) return "";
var buffer = new byte[length];
stream.Read(buffer, 0, length);
return System.Text.Encoding.UTF8.GetString(buffer);
}
This will give you a list of records (including the starting control record).
This is far from perfect, but an example - also keep in mind that even if the file is empty there is always 1 result in the returned list.
I have a number of documents with predicted placement of certain text which I'm trying to extract. For the most part, it works very well, but I'm having difficulties with a certain fraction of documents which have slightly thicker text.
Thin text:
Thick text:
I know it's hard to tell the difference at this resolution, but if you look at MO DAY YEAR TIME (2400) portion, you can tell that the second one is thicker.
The thin text gives me exactly what is expected:
09/28/2015
0820
However, the thick version gives me a triple of every character with white space in between each duplicated character:
1 1 11 1 1/ / /1 1 19 9 9/ / /2 2 20 0 01 1 15 5 5
1 1 17 7 70 0 02 2 2
I'm using the following code to extract text from documents:
public static Document GetDocumentInfo(string fileName)
{
// Using 11 in x 8.5 in dimensions at 72 dpi.
var boudingBoxes = new[]
{
new RectangleJ(446, 727, 85, 14),
new RectangleJ(396, 702, 43, 14),
new RectangleJ(306, 680, 58, 7),
new RectangleJ(378, 680, 58, 7),
new RectangleJ(446, 680, 45, 7),
new RectangleJ(130, 727, 29, 10),
new RectangleJ(130, 702, 29, 10)
};
var data = GetPdfData(fileName, 1, boudingBoxes);
// I would populated the new document with extracted data
// here, but it's not important for the example.
var doc = new Document();
return doc;
}
public static string[] GetPdfData(string fileName, int pageNum, RectangleJ[] boundingBoxes)
{
// Omitted safety checks, as they're not important for the example.
var data = new string[boundingBoxes.Length];
using (var reader = new PdfReader(fileName))
{
if (reader.NumberOfPages < 1)
{
return null;
}
RenderFilter filter;
ITextExtractionStrategy strategy;
for (var i = 0; i < boundingBoxes.Length; ++i)
{
filter = new RegionTextRenderFilter(boundingBoxes[i]);
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
data[i] = PdfTextExtractor.GetTextFromPage(reader, pageNum, strategy);
}
return data;
}
}
Obviously, if nothing else works, I can get rid of duplicate characters after reading them in, as there is a very apparent pattern, but I'd rather find a proper way than a hack. I tried looking around for the past few hours, but couldn't find anyone encountering a similar issue.
EDIT:
I finally came across this SO question:
Text Extraction Duplicate Bold Text
...and in the comments it's indicated that some of the lower quality PDF producers duplicate text to simulate boldness, so that's one of the things that might be happening. However, there is a mention of omitting duplicate text at the location, which I don't know how can be achieved since this portion of my code...
data[i] = PdfTextExtractor.GetTextFromPage(reader, pageNum, strategy);
...reads in the duplicated text completely in any of the specified locations.
EDIT:
I now have come across documents that duplicate contents up to four times to simulate thickness. That's a very strange way of doing things, but I'm sure designers of that method had their reasons.
EDIT:
I produced A solution (see my answer). It processes the data after it's already extracted and removes any repetitions. Ideally this would have been done during the extraction process, but it can get pretty complicated and this seemed like a very clean and easy way of getting the same accomplished.
As #mkl has suggested, one way of tackling this issue is to override LocationExtractionStrategy; however, things get pretty complicated since it would require comparison of locations for each character found at specific boundaries. I tried doing some research in order to accomplish that, but due to poor documentation, it was getting a bit out of hand.
So, instead as I created a post-processing method, loosely based around what #TheMuffinMan has suggested, to clean up any repetitions. I decided not to deal with pixels, but rather with character count anomalies in known static locations. In my case, I know that the second data piece extracted can never be greater than three characters, so it's a good comparison point for me. If you know the document layout, you can use anything on it that you know will always be of fixed length.
After I extract the data with the method listed in my original post, I check to see if the second data piece is greater than three in length. If it returns true, then I divide the given length by three, as that's the most characters it can have and since all repitions come out to even length, I know I'll get an even number of repetition cases:
var data = GetPdfData(fileName, 1, boudingBoxes);
if (data[1].Length > 3)
{
var count = data[1].Length / 3;
for (var i = 0; i < data.Length; ++i)
{
data[i] = RemoveRepetitions(data[i], count);
}
}
As you can see, I then loop over the data and pass each piece into RemoveRepetitions() method:
public static string RemoveRepetitions(string original, int count)
{
if (original.Length % count != 0)
{
return null;
}
var temp = new char[original.Length / count];
for (int i = 0; i < original.Length; i += count)
{
temp[i / count] = original[i];
}
return new string(temp);
}
This method takes the string and the number of expected repetitions, which we calculated earlier. One thing to note is that I don't have to worry about the white spaces that are inserted in the duplicated process, as the example shows in the original post, due to the fact that count will represent the total number of characters where only one should have been.
I have a huge byte[] data array. I want to take specific amount of bytes (considering as Blocksize) and do some operation with it and have all the results of each block added one after another in a new array.
This is my code:
int j = 0;
int number_of_blocks = (data.Length) / 16;
byte[] one_block = new byte[16];
byte[] one_block_return = new byte[16];
byte[] all_block_return = new byte[data.Length];
for (int i = 0; i < number_of_blocks; i++)
{
Array.Copy(data, j, one_block, 0, 16);
one_block_return = one_block_operation(one_block);
Array.Copy(one_block_return, 0, all_block_return, j, 16);
Array.Clear(one_block, 0, one_block.Length);
j = j + 16;
}
The only problem of this code is its too slow since my data array is extremely large. So I am expecting a replacement of Array.Copy() which is more faster then this or if someone has a better way to do this. I want to know how many ways to do this and hoping to see variation of coding as well.
-Thanks
What about simple parallelization?
int number_of_blocks = (int)Math.Ceiling((double)data.Length / 16);
byte[] all_block_return = new byte[data.Length];
Parallel.For(0, number_of_blocks - 1, block_no =>
{
var blockStart = block_no * 16; // 16 - block size
var blockLength = Math.Min(16, data.Length - blockStart);
byte[] one_block = new byte[16];
byte[] one_block_return = new byte[16];
Array.Copy(data, blockStart, one_block, 0, blockLength);
one_block_return = one_block_operation(one_block);
Array.Copy(one_block_return, 0, all_block_return, blockStart, blockLength);
});
It is possible to modyify one_block_operation to take data, blockStart, blockStart + blockLength arguments instead buffer (one_block)? You could avoid one of Array.Copy.
EDIT:
Here is how it works:
Firstly, we need to calculate number or blocks. Then the Parallel.For is executes with specified arguments: start index, end index and delegate that passes one argument - currently processed index. In our case, index is considered to be number of block. Equivalent to this code is:
for (var block_no = 0, block_no <= number_of_blocks - 1; block_no++) {
delegate(block_no);
}
The only difference is that Parallerl.For runs that loop in multiple threads. The threads count number is not fixed - it dependens on ThreadPool size (according to MSDN it also depedends on many factors).
Due to each deletage could be called independently (and we don't know the order of calling deletagtes) we cannot use variable to store current block start index outside deletegate (like you stored it outside for loop). But if we know to current block number and size of block, calculating block start index is very easy (and it is done in 8th line).
And no - you can't skip 9th line or replace it with const value of 16. Why? Consider follwing sequence:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
We can divide this sequence into two block of size 16:
1st: [1-16]
2nd: [17]
So, like you see - the second block doesn't contain 16 elements, but only 1. The 9th line calculate actual block size / length, so you can easily avoid IndexOutOfBoundException.
I have a requirement to test some load issues with regards to file size. I have a windows application written in C# which will automatically generate the files. I know the size of each file, ex. 100KB, and how many files to generate. What I need help with is how to generate a string less than or equal to the required file size.
pseudo code:
long fileSizeInKB = (1024 * 100); //100KB
int numberOfFiles = 5;
for(var i = 0; i < numberOfFiles - 1; i++) {
var dataSize = fileSizeInKB;
var buffer = new byte[dataSize];
using (var fs = new FileStream(File, FileMode.Create, FileAccess.Write)) {
}
}
You can always use the a constructor for string which takes a char and a number of times you want that character repeated:
string myString = new string('*', 5000);
This gives you a string of 5000 stars - tweak to your needs.
Easiest way would be following code:
var content = new string('A', fileSizeInKB);
Now you've got a string with as many A as required.
To fill it with Lorem Ipsum or some other repeating string build something like the following pseudocode:
string contentString = "Lorem Ipsum...";
for (int i = 0; i < fileSizeInKB / contentString.Length; i++)
//write contentString to file
if (fileSizeInKB % contentString.Length > 0)
// write remaining substring of contentString to file
Edit: If you're saving in Unicode you may need to half the filesize count because unicode uses two bytes per character if I remember correctly.
There are so many variations on how you can do this. One would be, fill the file with a bunch of chars. You need 100KB? No problem.. 100 * 1024 * 8 = 819200 bits. A single char is 16 bits. 819200 / 16 = 51200. You need to stick 51,200 chars into a file. But consider that a file may have additional header/meta data, so you may need to account for that and decrease the number of chars to write to file.
As a partial answer to your question I recently created a portable WPF app that easily creates 'junk' files of almost any size: https://github.com/webmooch/FileCreator
I need a fast and efficient method to read a space separated file with numbers into an array. The files are formatted this way:
4 6
1 2 3 4 5 6
2 5 4 3 21111 101
3 5 6234 1 2 3
4 2 33434 4 5 6
The first row is the dimension of the array [rows columns]. The lines following contain the array data.
The data may also be formatted without any newlines like this:
4 6
1 2 3 4 5 6 2 5 4 3 21111 101 3 5 6234 1 2 3 4 2 33434 4 5 6
I can read the first line and initialize an array with the row and column values. Then I need to fill the array with the data values. My first idea was to read the file line by line and use the split function. But the second format listed gives me pause, because the entire array data would be loaded into memory all at once. Some of these files are in the 100 of MBs. The second method would be to read the file in chunks and then parse them piece by piece. Maybe somebody else has a better a way of doing this?
What's your usage pattern for the data once it's loaded? Do you generally need to touch every array element or will you just make sparse/random access?
If you need to touch most array elements, loading it into memory will probably be the best way to go.
If you need to just access certain elements, you might want to lazy load the elements that you need into memory. One strategy would be to determine which of the two layouts the file uses (with/without newline) and create an algorithm to load a particular element directly from disk as needed (seek the given file offset, read and parse). To efficiently re-access the same element it could make sense to keep the element, once read, in a dictionary indexed by the offset. Check the dictionary first before going to the file for a particular value.
On general principal I would take the simple route unless your testing proves that you need to go a more complicated route (avoid premature optimization).
Read the file a character at a time. If it's whitespace, start a new number. If it's a digit, use it.
for numbers with multiple digits, keep a counter variable:
int counter = 0;
while (fileOpen) {
char ch = readChar(); // use your imagination to define this method.
if (isDigit(ch)) {
counter *= 10;
counter += asciiToDecimal(ch);
} else if (isWhitespace(ch)) {
appendToArray(counter);
counter = 0;
} else {
// Error?
}
}
Edited for clarification.
How about:
static void Main()
{
// sample data
File.WriteAllText("my.data", #"4 6
1 2 3 4 5 6
2 5 4 3 21111 101
3 5 6234 1 2 3
4 2 33434 4 5 6");
using (Stream s = new BufferedStream(File.OpenRead("my.data")))
{
int rows = ReadInt32(s), cols = ReadInt32(s);
int[,] arr = new int[rows, cols];
for(int y = 0 ; y < rows ; y++)
for (int x = 0; x < cols; x++)
{
arr[y, x] = ReadInt32(s);
}
}
}
private static int ReadInt32(Stream s)
{ // edited to improve handling of multiple spaces etc
int b;
// skip any preceeding
while ((b = s.ReadByte()) >= 0 && (b < '0' || b > '9')) { }
if (b < 0) throw new EndOfStreamException();
int result = b - '0';
while ((b = s.ReadByte()) >= '0' && b <= '9')
{
result = result * 10 + (b - '0');
}
return result;
}
Actually, this isn't very specific about the delimiters - it'll pretty much assume that anything that isn't an integer is a delimiter, and it only supports ASCII (you use use a reader if you need other encodings).
Unless the machine you're parsing these text files on is limited, files of a few hundred MB should still fit in memory. I'd suggest going with your first approach of reading by line and using split.
If memory becomes an issue, your second approach of reading in chunks should work fine.
Basically what I'm saying is just to implement it and measure if performance is a problem.
Lets assume we've read the entire file into a string.
You say the first two are rows and columns, so what we definitely need is to parse the numbers.
After that, we can take the first two, create our data structure, and fill it accordingly.
var fileData = File.ReadAllText(...).Split(' ');
var convertedToNumbers = fileData.Select(entry => int.Parse(entry));
int rows = convertedToNumbers.First();
int columns = convertedToNumbers.Skip(1).First();
// Now we have the number of rows, number of columns, and the data.
int[,] resultData = new int[rows, columns];
// Skipping over rows and columns values.
var indexableData = convertedToNumbers.Skip(2).ToList();
for(int i=0; i<rows; i++)
for(int j=0; j<columns; j++)
resultData[i, j] = inedexableData[i*rows + j];
An alternative would be to read the first two from a stream, initialize the array, and then read n values at a time, which would be complicated. Also, it's best to keep files open for the shortest time possible.
you want to stream the file into memory and parse as you go.
private IEnumerable<String> StreamAsSpaceDelimited(this StreamReader reader)
{
StringBuilder builder = new StringBuilder();
int v;
while((v = reader.Read()) != -1)
{
char c = (char) v;
if(Char.IsWhiteSpace(c))
{
if(builder.Length >0)
{
yield return builder.ToString();
builder.Clear();
}
}
else
{
builder.Append(c);
}
}
yield break;
}
this will parse the file into a collection of space delimited strings (lazily) and then you can read them as doubles just like :
using(StreamReader sr = new StreamReader("filename"))
{
var nums = sr.StreamAsSpaceDelimited().Select(s => int.Parse(s));
var enumerator = nums.GetEnumerator();
enumerator.MoveNext();
int numRows = enumerator.Current;
enumerator.MoveNext();
int numColumns = enumerator.current;
int r =0, c = 0;
int[][] destArray = new int[numRows][numColumns];
while(enumerator.MoveNext())
{
destArray[r][c] = enumerator.Current;
c++;
if(c == numColumns)
{
c = 0;
r++;
if(r == numRows)
break;//we are done
}
}
because we use iterators this should never read more than a few chars at a time. this is a common approach used to parse large files (for example this is how LINQ2CSV works).
Here are two methods
IEnumerable<int[]> GetArrays(string filename, bool skipFirstLine)
{
using (StreamReader reader = new StreamReader(filename))
{
if (skipFirstLine && !reader.EndOfStream)
reader.ReadLine();
while (!reader.EndOfStream)
{
string temp = reader.ReadLine();
int[] array = temp.Trim().Split().Select(s => int.Parse(s)).ToArray();
yield return array;
}
}
}
int[][] GetAllArrays(string filename, bool skipFirstLine)
{
int skipNumber = 0;
if (skipFirstLine )
skipNumber = 1;
int[][] array = File.ReadAllLines(filename).Skip(skipNumber).Select(line => line.Trim().Split().Select(s => int.Parse(s)).ToArray()).ToArray();
return array;
}
If you're dealing with large files, the first would likely be preferrable. If files are small, then the second can load the entire thing into a jagged array.