I have a binary file. i stored it in byte array. file size can be 20MB or more. then i want to parse or find particular value in the file. i am doing it by 2 ways ->
1. By converting full file in char array.
2. By converting full file in hex string.(i also have hex values)
what is best way to parse full file..or should i do in binary form. i am using vs-2005.
From the aspect of memory consumption, it would be best it you could parse it directly, on-the-fly.
Converting it to a char array in C# means effectively doubling it's size in memory (presuming you are converting each byte to a char), while hex string will take at least 4 times the size (C# chars are 16-bit unicode characters).
On the other hand, it you need to make many searches and parsing over an existing set of data repeatedly, you may benefit from having it stored in any form which suits your needs better.
What's stopping you from seaching in the byte[]?
IMHO, If you're simply searching for a byte of specified value, or several continous bytes, this is the easiest way and most efficient way to do it.
If I understood your question correctly you need to find strings which can contain any characters in a large binary file. Does the binary file contain text? If so do you know the encoding? If so you can use StreamReader class like so:
using (StreamReader sr = new StreamReader("C:\test.dat", System.Text.Encoding.UTF8))
{
string s = sr.ReadLine();
}
In any case I think it's much more efficient using some kind of stream access to the file, instead of loading it all to memory.
You could load it by chunks into the memory, and then use some pattern matching algorithm (like Knuth-Moris-Pratt or Karp-Rabin)
Related
I want to create a binary file and store string data in it, I used this sample:
FileStream fs = new FileStream("c:\\test.data", FileMode.Create);
BinaryWriter bw = new BinaryWriter(fs);
bw.Write(Encoding.ASCII.GetBytes("david stein"));
bw.Close();
but when I opened created file by this sample (test.data) in notepad, it has string data in it ("david stein"), now my question is that whats the difference between this binary writing and text writing when the result is string?
I'm looking to create a data in binary file until user can not open and read my data by note pad and if user open it in notepad see real binary data .
in some files when you open theme in text editors you can not read file content like jpg files contents,they do not use any encryption methods,what about it?how can i wite my data like this?
now my question is that whats the difference between this binary writing and text writing when the result is string?
The data in a file is always "a sequence of bytes". In this case, the sequence of bytes you've written is "the bytes representing the text 'david stein'" in the ASCII encoding. So yes, if you open the file in any editor which tries to interpret the bytes as text in a way which is compatible with ASCII, you'll see the text "david stein". Really it's just a load of bytes though - it all depends on how you interpret them.
If you'd written:
File.WriteAllText("c:\\test.data", "david stein", Encoding.ASCII);
you'd have ended up with the exact same sequence of bytes. There are any number of ways you could have created a file with the same bytes in. There's nothing about File.WriteAllText which "marks" the file as text, and there's nothing about FileStream or BinaryWriter which marks the file as binary.
EDIT: From comments:
I'm looking to create a data in binary file until user can not open and read my data by note pad
Well, there are lots of ways of doing that with different levels of security. Ideally, you'd want some sort of encryption - but then the code reading the data would need to be able to decrypt it as well, which means it would need to be able to get a key. That then moves the question to "how do I store a key".
If you only need to verify the data in the file (e.g. check that it matches something from elsewhere) then you could use a cryptographic hash instead.
If you only need to prevent the most casual of snoopers, you could use something which is basically just obfuscation - a very weak form of encryption with no "key" as such. Anyone who dceompiled your code would easily be able to get at the data in that case, but you may not care too much about that.
It all depends on your requirements.
All data is binary. A text file is binary data that happens to be a limited subset that represent valid characters, but it's still binary.
The way text editors typically differentiate a text file from a binary file is they scan a certain portion of the file for zero values, \0. These never exist in text-only files and almost always exist in binary files.
I'm parsing a large csv files - about 500 meg (many rows, many columns). I only need the first two columns (so up to the second comma on each line). Also, multiple threads need access to this file at the same time, so I can't take an exclusive lock.
What's the fastest/least memory consuming approach to this problem? What classes/methods should I be looking at? I assume that I should stay as low-level as possible - reading character by character, line by line?
Perhaps this is a way to allow simultaneous access?
using ( var filestream = new FileStream( filePath , FileMode.Open , FileAccess.Read , FileShare.Read ) )
{
using ( var reader = new StreamReader( filestream ) )
{
...
}
}
Edit
Decided to check out http://www.codeproject.com/KB/database/CsvReader.aspx
which seems to give me the ability to read just two columns and then skip to the next line.
They also have some benchmarks showing fast performance and low memory profile.
If you want low memory, you'll probably use a StreamReader and ReadLine by line.
In a similar case the other day, I was able to skip the first 20,000,000 lines in a 500 MB file and build a string (using StringBuilder) for the next 1,000,000 lines in about 7 seconds.
Assuming that the file contains ASCII encoded text (would be typical for csv), your best bet may be to use Stream directly and the Stream.Read method, which allows you to read into a pre-allocated buffer. This has a few advantages:
You only allocate a buffer once, whereas ReadLine() will create a new String for every line.
You don't have to perform the Unicode conversion for the entire line; you can either do this only for the portion up to the second comma or (if you're severely time-constrained), you can write your own numeric parser that operates on the ASCII string data in the buffer (I'm sure there are well-documented algorithms for doing this.) This is assuming you need numeric data, of course.
Additional methods you'll likely need include the ASCII Encoding methods, particularly Encoding.ASCII.GetString.
I have a byte[] with some data in it, I would like to write this byte array AS-IS to the log file using log4.net. The problems that i am facing is that
There are no overload for byte[] in TextWriter, so even implementing an IObjectRenderer is of no use.
I dont have access to the underlying Stream object of Log4.net
Also tried converting byte[] into char[] still when i write it, it adds an extra byte.
Is this even possible with Log4.net.
Thanx in Advance.
Log files are usually plain text files. It's probably best to log your byte array represented as string.
Have a look at BitConverter.ToString or Convert.ToBase64String.
Nope. Have you thought about writing it out as a hex string (see this post)?
I also think that logging any larger data is kind of useless, however, i guess this is what you are looking for - this converts your bytes to string.
System.Text.Encoding.ASCII.GetString(byteArray)
I believe you can figure out how to use that for logging.
Pz, the TaskConnect developer
If you are logging into DB then use Binary type with maximum size
Dictionaries usually has an index and a data file. I'm writing a dictionary application as a hobby project. I'm confused about how to read the offset file in .NET. The index file is of 4-5 MB size. What is the most efficient way to fetch the offset/length value of a word.
EDIT:
I need to know only how to read offset file if I have a word to search. ie how to search the index file for a word so that I can get the subsequent 8 bytes
Stream.Seek(long offset, SeekOrigin origin) will be usefull to get to the offset.
4-5 megabytes for the index? That's nothing. Read the entire thing into a byte array and with it as a MemoryStream or more appropriately, parse the entire contents into appropriate data structures for quick searching (has, b-tree, etc).
System.IO.BinaryReader has a ReadUInt32 method that reads an unsigned int. It also has different methods for reading binary files.
I got this next problem.
I have a binary file, which I write to it vital data of the system.
One of the fields is time, which I use DateTime.Now.ToString("HHmmssffffff), in format of microseconds. This data (in a string) I convert (to ToCahrArray) (and checked it in debugging in it is fine), it consists of time valid till the microseconds.
Then I write it and flush it to the file. When opening it with PsPad that translate binary to Ascii, I see that data is corrupted in this field and a nother one but the rest of the message is fine.
The code:
void Write(string strData) {
char[] cD = strData.ToCharArry();
bw.Write(c); //br is from type of BinaryWriter
bw.Flush();
}
You're writing out the bytes in Unicode characters, not Ascii bytes. If you want Ascii bytes, you should change this to use the Encoding class.
byte[] data = Encoding.ASCII.GetBytes(strData);
bw.Write(data);
I strongly recommend reading Joel Spolsky's article on character sets and encoding. It may help you understand what your current code is not working properly.