Converting byte[] to string efficiently - c#

I am working on a hobby project (simple / efficient datastore). My current concern is regarding the performance of reading data from disk (binary) and populating my objects.
My goal is to create a simple store optimized for read performance (for mobile) that is much faster than reading from SQL database or CSV.
After profiling the application, when I read data from disk (~1000 records = 240 ms) records most of the time is spent in the method "set(byte[])":
// data layout:
// strings are stored as there UTF-8 representation in a byte array
// within a "row", the first two bytes contain the length in bytes of the string data
// my data store also supports other types (which are much faster) - not shown below.
class myObject : IRow
{
public string Name;
public string Title;
// and so on
public void set(byte[] row_buffer)
{
int offset = 0;
short strLength = 0;
// Name - variable about 40 bytes
strLength = BitConverter.ToInt16(row_buffer, offset);
offset += 2;
Name = Encoding.UTF8.GetString(row_buffer, offset, strLength);
offset += strLength;
// Path - variable about 150 bytes
strLength = BitConverter.ToInt16(row_buffer, offset);
offset += 2;
Path = Encoding.UTF8.GetString(row_buffer, offset, strLength);
offset += strLength;
// and so on
}
}
Further remarks:
The data is read as binary from disk.
for each row in the file, a new object is created and the function set(row_buffer) is called.
Reading the stream into the row_buffer (using br.Read(row_Buffer, 0, rowLengths[i])) consumes ~ 10% of the time
Converting the bytes (GetString) to string consumes about 88% of the time
-> I don't understand why creating strings is so expensive :(
Any idea, how I can improve the performance? I am limited to "safe C#" code only.
Thanks for reading.
EDIT
I need to create the Objects to run my Linq queries. I would like to defer object creation but failed to find a way at this stage. See my other SO question: Implement Linq query on byte[] for my own type

Related

Performance using Span<T> to parse a text file

I am trying to take advantage of Span<T>, using .NETCore 2.2 to improve the performance of parsing text from a text file. The text file contains multiple consecutive rows of data which will each be split into fields that are then each mapped to a data class.
Initially, the parsing routine uses a traditional approach of using StreamReader to read each row, and then using Substring to copy the individual fields from that row.
From what I have read (on MSDN), amongst others, using Span<T> with Slice should perform more efficiently as less data allocations are made, and instead, a pointer to the byte[] array is passed around and acted upon.
After some experimentation I have compared 3 approaches to parsing the file and used BenchmarkDotNet to compare the results. What I found was that, when parsing a single row from the text file using Span, both mean execution time and allocated memory are indeed significantly less. So far so good. However, when parsing more than one row from the file, the performance gain quickly disappears to the point that it is almost insignificant, even from as little as 50 rows.
I am sure I must be missing something. Something seems to be outweighing the performance gain of Span.
The best performing approach WithSpan_StringFirst looks like this:
private static byte[] _filecontent;
private const int ROWSIZE = 252;
private readonly Encoding _encoding = Encoding.ASCII;
public void WithSpan_StringFirst()
{
var buffer1 = new Span<byte>(_filecontent).Slice(0, RowCount * ROWSIZE);
var buffer = _encoding.GetString(buffer1).AsSpan();
int cursor = 0;
for (int i = 0; i < RowCount; i++)
{
var row = buffer.Slice(cursor, ROWSIZE);
cursor += ROWSIZE;
Foo.ReadWithSpan(row);
}
}
[Params(1, 50)]
public int RowCount { get; set; }
Implementation of Foo.ReadWithSpan:
public static Foo ReadWithSpan(ReadOnlySpan<char> buffer) => new Foo
{
Field1 = buffer.Read(0, 2),
Field2 = buffer.Read(3, 4),
Field3 = buffer.Read(5, 6),
// ...
Field30 = buffer.Read(246, 249)
};
public static string Read(this ReadOnlySpan<char> input, int startIndex, int endIndex)
{
return new string(input.Slice(startIndex, endIndex - startIndex));
}
Any feedback would be appreciated. I have posted a full working sample on github.
For small files < 10,000 lines and simple line structure to parse, most any .net Core method will be the same.
For large, multi-gigibyte files and millions of lines of data, optimizations matter more.
If file processing time is in hours or even in tens of minutes, getting all the C# code together in the same class will drastically speed up processing the file as the compiler can do better code optimizations. Inlining the methods called into the main processing code can help also.
It's the same 1960s answer, changing the processing algorithm and how it chunks input and output data is an order of magnitude better than small code optimizations.

C# "Tag" 4-byte Hex Chunks for reconstruction to original string later

I am wrestling with a particular issue and like to ask for guidance on how I can achieve what I seek. Given the below function, a variable length string is used as input which produces a 4-byte Hex chunk equivalent. These 4-byte chunks are being written to an XML file for storage. And that XML file's schema cannot be altered. However, my issue is when the application which governs the XML file sorts the 4-byte chunks in the XML file. The result, is when I read that same XML file my string is destroyed. So, I'd like a way to "tag" each 4-byte chunk with some sort of identifier that I can in my decoder function inspite of the sorting that may have been done to it.
Encoding Function (Much of which was provided by (AntonĂ­n Lejsek)
private static string StringEncoder(string strInput)
{
try
{
// instantiate our StringBuilder object and set the capacity for our sb object to the length of our message.
StringBuilder sb = new StringBuilder(strInput.Length * 9 / 4 + 10);
int count = 0;
// iterate through each character in our message and format the sb object to follow Microsofts implementation of ECMA-376 for rsidR values of type ST_LongHexValue
foreach (char c in strInput)
{
// pad the first 4 byte chunk with 2 digit zeros.
if (count == 0)
{
sb.Append("00");
count = 0;
}
// every three bytes add a space and append 2 digit zeros.
if (count == 3)
{
sb.Append(" ");
sb.Append("00");
count = 0;
}
sb.Append(String.Format("{0:X2}", (int)c));
count++;
}
// handle encoded bytes which are greater than a 1 byte but smaller than 3 bytes so we know how many bytes to pad right with.
for (int i = 0; i < (3 - count) % 3; ++i)
{
sb.Append("00");
}
// DEBUG: echo results for testing.
//Console.WriteLine("");
//Console.WriteLine("String provided: {0}", strInput);
//Console.WriteLine("Hex in 8-digit chunks: {0}", sb.ToString());
//Console.WriteLine("======================================================");
return sb.ToString();
}
catch (NullReferenceException e)
{
Console.WriteLine("");
Console.WriteLine("ERROR : StringEncoder has received null input.");
Console.WriteLine("ERROR : Please ensure there is something to read in the output.txt file.");
Console.WriteLine("");
//Console.WriteLine(e.Message);
return null;
}
}
For Example : This function when provided with the following input "coolsss" would produce the following output : 0020636F 006F6C73 00737300
The above (3) 8 digit chunks would get written to the XML file starting with the first chunk and proceeding onto the last. Like so,
0020636F
006F6C73
00737300
Now, there are other 8-digit chunks in the XML file which were not created by the function above. This presents an issue as the Application can reorder these chunks among themselves and the others already present in the file like so,
00737300
00111111
006F6C73
00000000
0020636F
So, can you help me think of anyway to add a tag of some sort or use some C# Data Structure to be able to read each chunk and reconstruct my original string despite the the reordering?
I appreciate any guidance you can provide. Credit to AntonĂ­n Lejsek for his help with the function above.
Thank you,
Gabriel Alicea
Well, I am reluctant to suggest this as a proposed solution because it feels a bit too hackish for me.
Having said that; I suppose you could leverage the second byte as an ordinal so you can track the chunks and "re-assemble" the string later.
You could use the following scheme to track your chunks.
00XY0000
Where the second byte XY could be split up into two 4-bit parts representing an ordinal and a checksum.
X = Ordinal
Y = 16 % X
When reading the chunks you can split up the second byte into two words just like above and verify that the checksum aligns for the ordinal.
This solution does introduce a 16 character constraint on string length unless you eliminate the checksum and use the entire byte as an ordinal for which you can increase your string lengths to 256 characters.

What's an appropriate search/retrieval method for a VERY long list of strings?

This is not a terribly uncommon question, but I still couldn't seem to find an answer that really explained the choice.
I have a very large list of strings (ASCII representations of SHA-256 hashes, to be exact), and I need to query for the presence of a string within that list.
There will be what is likely in excess of 100 million entries in this list, and I will need to repeatably query for the presence of an entry many times.
Given the size, I doubt I can stuff it all into a HashSet<string>. What would be an appropriate retrieval system to maximize performance?
I CAN pre-sort the list, I CAN put it into a SQL table, I CAN put it into a text file, but I'm not sure what really makes the most sense given my application.
Is there a clear winner in terms of performance among these, or other methods of retrieval?
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Security.Cryptography;
namespace HashsetTest
{
abstract class HashLookupBase
{
protected const int BucketCount = 16;
private readonly HashAlgorithm _hasher;
protected HashLookupBase()
{
_hasher = SHA256.Create();
}
public abstract void AddHash(byte[] data);
public abstract bool Contains(byte[] data);
private byte[] ComputeHash(byte[] data)
{
return _hasher.ComputeHash(data);
}
protected Data256Bit GetHashObject(byte[] data)
{
var hash = ComputeHash(data);
return Data256Bit.FromBytes(hash);
}
public virtual void CompleteAdding() { }
}
class HashsetHashLookup : HashLookupBase
{
private readonly HashSet<Data256Bit>[] _hashSets;
public HashsetHashLookup()
{
_hashSets = new HashSet<Data256Bit>[BucketCount];
for(int i = 0; i < _hashSets.Length; i++)
_hashSets[i] = new HashSet<Data256Bit>();
}
public override void AddHash(byte[] data)
{
var item = GetHashObject(data);
var offset = item.GetHashCode() & 0xF;
_hashSets[offset].Add(item);
}
public override bool Contains(byte[] data)
{
var target = GetHashObject(data);
var offset = target.GetHashCode() & 0xF;
return _hashSets[offset].Contains(target);
}
}
class ArrayHashLookup : HashLookupBase
{
private Data256Bit[][] _objects;
private int[] _offsets;
private int _bucketCounter;
public ArrayHashLookup(int size)
{
size /= BucketCount;
_objects = new Data256Bit[BucketCount][];
_offsets = new int[BucketCount];
for(var i = 0; i < BucketCount; i++) _objects[i] = new Data256Bit[size + 1];
_bucketCounter = 0;
}
public override void CompleteAdding()
{
for(int i = 0; i < BucketCount; i++) Array.Sort(_objects[i]);
}
public override void AddHash(byte[] data)
{
var hashObject = GetHashObject(data);
_objects[_bucketCounter][_offsets[_bucketCounter]++] = hashObject;
_bucketCounter++;
_bucketCounter %= BucketCount;
}
public override bool Contains(byte[] data)
{
var hashObject = GetHashObject(data);
return _objects.Any(o => Array.BinarySearch(o, hashObject) >= 0);
}
}
struct Data256Bit : IEquatable<Data256Bit>, IComparable<Data256Bit>
{
public bool Equals(Data256Bit other)
{
return _u1 == other._u1 && _u2 == other._u2 && _u3 == other._u3 && _u4 == other._u4;
}
public int CompareTo(Data256Bit other)
{
var rslt = _u1.CompareTo(other._u1); if (rslt != 0) return rslt;
rslt = _u2.CompareTo(other._u2); if (rslt != 0) return rslt;
rslt = _u3.CompareTo(other._u3); if (rslt != 0) return rslt;
return _u4.CompareTo(other._u4);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj))
return false;
return obj is Data256Bit && Equals((Data256Bit) obj);
}
public override int GetHashCode()
{
unchecked
{
var hashCode = _u1.GetHashCode();
hashCode = (hashCode * 397) ^ _u2.GetHashCode();
hashCode = (hashCode * 397) ^ _u3.GetHashCode();
hashCode = (hashCode * 397) ^ _u4.GetHashCode();
return hashCode;
}
}
public static bool operator ==(Data256Bit left, Data256Bit right)
{
return left.Equals(right);
}
public static bool operator !=(Data256Bit left, Data256Bit right)
{
return !left.Equals(right);
}
private readonly long _u1;
private readonly long _u2;
private readonly long _u3;
private readonly long _u4;
private Data256Bit(long u1, long u2, long u3, long u4)
{
_u1 = u1;
_u2 = u2;
_u3 = u3;
_u4 = u4;
}
public static Data256Bit FromBytes(byte[] data)
{
return new Data256Bit(
BitConverter.ToInt64(data, 0),
BitConverter.ToInt64(data, 8),
BitConverter.ToInt64(data, 16),
BitConverter.ToInt64(data, 24)
);
}
}
class Program
{
private const int TestSize = 150000000;
static void Main(string[] args)
{
GC.Collect(3);
GC.WaitForPendingFinalizers();
{
var arrayHashLookup = new ArrayHashLookup(TestSize);
PerformBenchmark(arrayHashLookup, TestSize);
}
GC.Collect(3);
GC.WaitForPendingFinalizers();
{
var hashsetHashLookup = new HashsetHashLookup();
PerformBenchmark(hashsetHashLookup, TestSize);
}
Console.ReadLine();
}
private static void PerformBenchmark(HashLookupBase hashClass, int size)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < size; i++)
hashClass.AddHash(BitConverter.GetBytes(i * 2));
Console.WriteLine("Hashing and addition took " + sw.ElapsedMilliseconds + "ms");
sw.Restart();
hashClass.CompleteAdding();
Console.WriteLine("Hash cleanup (sorting, usually) took " + sw.ElapsedMilliseconds + "ms");
sw.Restart();
var found = 0;
for (int i = 0; i < size * 2; i += 10)
{
found += hashClass.Contains(BitConverter.GetBytes(i)) ? 1 : 0;
}
Console.WriteLine("Found " + found + " elements (expected " + (size / 5) + ") in " + sw.ElapsedMilliseconds + "ms");
}
}
}
Results are pretty promising. They run single-threaded. The hashset version can hit a little over 1 million lookups per second at 7.9GB RAM usage. The array-based version uses less RAM (4.6GB). Startup times between the two are nearly identical (388 vs 391 seconds). The hashset trades RAM for lookup performance. Both had to be bucketized because of memory allocation constraints.
Array performance:
Hashing and addition took 307408ms
Hash cleanup (sorting, usually) took 81892ms
Found 30000000 elements (expected 30000000) in 562585ms [53k searches per second]
======================================
Hashset performance:
Hashing and addition took 391105ms
Hash cleanup (sorting, usually) took 0ms
Found 30000000 elements (expected 30000000) in 74864ms [400k searches per second]
If the list changes over time, I would put it in a database.
If the list doesn't change, I would put it in a sorted file and do a binary search for every query.
In both cases, I would use a Bloom filter to minimize I/O. And I would stop using strings and use the binary representation with four ulongs (to avoid the object reference cost).
If you have more than 16 GB (2*64*4/3*100M, assuming Base64 encoding) to spare, an option is to make a Set&ltstring> and be happy. Of course it would fit in less than 7 GB if you use the binary representation.
David Haney's answer shows us that the memory cost is not so easily calculated.
With <gcAllowVeryLargeObjects>, you can have arrays that are much larger. Why not convert those ASCII representations of 256-bit hash codes to a custom struct that implements IComparable<T>? It would look like this:
struct MyHashCode: IComparable<MyHashCode>
{
// make these readonly and provide a constructor
ulong h1, h2, h3, h4;
public int CompareTo(MyHashCode other)
{
var rslt = h1.CompareTo(other.h1);
if (rslt != 0) return rslt;
rslt = h2.CompareTo(other.h2);
if (rslt != 0) return rslt;
rslt = h3.CompareTo(other.h3);
if (rslt != 0) return rslt;
return h4.CompareTo(other.h4);
}
}
You can then create an array of these, which would occupy approximately 3.2 GB. You can search it easy enough with Array.BinarySearch.
Of course, you'll need to convert the user's input from ASCII to one of those hash code structures, but that's easy enough.
As for performance, this isn't going to be as fast as a hash table, but it's certainly going to be faster than a database lookup or file operations.
Come to think of it, you could create a HashSet<MyHashCode>. You'd have to override the Equals method on MyHashCode, but that's really easy. As I recall, the HashSet costs something like 24 bytes per entry, and you'd have the added cost of the larger struct. Figure five or six gigabytes, total, if you were to use a HashSet. More memory, but still doable, and you get O(1) lookup.
These answers don't factor the string memory into the application. Strings are not 1 char == 1 byte in .NET. Each string object requires a constant 20 bytes for the object data. And the buffer requires 2 bytes per character. Therefore: the memory usage estimate for a string instance is 20 + (2 * Length) bytes.
Let's do some math.
100,000,000 UNIQUE strings
SHA256 = 32 bytes (256 bits)
size of each string = 20 + (2 * 32 bytes) = 84 bytes
Total required memory: 8,400,000,000 bytes = 8.01 gigabytes
It is possible to do so, but this will not store well in .NET memory. Your goal should be to load all of this data into a form that can be accessed/paged without holding it all in memory at once. For that I'd use Lucene.net which will store your data on disk and intelligently search it. Write each string as searchable to an index and then search the index for the string. Now you have a scalable app that can handle this problem; your only limitation will be disk space (and it would take a lot of string to fill up a terabyte drive). Alternatively, put these records in a database and query against it. That's why databases exist: to persist things outside of RAM. :)
For maximum speed, keep them in RAM. It's only ~3GB worth of data, plus whatever overhead your data structure needs. A HashSet<byte[]> should work just fine. If you want to lower overhead and GC pressure, turn on <gcAllowVeryLargeObjects>, use a single byte[], and a HashSet<int> with a custom comparer to index into it.
For speed and low memory usage, store them in a disk-based hash table.
For simplicity, store them in a database.
Whatever you do, you should store them as plain binary data, not strings.
A hashset splits your data into buckets (arrays). On a 64-bit system, the size limit for an array is 2 GB, which is roughly 2,000,000,000 bytes.
Since a string is a reference type, and since a reference takes eight bytes (assuming a 64-bit system), each bucket can hold approximately 250,000,000 (250 million) references to strings. It seems to be way more than what you need.
That being said, as Tim S. pointed out, it's highly unlikely you'll have the necessary memory to hold the strings themselves, even though the references would fit into the hashset. A database would me a much better fit for this.
You need to be careful in this sort of situation as most collections in most languages are not really designed or optimized for that sort of scale. As you have already identified memory usage will be a problem too.
The clear winner here is to use some form of database. Either a SQL database or there are a number of NoSQL ones that would be appropriate.
The SQL server is already designed and optimized for keeping track of large amounts of data, indexing it and searching and querying across those indexes. It's designed for doing exactly what you are trying to do so really would be the best way to go.
For performance you could consider using an embedded database that will run within your process and save the resulting communications overhead. For Java I could recommend a Derby database for that purpose, I'm not aware of the C# equivalents enough to make a recommendation there but I imagine suitable databases exist.
It might take a while (1) to dump all the records in a (clustered indexed) table (preferably use their values, not their string representation (2)) and let SQL do the searching. It will handle binary searching for you, it will handle caching for you and it's probably the easiest thing to work with if you need to make changes to the list. And I'm pretty sure that querying things will be just as fast (or faster) than building your own.
(1): For loading the data have a look at the SqlBulkCopy object, things like ADO.NET or Entity Framework are going to be too slow as they load the data row by row.
(2): SHA-256 = 256 bits, so a binary(32) will do; which is only half of the 64 characters you're using now. (Or a quarter of it if you're using Unicode numbers =P) Then again, if you currently have the information in a plain text-file you could still go the char(64) way and simply dump the data in the table using bcp.exe. The database will be bigger, the queries slightly slower (as more I/O is needed + the cache holds only half of the information for the same amount of RAM), etc... But it's quite straightforward to do, and if you're not happy with the result you can still write your own database-loader.
If the set is constant then just make a big sorted hash list (in raw format, 32 bytes each). Store all hashes so that they fit to disk sectors (4KB), and that the beginning of each sector is also the beginning of a hash. Save the first hash in every Nth sector in a special index list, which will easily fit into memory. Use binary search on this index list to determine the starting sector of a sector cluster where the hash should be, and then use another binary search within this sector cluster to find your hash. Value N should be determined based on measuring with test data.
EDIT: alternative would be to implement your own hash table on disk. The table should use open addressing strategy, and the probe sequence should be restricted to the same disk sector as much as possible. Empty slot have to be marked with a special value (all zeroes for instance) so this special value should be specially handled when queried for existence. To avoid collisions the table should not be less than 80% full with values, so in your case with 100 million entries with size of 32 bytes that means the table should have at least 100M/80%= 125 millions slots, and have the size of 125M*32= 4 GB. You only need to create the hashing function that would convert 2^256 domain to 125M, and some nice probe sequence.
You can try a Suffix Tree, this question goes over how to do it in C#
Or you can try a search like so
var matches = list.AsParallel().Where(s => s.Contains(searchTerm)).ToList();
AsParallel will help speed things up as it creates a parallelization of a query.
Store your hashes as UInt32[8]
2a. Use sorted list. To compare two hashes, first compare their first elements; if they are equals, then compare second ones and so on.
2b. Use prefix tree
First of all I would really recommend that you use data compression in order to minimize resource consumption. Cache and memory bandwidth are usually the most limited resource in a modern computer. No matter how you implement this the biggest bottleneck will be waiting for data.
Also I would recommend using an existing database engine. Many of them have build-in compression and any database would make use of the RAM you have available. If you have a decent operating system, the system cache will store as much of the file as it can. But most databases have their own caching subsystem.
I cant really tell what db engine will be best for you, you have to try them out. Personally I often use H2 which have decent performance and can be used as both in-memory and file-based database, and have build in transparent compression.
I see that some have stated that importing your data to a database and building the search index may take longer than some custom solution. That may be true but importing are usually something that's quite rare. I am going to assume that you are more interested in fast searches as they are probable to be the most common operation.
Also why SQL databases are both reliable and quite fast, you may want to consider NoSQL databases. Try out a few alternatives. The only way to know which solution will give you the best performance are by benchmarking them.
Also you should consider if storing your list as text makes sense. Perhaps you should convert the list to numeric values. That will use less space and therefore give you faster queries. Database import may be significantly slower, but queries may become significantly faster.
If you want really fast, and the elements are more or less immutable and require exact matches, you can build something that operates like a virus scanner: set the scope to collect the minimum number of potential elements using whatever algorithms are relevant to your entries and search criteria, then iterate through those items, testing against the search item using RtlCompareMemory.. You can pull the items from disk if they are fairly contiguous and compare using something like this:
private Boolean CompareRegions(IntPtr hFile, long nPosition, IntPtr pCompare, UInt32 pSize)
{
IntPtr pBuffer = IntPtr.Zero;
UInt32 iRead = 0;
try
{
pBuffer = VirtualAlloc(IntPtr.Zero, pSize, MEM_COMMIT, PAGE_READWRITE);
SetFilePointerEx(hFile, nPosition, IntPtr.Zero, FILE_BEGIN);
if (ReadFile(hFile, pBuffer, pSize, ref iRead, IntPtr.Zero) == 0)
return false;
if (RtlCompareMemory(pCompare, pBuffer, pSize) == pSize)
return true; // equal
return false;
}
finally
{
if (pBuffer != IntPtr.Zero)
VirtualFree(pBuffer, pSize, MEM_RELEASE);
}
}
I would modify this example to grab a large buffer full of entries, and loop through those. But managed code may not be the way to go.. Fastest is always closer to the calls that do the actual work, so a driver with kernel mode access built on straight C would be much faster..
Firstly, you say the strings are really SHA256 hashes. Observe that 100 million * 256 bits = 3.2 gigabytes, so it is possible to fit the entire list in memory, assuming you use a memory-efficient data structure.
If you forgive occasional false positives, you can actually use less memory than that. See bloom filters http://billmill.org/bloomfilter-tutorial/
Otherwise, use a sorted data structure to achieve fast querying (time complexity O(log n)).
If you really do want to store the data in memory (because you're querying frequently and need fast results), try Redis. http://redis.io/
Redis is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.
It has a set datatype http://redis.io/topics/data-types#sets
Redis Sets are an unordered collection of Strings. It is possible to add, remove, and test for existence of members in O(1) (constant time regardless of the number of elements contained inside the Set).
Otherwise, use a database that saves the data on disk.
A plain vanilla binary search tree will give excellent lookup performance on large lists. However, if you don't really need to store the strings and simple membership is what you want to know, a Bloom Filter may be a terric solution. Bloom filters are a compact data structure that you train with all the strings. Once trained, it can quickly tell you if it has seen a string before. It rarely reports.false positives, but never reports false negatives. Depending on the application, they can produce amazing results quickly and with relatively little memory.
I developed a solution similar to Insta's approach, but with some differences. In effect, it looks a lot like his chunked array solution. However, instead of just simply splitting the data, my approach builds an index of chunks and directs the search only to the appropriate chunk.
The way the index is built is very similar to a hashtable, with each bucket being an sorted array that can be search with a binary search. However, I figured that there's little point in computing a hash of an SHA256 hash, so instead I simply take a prefix of the value.
The interesting thing about this technique is that you can tune it by extending the length of the index keys. A longer key means a larger index and smaller buckets. My test case of 8 bits is probably on the small side; 10-12 bits would probably be more effective.
I attempted to benchmark this approach, but it quickly ran out of memory so I wasn't able to see anything interesting in terms of performance.
I also wrote a C implementation. The C implementation wasn't able to deal with a data set of the specified size either (the test machine has only 4GB of RAM), but it did manage somewhat more. (The target data set actually wasn't so much of a problem in that case, it was the test data that filled up the RAM.) I wasn't able to figure out a good way to throw data at it fast enough to really see its performance tested.
While I enjoyed writing this, I'd say overall it mostly provides evidence in favor of the argument that you shouldn't be trying to do this in memory with C#.
public interface IKeyed
{
int ExtractKey();
}
struct Sha256_Long : IComparable<Sha256_Long>, IKeyed
{
private UInt64 _piece1;
private UInt64 _piece2;
private UInt64 _piece3;
private UInt64 _piece4;
public Sha256_Long(string hex)
{
if (hex.Length != 64)
{
throw new ArgumentException("Hex string must contain exactly 64 digits.");
}
UInt64[] pieces = new UInt64[4];
for (int i = 0; i < 4; i++)
{
pieces[i] = UInt64.Parse(hex.Substring(i * 8, 1), NumberStyles.HexNumber);
}
_piece1 = pieces[0];
_piece2 = pieces[1];
_piece3 = pieces[2];
_piece4 = pieces[3];
}
public Sha256_Long(byte[] bytes)
{
if (bytes.Length != 32)
{
throw new ArgumentException("Sha256 values must be exactly 32 bytes.");
}
_piece1 = BitConverter.ToUInt64(bytes, 0);
_piece2 = BitConverter.ToUInt64(bytes, 8);
_piece3 = BitConverter.ToUInt64(bytes, 16);
_piece4 = BitConverter.ToUInt64(bytes, 24);
}
public override string ToString()
{
return String.Format("{0:X}{0:X}{0:X}{0:X}", _piece1, _piece2, _piece3, _piece4);
}
public int CompareTo(Sha256_Long other)
{
if (this._piece1 < other._piece1) return -1;
if (this._piece1 > other._piece1) return 1;
if (this._piece2 < other._piece2) return -1;
if (this._piece2 > other._piece2) return 1;
if (this._piece3 < other._piece3) return -1;
if (this._piece3 > other._piece3) return 1;
if (this._piece4 < other._piece4) return -1;
if (this._piece4 > other._piece4) return 1;
return 0;
}
//-------------------------------------------------------------------
// Implementation of key extraction
public const int KeyBits = 8;
private static UInt64 _keyMask;
private static int _shiftBits;
static Sha256_Long()
{
_keyMask = 0;
for (int i = 0; i < KeyBits; i++)
{
_keyMask |= (UInt64)1 << i;
}
_shiftBits = 64 - KeyBits;
}
public int ExtractKey()
{
UInt64 keyRaw = _piece1 & _keyMask;
return (int)(keyRaw >> _shiftBits);
}
}
class IndexedSet<T> where T : IComparable<T>, IKeyed
{
private T[][] _keyedSets;
public IndexedSet(IEnumerable<T> source, int keyBits)
{
// Arrange elements into groups by key
var keyedSetsInit = new Dictionary<int, List<T>>();
foreach (T item in source)
{
int key = item.ExtractKey();
List<T> vals;
if (!keyedSetsInit.TryGetValue(key, out vals))
{
vals = new List<T>();
keyedSetsInit.Add(key, vals);
}
vals.Add(item);
}
// Transform the above structure into a more efficient array-based structure
int nKeys = 1 << keyBits;
_keyedSets = new T[nKeys][];
for (int key = 0; key < nKeys; key++)
{
List<T> vals;
if (keyedSetsInit.TryGetValue(key, out vals))
{
_keyedSets[key] = vals.OrderBy(x => x).ToArray();
}
}
}
public bool Contains(T item)
{
int key = item.ExtractKey();
if (_keyedSets[key] == null)
{
return false;
}
else
{
return Search(item, _keyedSets[key]);
}
}
private bool Search(T item, T[] set)
{
int first = 0;
int last = set.Length - 1;
while (first <= last)
{
int midpoint = (first + last) / 2;
int cmp = item.CompareTo(set[midpoint]);
if (cmp == 0)
{
return true;
}
else if (cmp < 0)
{
last = midpoint - 1;
}
else
{
first = midpoint + 1;
}
}
return false;
}
}
class Program
{
//private const int NTestItems = 100 * 1000 * 1000;
private const int NTestItems = 1 * 1000 * 1000;
private static Sha256_Long RandomHash(Random rand)
{
var bytes = new byte[32];
rand.NextBytes(bytes);
return new Sha256_Long(bytes);
}
static IEnumerable<Sha256_Long> GenerateRandomHashes(
Random rand, int nToGenerate)
{
for (int i = 0; i < nToGenerate; i++)
{
yield return RandomHash(rand);
}
}
static void Main(string[] args)
{
Console.WriteLine("Generating test set.");
var rand = new Random();
IndexedSet<Sha256_Long> set =
new IndexedSet<Sha256_Long>(
GenerateRandomHashes(rand, NTestItems),
Sha256_Long.KeyBits);
Console.WriteLine("Testing with random input.");
int nFound = 0;
int nItems = NTestItems;
int waypointDistance = 100000;
int waypoint = 0;
for (int i = 0; i < nItems; i++)
{
if (++waypoint == waypointDistance)
{
Console.WriteLine("Test lookups complete: " + (i + 1));
waypoint = 0;
}
var item = RandomHash(rand);
nFound += set.Contains(item) ? 1 : 0;
}
Console.WriteLine("Testing complete.");
Console.WriteLine(String.Format("Found: {0} / {0}", nFound, nItems));
Console.ReadKey();
}
}

How to use Stream.Write Method to overwrite existing text

I am using StreamWriter to write records into a file. Now I want to overwrite specific record.
string file="c:\\......";
StreamWriter sw = new StreamWriter(new FileStream(file, FileMode.Open, FileAccess.Write));
sw.write(...);
sw.close();
I read somewhere here that I can use Stream.Write method to do that, I have no previous experience or knowledge of how to deal with bytes.
public override void Write(
byte[] array,
int offset,
int count
)
So how to use this method.
I need someone to explain what exactly byte[] array and int count are in this method, and any simple sample code shows how to use this method to overwrite existing record in a file.
ex. change any record like record Mark1287,11100,25| to Bill9654,22100,30| .
If you want to override a particular record, you must use FileStream.Seek-method to set the put your stream in position.
Example for Seek
using System;
using System.IO;
class FStream
{
static void Main()
{
const string fileName = "Test####.dat";
// Create random data to write to the file.
byte[] dataArray = new byte[100000];
new Random().NextBytes(dataArray);
using(FileStream
fileStream = new FileStream(fileName, FileMode.Create))
{
// Write the data to the file, byte by byte.
for(int i = 0; i < dataArray.Length; i++)
{
fileStream.WriteByte(dataArray[i]);
}
// Set the stream position to the beginning of the file.
fileStream.Seek(0, SeekOrigin.Begin);
// Read and verify the data.
for(int i = 0; i < fileStream.Length; i++)
{
if(dataArray[i] != fileStream.ReadByte())
{
Console.WriteLine("Error writing data.");
return;
}
}
Console.WriteLine("The data was written to {0} " +
"and verified.", fileStream.Name);
}
}
}
After having sought the position, use Write, whereas
public override void Write(
byte[] array,
int offset,
int count
)
Parameters
array
Type: System.Byte[]
The buffer containing data to write to the stream.
offset
Type: System.Int32
The zero-based byte offset in array from which to begin copying bytes to the stream.
count
Type: System.Int32
The maximum number of bytes to write.
And most important: always consider the documentation when unsure!
So... in short:
Your file is text base (but is allowed to become binary based).
Your record have various sizes.
This way there is, without analyzing your file, no way to know where a given record starts and ends. If you want to overwrite a record, the new record can be larger than the old record, so all records further in that file will have to be moved.
This requires a complex management system. Options could be:
When your application starts it analyzes your file and stores in memory the start and length of each record.
There is a seperate (binary)file which holds per record the start and length of each record. This will cost an additional 8 bytes in total (an Int32 for both start+length. Perhapse you want to conside Int64.)
If you want to rewrite a record, u can use this "record/start/length"-system to know where to start to write your record. But before you do that, you have to assure space, thus moving all records after the record being rewritten. Of course you have to update your managementsystem witht the new positions and length.
Another option is to do as a database: every record exists of fixed width columns. Even text columns have a maximum length. Because of this you can calculate very easy where each record start in the file. For example: if each record has a size of 200 bytes, then record #0 will start at position 0, the next record at position 200, the one after that at 400, etc. You do not have to move record when a record is rewritten.
Another suggestion is: create a mangementsystem like how memory is managed. Once a record is written it stays there. The managementsystem keeps a list of allocated portions and free portions of the file. If a new record is written, a free and fitting portion is search by the managementsystem and the record is written at that position (optionally leaving a smaller free portion). When a record is deleted, that space is freeds up. When you rewrite a record you actually delete the old record and write a new record (possibly at a totalle different location).
My last suggestion: Use a database :)

How to read a large (1 GB) txt file in .NET?

I have a 1 GB text file which I need to read line by line. What is the best and fastest way to do this?
private void ReadTxtFile()
{
string filePath = string.Empty;
filePath = openFileDialog1.FileName;
if (string.IsNullOrEmpty(filePath))
{
using (StreamReader sr = new StreamReader(filePath))
{
String line;
while ((line = sr.ReadLine()) != null)
{
FormatData(line);
}
}
}
}
In FormatData() I check the starting word of line which must be matched with a word and based on that increment an integer variable.
void FormatData(string line)
{
if (line.StartWith(word))
{
globalIntVariable++;
}
}
If you are using .NET 4.0, try MemoryMappedFile which is a designed class for this scenario.
You can use StreamReader.ReadLine otherwise.
Using StreamReader is probably the way to since you don't want the whole file in memory at once. MemoryMappedFile is more for random access than sequential reading (it's ten times as fast for sequential reading and memory mapping is ten times as fast for random access).
You might also try creating your streamreader from a filestream with FileOptions set to SequentialScan (see FileOptions Enumeration), but I doubt it will make much of a difference.
There are however ways to make your example more effective, since you do your formatting in the same loop as reading. You're wasting clockcycles, so if you want even more performance, it would be better with a multithreaded asynchronous solution where one thread reads data and another formats it as it becomes available. Checkout BlockingColletion that might fit your needs:
Blocking Collection and the Producer-Consumer Problem
If you want the fastest possible performance, in my experience the only way is to read in as large a chunk of binary data sequentially and deserialize it into text in parallel, but the code starts to get complicated at that point.
You can use LINQ:
int result = File.ReadLines(filePath).Count(line => line.StartsWith(word));
File.ReadLines returns an IEnumerable<String> that lazily reads each line from the file without loading the whole file into memory.
Enumerable.Count counts the lines that start with the word.
If you are calling this from an UI thread, use a BackgroundWorker.
Probably to read it line by line.
You should rather not try to force it into memory by reading to end and then processing.
StreamReader.ReadLine should work fine. Let the framework choose the buffering, unless you know by profiling you can do better.
TextReader.ReadLine()
I was facing same problem in our production server at Agenty where we see large files (sometimes 10-25 gb (\t) tab delimited txt files). And after lots of testing and research I found the best way to read large files in small chunks with for/foreach loop and setting offset and limit logic with File.ReadLines().
int TotalRows = File.ReadLines(Path).Count(); // Count the number of rows in file with lazy load
int Limit = 100000; // 100000 rows per batch
for (int Offset = 0; Offset < TotalRows; Offset += Limit)
{
var table = Path.FileToTable(heading: true, delimiter: '\t', offset : Offset, limit: Limit);
// Do all your processing here and with limit and offset and save to drive in append mode
// The append mode will write the output in same file for each processed batch.
table.TableToFile(#"C:\output.txt");
}
See the complete code in my Github library : https://github.com/Agenty/FileReader/
Full Disclosure - I work for Agenty, the company who owned this library and website
My file is over 13 GB:
You can use my class:
public static void Read(int length)
{
StringBuilder resultAsString = new StringBuilder();
using (MemoryMappedFile memoryMappedFile = MemoryMappedFile.CreateFromFile(#"D:\_Profession\Projects\Parto\HotelDataManagement\_Document\Expedia_Rapid.jsonl\Expedia_Rapi.json"))
using (MemoryMappedViewStream memoryMappedViewStream = memoryMappedFile.CreateViewStream(0, length))
{
for (int i = 0; i < length; i++)
{
//Reads a byte from a stream and advances the position within the stream by one byte, or returns -1 if at the end of the stream.
int result = memoryMappedViewStream.ReadByte();
if (result == -1)
{
break;
}
char letter = (char)result;
resultAsString.Append(letter);
}
}
}
This code will read text of file from start to the length that you pass to the method Read(int length) and fill the resultAsString variable.
It will return the bellow text:
I'd read the file 10,000 bytes at a time. Then I'd analyse those 10,000 bytes and chop them into lines and feed them to the FormatData function.
Bonus points for splitting the reading and line analysation on multiple threads.
I'd definitely use a StringBuilder to collect all strings and might build a string buffer to keep about 100 strings in memory all the time.

Categories

Resources