The output should be a large text file, where each line has the form Number.String, text is random:
347. Bus
20175. Yes Yes
15. The same
2. Hello world
178. Tree
The file size must be specified in bytes. Interested in the fastest way to generate files of about 1000MB and more.
There is my code for generation random text:
public string[] GetRandomTextWithIndexes(int size)
{
var result = new string[size];
var sw = Stopwatch.StartNew();
var indexes = Enumerable.Range(0, size).AsParallel().OrderBy(g => GenerateRandomNumber(0, 5)).ToList();
sw.Stop();
Console.WriteLine("Queue fill: " + sw.Elapsed);
sw = Stopwatch.StartNew();
Parallel.For(0, size, i =>
{
var text = GetRandomText(GenerateRandomNumber(1, 20));
result[i] = $"{indexes[i]}. {text}";
});
sw.Stop();
Console.WriteLine("Text fill: " + sw.Elapsed);
return result;
}
public string GetRandomText(int size)
{
var builder = new StringBuilder();
for (var i = 0; i < size; i++)
{
var character = LegalCharacters[GenerateRandomNumber(0, LegalCharacters.Length)];
builder.Append(character);
}
return builder.ToString();
}
private int GenerateRandomNumber(int min, int max)
{
lock (_synlock)
{
if (_random == null)
_random = new Random();
return _random.Next(min, max);
}
}
I don't know how to make working this code not with size of strings but with size of MBs. When I set size to about 1000000000 I receive OutOfMemoryException. And maybe there is some faster way to generate indexes
Disk is your bottleneck, no need for parallel processing
No need to store everything in memory before writing
using (var fs = File.OpenWrite(#"c:\w\test.txt"))
using (var w = new StreamWriter(fs))
{
for (var i = 0; i < size; i++)
{
var text = GetRandomText(GenerateRandomNumber(1, 20));
var number = GenerateRandomNumber(0, 5);
var line = $"{number}. {text}";
w.WriteLine(line);
}
}
It's better to put the full exception in the question. I bet it shows at
var result = new string[size];
1000000000 for size of string array is too much, try to run this dotnetfiddle, you'll get:
Run-time exception (line 12): Array dimensions exceeded supported
range.
Stack Trace:
[System.OutOfMemoryException: Array dimensions exceeded supported
range.] at Program.Main() :line 12
Please have a look at the following to know why you are getting that exception and what's the workaround.
What is the Maximum Size that an Array can hold?
Can't create huge arrays
Error when Dictionary count is bigger as 89478457
Related
Haven't found a similar question
I want to slice an array into smaller batches based on byte size limit. This is my implementation which is a little slow. In particular I want to know if there is a build-in functionality that can get me this, or maybe I can enhance my approach below.
When i accumulated the size of individual items it didn't correlate correctly to the size of the batch. probably extra metadata on the stream itself..
private static List<List<T>> SliceLogsIntoBatches<T>(List<T> data) where T : Log
{
const long batchSizeLimitInBytes = 1048576;
var batches = new List<List<T>>();
while (data.Count > 0)
{
var batch = new List<T>();
batch.AddRange(data.TakeWhile((log) =>
{
var currentBatchSizeInBytes = GetObjectSizeInBytes(batch); // this will slow down as takewhile moves on
return (currentBatchSizeInBytes < batchSizeLimitInBytes);
}));
batches.Add(batch);
data = data.Except(batch).ToList();
}
return batches;
}
private static long GetObjectSizeInBytes(object objectToGetSizeFor)
{
using (var objectAsStream = ConvertObjectToMemoryStream(objectToGetSizeFor))
{
return objectAsStream.Length;
}
}
You keep recalculating the size of the batch you are creating. So you are recalculating the size of some data items a lot.
It would help if you would calculate the data size of each data item and simply add that to a variable to keep track of the current batch size.
Try something like this:
long batchSizeLimitInBytes = 1048576;
var batches = new List<List<T>>();
var currentBatch = new List<T>();
var currentBatchLength = 0;
for (int i = 0; i < data.Count; i++)
{
var currentData = data[i];
var currentDataLength = GetObjectSizeInBytes(currentData);
if (currentBatchLength + currentDataLength > batchSizeLimitInBytes)
{
batches.Add(currentBatch);
currentBatchLength = 0;
currentBatch = new List<T>();
}
currentBatch.Add(currentData);
currentBatchLength += currentDataLength;
}
As a sidenote, I would probably only want to convert the data to byte streams only once, since this is an expensive operation. You currently convert to streams just to check the length, you may want ot have this method actually return the streams batched, instead of List<List<T>>.
I think that your approach can be enhanced using the next idea: we can calculate an approximate size of the batch as sum of sizes of data objects; and then use this approximate batch size to form an actual batch; actual batch size is a size of list of data objects. If we use this idea we can reduce the number of invocations of the method GetObjectSizeInBytes.
Here is the code that implements this idea:
private static List<List<T>> SliceLogsIntoBatches<T>(List<T> data) where T : Log
{
const long batchSizeLimitInBytes = 1048576;
var batches = new List<List<T>>();
var currentBatch = new List<T>();
// At first, we calculate size of each data object.
// We will use them to calculate an approximate size of the batch.
List<long> sizes = data.Select(GetObjectSizeInBytes).ToList();
int index = 0;
// Approximate size of the batch.
long dataSize = 0;
while (index < data.Count)
{
dataSize += sizes[index];
if (dataSize <= batchSizeLimitInBytes)
{
currentBatch.Add(data[index]);
index++;
}
// If approximate size of the current batch is greater
// than max batch size we try to form an actual batch by:
// 1. calculating actual batch size via GetObjectSizeInBytes method;
// and then
// 2. excluding excess data objects if actual batch size is greater
// than max batch size.
if (dataSize > batchSizeLimitInBytes || index >= data.Count)
{
// This loop excludes excess data objects if actual batch size
// is greater than max batch size.
while (GetObjectSizeInBytes(currentBatch) > batchSizeLimitInBytes)
{
index--;
currentBatch.RemoveAt(currentBatch.Count - 1);
}
batches.Add(currentBatch);
currentBatch = new List<T>();
dataSize = 0;
}
}
return batches;
}
Here is complete sample that demostrates this approach.
I want to generate a one million bit random binary but my problem is that the code take to much time and not execute why that happen?
string result1 = "";
Random rand = new Random();
for (int i = 0; i < 1000000; i++)
{
result1 += ((rand.Next() % 2 == 0) ? "0" : "1");
}
textBox1.Text = result1.ToString();
Concatenating strings is an O(N) operation. Strings are immutable, so when you add to a string the new value is copied into a new string, which requires iterating the previous string. Since you're adding a value for each iteration, the amount that has to be read each time grows with each addition, leading to a performance of O(N^2). Since your N is 1,000,000 this takes a very, very long time, and probably is eating all of the memory you have storing these intermediary throw-away strings.
The normal solution when building a string with an arbitrary number of inputs is to instead use a StringBuilder. Although, a 1,000,000 character bit string is still.. unwieldy. Assuming a bitstring is what you want/need, you can change your code to something like the following and have a much more performant solution.
public string GetGiantBitString() {
var sb = new StringBuilder();
var rand = new Random();
for(var i = 0; i < 1_000_000; i++) {
sb.Append(rand.Next() % 2);
}
return sb.ToString();
}
This works for me, it takes about 0.035 seconds on my box:
private static IEnumerable<Byte> MillionBits()
{
var rand = new RNGCryptoServiceProvider();
//a million bits is 125,000 bytes, so
var bytes = new List<byte>(125000);
for (var i = 0; i < 125; ++i)
{
byte[] tempBytes = new byte[1000];
rand.GetBytes(tempBytes);
bytes.AddRange(tempBytes);
}
return bytes;
}
private static string BytesAsString(IEnumerable<Byte> bytes)
{
var buffer = new StringBuilder();
foreach (var byt in bytes)
{
buffer.Append(Convert.ToString(byt, 2).PadLeft(8, '0'));
}
return buffer.ToString();
}
and then:
var myStopWatch = new Stopwatch();
myStopWatch.Start();
var lotsOfBytes = MillionBits();
var bigString = BytesAsString(lotsOfBytes);
var len = bigString.Length;
var elapsed = myStopWatch.Elapsed;
The len variable was a million, the string looked like it was all 1s and 0s.
If you really want to fill your textbox full of ones and zeros, just set its Text property to bigString.
I have 2 files.
1 is Source File and 2nd is Destination file.
Below is my code for Intersect and Union two file using byte array.
FileStream frsrc = new FileStream("Src.bin", FileMode.Open);
FileStream frdes = new FileStream("Des.bin", FileMode.Open);
int length = 24; // get file length
byte[] src = new byte[length];
byte[] des = new byte[length]; // create buffer
int Counter = 0; // actual number of bytes read
int subcount = 0;
while (frsrc.Read(src, 0, length) > 0)
{
try
{
Counter = 0;
frdes.Position = subcount * length;
while (frdes.Read(des, 0, length) > 0)
{
var data = src.Intersect(des);
var data1 = src.Union(des);
Counter++;
}
subcount++;
Console.WriteLine(subcount.ToString());
}
}
catch (Exception ex)
{
}
}
It is works fine with fastest speed.
but Now the problem is that I want count of it and when I Use below code then it becomes very slow.
var data = src.Intersect(des).Count();
var data1 = src.Union(des).Count();
So, Is there any solution for that ?
If yes,then please lete me know as soon as possible.
Thanks
Intersect and Union are not the fastest operations. The reason you see it being fast is that you never actually enumerate the results!
Both return an enumerable, not the actual results of the operation. You're supposed to go through that and enumerate the enumerable, otherwise nothing happens - this is called "deferred execution". Now, when you do Count, you actually enumerate the enumerable, and incur the full cost of the Intersect and Union - believe me, the Count itself is relatively trivial (though still an O(n) operation!).
You'll need to make your own methods, most likely. You want to avoid the enumerable overhead, and more importantly, you'll probably want a lookup table.
A few points: the comment // get file length is misleading as it is the buffer size. Counter is not the number of bytes read, it is the number of blocks read. data and data1 will end up with the result of the last block read, ignoring any data before them. That is assuming that nothing goes wrong in the while loop - you need to remove the try structure to see if there are any errors.
What you can do is count the number of occurences of each byte in each file, then if the count of a byte in any file is greater than one then it is is a member of the intersection of the files, and if the count of a byte in all the files is greater than one then it is a member of the union of the files.
It is just as easy to write the code for more than two files as it is for two files, whereas LINQ is easy for two but a little bit more fiddly for more than two. (I put in a comparison with using LINQ in a naïve fashion for only two files at the end.)
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var file1 = #"C:\Program Files (x86)\Electronic Arts\Crysis 3\Bin32\Crysis3.exe"; // 26MB
var file2 = #"C:\Program Files (x86)\Electronic Arts\Crysis 3\Bin32\d3dcompiler_46.dll"; // 3MB
List<string> files = new List<string> { file1, file2 };
var sw = System.Diagnostics.Stopwatch.StartNew();
// Prepare array of counters for the bytes
var nFiles = files.Count;
int[][] count = new int[nFiles][];
for (int i = 0; i < nFiles; i++)
{
count[i] = new int[256];
}
// Get the counts of bytes in each file
int bufLen = 32768;
byte[] buffer = new byte[bufLen];
int bytesRead;
for (int fileNum = 0; fileNum < nFiles; fileNum++)
{
using (var sr = new FileStream(files[fileNum], FileMode.Open, FileAccess.Read))
{
bytesRead = bufLen;
while (bytesRead > 0)
{
bytesRead = sr.Read(buffer, 0, bufLen);
for (int i = 0; i < bytesRead; i++)
{
count[fileNum][buffer[i]]++;
}
}
}
}
// Find which bytes are in any of the files or in all the files
var inAny = new List<byte>(); // union
var inAll = new List<byte>(); // intersect
for (int i = 0; i < 256; i++)
{
Boolean all = true;
for (int fileNum = 0; fileNum < nFiles; fileNum++)
{
if (count[fileNum][i] > 0)
{
if (!inAny.Contains((byte)i)) // avoid adding same value more than once
{
inAny.Add((byte)i);
}
}
else
{
all = false;
}
};
if (all)
{
inAll.Add((byte)i);
};
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
// Display the results
Console.WriteLine("Union: " + string.Join(",", inAny.Select(x => x.ToString("X2"))));
Console.WriteLine();
Console.WriteLine("Intersect: " + string.Join(",", inAll.Select(x => x.ToString("X2"))));
Console.WriteLine();
// Compare to using LINQ.
// N/B. Will need adjustments for more than two files.
var srcBytes1 = File.ReadAllBytes(file1);
var srcBytes2 = File.ReadAllBytes(file2);
sw.Restart();
var intersect = srcBytes1.Intersect(srcBytes2).ToArray().OrderBy(x => x);
var union = srcBytes1.Union(srcBytes2).ToArray().OrderBy(x => x);
Console.WriteLine(sw.ElapsedMilliseconds);
Console.WriteLine("Union: " + String.Join(",", union.Select(x => x.ToString("X2"))));
Console.WriteLine();
Console.WriteLine("Intersect: " + String.Join(",", intersect.Select(x => x.ToString("X2"))));
Console.ReadLine();
}
}
}
The counting-the-byte-occurences method is roughly five times faster than the LINQ method on my computer, even without the latter loading the files and on a range of file sizes (a few KB to a few MB).
I am creating a word list of possible uppercase letters to prove how insecure 8 digit passwords are this code will write aaaaaaaa to aaaaaaab to aaaaaaac etc. until zzzzzzzz using this code:
class Program
{
static string path;
static int file = 0;
static void Main(string[] args)
{
new_file();
var alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ123456789+-*_!$£^=<>§°ÖÄÜöäü.;:,?{}[]";
var q = alphabet.Select(x => x.ToString());
int size = 3;
int counter = 0;
for (int i = 0; i < size - 1; i++)
{
q = q.SelectMany(x => alphabet, (x, y) => x + y);
}
foreach (var item in q)
{
if (counter >= 20000000)
{
new_file();
counter = 0;
}
if (File.Exists(path))
{
using (StreamWriter sw = File.AppendText(path))
{
sw.WriteLine(item);
Console.WriteLine(item);
/*if (!(Regex.IsMatch(item, #"(.)\1")))
{
sw.WriteLine(item);
counter++;
}
else
{
Console.WriteLine(item);
}*/
}
}
else
{
new_file();
}
}
}
static void new_file()
{
path = #"C:\" + "list" + file + ".txt";
if (!File.Exists(path))
{
using (StreamWriter sw = File.CreateText(path))
{
}
}
file++;
}
}
The Code is working fine but it takes Weeks to run it. Does anyone know a way to speed it up or do I have to wait? If anyone has a idea please tell me.
Performance:
size 3: 0.02s
size 4: 1.61s
size 5: 144.76s
Hints:
removed LINQ for combination generation
removed Console.WriteLine for each password
removed StreamWriter
large buffer (128k) for file writing
const string alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ123456789+-*_!$£^=<>§°ÖÄÜöäü.;:,?{}[]";
var byteAlphabet = alphabet.Select(ch => (byte)ch).ToArray();
var alphabetLength = alphabet.Length;
var newLine = new[] { (byte)'\r', (byte)'\n' };
const int size = 4;
var number = new byte[size];
var password = Enumerable.Range(0, size).Select(i => byteAlphabet[0]).Concat(newLine).ToArray();
var watcher = new System.Diagnostics.Stopwatch();
watcher.Start();
var isRunning = true;
for (var counter = 0; isRunning; counter++)
{
Console.Write("{0}: ", counter);
Console.Write(password.Select(b => (char)b).ToArray());
using (var file = System.IO.File.Create(string.Format(#"list.{0:D5}.txt", counter), 2 << 16))
{
for (var i = 0; i < 2000000; ++i)
{
file.Write(password, 0, password.Length);
var j = size - 1;
for (; j >= 0; j--)
{
if (number[j] < alphabetLength - 1)
{
password[j] = byteAlphabet[++number[j]];
break;
}
else
{
number[j] = 0;
password[j] = byteAlphabet[0];
}
}
if (j < 0)
{
isRunning = false;
break;
}
}
}
}
watcher.Stop();
Console.WriteLine(watcher.Elapsed);
}
Try the following modified code. In LINQPad it runs in < 1 second. With your original code I gave up after 40 seconds. It removes the overhead of opening and closing the file for every WriteLine operation. You'll need to test and ensure it gives the same results because I'm not willing to run your original code for 24 hours to ensure the output is the same.
class Program
{
static string path;
static int file = 0;
static void Main(string[] args)
{
new_file();
var alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ123456789+-*_!$£^=<>§°ÖÄÜöäü.;:,?{}[]";
var q = alphabet.Select(x => x.ToString());
int size = 3;
int counter = 0;
for (int i = 0; i < size - 1; i++)
{
q = q.SelectMany(x => alphabet, (x, y) => x + y);
}
StreamWriter sw = File.AppendText(path);
try
{
foreach (var item in q)
{
if (counter >= 20000000)
{
sw.Dispose();
new_file();
counter = 0;
}
sw.WriteLine(item);
Console.WriteLine(item);
}
}
finally
{
if(sw != null)
{
sw.Dispose();
}
}
}
static void new_file()
{
path = #"C:\temp\list" + file + ".txt";
if (!File.Exists(path))
{
using (StreamWriter sw = File.CreateText(path))
{
}
}
file++;
}
}
your alphabet is missing 0
With that fixed there would be 89 chars in your set. Let's call it 100 for simplicity. The set you are looking for is all the 8 character length strings drawn from that set. There are 100^8 of these, i.e. 10,000,000,000,000,000.
The disk space they will take up depends on how you encode them, lets be generous - assume you use some 8 bit char set that contains the these characters, and you don't put in carriage returns, so one byte per char, so 10,000,000,000,000,000 bytes =~ 10 peta byes?
Do you have 10 petabytes of disk? (10000 TB)?
[EDIT] In response to 'this is not an answer':
The original motivation is to create the list? The shows how large the list would be. Its hard to see what could be DONE with the list if it was actualised, i.e. it would always be quicker to reproduce it than to load it. Surely whatever point could be made by producing the list can also be made by simply knowing it's size, which the above shows how to work it out.
There are LOTS of inefficiencies in you code, but if your questions is 'how can i quickly produce this list and write it to disk' the answer is 'you literally cannot'.
[/EDIT]
Up to now, I know 2 way to get some lines of a file (contains about 30.000 lines):
int[] input = { 100, 50, 377, 15, 26000, 5000, 15000, 30, ... };
string output = "";
for (int i = 0; i < input.Length; i++)
{
output += File.ReadLines("C:\\file").Skip(input[i]).Take(1).First();
}
or
string[] lines = File.ReadAllLines("C\\file");
int[] input = { 100, 50, 377, 15, 26000, 5000, 15000, 30, ... };
string output = "";
for (int i = 0; i < input.Length; i++)
{
output += lines[input[i]];
}
Lines I want to get need be order by input array.
The first way, I dont need make a lines array, which contains 30.000 elements (~4MB), but I must re-open file for each element of input.
The second way, I only need read file one time, but must make an array with large data.
There is any way I can get the lines better? Thank!
You can create buffered iterator, which will iterate sequence only once and keep buffer of required size:
public class BufferedIterator<T> : IDisposable
{
List<T> buffer = new List<T>();
IEnumerator<T> iterator;
public BufferedIterator(IEnumerable<T> source)
{
iterator = source.GetEnumerator();
}
public T GetItemAt(int index)
{
if (buffer.Count > index) // if item is buffered
return buffer[index]; // return it
// or fill buffer with next items
while(iterator.MoveNext() && buffer.Count <= index)
buffer.Add(iterator.Current);
// if we have read all file, but buffer has not enough items
if (buffer.Count <= index)
throw new IndexOutOfRangeException(); // throw
return buffer[index]; // otherwise return required item
}
public void Dispose()
{
if (iterator != null)
iterator.Dispose();
}
}
Usage:
var lines = File.ReadLines("C\\file");
using (var iterator = new BufferedIterator<string>(lines))
{
int[] input = { 100, 50, 377 };
for(int i = 0; i < input.Length; i++)
output += iterator.GetItemAt(input[i]);
}
With this sample only first 377 lines of file will be read and buffered, and file lines will be enumerated only once.
This article shows how to read from a file using a memorystream. You can use it to buffer sections of the file at a time maybe using a carriage return as a delimiter http://www.codeproject.com/Articles/164372/Back-to-Basics-Reading-a-File-into-Memory-Stream