Probability of already existing file System.IO.Path.GetRandomFileName() - c#

Recently I got the exception:
Message:
System.IO.IOException: The file 'C:\Windows\TEMP\635568456627146499.xlsx' already exists.
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
This was the result of the following code I used for generating file names:
Path.Combine(Path.GetTempPath(), DateTime.Now.Ticks + ".xlsx");
After realising that it is possible to create two files in one Tick, I changed the code to:
Path.Combine(Path.GetTempPath(), Path.GetRandomFileName() + ".xlsx");
But I am still wondering what is the probability of the above exception in the new case?

Internally, GetRandomFileName uses RNGCryptoServiceProvider to generate 11-character (name:8+ext:3) string. The string represents a base-32 encoded number, so the total number of possible strings is 3211 or 255.
Assuming uniform distribution, the chances of making a duplicate are about 2-55, or 1 in 36 quadrillion. That's pretty low: for comparison, your chances of winning NY lotto are roughly one million times higher.

The probability of getting duplicate names with GetRandomFileName are really low, but if you look at it source here, you see that they don't check if the name is duplicate (They can't because you can't tell the path where this file should be created)
Instead the Path.GetTempFileName return an unique file name inside the Temp directory.
(So removing also the need to build the temp path in your code)
GetTempFileName uses the Win32 API GetTempFileName requesting the creation of an unique file name.
The Win32 API creates the file with a zero length and release the handle.
So you don't fall in concurrency scenarios. Better use this one.

GetRandomFileName() returns 8.3 char string. This is 11 characters that can vary. Assuming it contains only letters and digits, this gives us an "alphabet" of 36 characters. So the number of variations is least 36^11, which makes the probability of above exception extremely low.

I would like to put my answer in comment area rather than here, but I don't have enough reputation to add comment.
For your first snippet, I think you can precheck if file exists or not.
For the second one, code will generate random name but random means you still have tiny teeny possibility to get the exception....but I don't think you need worry about this. Existence check will help.

Related

ASP.NET TimeDate stamp on file gets overriden [duplicate]

This question already has answers here:
How to Generate unique file names in C#
(20 answers)
Closed 4 years ago.
I have this code here that take base 64 string and creates bytes, next I create a file name for these bytes.
byte[] bytes = System.Convert.FromBase64String(landingCells.imageBytes);
var filePath = landingCells.jobNo + DateTime.Now.ToString("yyyyMMddHHmmssffffff");
next I have save these bytes:
System.IO.File.WriteAllBytes("C:/app/Images/" + filePath + ".jpg", bytes);
The problem I am having is I am calling these lines of code in a loop via an iOS app and sometimes the yyyyMMddHHmmssffffff is the same as the previous item in the loop. My question, how can I make the file names more unique so this does not happen.
Try This by using Guid.NewGuid():
var uniquecode=Guid.NewGuid();
var filePath = landingCells.jobNo + DateTime.Now.ToString("yyyyMMddHHmmssffffff")+uniquecode;
Using a date-based name will limit your file creation rate to the frequency of the system clock (and is also not threadsafe) which is why you are seeing duplicate file names when you complete iterations of your loop too quickly. You have several options to make it more unique that depend on what your requirements are:
Add an incrementing counter suffix to the file name when the date is the same as the date of the last file written
Incorporate a GUID into the file name. This will be less readable than the counter suffix but will guarantee uniqueness even across a distributed system and won't require you to maintain a counter.
Incorporate some other original information about the file or its metadata into the name that when combined with the date will be unique
Come up with some custom name generation algorithm that will generate unique names for every (even repeated) input. How you do this depends on the domain you're working within and the data you're dealing with.
I'm not sure what kind app you're building, but it's worth reevaluating whether you actually need to write that many images to disk per second and if you do whether a video would be better. Throttling the writes would probably not be a bad idea and it would also solve the naming problem.

How to read specific line from large file?

I got the problem of reading single line form large file encoded in UTF-8. Lines in the file has the constant length.
The file in average has 300k lines. The time is the main constraint, so I want to do it the fastest way possible.
I've tried LinQ
File.ReadLines("file.txt").Skip(noOfLines).Take(1).First();
But the time is not satisfactory enough.
My biggest hope was using the stream, and setting its possition to the desired line start, but the problem is that lines sizes in bytes differ.
Any ideas, how to do it?
Now this is where you don't want to use linq (-:
You actually want to find a nth occurrence of a new line in the file and read something till the next new line.
You probably want to check out this documentation on memory mapped files as well:
https://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile(v=vs.110).aspx
There is also a post comparing different access methods
http://cc.davelozinski.com/c-sharp/fastest-way-to-read-text-files

FromBase64 string length must be multiple or 4 or not?

according to my understanding, a base64 encoded string (ie the output of encode) must always be a multiple of 4.
the c# Convert.FromBase64String says that its input must be a multiple of 4
However if I give it a 25 character string it doesnt complain
[convert]::FromBase64String("ei5gsIELIki+GpnPGyPVBA==")
[convert]::FromBase64String("1ei5gsIELIki+GpnPGyPVBA==")
both work. (The first one is 24 , second is 25)
[convert]::FromBase64String("11ei5gsIELIki+GpnPGyPVBA==")
fails with Invalid length exception
I assume this is a bug in the c# library but I just want to make sure - I am writing code that is sniffing strings to see if they are valid base64 strings and I want to be sure that I understand what a valid one looks like (one possible implementation was to give the string to system.convert and see if it threw - why reinvent perfectly good code)
Yes, this is a flaw (aka bug). It got started due to a perf optimization in an internal helper function named FromBase64_ComputeResultLength() which calculates the length of the byte[] result. It has this comment (edited to fit):
// For legal input, we can assume that 0 <= padding < 3. But it may be
// more for illegal input.
// We will notice it at decode when we see a '=' at the wrong place.
The "we will notice" remark is not entirely accurate, the decoder does flag an '=' if one isn't expected but it fails to check if there's one too many. Which is the case for the 25-char string.
You can report the problem at connect.microsoft.com, I don't see an existing report that resembles it. Do note that it is fairly unlikely that Microsoft can actually fix it any time soon since the change is going to break existing programs that now successfully parse bad base64 strings. It normally requires a major .NET release update to get rid of such problems, like it was done for .NET 4.0, there isn't one on the horizon afaik.
But yes, the simple workaround for you is to check if the string length is divisible by 4, use the % operator.

Generate unique hash from filename

I'm looking to generate a unique random hash that has a miniscule chance of being duplicated. It should only contain numbers, and I want it to be 4 characters long. I have the file path in the form of
filepath = "c:\\users\\john\\filename.csv"
Now, I'd like to only select the "filename" part of that string and create a hash from that filename, though I want it to be different each time so if two users upload a similarly named file it will likely generate a different hash code. What's the best way to go about doing this?
I will be using this hash to append "001", "002", etc. on to create student IDs.
Generating a unique hash from a file's filename is fairly simple.
However...
It should only contain numbers, and I want it to be 4 characters long.
With only 4 numeric characters, you're going to be guaranteed to have a collision with 1000 different files, and will likely be hit quite a bit sooner. This makes it impossible to have a "minuscule chance of being duplicated".
Edit in response to comments:
You could do some simple type of hash, though this will give quite a few collisions:
string ComputeFourDigitStringHash(string filepath)
{
string filename = System.IO.Path.GetFileNameWithoutExtension(filepath);
int hash = filename.GetHashCode() % 10000;
return hash.ToString("0000");
}
This will give you a 4 digit "hash" from the filename portion of the string. Note that it will have a lot of collisions, but it will give you something you can use.

Big strings: System.OutOfMemoryException

var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
string[] contents = File.ReadAllLines(path); // OutOfMemoryException
Array.Sort(contents);
string newpath = path.Replace("split", "sorted");
File.WriteAllLines(newpath, contents);
File.Delete(path);
contents = null;
GC.Collect();
SortChunksProgressChanged(this, (double)i / fileCount);
i++;
}
And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it?
P.S. I use Mono on MacOS
You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.
The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.
You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.
The built in sort is not as fast as you'd like.
Side Note:
If you call GC.* you've probably done it wrong
setting contents = null does not help you
If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability
Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.
Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.
When you do this, you will be able to write your application code like this:
List<FileLine> lines = GetLines("fileToSort.dat");
lines.Sort();
foreach (var line in lines)
{
line.AppendToFile("sortedFile.dat");
}
Your task will be to implement GetLines(string path) and create the FileLine class.
Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.
Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.
As you said:
I must implement external sort and show my teacher that it works for files bigger than my
RAM
Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.

Categories

Resources