Binary search on file with different line length

Binary search on file with different line length - c#

I have some code which does a binary search over a file with sorted hex values (SHA1 hashes) on each line. This is used to search the HaveIBeenPwned database. The latest version contains a count of the number of times each password hash was found, so some lines have extra characters at the end, in the format ':###'
The length of this additional check isn't fixed, and it isn't always there. This causes the buffer to read incorrect values and fail to find values that actually exist.
Current code:
static bool Check(string asHex, string filename)
{
const int LINELENGTH = 40; //SHA1 hash length
var buffer = new byte[LINELENGTH];
using (var sr = File.OpenRead(filename))
{
//Number of lines
var high = (sr.Length / (LINELENGTH + 2)) - 1;
var low = 0L;
while (low <= high)
{
var middle = (low + high + 1) / 2;
sr.Seek((LINELENGTH + 2) * ((long)middle), SeekOrigin.Begin);
sr.Read(buffer, 0, LINELENGTH);
var readLine = Encoding.ASCII.GetString(buffer);
switch (readLine.CompareTo(asHex))
{
case 0:
return true;
case 1:
high = middle - 1;
break;
case -1:
low = middle + 1;
break;
default:
break;
}
}
}
return false;
}
My idea is to seek forward from the middle until a newline character is found, then seek backwards for the same point, which should give me a complete line which I can split by the ':' delimiter. I then compare the first part of the split string array which should be just a SHA1 hash.
I think this should still centre on the correct value, however I am wondering if there is a neater way to do this? If the midpoint isn't that actual midpoint between the end of line characters, should it be adjusted before the high and low values are?

I THINK this may be a possible simpler (faster) solution without the backtracking to the beginning of the line. I think you can just use byte file indexes instead of trying to work with a full "record/line. Because the middle index will not always be at the start of a line/record, the "readline" can return a partial line/record. If you were to immediately do a second "readline", you would get a full line/record. It wouldn't be quite optimal, because you would actually be comparing a little ahead of the middle index.
I downloaded the pwned-passwords-update-1 and pulled out about 30 records at the start, end, and in the middle, it seemed to find them all. What do you think?
const int HASHLENGTH = 40;
static bool Check(string asHex, string filename)
{
using (var fs = File.OpenRead(filename))
{
var low = 0L;
// We don't need to start at the very end
var high = fs.Length - (HASHLENGTH - 1); // EOF - 1 HASHLENGTH
StreamReader sr = new StreamReader(fs);
while (low <= high)
{
var middle = (low + high + 1) / 2;
fs.Seek(middle, SeekOrigin.Begin);
// Resync with base stream after seek
sr.DiscardBufferedData();
var readLine = sr.ReadLine();
// 1) If we are NOT at the beginning of the file, we may have only read a partial line so
// Read again to make sure we get a full line.
// 2) No sense reading again if we are at the EOF
if ((middle > 0) && (!sr.EndOfStream)) readLine = sr.ReadLine() ?? "";
string[] parts = readLine.Split(':');
string hash = parts[0];
// By default string compare does a culture-sensitive comparison we may not be what we want?
// Do an ordinal compare (0-9 < A-Z < a-z)
int compare = String.Compare(asHex, hash, StringComparison.Ordinal);
if (compare < 0)
{
high = middle - 1;
}
else if (compare > 0)
{
low = middle + 1;
}
else
{
return true;
}
}
}
return false;
}

My way of solving your problem was to create a new binary file containing the hashes only. 16 byte/hash and a faster binary search  ( I don't have 50 reps needed to comment only )

Related

How do you do a string split with 2 chars counts in C#?

I know how to do a string split if there's a letter, number, that I want to replace.
But how could I do a string.Split() by 2 char counts without replacing any existing letters, number, etc...?
Example:
string MAC = "00122345"
I want that string to output: 00:12:23:45

You could create a LINQ extension method to give you an IEnumerable<string> of parts:
public static class Extensions
{
public static IEnumerable<string> SplitNthParts(this string source, int partSize)
{
if (string.IsNullOrEmpty(source))
{
throw new ArgumentException("String cannot be null or empty.", nameof(source));
}
if (partSize < 1)
{
throw new ArgumentException("Part size has to be greater than zero.", nameof(partSize));
}
return Enumerable
.Range(0, (source.Length + partSize - 1) / partSize)
.Select(pos => source
.Substring(pos * partSize,
Math.Min(partSize, source.Length - pos * partSize)));
}
}
Usage:
var strings = new string[] {
"00122345",
"001223453"
};
foreach (var str in strings)
{
Console.WriteLine(string.Join(":", str.SplitNthParts(2)));
}
// 00:12:23:45
// 00:12:23:45:3
Explanation:
Use Enumerable.Range to get number of positions to slice string. In this case its the length of the string + chunk size - 1, since we need to get a big enough range to also fit leftover chunk sizes.
Enumerable.Select each position of slicing and get the startIndex using String.Substring using the position multiplied by 2 to move down the string every 2 characters. You will have to use Math.Min to calculate the smallest size leftover size if the string doesn't have enough characters to fit another chunk. You can calculate this by the length of the string - current position * chunk size.
String.Join the final result with ":".
You could also replace the LINQ query with yield here to increase performance for larger strings since all the substrings won't be stored in memory at once:
for (var pos = 0; pos < source.Length; pos += partSize)
{
yield return source.Substring(pos, Math.Min(partSize, source.Length - pos));
}

You can use something like this:
string newStr= System.Text.RegularExpressions.Regex.Replace(MAC, ".{2}", "$0:");
To trim the last colon, you can use something like this.
newStr.TrimEnd(':');
Microsoft Document

Try this way.
string MAC = "00122345";
MAC = System.Text.RegularExpressions.Regex.Replace(MAC,".{2}", "$0:");
MAC = MAC.Substring(0,MAC.Length-1);
Console.WriteLine(MAC);

A quite fast solution, 8-10x faster than the current accepted answer (regex solution) and 3-4x faster than the LINQ solution
public static string Format(this string s, string separator, int length)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.Length; i += length)
{
sb.Append(s.Substring(i, Math.Min(s.Length - i, length)));
if (i < s.Length - length)
{
sb.Append(separator);
}
}
return sb.ToString();
}
Usage:
string result = "12345678".Format(":", 2);

Here is a one (1) line alternative using LINQ Enumerable.Aggregate.
string result = MAC.Aggregate("", (acc, c) => acc.Length % 3 == 0 ? acc += c : acc += c + ":").TrimEnd(':');

An easy to understand and simple solution.
This is a simple fast modified answer in which you can easily change the split char.
This answer also checks if the number is even or odd , to make the suitable string.Split().
input : 00122345
output : 00:12:23:45
input : 0012234
output : 00:12:23:4
//The List that keeps the pairs
List<string> MACList = new List<string>();
//Split the even number into pairs
for (int i = 1; i <= MAC.Length; i++)
{
if (i % 2 == 0)
{
MACList.Add(MAC.Substring(i - 2, 2));
}
}
//Make the preferable output
string output = "";
for (int j = 0; j < MACList.Count; j++)
{
output = output + MACList[j] + ":";
}
//Checks if the input string is even number or odd number
if (MAC.Length % 2 == 0)
{
output = output.Trim(output.Last());
}
else
{
output += MAC.Last();
}
//input : 00122345
//output : 00:12:23:45
//input : 0012234
//output : 00:12:23:4

Text File - Read and fix delimiter problems - Too slow

I am looking for a bit of advice on ways I can make this function quicker.
The function is designed to run through a delimited text file (with CRLF row ends) and remove any carriage returns or line breaks in between data rows.
E.g. A file of -
A|B|C|D
A|B|C|D
A|B|
C|D
A|B|C|D
Would become -
A|B|C|D
A|B|C|D
A|B|C|D
A|B|C|D
The function seems to work well, however when we start processing large files, the performance is too slow. An example is - for 800k rows it takes 3 seconds, for 130 million rows it takes over an hour....
The code is -
private void CleanDelimitedFile(string readFilePath, string writeFilePath, string delimiter, string problemFilePath, string rejectsFilePath, int estimateNumberOfRows)
{
ArrayList rejects = new ArrayList();
ArrayList problems = new ArrayList();
int safeSameLengthBreak = 0;
int numberOfLinesSameLength = 0;
int lineCount = 0;
int maxCount = 0;
string previousLine = string.Empty;
string currentLine = string.Empty;
// determine after how many rows with the same number of delimiter chars that we can safety
// say that we have found the expected length of a row (to save reading the full file twice)
if (estimateNumberOfRows > 100000000)
safeSameLengthBreak = estimateNumberOfRows / 200; // set the safe check limit as 0.5% of the file (minimum of 500,000)
else if (estimateNumberOfRows > 10000000)
safeSameLengthBreak = estimateNumberOfRows / 50; // set the safe check limit as 2% of the file (minimum of 200,000)
else
safeSameLengthBreak = 50000; // set the safe check limit as 50,000 (if there are less than 50,000 this wont be required anyway)
// open a reader
using (var reader = new StreamReader(readFilePath))
{
// check the file is still being read
while (!reader.EndOfStream)
{
// append the line count (for debugging)
lineCount += 1;
// get the current line
currentLine = reader.ReadLine();
// get the number of chars in the new line
int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);
// if the number is higher than the previous maximum set the new maximum
if (maxCount < chars)
{
maxCount = chars;
// the maximum has changed, reset the number of lines in a row with the same delimiter
numberOfLinesSameLength = 0;
}
else
{
// the maximum has not changed, add to the number of lines in a row with the same delimiter
numberOfLinesSameLength += 1;
}
// is the number of lines parsed in a row with the same number of delimiter chars above the safe limit? If so break the loop
if (numberOfLinesSameLength > safeSameLengthBreak)
{
break;
}
}
}
// reset the line count
lineCount = 0;
// open a writer for the duration of the next read
using (var writer = new StreamWriter(writeFilePath))
{
using (var reader = new StreamReader(readFilePath))
{
// check the file is still being read
while (!reader.EndOfStream)
{
// append the line count (for debugging)
lineCount += 1;
// get the current line
currentLine = reader.ReadLine();
// get the number of chars in the new line
int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);
// check the number of chars in the line matches the required number
if (chars == maxCount)
{
// write line
writer.WriteLine(currentLine);
// clear the previous line variable as this was a valid write
previousLine = string.Empty;
}
else
{
// add the line to problems
problems.Add(currentLine);
// append the new line to the previous line
previousLine += currentLine;
// get the number of chars in the new appended previous line
int newPreviousChars = (previousLine.Length - previousLine.Replace(delimiter, "").Length);
// check the number of chars in the previous appended line matches the required number
if (newPreviousChars == maxCount)
{
// write line
writer.WriteLine(previousLine);
// clear the previous line as this was a valid write
previousLine = string.Empty;
}
else if (newPreviousChars > maxCount)
{
// the number of delimiter chars in the new line is higher than the file maximum, add to rejects
rejects.Add(previousLine);
// clear the previous line and move on
previousLine = string.Empty;
}
}
}
}
}
// rename the original file as _original
System.IO.File.Move(readFilePath, readFilePath.Replace(".txt", "") + "_Original.txt");
// rename the new file as the original file name
System.IO.File.Move(writeFilePath, readFilePath);
// Write rejects
using (var rejectWriter = new StreamWriter(rejectsFilePath))
{
// loop through the problem array list and write the problem row to the problem file
foreach (string reject in rejects)
{
rejectWriter.WriteLine(reject);
}
}
// Write problems
using (var problemWriter = new StreamWriter(problemFilePath))
{
// loop through the reject array list and write the reject row to the problem file
foreach (string problem in problems)
{
problemWriter.WriteLine(problem);
}
}
}
Any pointers would be greatly appreciated.
Thanks in advance.

A few ideas
List<String>
For rejects and problems and allocate an initial capacity to you think they will need
Don't process over the network
Get an SSD, copy to it, process, write lines to it, and then copy the file back
This does not look like an efficient way to me to count delimeters
int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);
This is wastefully expensive: currentLine.Replace(delimiter, "")
int chars = 0;
foreach(char c in currentLine) if (c == delimeter) chars++;
This is not efficient
previousLine += currentLine;
Use StringBuilder
And allocate StringBuilder once outside the loop
In the loop call .Clear()

erroneous character fixing of strings in c#

I have five strings like below,
ABBCCD
ABBDCD
ABBDCD
ABBECD
ABBDCD
all the strings are basically same except for the fourth characters. But only the character that appears maximum time will take the place. For example here D was placed 3 times in the fourth position. So, the final string will be ABBDCD. I wrote following code, but it seemed to be less efficient in terms of time. Because this function can be called million times. What should I do to improve the performance?
Here changedString is the string to be matched with other 5 strings. If Any position of the changed string is not matched with other four, then the maxmum occured character will be placed on changedString.
len is the length of the strings which is same for all strings.
for (int i = 0; i < len;i++ )
{
String findDuplicate = string.Empty + changedString[i] + overlapStr[0][i] + overlapStr[1][i] + overlapStr[2][i] +
overlapStr[3][i] + overlapStr[4][i];
char c = findDuplicate.GroupBy(x => x).OrderByDescending(x => x.Count()).First().Key;
if(c!=changedString[i])
{
if (i > 0)
{
changedString = changedString.Substring(0, i) + c +
changedString.Substring(i + 1, changedString.Length - i - 1);
}
else
{
changedString = c + changedString.Substring(i + 1, changedString.Length - 1);
}
}
//string cleanString = new string(findDuplicate.ToCharArray().Distinct().ToArray());
}

I'm not quite sure what you are going to do, but if it is about sorting strings by some n-th character, then the best way is to use Counting Sort http://en.wikipedia.org/wiki/Counting_sort It is used for sorting array of small integers and is quite fine for chars. It has linear O(n) time. The main idea is that if you know all your possible elements (looks like they can be only A-Z here) then you can create an additional array and count them. For your example it will be {0, 0, 1 ,3 , 1, 0,...} if we use 0 for 'A', 1 for 'B' and so on.

There is a function that might help performance-wise as it runs five times faster. The idea is to count occurrences yourself using a dictionary to convert character to a position into counting array, increment value at this position and check if it is greater than previously highest number of occurrences. If it is, current character is top and is stored as result. This repeats for each string in overlapStr and for each position within the strings. Please read comments inside code to see details.
string HighestOccurrenceByPosition(string[] overlapStr)
{
int len = overlapStr[0].Length;
// Dictionary transforms character to offset into counting array
Dictionary<char, int> char2offset = new Dictionary<char, int>();
// Counting array. Each character has an entry here
int[] counters = new int[overlapStr.Length];
// Highest occurrence characters found so far
char[] topChars = new char[len];
for (int i = 0; i < len; ++i)
{
char2offset.Clear();
// faster! char2offset = new Dictionary<char, int>();
// Highest number of occurrences at the moment
int highestCount = 0;
// Allocation of counters - as previously unseen character arrives
// it is given a slot at this offset
int lastOffset = 0;
// Current offset into "counters"
int offset = 0;
// Small optimization. As your data seems very similar, this helps
// to reduce number of expensive calls to TryGetValue
// You might need to remove this optimization if you don't have
// unused value of char in your dataset
char lastChar = (char)0;
for (int j = 0; j < overlapStr.Length; ++ j)
{
char thisChar = overlapStr[j][i];
// If this is the same character as last one
// Offset already points to correct cell in "counters"
if (lastChar != thisChar)
{
// Get offset
if (!char2offset.TryGetValue(thisChar, out offset))
{
// First time seen - allocate & initialize cell
offset = lastOffset;
counters[offset] = 0;
// Map character to this cell
char2offset[thisChar] = lastOffset++;
}
// This is now last character
lastChar = thisChar;
}
// increment and get count for character
int charCount = ++counters[offset];
// This is now highestCount.
// TopChars receives current character
if (charCount > highestCount)
{
highestCount = charCount;
topChars[i] = thisChar;
}
}
}
return new string(topChars);
}
P.S. This is certainly not the best solution. But as it is significantly faster than original I thought I should help out.

c# getting a string within another string

i have a string like this:
some_string = "A simple demo of SMS text messaging.\r\n+CMGW: 3216\r\n\r\nOK\r\n\"
im coming from vb.net and i need to know in c#, if i know the position of CMGW, how do i get "3216" out of there?
i know that my start should be the position of CMGW + 6, but how do i make it stop as soon as it finds "\r" ??
again, my end result should be 3216
thank you!

Find the index of \r from the start of where you're interested in, and use the Substring overload which takes a length:
// Production code: add validation here.
// (Check for each index being -1, meaning "not found")
int cmgwIndex = text.IndexOf("CMGW: ");
// Just a helper variable; makes the code below slightly prettier
int startIndex = cmgwIndex + 6;
int crIndex = text.IndexOf("\r", startIndex);
string middlePart = text.Substring(startIndex, crIndex - startIndex);

If you know the position of 3216 then you can just do the following
string inner = some_string.SubString(positionOfCmgw+6,4);
This code will take the substring of some_string starting at the given position and only taking 4 characters.
If you want to be more general you could do the following
int start = positionOfCmgw+6;
int endIndex = some_string.IndexOf('\r', start);
int length = endIndex - start;
string inner = some_string.SubString(start, length);

One option would be to start from your known index and read characters until you hit a non-numeric value. Not the most robust solution, but it will work if you know your input's always going to look like this (i.e., no decimal points or other non-numeric characters within the numeric part of the string).
Something like this:
public static int GetNumberAtIndex(this string text, int index)
{
if (index < 0 || index >= text.Length)
throw new ArgumentOutOfRangeException("index");
var sb = new StringBuilder();
for (int i = index; i < text.Length; ++i)
{
char c = text[i];
if (!char.IsDigit(c))
break;
sb.Append(c);
}
if (sb.Length > 0)
return int.Parse(sb.ToString());
else
throw new ArgumentException("Unable to read number at the specified index.");
}
Usage in your case would look like:
string some_string = #"A simple demo of SMS text messaging.\r\n+CMGW: 3216\r\n...";
int index = some_string.IndexOf("CMGW") + 6;
int value = some_string.GetNumberAtIndex(index);
Console.WriteLine(value);
Output:
3216

If you're looking to extract the number portion of 'CMGW: 3216' then a more reliable method would be to use regular expressions. That way you can look for the entire pattern, and not just the header.
var some_string = "A simple demo of SMS text messaging.\r\n+CMGW: 3216\r\n\r\nOK\r\n";
var match = Regex.Match(some_string, #"CMGW\: (?<number>[0-9]+)", RegexOptions.Multiline);
var number = match.Groups["number"].Value;

More general, if you don't know the start position of CMGW but the structure remains as before.
String s;
char[] separators = {'\r'};
var parts = s.Split(separators);
parts.Where(part => part.Contains("CMGW")).Single().Reverse().TakeWhile(c => c != ' ').Reverse();

Non-exponential formatted float

I have a UTF-8 formatted data file that contains thousands of floating point numbers. At the time it was designed the developers decided to omit the 'e' in the exponential notation to save space. Therefore the data looks like:
1.85783+16 0.000000+0 1.900000+6-3.855418-4 1.958263+6 7.836995-4
-2.000000+6 9.903130-4 2.100000+6 1.417469-3 2.159110+6 1.655700-3
2.200000+6 1.813662-3-2.250000+6-1.998687-3 2.300000+6 2.174219-3
2.309746+6 2.207278-3 2.400000+6 2.494469-3 2.400127+6 2.494848-3
-2.500000+6 2.769739-3 2.503362+6 2.778185-3 2.600000+6 3.020353-3
2.700000+6 3.268572-3 2.750000+6 3.391230-3 2.800000+6 3.512625-3
2.900000+6 3.750746-3 2.952457+6 3.872690-3 3.000000+6 3.981166-3
3.202512+6 4.437824-3 3.250000+6 4.542310-3 3.402356+6 4.861319-3
The problem is float.Parse() will not work with this format. The intermediate solution I had was,
protected static float ParseFloatingPoint(string data)
{
int signPos;
char replaceChar = '+';
// Skip over first character so that a leading + is not caught
signPos = data.IndexOf(replaceChar, 1);
// Didn't find a '+', so lets see if there's a '-'
if (signPos == -1)
{
replaceChar = '-';
signPos = data.IndexOf('-', 1);
}
// Found either a '+' or '-'
if (signPos != -1)
{
// Create a new char array with an extra space to accomodate the 'e'
char[] newData = new char[EntryWidth + 1];
// Copy from string up to the sign
for (int i = 0; i < signPos; i++)
{
newData[i] = data[i];
}
// Replace the sign with an 'e + sign'
newData[signPos] = 'e';
newData[signPos + 1] = replaceChar;
// Copy the rest of the string
for (int i = signPos + 2; i < EntryWidth + 1; i++)
{
newData[i] = data[i - 1];
}
return float.Parse(new string(newData), NumberStyles.Float, CultureInfo.InvariantCulture);
}
else
{
return float.Parse(data, NumberStyles.Float, CultureInfo.InvariantCulture);
}
}
I can't call a simple String.Replace() because it will replace any leading negative signs. I could use substrings but then I'm making LOTS of extra strings and I'm concerned about the performance.
Does anyone have a more elegant solution to this?

string test = "1.85783-16";
char[] signs = { '+', '-' };
int decimalPos = test.IndexOf('.');
int signPos = test.LastIndexOfAny(signs);
string result = (signPos > decimalPos) ?
string.Concat(
test.Substring(0, signPos),
"E",
test.Substring(signPos)) : test;
float.Parse(result).Dump(); //1.85783E-16
The ideas I'm using here ensure the decimal comes before the sign (thus avoiding any problems if the exponent is missing) as well as using LastIndexOf() to work from the back (ensuring we have the exponent if one existed). If there is a possibility of a prefix "+" the first if would need to include || signPos < decimalPos.
Other results:
"1.85783" => "1.85783"; //Missing exponent is returned clean
"-1.85783" => "-1.85783"; //Sign prefix returned clean
"-1.85783-3" => "-1.85783e-3" //Sign prefix and exponent coexist peacefully.
According to the comments a test of this method shows only a 5% performance hit (after avoiding the String.Format(), which I should have remembered was awful). I think the code is much clearer: only one decision to make.

In terms of speed, your original solution is the fastest I've tried so far (#Godeke's is a very close second). #Godeke's has a lot of readability, for only a minor amount of performance degradation. Add in some robustness checks, and his may be the long term way to go. In terms of robustness, you can add that in to yours like so:
static char[] signChars = new char[] { '+', '-' };
static float ParseFloatingPoint(string data)
{
if (data.Length != EntryWidth)
{
throw new ArgumentException("data is not the correct size", "data");
}
else if (data[0] != ' ' && data[0] != '+' && data[0] != '-')
{
throw new ArgumentException("unexpected leading character", "data");
}
int signPos = data.LastIndexOfAny(signChars);
// Found either a '+' or '-'
if (signPos > 0)
{
// Create a new char array with an extra space to accomodate the 'e'
char[] newData = new char[EntryWidth + 1];
// Copy from string up to the sign
for (int ii = 0; ii < signPos; ++ii)
{
newData[ii] = data[ii];
}
// Replace the sign with an 'e + sign'
newData[signPos] = 'e';
newData[signPos + 1] = data[signPos];
// Copy the rest of the string
for (int ii = signPos + 2; ii < EntryWidth + 1; ++ii)
{
newData[ii] = data[ii - 1];
}
return Single.Parse(
new string(newData),
NumberStyles.Float,
CultureInfo.InvariantCulture);
}
else
{
Debug.Assert(false, "data does not have an exponential? This is odd.");
return Single.Parse(data, NumberStyles.Float, CultureInfo.InvariantCulture);
}
}
Benchmarks on my X5260 (including the times to just grok out the individual data points):
Code Average Runtime Values Parsed
--------------------------------------------------
Nothing (Overhead) 13 ms 0
Original 50 ms 150000
Godeke 60 ms 150000
Original Robust 56 ms 150000

Thanks Godeke for your contiually improving edits.
I ended up changing the parameters of the parsing function to take a char[] rather than a string and used your basic premise to come up with the following.
protected static float ParseFloatingPoint(char[] data)
{
int decimalPos = Array.IndexOf<char>(data, '.');
int posSignPos = Array.LastIndexOf<char>(data, '+');
int negSignPos = Array.LastIndexOf<char>(data, '-');
int signPos = (posSignPos > negSignPos) ? posSignPos : negSignPos;
string result;
if (signPos > decimalPos)
{
char[] newData = new char[data.Length + 1];
Array.Copy(data, newData, signPos);
newData[signPos] = 'E';
Array.Copy(data, signPos, newData, signPos + 1, data.Length - signPos);
result = new string(newData);
}
else
{
result = new string(data);
}
return float.Parse(result, NumberStyles.Float, CultureInfo.InvariantCulture);
}
I changed the input to the function from string to char[] because I wanted to move away from ReadLine(). I'm assuming this would perform better then creating lots of strings. Instead I get a fixed number of bytes from the data file (since it will ALWAYS be 11 char width data), converting the byte[] to char[], and then performing the above processing to convert to a float.

Could you possibly use a regular expression to pick out each occurrence?
Some information here on suitable expresions:
http://www.regular-expressions.info/floatingpoint.html

Why not just write a simple script to reformat the data file once and then use float.Parse()?
You said "thousands" of floating point numbers, so even a terribly naive approach will finish pretty quickly (if you said "trillions" I would be more hesitant), and code that you only need to run once will (almost) never be performance critical. Certainly it would take less time to run then posting the question to SO takes, and there's much less opportunity for error.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Binary search on file with different line length - c#

My way of solving your problem was to create a new binary file containing the hashes only. 16 byte/hash and a faster binary search ( I don't have 50 reps needed to comment only )

Related

How do you do a string split with 2 chars counts in C#?

Text File - Read and fix delimiter problems - Too slow

erroneous character fixing of strings in c#

c# getting a string within another string

Non-exponential formatted float

Categories

Resources