Non-exponential formatted float - c#

I have a UTF-8 formatted data file that contains thousands of floating point numbers. At the time it was designed the developers decided to omit the 'e' in the exponential notation to save space. Therefore the data looks like:
1.85783+16 0.000000+0 1.900000+6-3.855418-4 1.958263+6 7.836995-4
-2.000000+6 9.903130-4 2.100000+6 1.417469-3 2.159110+6 1.655700-3
2.200000+6 1.813662-3-2.250000+6-1.998687-3 2.300000+6 2.174219-3
2.309746+6 2.207278-3 2.400000+6 2.494469-3 2.400127+6 2.494848-3
-2.500000+6 2.769739-3 2.503362+6 2.778185-3 2.600000+6 3.020353-3
2.700000+6 3.268572-3 2.750000+6 3.391230-3 2.800000+6 3.512625-3
2.900000+6 3.750746-3 2.952457+6 3.872690-3 3.000000+6 3.981166-3
3.202512+6 4.437824-3 3.250000+6 4.542310-3 3.402356+6 4.861319-3
The problem is float.Parse() will not work with this format. The intermediate solution I had was,
protected static float ParseFloatingPoint(string data)
{
int signPos;
char replaceChar = '+';
// Skip over first character so that a leading + is not caught
signPos = data.IndexOf(replaceChar, 1);
// Didn't find a '+', so lets see if there's a '-'
if (signPos == -1)
{
replaceChar = '-';
signPos = data.IndexOf('-', 1);
}
// Found either a '+' or '-'
if (signPos != -1)
{
// Create a new char array with an extra space to accomodate the 'e'
char[] newData = new char[EntryWidth + 1];
// Copy from string up to the sign
for (int i = 0; i < signPos; i++)
{
newData[i] = data[i];
}
// Replace the sign with an 'e + sign'
newData[signPos] = 'e';
newData[signPos + 1] = replaceChar;
// Copy the rest of the string
for (int i = signPos + 2; i < EntryWidth + 1; i++)
{
newData[i] = data[i - 1];
}
return float.Parse(new string(newData), NumberStyles.Float, CultureInfo.InvariantCulture);
}
else
{
return float.Parse(data, NumberStyles.Float, CultureInfo.InvariantCulture);
}
}
I can't call a simple String.Replace() because it will replace any leading negative signs. I could use substrings but then I'm making LOTS of extra strings and I'm concerned about the performance.
Does anyone have a more elegant solution to this?

string test = "1.85783-16";
char[] signs = { '+', '-' };
int decimalPos = test.IndexOf('.');
int signPos = test.LastIndexOfAny(signs);
string result = (signPos > decimalPos) ?
string.Concat(
test.Substring(0, signPos),
"E",
test.Substring(signPos)) : test;
float.Parse(result).Dump(); //1.85783E-16
The ideas I'm using here ensure the decimal comes before the sign (thus avoiding any problems if the exponent is missing) as well as using LastIndexOf() to work from the back (ensuring we have the exponent if one existed). If there is a possibility of a prefix "+" the first if would need to include || signPos < decimalPos.
Other results:
"1.85783" => "1.85783"; //Missing exponent is returned clean
"-1.85783" => "-1.85783"; //Sign prefix returned clean
"-1.85783-3" => "-1.85783e-3" //Sign prefix and exponent coexist peacefully.
According to the comments a test of this method shows only a 5% performance hit (after avoiding the String.Format(), which I should have remembered was awful). I think the code is much clearer: only one decision to make.

In terms of speed, your original solution is the fastest I've tried so far (#Godeke's is a very close second). #Godeke's has a lot of readability, for only a minor amount of performance degradation. Add in some robustness checks, and his may be the long term way to go. In terms of robustness, you can add that in to yours like so:
static char[] signChars = new char[] { '+', '-' };
static float ParseFloatingPoint(string data)
{
if (data.Length != EntryWidth)
{
throw new ArgumentException("data is not the correct size", "data");
}
else if (data[0] != ' ' && data[0] != '+' && data[0] != '-')
{
throw new ArgumentException("unexpected leading character", "data");
}
int signPos = data.LastIndexOfAny(signChars);
// Found either a '+' or '-'
if (signPos > 0)
{
// Create a new char array with an extra space to accomodate the 'e'
char[] newData = new char[EntryWidth + 1];
// Copy from string up to the sign
for (int ii = 0; ii < signPos; ++ii)
{
newData[ii] = data[ii];
}
// Replace the sign with an 'e + sign'
newData[signPos] = 'e';
newData[signPos + 1] = data[signPos];
// Copy the rest of the string
for (int ii = signPos + 2; ii < EntryWidth + 1; ++ii)
{
newData[ii] = data[ii - 1];
}
return Single.Parse(
new string(newData),
NumberStyles.Float,
CultureInfo.InvariantCulture);
}
else
{
Debug.Assert(false, "data does not have an exponential? This is odd.");
return Single.Parse(data, NumberStyles.Float, CultureInfo.InvariantCulture);
}
}
Benchmarks on my X5260 (including the times to just grok out the individual data points):
Code Average Runtime Values Parsed
--------------------------------------------------
Nothing (Overhead) 13 ms 0
Original 50 ms 150000
Godeke 60 ms 150000
Original Robust 56 ms 150000

Thanks Godeke for your contiually improving edits.
I ended up changing the parameters of the parsing function to take a char[] rather than a string and used your basic premise to come up with the following.
protected static float ParseFloatingPoint(char[] data)
{
int decimalPos = Array.IndexOf<char>(data, '.');
int posSignPos = Array.LastIndexOf<char>(data, '+');
int negSignPos = Array.LastIndexOf<char>(data, '-');
int signPos = (posSignPos > negSignPos) ? posSignPos : negSignPos;
string result;
if (signPos > decimalPos)
{
char[] newData = new char[data.Length + 1];
Array.Copy(data, newData, signPos);
newData[signPos] = 'E';
Array.Copy(data, signPos, newData, signPos + 1, data.Length - signPos);
result = new string(newData);
}
else
{
result = new string(data);
}
return float.Parse(result, NumberStyles.Float, CultureInfo.InvariantCulture);
}
I changed the input to the function from string to char[] because I wanted to move away from ReadLine(). I'm assuming this would perform better then creating lots of strings. Instead I get a fixed number of bytes from the data file (since it will ALWAYS be 11 char width data), converting the byte[] to char[], and then performing the above processing to convert to a float.

Could you possibly use a regular expression to pick out each occurrence?
Some information here on suitable expresions:
http://www.regular-expressions.info/floatingpoint.html

Why not just write a simple script to reformat the data file once and then use float.Parse()?
You said "thousands" of floating point numbers, so even a terribly naive approach will finish pretty quickly (if you said "trillions" I would be more hesitant), and code that you only need to run once will (almost) never be performance critical. Certainly it would take less time to run then posting the question to SO takes, and there's much less opportunity for error.

Related

C# Parse String To Double Without Scientific Notation [duplicate]

How to convert a double into a floating-point string representation without scientific notation in the .NET Framework?
"Small" samples (effective numbers may be of any size, such as 1.5E200 or 1e-200) :
3248971234698200000000000000000000000000000000
0.00000000000000000000000000000000000023897356978234562
None of the standard number formats are like this, and a custom format also doesn't seem to allow having an open number of digits after the decimal separator.
This is not a duplicate of How to convert double to string without the power to 10 representation (E-05) because the answers given there do not solve the issue at hand. The accepted solution in this question was to use a fixed point (such as 20 digits), which is not what I want. A fixed point formatting and trimming the redundant 0 doesn't solve the issue either because the max width for fixed width is 99 characters.
Note: the solution has to deal correctly with custom number formats (e.g. other decimal separator, depending on culture information).
Edit: The question is really only about displaing aforementioned numbers. I'm aware of how floating point numbers work and what numbers can be used and computed with them.
For a general-purpose¹ solution you need to preserve 339 places:
doubleValue.ToString("0." + new string('#', 339))
The maximum number of non-zero decimal digits is 16. 15 are on the right side of the decimal point. The exponent can move those 15 digits a maximum of 324 places to the right. (See the range and precision.)
It works for double.Epsilon, double.MinValue, double.MaxValue, and anything in between.
The performance will be much greater than the regex/string manipulation solutions since all formatting and string work is done in one pass by unmanaged CLR code. Also, the code is much simpler to prove correct.
For ease of use and even better performance, make it a constant:
public static class FormatStrings
{
public const string DoubleFixedPoint = "0.###################################################################################################################################################################################################################################################################################################################################################";
}
¹ Update: I mistakenly said that this was also a lossless solution. In fact it is not, since ToString does its normal display rounding for all formats except r. Live example. Thanks, #Loathing! Please see Lothing’s answer if you need the ability to roundtrip in fixed point notation (i.e, if you’re using .ToString("r") today).
I had a similar problem and this worked for me:
doubleValue.ToString("F99").TrimEnd('0')
F99 may be overkill, but you get the idea.
This is a string parsing solution where the source number (double) is converted into a string and parsed into its constituent components. It is then reassembled by rules into the full-length numeric representation. It also accounts for locale as requested.
Update: The tests of the conversions only include single-digit whole numbers, which is the norm, but the algorithm also works for something like: 239483.340901e-20
using System;
using System.Text;
using System.Globalization;
using System.Threading;
public class MyClass
{
public static void Main()
{
Console.WriteLine(ToLongString(1.23e-2));
Console.WriteLine(ToLongString(1.234e-5)); // 0.00010234
Console.WriteLine(ToLongString(1.2345E-10)); // 0.00000001002345
Console.WriteLine(ToLongString(1.23456E-20)); // 0.00000000000000000100023456
Console.WriteLine(ToLongString(5E-20));
Console.WriteLine("");
Console.WriteLine(ToLongString(1.23E+2)); // 123
Console.WriteLine(ToLongString(1.234e5)); // 1023400
Console.WriteLine(ToLongString(1.2345E10)); // 1002345000000
Console.WriteLine(ToLongString(-7.576E-05)); // -0.00007576
Console.WriteLine(ToLongString(1.23456e20));
Console.WriteLine(ToLongString(5e+20));
Console.WriteLine("");
Console.WriteLine(ToLongString(9.1093822E-31)); // mass of an electron
Console.WriteLine(ToLongString(5.9736e24)); // mass of the earth
Console.ReadLine();
}
private static string ToLongString(double input)
{
string strOrig = input.ToString();
string str = strOrig.ToUpper();
// if string representation was collapsed from scientific notation, just return it:
if (!str.Contains("E")) return strOrig;
bool negativeNumber = false;
if (str[0] == '-')
{
str = str.Remove(0, 1);
negativeNumber = true;
}
string sep = Thread.CurrentThread.CurrentCulture.NumberFormat.NumberDecimalSeparator;
char decSeparator = sep.ToCharArray()[0];
string[] exponentParts = str.Split('E');
string[] decimalParts = exponentParts[0].Split(decSeparator);
// fix missing decimal point:
if (decimalParts.Length==1) decimalParts = new string[]{exponentParts[0],"0"};
int exponentValue = int.Parse(exponentParts[1]);
string newNumber = decimalParts[0] + decimalParts[1];
string result;
if (exponentValue > 0)
{
result =
newNumber +
GetZeros(exponentValue - decimalParts[1].Length);
}
else // negative exponent
{
result =
"0" +
decSeparator +
GetZeros(exponentValue + decimalParts[0].Length) +
newNumber;
result = result.TrimEnd('0');
}
if (negativeNumber)
result = "-" + result;
return result;
}
private static string GetZeros(int zeroCount)
{
if (zeroCount < 0)
zeroCount = Math.Abs(zeroCount);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < zeroCount; i++) sb.Append("0");
return sb.ToString();
}
}
You could cast the double to decimal and then do ToString().
(0.000000005).ToString() // 5E-09
((decimal)(0.000000005)).ToString() // 0,000000005
I haven't done performance testing which is faster, casting from 64-bit double to 128-bit decimal or a format string of over 300 chars. Oh, and there might possibly be overflow errors during conversion, but if your values fit a decimal this should work fine.
Update: The casting seems to be a lot faster. Using a prepared format string as given in the other answer, formatting a million times takes 2.3 seconds and casting only 0.19 seconds. Repeatable. That's 10x faster. Now it's only about the value range.
This is what I've got so far, seems to work, but maybe someone has a better solution:
private static readonly Regex rxScientific = new Regex(#"^(?<sign>-?)(?<head>\d+)(\.(?<tail>\d*?)0*)?E(?<exponent>[+\-]\d+)$", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture|RegexOptions.CultureInvariant);
public static string ToFloatingPointString(double value) {
return ToFloatingPointString(value, NumberFormatInfo.CurrentInfo);
}
public static string ToFloatingPointString(double value, NumberFormatInfo formatInfo) {
string result = value.ToString("r", NumberFormatInfo.InvariantInfo);
Match match = rxScientific.Match(result);
if (match.Success) {
Debug.WriteLine("Found scientific format: {0} => [{1}] [{2}] [{3}] [{4}]", result, match.Groups["sign"], match.Groups["head"], match.Groups["tail"], match.Groups["exponent"]);
int exponent = int.Parse(match.Groups["exponent"].Value, NumberStyles.Integer, NumberFormatInfo.InvariantInfo);
StringBuilder builder = new StringBuilder(result.Length+Math.Abs(exponent));
builder.Append(match.Groups["sign"].Value);
if (exponent >= 0) {
builder.Append(match.Groups["head"].Value);
string tail = match.Groups["tail"].Value;
if (exponent < tail.Length) {
builder.Append(tail, 0, exponent);
builder.Append(formatInfo.NumberDecimalSeparator);
builder.Append(tail, exponent, tail.Length-exponent);
} else {
builder.Append(tail);
builder.Append('0', exponent-tail.Length);
}
} else {
builder.Append('0');
builder.Append(formatInfo.NumberDecimalSeparator);
builder.Append('0', (-exponent)-1);
builder.Append(match.Groups["head"].Value);
builder.Append(match.Groups["tail"].Value);
}
result = builder.ToString();
}
return result;
}
// test code
double x = 1.0;
for (int i = 0; i < 200; i++) {
x /= 10;
}
Console.WriteLine(x);
Console.WriteLine(ToFloatingPointString(x));
The problem using #.###...### or F99 is that it doesn't preserve precision at the ending decimal places, e.g:
String t1 = (0.0001/7).ToString("0." + new string('#', 339)); // 0.0000142857142857143
String t2 = (0.0001/7).ToString("r"); // 1.4285714285714287E-05
The problem with DecimalConverter.cs is that it is slow. This code is the same idea as Sasik's answer, but twice as fast. Unit test method at bottom.
public static class RoundTrip {
private static String[] zeros = new String[1000];
static RoundTrip() {
for (int i = 0; i < zeros.Length; i++) {
zeros[i] = new String('0', i);
}
}
private static String ToRoundTrip(double value) {
String str = value.ToString("r");
int x = str.IndexOf('E');
if (x < 0) return str;
int x1 = x + 1;
String exp = str.Substring(x1, str.Length - x1);
int e = int.Parse(exp);
String s = null;
int numDecimals = 0;
if (value < 0) {
int len = x - 3;
if (e >= 0) {
if (len > 0) {
s = str.Substring(0, 2) + str.Substring(3, len);
numDecimals = len;
}
else
s = str.Substring(0, 2);
}
else {
// remove the leading minus sign
if (len > 0) {
s = str.Substring(1, 1) + str.Substring(3, len);
numDecimals = len;
}
else
s = str.Substring(1, 1);
}
}
else {
int len = x - 2;
if (len > 0) {
s = str[0] + str.Substring(2, len);
numDecimals = len;
}
else
s = str[0].ToString();
}
if (e >= 0) {
e = e - numDecimals;
String z = (e < zeros.Length ? zeros[e] : new String('0', e));
s = s + z;
}
else {
e = (-e - 1);
String z = (e < zeros.Length ? zeros[e] : new String('0', e));
if (value < 0)
s = "-0." + z + s;
else
s = "0." + z + s;
}
return s;
}
private static void RoundTripUnitTest() {
StringBuilder sb33 = new StringBuilder();
double[] values = new [] { 123450000000000000.0, 1.0 / 7, 10000000000.0/7, 100000000000000000.0/7, 0.001/7, 0.0001/7, 100000000000000000.0, 0.00000000001,
1.23e-2, 1.234e-5, 1.2345E-10, 1.23456E-20, 5E-20, 1.23E+2, 1.234e5, 1.2345E10, -7.576E-05, 1.23456e20, 5e+20, 9.1093822E-31, 5.9736e24, double.Epsilon };
foreach (int sign in new [] { 1, -1 }) {
foreach (double val in values) {
double val2 = sign * val;
String s1 = val2.ToString("r");
String s2 = ToRoundTrip(val2);
double val2_ = double.Parse(s2);
double diff = Math.Abs(val2 - val2_);
if (diff != 0) {
throw new Exception("Value {0} did not pass ToRoundTrip.".Format2(val.ToString("r")));
}
sb33.AppendLine(s1);
sb33.AppendLine(s2);
sb33.AppendLine();
}
}
}
}
The obligatory Logarithm-based solution. Note that this solution, because it involves doing math, may reduce the accuracy of your number a little bit. Not heavily tested.
private static string DoubleToLongString(double x)
{
int shift = (int)Math.Log10(x);
if (Math.Abs(shift) <= 2)
{
return x.ToString();
}
if (shift < 0)
{
double y = x * Math.Pow(10, -shift);
return "0.".PadRight(-shift + 2, '0') + y.ToString().Substring(2);
}
else
{
double y = x * Math.Pow(10, 2 - shift);
return y + "".PadRight(shift - 2, '0');
}
}
Edit: If the decimal point crosses non-zero part of the number, this algorithm will fail miserably. I tried for simple and went too far.
In the old days when we had to write our own formatters, we'd isolate the mantissa and exponent and format them separately.
In this article by Jon Skeet (https://csharpindepth.com/articles/FloatingPoint) he provides a link to his DoubleConverter.cs routine that should do exactly what you want. Skeet also refers to this at extracting mantissa and exponent from double in c#.
I have just improvised on the code above to make it work for negative exponential values.
using System;
using System.Text.RegularExpressions;
using System.IO;
using System.Text;
using System.Threading;
namespace ConvertNumbersInScientificNotationToPlainNumbers
{
class Program
{
private static string ToLongString(double input)
{
string str = input.ToString(System.Globalization.CultureInfo.InvariantCulture);
// if string representation was collapsed from scientific notation, just return it:
if (!str.Contains("E")) return str;
var positive = true;
if (input < 0)
{
positive = false;
}
string sep = Thread.CurrentThread.CurrentCulture.NumberFormat.NumberDecimalSeparator;
char decSeparator = sep.ToCharArray()[0];
string[] exponentParts = str.Split('E');
string[] decimalParts = exponentParts[0].Split(decSeparator);
// fix missing decimal point:
if (decimalParts.Length == 1) decimalParts = new string[] { exponentParts[0], "0" };
int exponentValue = int.Parse(exponentParts[1]);
string newNumber = decimalParts[0].Replace("-", "").
Replace("+", "") + decimalParts[1];
string result;
if (exponentValue > 0)
{
if (positive)
result =
newNumber +
GetZeros(exponentValue - decimalParts[1].Length);
else
result = "-" +
newNumber +
GetZeros(exponentValue - decimalParts[1].Length);
}
else // negative exponent
{
if (positive)
result =
"0" +
decSeparator +
GetZeros(exponentValue + decimalParts[0].Replace("-", "").
Replace("+", "").Length) + newNumber;
else
result =
"-0" +
decSeparator +
GetZeros(exponentValue + decimalParts[0].Replace("-", "").
Replace("+", "").Length) + newNumber;
result = result.TrimEnd('0');
}
float temp = 0.00F;
if (float.TryParse(result, out temp))
{
return result;
}
throw new Exception();
}
private static string GetZeros(int zeroCount)
{
if (zeroCount < 0)
zeroCount = Math.Abs(zeroCount);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < zeroCount; i++) sb.Append("0");
return sb.ToString();
}
public static void Main(string[] args)
{
//Get Input Directory.
Console.WriteLine(#"Enter the Input Directory");
var readLine = Console.ReadLine();
if (readLine == null)
{
Console.WriteLine(#"Enter the input path properly.");
return;
}
var pathToInputDirectory = readLine.Trim();
//Get Output Directory.
Console.WriteLine(#"Enter the Output Directory");
readLine = Console.ReadLine();
if (readLine == null)
{
Console.WriteLine(#"Enter the output path properly.");
return;
}
var pathToOutputDirectory = readLine.Trim();
//Get Delimiter.
Console.WriteLine("Enter the delimiter;");
var columnDelimiter = (char)Console.Read();
//Loop over all files in the directory.
foreach (var inputFileName in Directory.GetFiles(pathToInputDirectory))
{
var outputFileWithouthNumbersInScientificNotation = string.Empty;
Console.WriteLine("Started operation on File : " + inputFileName);
if (File.Exists(inputFileName))
{
// Read the file
using (var file = new StreamReader(inputFileName))
{
string line;
while ((line = file.ReadLine()) != null)
{
String[] columns = line.Split(columnDelimiter);
var duplicateLine = string.Empty;
int lengthOfColumns = columns.Length;
int counter = 1;
foreach (var column in columns)
{
var columnDuplicate = column;
try
{
if (Regex.IsMatch(columnDuplicate.Trim(),
#"^[+-]?[0-9]+(\.[0-9]+)?[E]([+-]?[0-9]+)$",
RegexOptions.IgnoreCase))
{
Console.WriteLine("Regular expression matched for this :" + column);
columnDuplicate = ToLongString(Double.Parse
(column,
System.Globalization.NumberStyles.Float));
Console.WriteLine("Converted this no in scientific notation " +
"" + column + " to this number " +
columnDuplicate);
}
}
catch (Exception)
{
}
duplicateLine = duplicateLine + columnDuplicate;
if (counter != lengthOfColumns)
{
duplicateLine = duplicateLine + columnDelimiter.ToString();
}
counter++;
}
duplicateLine = duplicateLine + Environment.NewLine;
outputFileWithouthNumbersInScientificNotation = outputFileWithouthNumbersInScientificNotation + duplicateLine;
}
file.Close();
}
var outputFilePathWithoutNumbersInScientificNotation
= Path.Combine(pathToOutputDirectory, Path.GetFileName(inputFileName));
//Create Directory If it does not exist.
if (!Directory.Exists(pathToOutputDirectory))
Directory.CreateDirectory(pathToOutputDirectory);
using (var outputFile =
new StreamWriter(outputFilePathWithoutNumbersInScientificNotation))
{
outputFile.Write(outputFileWithouthNumbersInScientificNotation);
outputFile.Close();
}
Console.WriteLine("The transformed file is here :" +
outputFilePathWithoutNumbersInScientificNotation);
}
}
}
}
}
This code takes an input directory and based on the delimiter converts all values in scientific notation to numeric format.
Thanks
try this one:
public static string DoubleToFullString(double value,
NumberFormatInfo formatInfo)
{
string[] valueExpSplit;
string result, decimalSeparator;
int indexOfDecimalSeparator, exp;
valueExpSplit = value.ToString("r", formatInfo)
.ToUpper()
.Split(new char[] { 'E' });
if (valueExpSplit.Length > 1)
{
result = valueExpSplit[0];
exp = int.Parse(valueExpSplit[1]);
decimalSeparator = formatInfo.NumberDecimalSeparator;
if ((indexOfDecimalSeparator
= valueExpSplit[0].IndexOf(decimalSeparator)) > -1)
{
exp -= (result.Length - indexOfDecimalSeparator - 1);
result = result.Replace(decimalSeparator, "");
}
if (exp >= 0) result += new string('0', Math.Abs(exp));
else
{
exp = Math.Abs(exp);
if (exp >= result.Length)
{
result = "0." + new string('0', exp - result.Length)
+ result;
}
else
{
result = result.Insert(result.Length - exp, decimalSeparator);
}
}
}
else result = valueExpSplit[0];
return result;
}
Being millions of programmers world wide, it's always a good practice to try search if someone has bumped into your problem already. Sometimes there's solutions are garbage, which means it's time to write your own, and sometimes there are great, such as the following:
http://www.yoda.arachsys.com/csharp/DoubleConverter.cs
(details: http://www.yoda.arachsys.com/csharp/floatingpoint.html)
string strdScaleFactor = dScaleFactor.ToString(); // where dScaleFactor = 3.531467E-05
decimal decimalScaleFactor = Decimal.Parse(strdScaleFactor, System.Globalization.NumberStyles.Float);
I don't know if my answer to the question can still be helpful. But in this case I suggest the "decomposition of the double variable into decimal places" to store it in an Array / Array of data of type String.
This process of decomposition and storage in parts (number by number) from double to string, would basically work with the use of two loops and an "alternative" (if you thought of workaround, I think you got it), where the first loop will extract the values from double without converting to String, resulting in blessed scientific notation and storing number by number in an Array. And this will be done using MOD - the same method to check a palindrome number, which would be for example:
String[] Array_ = new double[ **here you will put an extreme value of places your DOUBLE can reach, you must have a prediction**];
for (int i = 0, variableDoubleMonstrous > 0, i++){
x = variableDoubleMonstrous %10;
Array_[i] = x;
variableDoubleMonstrous /= 10;
}
And the second loop to invert the Array values ​​(because in this process of checking a palindrome, the values ​​invert from the last place, to the first, from the penultimate to the second and so on. Remember?) to get the original value:
String[] ArrayFinal = new String[the same number of "places" / indices of the other Array / Data array];
int lengthArray = Array_.Length;
for (int i = 0, i < Array_.Length, i++){
FinalArray[i] = Array_[lengthArray - 1];
lengthArray--;
}
***Warning: There's a catch that I didn't pay attention to. In that case there will be no "." (floating point decimal separator or double), so this solution is not generalized. But if it is really important to use decimal separators, unfortunately the only possibility (If done well, it will have a great performance) is:
**Use a routine to get the position of the decimal point of the original value, the one with scientific notation - the important thing is that you know that this floating point is before a number such as the "Length" position x, and after a number such as the y position - extracting each digit using the loops - as shown above - and at the end "export" the data from the last Array to another one, including the decimal place divider (the comma, or the period , if variable decimal, double or float) in the imaginary position that was in the original variable, in the "real" position of that matrix.
*** The concept of position is, find out how many numbers occur before the decimal point, so with this information you will be able to store in the String Array the point in the real position.
NEEDS THAT CAN BE MADE:
But then you ask:
But what about when I'm going to convert String to a floating point value?
My answer is that you use the second matrix of this entire process (the one that receives the inversion of the first matrix that obtains the numbers by the palindrome method) and use it for the conversion, but always making sure, when necessary, of the position of the decimal place in future situations, in case this conversion (Double -> String) is needed again.
But what if the problem is to use the value of the converted Double (Array of Strings) in a calculation. Then in this case you went around in circles. Well, the original variable will work anyway even with scientific notation. The only difference between floating point and decimal variable types is in the rounding of values, which depending on the purpose, it will only be necessary to change the type of data used, but it is dangerous to have a significant loss of information, look here
I could be wrong, but isn't it like this?
data.ToString("n");
http://msdn.microsoft.com/en-us/library/dwhawy9k.aspx
i think you need only to use IFormat with
ToString(doubleVar, System.Globalization.NumberStyles.Number)
example:
double d = double.MaxValue;
string s = d.ToString(d, System.Globalization.NumberStyles.Number);
My solution was using the custom formats.
try this:
double d;
d = 1234.12341234;
d.ToString("#########0.#########");
Just to build on what jcasso said what you can do is to adjust your double value by changing the exponent so that your favorite format would do it for you, apply the format, and than pad the result with zeros to compensate for the adjustment.
This works fine for me...
double number = 1.5E+200;
string s = number.ToString("#");
//Output: "150000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"

Compare numeric values as string

I have a method which gets two string. These strings can contain numbers, ASCII chars or both at the same time.
The algorithm works like this:
Split both strings into char Arrays A and B.
Try to parse element Ai and Bi to an int
Compare element Ai with element Bi, in case of integers use direct comparison, in case of chars use ordinal string comparison.
Do work based on the result
Now, I'm wondering: Do I really need to parse the elements to int? I simply could compare each element in an ordinal string comparison and would get the same result, right?
What are the performance implications here? Is parsing and normal comparison faster than ordinal string comparison? Is it slower?
Is my assumption (using ordinal string comparison instead of parsing and comparing) correct?
Here is the method in question:
internal static int CompareComponentString(this string componentString, string other)
{
bool componentEmpty = string.IsNullOrWhiteSpace(componentString);
bool otherEmtpy = string.IsNullOrWhiteSpace(other);
if (componentEmpty && otherEmtpy)
{
return 0;
}
if (componentEmpty)
{
return -1;
}
if (otherEmtpy)
{
return 1;
}
string[] componentParts = componentString.Split(new[] { '.' }, StringSplitOptions.RemoveEmptyEntries);
string[] otherParts = other.Split(new[] { '.' }, StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < Math.Min(componentParts.Length, otherParts.Length); i++)
{
string componentChar = componentParts[i];
string otherChar = otherParts[i];
int componentNumVal, otherNumVal;
bool componentIsNum = int.TryParse(componentChar, out componentNumVal);
bool otherIsNum = int.TryParse(otherChar, out otherNumVal);
if (componentIsNum && otherIsNum)
{
if (componentNumVal.CompareTo(otherNumVal) == 0)
{
continue;
}
return componentNumVal.CompareTo(otherNumVal);
}
else
{
if (componentIsNum)
{
return -1;
}
if (otherIsNum)
{
return 1;
}
int comp = string.Compare(componentChar, otherChar, StringComparison.OrdinalIgnoreCase);
if (comp != 0)
{
return comp;
}
}
}
return componentParts.Length.CompareTo(otherParts.Length);
}
This are strings that might be used. I might add only the part after the minus sign is used.
1.0.0-alpha
1.0.0-alpha.1
1.0.0-alpha.beta
1.0.0-beta.2
With this method you can create a compare string for each of your string. These strings are comparable by simple alphanumeric comparison.
Assumptions:
There is a minus in the string separating the common part and the indiv part
before the minus is always a substring of three integer values divided by a dot
These integer values are not higher than 999 (look at variable "MaxWidth1")
behind the minus is another substring consisting of several parts, also divided by a dot
The second substring's parts may be numeric or alphanumeric with a max. width of 7 (look at "MaxWidth2")
The second substring consists of max. 5 parts (MaxIndivParts)
Put this method wherever you want:
public string VersionNumberCompareString(string versionNumber, int MaxWidth1=3, int MaxWidth2=7,int MaxIndivParts=5){
string result = null;
int posMinus = versionNumber.IndexOf('-');
string part1 = versionNumber.Substring(0, posMinus);
string part2 = versionNumber.Substring(posMinus+1);
var integerValues=part1.Split('.');
result = integerValues[0].PadLeft(MaxWidth1, '0');
result += integerValues[1].PadLeft(MaxWidth1, '0');
result += integerValues[2].PadLeft(MaxWidth1, '0');
var alphaValues = part2.Split('.');
for (int i = 0; i < MaxIndivParts;i++ ) {
if (i <= alphaValues.GetUpperBound(0)) {
var s = alphaValues[i];
int casted;
if (int.TryParse(s, out casted)) //if int: treat as number
result += casted.ToString().PadLeft(MaxWidth2, '0');
else //treat as string
result += s.PadRight(MaxWidth2, ' ');
}
else
result += new string(' ', MaxWidth2);
}
return result; }
You call it like this:
var s1 = VersionNumberCompareString("1.3.0-alpha.1.12");
//"001003000alpha 00000010000012 "
var s2 = VersionNumberCompareString("0.11.4-beta");
//"000011004beta "
var s3 = VersionNumberCompareString("2.10.11-beta.2");
//"002010011beta 0000002 "
Be aware of the final " sign. All strings are of the same length!
Hope this helps...
that's .net comparison logic for ascii strings -
private unsafe static int CompareOrdinalIgnoreCaseHelper(String strA, String strB)
{
Contract.Requires(strA != null);
Contract.Requires(strB != null);
Contract.EndContractBlock();
int length = Math.Min(strA.Length, strB.Length);
fixed (char* ap = &strA.m_firstChar) fixed (char* bp = &strB.m_firstChar)
{
char* a = ap;
char* b = bp;
while (length != 0)
{
int charA = *a;
int charB = *b;
Contract.Assert((charA | charB) <= 0x7F, "strings have to be ASCII");
// uppercase both chars - notice that we need just one compare per char
if ((uint)(charA - 'a') <= (uint)('z' - 'a')) charA -= 0x20;
if ((uint)(charB - 'a') <= (uint)('z' - 'a')) charB -= 0x20;
//Return the (case-insensitive) difference between them.
if (charA != charB)
return charA - charB;
// Next char
a++; b++;
length--;
}
return strA.Length - strB.Length;
}
}
having said that, Unless you have a strict performance constaint, i would say if you get the same result from an already implemented & tested function, its better to reuse it and not to reinvent the wheel.
It saves so much time in implementation, unit testing, debugging & bug fixing time. & helps keep the software simple.

Remove additional spacing in string [Fastest Way]

I need to remove all additional spaces in a string.
I use regex for matching strings and matched strings i replace with some others.
For better understanding please see examples below:
3 input strings:
Hello, how are you?
Hello , how are you?
Hello , how are you ?
This are 3 strings that should match by one pattern-regex.
It looks something like this:
Hello\s*,\s+how\s+are\s+you\s*?
It works fine but there is a perfomance problem.
If I have a lot of patterns (~20k) and try to execute each pattern it runs very slow (3-5 minutes).
Maybe there is better way for doing this?
for example use some 3d-party libs?
UPD: Folks, this question is not about how to do this. It's about how to do this with best perfomance. :)
Let me explain more detailed. The main goal is tokenize text. (replace some token with special symbols)
For example I have a token "nice try".
Then I input text "this is nice try".
result: "this is #tokenizedtext#" where #tokenizedtext# some special symbols. It doesen't matter in this case.
Next I have string "Mike said it was a nice try".
result should be "Mike said it was a #tokenizedtext#".
I think the main idea is clear.
So I can have a lot of tokens. When I process it I convert my token from "nice try" to pattern "nice\s+try". and try to replace with this pattern input text.
It works fine. But if in tokens there is more spaces and there is also punctuation then my regexes became bigger and works very slow.
Do you have some suggestions (technical or logic) for solving this problem?
I can suggest a few solutions.
First of all, avoid the static Regex method. Create an instance of it (and store it, don't call the constructor for each replacement!) and, if possible, use RegexOptions.Compiled. It should improve your performance.
Second, you can try to review your pattern. I'll do some profiling, but I'm currently undecisive between:
#"(?<=\s)\s+"
With replacement being an empty string or:
#"\s+"
With a space as a replacement. You can try this code, in the meanwhile:
var s = "Hello , how are you?";
var pattern = #"\s+";
var regex = new Regex(pattern, RegexOptions.Compiled);
var replaced = regex.Replace(s, " ");
EDIT: After having done some measurement, the second pattern seems to be faster. I'm editing my sample to adapt it.
EDIT 2: I've written an unsafe method. It's much faster than the other ones presented here, including the Regex ones, but, as the word itself says, it's unsafe. I don't think that there's any problem with the code I've written but I may be wrong -- So please, check it again and again in case there's a bug in the method.
static unsafe string TrimInternal(string input)
{
var length = input.Length;
var array = stackalloc char[length];
fixed (char* fix = input)
{
var ptr = fix;
var counter = 0;
var lastWasSpace = false;
while (*ptr != '\x0')
{
//Current char is a space?
var isSpace = *ptr == ' ';
//If it's a space but the last one wasn't
//Or if it's not a space
if (isSpace && !lastWasSpace || !isSpace)
//Write into the result array
array[counter++] = *ptr;
//The last character (before the next loop) was a space
lastWasSpace = isSpace;
//Increase the pointer
ptr++;
}
return new string(array, 0, counter);
}
}
Usage (compile with /unsafe):
var s = TrimInternal("Hello , how are you?");
Profiling made in Release build, optimizations on, 1000000 iterations:
My above solution with Regex: 00:00:03.2130121
The unsafe solution: 00:00:00.2063467
This might work for you. It should be pretty fast. Note that it also removes spaces at the end of the string; that might not be what you want...
using System;
namespace Demo
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(">{0}<", RemoveExtraSpaces("Hello, how are you?"));
Console.WriteLine(">{0}<", RemoveExtraSpaces("Hello , how are you?"));
Console.WriteLine(">{0}<", RemoveExtraSpaces("Hello , how are you ?"));
}
public static string RemoveExtraSpaces(string text)
{
var buffer = new char[text.Length];
bool isSpaced = false;
int n = 0;
foreach (char c in text)
{
if (c == ' ')
{
isSpaced = true;
}
else
{
if (isSpaced)
{
if ((c != ',') && (c != '?'))
{
buffer[n++] = ' ';
}
isSpaced = false;
}
buffer[n++] = c;
}
}
return new string(buffer, 0, n);
}
}
}
Something of my own :
find all the position of WhiteSpacechar in string;
private static IEnumerable<int> GetWhiteSpacePos(string input)
{
int iPos = -1;
while ((iPos = input.IndexOf(" ", iPos + 1, StringComparison.Ordinal)) > -1)
{
yield return iPos;
}
}
Remove all whitespace that are in in sequence Returned from GetWhiteSpacePos
string original_string = "Hello , how are you ?";
var poss = GetWhiteSpacePos(original_string).ToList();
int startPos;
int endPos;
StringBuilder builder = new StringBuilder(original_string);
for (int i = poss.Count -1; i > 1; i--)
{
endPos = poss[i];
while ((poss[i] == poss[i - 1] + 1) && i > 1)
{
i--;
}
startPos = poss[i];
if (endPos - startPos > 1)
{
builder.Remove(startPos, endPos - startPos);
}
}
string new_string = builder.ToString();
You are using a very complex regex..simplify the regex and that would definitely increasre the performance
Use \s+ and replace it with a single space
Well, these kind of problems really trouble us. Use this code, and I'm sure you're getting the result for what you've asked. This command removes any extra white space between any string.
cleanString= Regex.Replace(originalString, #"\s", " ");
Hope thar works for you. Thanks.
And since this is a single Instruction. It will utilize less CPU resource and hence less CPU time, which ultimately increases your performance. Therefore A/C to me this method works the best when compared in terms of performance.
if its just a matter of SPACE;
try this
Source : http://www.codeproject.com/Articles/10890/Fastest-C-Case-Insenstive-String-Replace
private static string ReplaceEx(string original,
string pattern, string replacement)
{
int count, position0, position1;
count = position0 = position1 = 0;
string upperString = original.ToUpper();
string upperPattern = pattern.ToUpper();
int inc = (original.Length / pattern.Length) *
(replacement.Length - pattern.Length);
char[] chars = new char[original.Length + Math.Max(0, inc)];
while ((position1 = upperString.IndexOf(upperPattern,
position0)) != -1)
{
for (int i = position0; i < position1; ++i)
chars[count++] = original[i];
for (int i = 0; i < replacement.Length; ++i)
chars[count++] = replacement[i];
position0 = position1 + pattern.Length;
}
if (position0 == 0) return original;
for (int i = position0; i < original.Length; ++i)
chars[count++] = original[i];
return new string(chars, 0, count);
}
Usage:
string original_string = "Hello , how are you ?";
while (original_string.Contains(" "))
{
original_string = ReplaceEx(original_string, " ", " ");
}
Replacing the regex way:
string resultString = null;
try {
resultString = Regex.Replace(subjectString, #"\s+", " ", RegexOption.Compiled);
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}

Eliminate redundant letters in string? (e.g. gooooooooood -> good)

I'm trying to set up some sample data for a Naive Bayesian Classifier for Twitter.
One of the post-processing of the tweets I'd like to do is to remove unnecessary repeat characters.
For example, one of the tweets reads: Twizzlers. mmmmm goooooooooooood!
I'd like to reduce the number of w's down to just two. Why two? That's what the article I'm following did. Any individual word that is less than 2 characters is discarded (see mmmmm above). And as far as gooooooood, I would imagine double letters are the most common to be uber repeated.
So, that said, what's the fastest way (in terms of execution time) to reduce words such as gooooooooood to simply good?
[Edit]
I'll be processing 800,000 tweets in this app, hence the requirement for fastest execution
[/Edit]
[Edit2]
I just ran some simple benchmarking based on elapsed time to iterate through 1000 records & save to a text file. I repeated this iteration 100 times on each method. The average results are here:
Method 1: 386 ms [LINQ - answer was deleted]
Method 2: 407 ms [Regex]
Method 3: 303 ms [StringBuilder]
Method 4: 301 ms [StringBuilder part 2]
Method 1: LINQ (answer was apparently deleted)
static string doIt(string a)
{
var l = a.Select((p, i) => new { ch = p, index = i }).
Where(p => (p.index < a.Length - 2) && (a[p.index + 1] == p.ch) && (a[p.index + 2] == p.ch))
.Select(p => p.index).ToList();
l.Sort();
l.Reverse();
l.ForEach(i => a = a.Remove(i, 1));
return a;
}
METHOD 2:
Regex.Replace(tweet,#"(\S)\1{2,}","$1$1");
Method 3:
static string StringB(string s)
{
string input = s;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < input.Length; i++)
{
if (i < 2 || input[i] != input[i - 1] || input[i] != input[i - 2])
sb.Append(input[i]);
}
string output = sb.ToString();
return output;
}
Method 4:
static string sb2(string s)
{
string input = s;
var sb = new StringBuilder(input);
char p2 = '\0';
char p1 = '\0';
int pos = 0, len = sb.Length;
while (pos < len)
{
if (p2 == p1) for (; pos < len && (sb[pos] == p2); len--)
sb.Remove(pos, 1);
if (pos < len)
{
p2 = p1;
p1 = sb[pos];
pos++;
}
}
return sb.ToString();
}
Regexen look to be the simplest. Simple proof of concept in the REPL:
using System.Text.RegularExpressions;
var regex = new Regex(#"(\S)\1{2,}"); // or #"([aeiouy])\1{2,}" etc?
regex.Replace("mmmmm gooood griieeeeefff", "$1$1");
-->
"mm good griieeff"
For raw performance, use something more like this: see it live on https://ideone.com/uWG68
using System;
using System.Text;
class Program
{
public static void Main(string[] args)
{
string input = "mmmm gooood griiiiiiiiiieeeeeeefffff";
var sb = new StringBuilder(input);
char p2 = '\0';
char p1 = '\0';
int pos = 0, len=sb.Length;
while (pos < len)
{
if (p2==p1) for (; pos<len && (sb[pos]==p2); len--)
sb.Remove(pos, 1);
if (pos<len)
{
p2=p1;
p1=sb[pos];
pos++;
}
}
Console.WriteLine(sb);
}
}
This is also (easily) doable via a regular expression:
var re = #"((.)\2)\2*";
Regex.Replace("god", re, "$1") // god
Regex.Replace("good", re, "$1") // good
Regex.Replace("gooood", re, "$1") // good
Is it faster than the other approaches? Well, that's for the benchmarks ;-) Regular expressions can be quite efficient in non-degenerate backtracking situations. The above may need to be altered (this will also match spaces for instance), but it's a small example.
Happy coding.
I would recommend looking into NLP solutions rather than C#/regex. In that world, python is preferred. See NLTK. I would recommend Nodebox Linguistics which gives you spelling corrections. You can even stem words and even go down to the infinitive.
I agree with the comments that this will not work in the general case, especially in "Twitter speak". Having said that the rules you mentioned are simple - eliminate every character that is the same as the previous two characters:
string input = "goooooooooooood";
StringBuilder sb = new StringBuilder(input.Length);
sb.Append(input.Substring(0, 2));
for (int i = 2; i < input.Length; i++)
{
if (input[i] != input[i - 1] || input[i] != input[i - 2])
sb.Append(input[i]);
}
string output = sb.ToString();

Format string with dashes

I have a compressed string value I'm extracting from an import file. I need to format this into a parcel number, which is formatted as follows: ##-##-##-###-###. So therefore, the string "410151000640" should become "41-01-51-000-640". I can do this with the following code:
String.Format("{0:##-##-##-###-###}", Convert.ToInt64("410151000640"));
However, The string may not be all numbers; it could have a letter or two in there, and thus the conversion to the int will fail. Is there a way to do this on a string so every character, regardless of if it is a number or letter, will fit into the format correctly?
Regex.Replace("410151000640", #"^(.{2})(.{2})(.{2})(.{3})(.{3})$", "$1-$2-$3-$4-$5");
Or the slightly shorter version
Regex.Replace("410151000640", #"^(..)(..)(..)(...)(...)$", "$1-$2-$3-$4-$5");
I would approach this by having your own formatting method, as long as you know that the "Parcel Number" always conforms to a specific rule.
public static string FormatParcelNumber(string input)
{
if(input.length != 12)
throw new FormatException("Invalid parcel number. Must be 12 characters");
return String.Format("{0}-{1}-{2}-{3}-{4}",
input.Substring(0,2),
input.Substring(2,2),
input.Substring(4,2),
input.Substring(6,3),
input.Substring(9,3));
}
This should work in your case:
string value = "410151000640";
for( int i = 2; i < value.Length; i+=3){
value = value.Insert( i, "-");
}
Now value contains the string with dashes inserted.
EDIT
I just now saw that you didn't have dashes between every second number all the way, to this will require a small tweak (and makes it a bit more clumsy also I'm afraid)
string value = "410151000640";
for( int i = 2; i < value.Length-1; i+=3){
if( value.Count( c => c == '-') >= 3) i++;
value = value.Insert( i, "-");
}
If its part of UI you can use MaskedTextProvider in System.ComponentModel
MaskedTextProvider prov = new MaskedTextProvider("aa-aa-aa-aaa-aaa");
prov.Set("41x151000a40");
string result = prov.ToDisplayString();
Here is a simple extension method with some utility:
public static string WithMask(this string s, string mask)
{
var slen = Math.Min(s.Length, mask.Length);
var charArray = new char[mask.Length];
var sPos = s.Length - 1;
for (var i = mask.Length - 1; i >= 0 && sPos >= 0;)
if (mask[i] == '#') charArray[i--] = s[sPos--];
else
charArray[i] = mask[i--];
return new string(charArray);
}
Use it as follows:
var s = "276000017812008";
var mask = "###-##-##-##-###-###";
var dashedS = s.WithMask(mask);
You can use it with any string and any character other than # in the mask will be inserted. The mask will work from right to left. You can tweak it to go the other way if you want.
Have fun.
If i understodd you correctly youre looking for a function that removes all letters from a string, aren't you?
I have created this on the fly, maybe you can convert it into c# if it's what you're looking for:
Dim str As String = "410151000vb640"
str = String.Format("{0:##-##-##-###-###}", Convert.ToInt64(MakeNumber(str)))
Public Function MakeNumber(ByVal stringInt As String) As String
Dim sb As New System.Text.StringBuilder
For i As Int32 = 0 To stringInt.Length - 1
If Char.IsDigit(stringInt(i)) Then
sb.Append(stringInt(i))
End If
Next
Return sb.ToString
End Function

Categories

Resources