Replace the start of line in a file quickly - c#

I have an initial file containing lines such as:
34 964:0.049759 1123:0.0031 2507:0.015979
32,48 524:0.061167 833:0.030133 1123:0.002549
34,52 534:0.07349 698:0.141667 1123:0.004403
106 389:0.013396 417:0.016276 534:0.023859
The first part of a line is the class number. A line can have several classes.
For each class, I create a new file.
For instance for class 34 the resulting file will be :
+1 964:0.049759 1123:0.0031 2507:0.015979
-1 524:0.061167 833:0.030133 1123:0.002549
+1 534:0.07349 698:0.141667 1123:0.004403
-1 389:0.013396 417:0.016276 534:0.023859
For class 106 the resulting file will be :
-1 964:0.049759 1123:0.0031 2507:0.015979
-1 524:0.061167 833:0.030133 1123:0.002549
-1 534:0.07349 698:0.141667 1123:0.004403
+1 389:0.013396 417:0.016276 534:0.023859
The problem is I have 13 files to write for 200 class.
I already ran a less optimized version of my code and it took several hours.
With my code below it takes 1 hour to generate the 2600 files.
Is there a way to perform such a replacement in a faster way? Are regex a viable option?
Below is my implementation (works on LINQPAD with this data file)
static void Main()
{
const string filePath = #"C:\data.txt";
const string generatedFilesFolderPath = #"C:\";
const string fileName = "data";
using (new TimeIt("Whole process"))
{
var fileLines = File.ReadLines(filePath).Select(l => l.Split(new[] { ' ' }, 2)).ToList();
var classValues = GetClassValues();
foreach (var classValue in classValues)
{
var directoryPath = Path.Combine(generatedFilesFolderPath, classValue);
if (!Directory.Exists(directoryPath))
Directory.CreateDirectory(directoryPath);
var classFilePath = Path.Combine(directoryPath, fileName);
using (var file = new StreamWriter(classFilePath))
{
foreach (var line in fileLines)
{
var lineFirstPart = line.First();
string newFirstPart = "-1";
var hashset = new HashSet<string>(lineFirstPart.Split(','));
if (hashset.Contains(classValue))
{
newFirstPart = "+1";
}
file.WriteLine("{0} {1}", newFirstPart, line.Last());
}
}
}
}
Console.Read();
}
public static List<string> GetClassValues()
{
// In real life there is 200 class values.
return Enumerable.Range(0, 2).Select(c => c.ToString()).ToList();
}
public class TimeIt : IDisposable
{
private readonly string _name;
private readonly Stopwatch _watch;
public TimeIt(string name)
{
_name = name;
_watch = Stopwatch.StartNew();
}
public void Dispose()
{
_watch.Stop();
Console.WriteLine("{0} took {1}", _name, _watch.Elapsed);
}
}
The output:
Whole process took 00:00:00.1175102
EDIT: I also ran a profiler and it looks like the split method is the hottest spot.
EDIT 2: Simple example:
2,1 1:0.8 2:0.2
3 1:0.4 3:0.6
12 1:0.02 4:0.88 5:0.1
Expected output for class 2:
+1 1:0.8 2:0.2
-1 1:0.4 3:0.6
-1 1:0.02 4:0.88 5:0.1
Expected output for class 3:
-1 1:0.8 2:0.2
+1 1:0.4 3:0.6
-1 1:0.02 4:0.88 5:0.1
Expected output for class 4:
-1 1:0.8 2:0.2
-1 1:0.4 3:0.6
-1 1:0.02 4:0.88 5:0.1

I have eliminated the hottest paths from your code by removing the split and using a bigger buffer on the FileStream.
Instead of Split I now call ToCharArray and then parse the first Chars to the first space and while I'm at it a match with classValue on a char by char basis is performed. The boolean found indicates an exact match for anything before the , of the first space. The rest of the handling is the same.
var fsw = new FileStream(classFilePath,
FileMode.Create,
FileAccess.Write,
FileShare.None,
64*1024*1024); // use a large buffer
using (var file = new StreamWriter(fsw)) // use the filestream
{
foreach(var line in fileLines) // for( int i = 0;i < fileLines.Length;i++)
{
char[] chars = line.ToCharArray();
int matched = 0;
int parsePos = -1;
bool takeClass = true;
bool found = false;
bool space = false;
// parse until space
while (parsePos<chars.Length && !space )
{
parsePos++;
space = chars[parsePos] == ' '; // end
// tokens
if (chars[parsePos] == ' ' ||
chars[parsePos] == ',')
{
if (takeClass
&& matched == classValue.Length)
{
found = true;
takeClass = false;
}
else
{
// reset matching
takeClass = true;
matched = 0;
}
}
else
{
if (takeClass
&& matched < classValue.Length
&& chars[parsePos] == classValue[matched])
{
matched++; // on the next iteration, match next
}
else
{
takeClass = false; // no match!
}
}
}
chars[parsePos - 1] = '1'; // replace 1 in front of space
var correction = 1;
if (parsePos > 1)
{
// is classValue before the comma (or before space)
if (found)
{
chars[parsePos - 2] = '+';
}
else
{
chars[parsePos - 2] = '-';
}
correction++;
}
else
{
// is classValue before the comma (or before space)
if (found)
{
// not enough space in the array, write a single char
file.Write('+');
}
else
{
file.Write('-');
}
}
file.WriteLine(chars, parsePos - correction, chars.Length - (parsePos - correction));
}
}

Instead of iterating over the un-parsed lines 200 times, how about parsing the lines upfront into a data structure then iterating over that 200 times? This should minimize the numer of string manipulation operations.
Also using StreamReader instead of File.ReadLines, so the entire file is not in memory twice -- once as string[] and another time as Detail[].
static void Main(string[] args)
{
var details = ReadDetail("data.txt").ToArray();
var classValues = Enumerable.Range(0, 10).ToArray();
foreach (var classValue in classValues)
{
// Create file/directory etc
using (var file = new StreamWriter("out.txt"))
{
foreach (var detail in details)
{
file.WriteLine("{0} {1}", detail.Classes.Contains(classValue) ? "+1" : "-1", detail.Line);
}
}
}
}
static IEnumerable<Detail> ReadDetail(string filePath)
{
using (StreamReader reader = new StreamReader(filePath))
{
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
int separator = line.IndexOf(' ');
Detail detail = new Detail
{
Classes = line.Substring(0, separator).Split(',').Select(c => Int32.Parse(c)).ToArray(),
Line = line.Substring(separator + 1)
};
yield return detail;
}
}
}
public class Detail
{
public int[] Classes { get; set; }
public string Line { get; set; }
}

Related

Efficiently replace unwanted separator from csv field

As input, I have a csv file of approximately 1,000,000 lines (about 300Mb) which contains columns separated by semi-colons.
colA;colB;colC;colD;colE
aaaa;bbbb;cccc;dddd;"eeee;"
aaaa;bbbb;cccc;dddd;evfvdfeee
aaaa;bb1bb;cc2cc;dd3dd;evfve
Some of the fields on the 5th column may have a semi-colon. Not all of them though. So, I want to remove all semi-colons after the 4th occurrence. The code below works, but it takes ages (approx. 10min) to save the csv to the file. How can I speed up this?
void Main()
{
string finput = #"myfile.csv";
FileInfo finfo = new FileInfo(finput);
var Lines = File.ReadLines(finfo.FullName);
List<string> output = new List<string>();
Stopwatch sw = new Stopwatch();
sw.Start();
int count = 0;
foreach (string s in Lines)
{
count++;
if (count % 10000 == 0)
count.Dump();
output.Add((StringExtender.ReplaceAfterNthOccurrency(s, ";", ".", 5)));
}
sw.Elapsed.Dump();
File.WriteAllLines(finfo.DirectoryName + finfo.Name + "_conv.csv", output);
}
public static class StringExtender
{
public static string ReplaceAfterNthOccurrency(string input, string to_replace, string to_add, int n)
{
var cont = true;
int count = 0;
int start = 0;
while (cont)
{
int i = input.IndexOf(to_replace, start);
if (i != -1)
{
count++;
start = i + 1;
if (count >= n)
{
input = input.Remove(i, 1);
input = input.Insert(i, to_add);
}
}
else
cont = false;
}
return input;
}
}

Reading lines from a text file, converting them, then writing back to new file

I have some basic knowledge of C#, but I am having trouble coding something that seems simple in concept. I want to read a file (.asm) containing values such as
#1
#12
#96
#2
#46
etc.
on multiple consecutive lines. I then want to get rid of the # symbols (if they are present), convert the remaining number values to binary, then write these binary values back to a new file (.hack) on their own lines. There isn't a set limit on the number of lines, which is my biggest issue as I don't know how to check for lines dynamically. So far I can only read and convert lines if I code to look for them, then I can't figure out how to write these values on their own lines in the new file. Sorry if this sounds a bit convoluted, but any help would be appreciated. Thanks!
if (openFileDialog1.ShowDialog() == DialogResult.OK)
{
var line = File.ReadAllText(openFileDialog1.FileName);
using (StreamWriter sw = File.CreateText("testCode.hack"))
{
var str = line;
var charsToRemove = new string[] {"#"};
foreach (var c in charsToRemove)
{
str = str.Replace(c, string.Empty);
}
int value = Convert.ToInt32(str);
string value2 = Convert.ToString(value, 2);
if (value2.Length < 16)
{
int zeroes = 16 - value2.Length;
if(zeroes == 12)
{
sw.WriteLine("000000000000" + value2);
}
}
else
{
sw.WriteLine(value2);
}
}
This code should help you get going real fast:
static void Main(string[] args)
{
string line = string.Empty;
System.IO.StreamReader reader = new System.IO.StreamReader(#"C:\test.txt");
System.IO.StreamWriter writer = new System.IO.StreamWriter(#"C:\test.hack");
while ((line = reader.ReadLine()) != null) // Read until there is nothing more to read
{
if (line.StartsWith("#"))
{
line = line.Remove(0, 1); // Remove '#'
}
int value = -1;
if (Int32.TryParse(line, out value)) // Check if the rest string is an integer
{
// Convert the rest string to its binary representation and write it to the file
writer.WriteLine(intToBinary(value));
}
else
{
// Couldn't convert the string to an integer..
}
}
reader.Close();
writer.Close();
Console.WriteLine("Done!");
Console.Read();
}
//http://www.dotnetperls.com/binary-representation
static string intToBinary(int n)
{
char[] b = new char[32];
int pos = 31;
int i = 0;
while (i < 32)
{
if ((n & (1 << i)) != 0)
{
b[pos] = '1';
}
else
{
b[pos] = '0';
}
pos--;
i++;
}
return new string(b);
}
My suggestion create a List<string>. Here are steps
Read input (.asm) file into List
Open StreamWriter for output (.hack) file.
Loop through List<string> modify the string and write into file.
Code Example:
List<string> lstInput = new List<string>();
using (StreamReader reader = new StreamReader(#"input.asm"))
{
string sLine = string.Empty;
//read one line at a time
while ((sLine = reader.ReadLine()) != null)
{
lstInput.Add(sLine);
}
}
using (StreamWriter writer = new StreamWriter(#"output.hack"))
{
foreach(string sFullLine in lstInput)
{
string sNumber = sFullLine;
//remove leading # sign
if(sFullLine.StartsWith("#"))
sNumber = sFullLine.Substring(1);
int iNumber;
if(int.TryParse(sNumber, out iNumber))
{
writer.WriteLine(IntToBinaryString(iNumber));
}
}
}
public string IntToBinaryString(int number)
{
const int mask = 1;
var binary = string.Empty;
while(number > 0)
{
// Logical AND the number and prepend it to the result string
binary = (number & 1) + binary;
number = number >> 1;
}
return binary;
}
Reference: IntToBinaryString method.
NOTE: Int to Binary String method mentioned in the answer of #TheDutchMan is better choice.

Best way to split string into lines with maximum length, without breaking words

I want to break a string up into lines of a specified maximum length, without splitting any words, if possible (if there is a word that exceeds the maximum line length, then it will have to be split).
As always, I am acutely aware that strings are immutable and that one should preferably use the StringBuilder class. I have seen examples where the string is split into words and the lines are then built up using the StringBuilder class, but the code below seems "neater" to me.
I mentioned "best" in the description and not "most efficient" as I am also interested in the "eloquence" of the code. The strings will never be huge, generally splitting into 2 or three lines, and it won't be happening for thousands of lines.
Is the following code really bad?
private static IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
stringToSplit = stringToSplit.Trim();
var lines = new List<string>();
while (stringToSplit.Length > 0)
{
if (stringToSplit.Length <= maximumLineLength)
{
lines.Add(stringToSplit);
break;
}
var indexOfLastSpaceInLine = stringToSplit.Substring(0, maximumLineLength).LastIndexOf(' ');
lines.Add(stringToSplit.Substring(0, indexOfLastSpaceInLine >= 0 ? indexOfLastSpaceInLine : maximumLineLength).Trim());
stringToSplit = stringToSplit.Substring(indexOfLastSpaceInLine >= 0 ? indexOfLastSpaceInLine + 1 : maximumLineLength);
}
return lines.ToArray();
}
Even when this post is 3 years old I wanted to give a better solution using Regex to accomplish the same:
If you want the string to be splitted and then use the text to be displayed you can use this:
public string SplitToLines(string stringToSplit, int maximumLineLength)
{
return Regex.Replace(stringToSplit, #"(.{1," + maximumLineLength +#"})(?:\s|$)", "$1\n");
}
If on the other hand you need a collection you can use this:
public MatchCollection SplitToLines(string stringToSplit, int maximumLineLength)
{
return Regex.Matches(stringToSplit, #"(.{1," + maximumLineLength +#"})(?:\s|$)");
}
NOTES
Remember to import regex (using System.Text.RegularExpressions;)
You can use string interpolation on the match:
$#"(.{{1,{maximumLineLength}}})(?:\s|$)"
The MatchCollection works almost like an Array
Matching example with explanation here
How about this as a solution:
IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
var words = stringToSplit.Split(' ').Concat(new [] { "" });
return
words
.Skip(1)
.Aggregate(
words.Take(1).ToList(),
(a, w) =>
{
var last = a.Last();
while (last.Length > maximumLineLength)
{
a[a.Count() - 1] = last.Substring(0, maximumLineLength);
last = last.Substring(maximumLineLength);
a.Add(last);
}
var test = last + " " + w;
if (test.Length > maximumLineLength)
{
a.Add(w);
}
else
{
a[a.Count() - 1] = test;
}
return a;
});
}
I reworked this as prefer this:
IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength)
{
var words = stringToSplit.Split(' ');
var line = words.First();
foreach (var word in words.Skip(1))
{
var test = $"{line} {word}";
if (test.Length > maximumLineLength)
{
yield return line;
line = word;
}
else
{
line = test;
}
}
yield return line;
}
I don't think your solution is too bad. I do, however, think you should break up your ternary into an if else because you are testing the same condition twice. Your code might also have a bug. Based on your description, it seems you want lines <= maxLineLength, but your code counts the space after the last word and uses it in the <= comparison resulting in effectively < behavior for the trimmed string.
Here is my solution.
private static IEnumerable<string> SplitToLines(string stringToSplit, int maxLineLength)
{
string[] words = stringToSplit.Split(' ');
StringBuilder line = new StringBuilder();
foreach (string word in words)
{
if (word.Length + line.Length <= maxLineLength)
{
line.Append(word + " ");
}
else
{
if (line.Length > 0)
{
yield return line.ToString().Trim();
line.Clear();
}
string overflow = word;
while (overflow.Length > maxLineLength)
{
yield return overflow.Substring(0, maxLineLength);
overflow = overflow.Substring(maxLineLength);
}
line.Append(overflow + " ");
}
}
yield return line.ToString().Trim();
}
It is a bit longer than your solution, but it should be more straightforward. It also uses a StringBuilder so it is much faster for large strings. I performed a benchmarking test for 20,000 words ranging from 1 to 11 characters each split into lines of 10 character width. My method completed in 14ms compared to 1373ms for your method.
Try this (untested)
private static IEnumerable<string> SplitToLines(string value, int maximumLineLength)
{
var words = value.Split(' ');
var line = new StringBuilder();
foreach (var word in words)
{
if ((line.Length + word.Length) >= maximumLineLength)
{
yield return line.ToString();
line = new StringBuilder();
}
line.AppendFormat("{0}{1}", (line.Length>0) ? " " : "", word);
}
yield return line.ToString();
}
~6x faster than the accepted answer
More than 1.5x faster than the Regex version in Release Mode (dependent on line length)
Optionally keep the space at the end of the line or not (the regex version always keeps it)
static IEnumerable<string> SplitToLines(string stringToSplit, int maximumLineLength, bool removeSpace = true)
{
int start = 0;
int end = 0;
for (int i = 0; i < stringToSplit.Length; i++)
{
char c = stringToSplit[i];
if (c == ' ' || c == '\n')
{
if (i - start > maximumLineLength)
{
string substring = stringToSplit.Substring(start, end - start); ;
start = removeSpace ? end + 1 : end; // + 1 to remove the space on the next line
yield return substring;
}
else
end = i;
}
}
yield return stringToSplit.Substring(start); // remember last line
}
Here is the example code used to test speeds (again, run on your own machine and test in Release mode to get accurate timings)
https://dotnetfiddle.net/h5I1GC
Timings on my machine in release mode .Net 4.8
Accepted Answer: 667ms
Regex: 368ms
My Version: 117ms
My requirement was to have a line break at the last space before the 30 char limit.
So here is how i did it. Hope this helps anyone looking.
private string LineBreakLongString(string input)
{
var outputString = string.Empty;
var found = false;
int pos = 0;
int prev = 0;
while (!found)
{
var p = input.IndexOf(' ', pos);
{
if (pos <= 30)
{
pos++;
if (p < 30) { prev = p; }
}
else
{
found = true;
}
}
outputString = input.Substring(0, prev) + System.Environment.NewLine + input.Substring(prev, input.Length - prev).Trim();
}
return outputString;
}
An approach using recursive method and ReadOnlySpan (Tested)
public static void SplitToLines(ReadOnlySpan<char> stringToSplit, int index, ref List<string> values)
{
if (stringToSplit.IsEmpty || index < 1) return;
var nextIndex = stringToSplit.IndexOf(' ');
var slice = stringToSplit.Slice(0, nextIndex < 0 ? stringToSplit.Length : nextIndex);
if (slice.Length <= index)
{
values.Add(slice.ToString());
nextIndex++;
}
else
{
values.Add(slice.Slice(0, index).ToString());
nextIndex = index;
}
if (stringToSplit.Length <= index) return;
SplitToLines(stringToSplit.Slice(nextIndex), index, ref values);
}

Override StreamReader's ReadLine method

I'm trying to override a StreamReader's ReadLine method, but having difficulty doing so due to inability to access some private variables. Is this possible, or should I just write my own StreamReader class?
Assuming you want your custom StreamReader to be usable anywhere that a TextReader can be used there are typically two options.
Inherit from StreamReader and override the functions that you want to have work differently. In your case this would be StreamReader.ReadLine.
Inherit from TextReader and implement the reader functionality completely to your requirements.
NB: For option 2 above, you can maintain an internal reference to a StreamReader instance and delegate all the functions to the internal instance, except for the piece of functionality that you want to replace. In my view, this is just an implementation detail of option 2 rather than a 3rd option.
Based on your question I assume you have tried option 1 and found that overriding StreamReader.ReadLine is rather difficult because you could not access the internals of the class. Well for StreamReader you are lucky and can achieve this without having access to the internal implementation of the StreamReader.
Here is a simple example:
Disclaimer: The ReadLine() implementation is for demonstration purposes and is not intended to be a robust or complete implementation.
class CustomStreamReader : StreamReader
{
public CustomStreamReader(Stream stream)
: base(stream)
{
}
public override string ReadLine()
{
int c;
c = Read();
if (c == -1)
{
return null;
}
StringBuilder sb = new StringBuilder();
do
{
char ch = (char)c;
if (ch == ',')
{
return sb.ToString();
}
else
{
sb.Append(ch);
}
} while ((c = Read()) != -1);
return sb.ToString();
}
}
You will notice that I simply used the StreamReader.Read() method to read the characters from the stream. While definitely less per formant that working directly with the internal buffers, the Read() method does use the internal buffering so should still yield pretty good performance, but that should be tested to confirm.
For fun, here is a example of option 2. I used the encapsulated StreamReader to reduce the actual code, this is not tested at all..
class EncapsulatedReader : TextReader
{
private StreamReader _reader;
public EncapsulatedReader(Stream stream)
{
_reader = new StreamReader(stream);
}
public Stream BaseStream
{
get
{
return _reader.BaseStream;
}
}
public override string ReadLine()
{
int c;
c = Read();
if (c == -1)
{
return null;
}
StringBuilder sb = new StringBuilder();
do
{
char ch = (char)c;
if (ch == ',')
{
return sb.ToString();
}
else
{
sb.Append(ch);
}
} while ((c = Read()) != -1);
return sb.ToString();
}
protected override void Dispose(bool disposing)
{
if (disposing)
{
_reader.Close();
}
base.Dispose(disposing);
}
public override int Peek()
{
return _reader.Peek();
}
public override int Read()
{
return _reader.Read();
}
public override int Read(char[] buffer, int index, int count)
{
return _reader.Read(buffer, index, count);
}
public override int ReadBlock(char[] buffer, int index, int count)
{
return _reader.ReadBlock(buffer, index, count);
}
public override string ReadToEnd()
{
return _reader.ReadToEnd();
}
public override void Close()
{
_reader.Close();
base.Close();
}
}
this class can help you
public class MyStreamReader : System.IO.StreamReader
{
public MyStreamReader(string path)
: base(path)
{
}
public override string ReadLine()
{
string result = string.Empty;
int b = base.Read();
while ((b != (int)',') && (b > 0))
{
result += this.CurrentEncoding.GetString(new byte[] { (byte)b });
b = base.Read();
}
return result;
}
}
Try This, I wrote this because I have some very large '|' delimited files that have \r\n inside of some of the columns and I needed to use \r\n as the end of the line delimiter. I was trying to import some files using SSIS packages but because of some corrupted data in the files I was unable to. The File was over 5 GB so it was too large to open and manually fix. I found the answer through looking through lots of Forums to understand how streams work and ended up coming up with a solution that reads each character in a file and spits out the line based on the definitions I added into it. this is for use in a Command Line Application, complete with help :). I hope this helps some other people out, I haven't found a solution quite like it anywhere else, although the ideas were inspired by this forum and others. This will not fix the files it only splits them... please be aware that this is still a work in progress :).
class Program
{
static long _fileposition = 0;
static void Main(string[] args)
{
// Check information passed in
if (args.Any())
{
if (args[0] == "/?")
{
var message = "Splits a file into smaller pieces";
message += "\n";
message += "\n";
message += "SplitFile [sourceFileName] [destinationFileName] [RowBatchAmount] [FirstRowHasHeader]";
message += "\n";
message += "\n";
message += " [sourceFileName] (STRING) required";
message += "\n";
message += " [destinationFileName] (STRING) will default to the same location as the sourceFileName";
message += "\n";
message += " [RowBatchAmount] (INT) will create files that have this many rows";
message += "\n";
message += " [FirstRowHasHeader] (True/False) Will Add Header Row to each new file";
Console.WriteLine(message);
}
else
{
string sourceFileName = args[0];
string destFileLocation = args.Count() >= 2 ? args[1] : sourceFileName.Substring(0, sourceFileName.LastIndexOf("\\"));
int RowCount = args.Count() >= 3 ? int.Parse(args[2]) : 500000;
bool FirstRowHasHeader = true;
FirstRowHasHeader = args.Count() != 4 || bool.Parse(args[3]);
// Create Directory If Needed
if (!Directory.Exists(destFileLocation))
{
Directory.CreateDirectory(destFileLocation);
}
string line = "";
int linecount = 0;
int FileNum = 1;
string newFileName = Path.Combine(destFileLocation, Path.GetFileNameWithoutExtension(sourceFileName));
newFileName += FileNum + Path.GetExtension(sourceFileName);
// Always add Header Line
string HeaderLine = GetLine(sourceFileName, _fileposition);
int HeaderCount = HeaderLine.Split('|').Count();
do
{
// Add Header Line
if ((linecount == 0 & FirstRowHasHeader) | (_fileposition == 1 & !FirstRowHasHeader))
{
using (FileStream NewFile = new FileStream(newFileName, FileMode.Append))
{
System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
Byte[] bytes = encoding.GetBytes(HeaderLine);
int length = encoding.GetByteCount(HeaderLine);
NewFile.Write(bytes, 0, length);
}
}
//Evaluate Line
line = GetLine(sourceFileName, _fileposition, HeaderCount);
if (line == null) continue;
// Create File if it doesn't exist and write to it
using (FileStream NewFile = new FileStream(newFileName, FileMode.Append))
{
System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
Byte[] bytes = encoding.GetBytes(line);
int length = encoding.GetByteCount(line);
NewFile.Write(bytes, 0, length);
}
//Add to the line count
linecount++;
//Create new FileName if needed
if (linecount == RowCount)
{
FileNum++;
// Create a new sub File, and read into it
newFileName = Path.Combine(destFileLocation, Path.GetFileNameWithoutExtension(sourceFileName));
newFileName += FileNum + Path.GetExtension(sourceFileName);
linecount = 0;
}
} while (line != null);
}
}
else
{
Console.WriteLine("You must provide sourcefile!");
Console.WriteLine("use /? for help");
}
}
static string GetLine(string sourceFileName, long position, int NumberOfColumns = 0)
{
byte[] buffer = new byte[65536];
var builder = new StringBuilder();
var finishedline = false;
using (Stream source = File.OpenRead(sourceFileName))
{
source.Position = position;
var crlf = "\r\n";
var lf = "\n";
var length = source.Length;
while (source.Position = 0 & finishedline == false & _fileposition = NumberOfColumns) | NumberOfColumns == 0)
{
// Remove all Control Line Feeds before the end of the line.
builder = builder.Replace(crlf, lf);
// Add Final Control Line Feed
var x = (char)NewLine.Read();
builder.Append(x);
finishedline = true;
_fileposition++;
continue;
}
}
break;
}
default:
builder.Append(c);
break;
}
}
}
break;
}
}
return (builder.ToString() == "" ? null: builder.ToString());
}
}
References: http://social.msdn.microsoft.com/forums/en-US/csharpgeneral/thread/b0d4cba1-471a-4260-94c1-fddd4244fa23/
this one helped me the most: https://stackoverflow.com/a/668003/1582188

Splitting CamelCase

This is all asp.net c#.
I have an enum
public enum ControlSelectionType
{
NotApplicable = 1,
SingleSelectRadioButtons = 2,
SingleSelectDropDownList = 3,
MultiSelectCheckBox = 4,
MultiSelectListBox = 5
}
The numerical value of this is stored in my database. I display this value in a datagrid.
<asp:boundcolumn datafield="ControlSelectionTypeId" headertext="Control Type"></asp:boundcolumn>
The ID means nothing to a user so I have changed the boundcolumn to a template column with the following.
<asp:TemplateColumn>
<ItemTemplate>
<%# Enum.Parse(typeof(ControlSelectionType), DataBinder.Eval(Container.DataItem, "ControlSelectionTypeId").ToString()).ToString()%>
</ItemTemplate>
</asp:TemplateColumn>
This is a lot better... However, it would be great if there was a simple function I can put around the Enum to split it by Camel case so that the words wrap nicely in the datagrid.
Note: I am fully aware that there are better ways of doing all this. This screen is purely used internally and I just want a quick hack in place to display it a little better.
I used:
public static string SplitCamelCase(string input)
{
return System.Text.RegularExpressions.Regex.Replace(input, "([A-Z])", " $1", System.Text.RegularExpressions.RegexOptions.Compiled).Trim();
}
Taken from http://weblogs.asp.net/jgalloway/archive/2005/09/27/426087.aspx
vb.net:
Public Shared Function SplitCamelCase(ByVal input As String) As String
Return System.Text.RegularExpressions.Regex.Replace(input, "([A-Z])", " $1", System.Text.RegularExpressions.RegexOptions.Compiled).Trim()
End Function
Here is a dotnet Fiddle for online execution of the c# code.
Indeed a regex/replace is the way to go as described in the other answer, however this might also be of use to you if you wanted to go a different direction
using System.ComponentModel;
using System.Reflection;
...
public static string GetDescription(System.Enum value)
{
FieldInfo fi = value.GetType().GetField(value.ToString());
DescriptionAttribute[] attributes = (DescriptionAttribute[])fi.GetCustomAttributes(typeof(DescriptionAttribute), false);
if (attributes.Length > 0)
return attributes[0].Description;
else
return value.ToString();
}
this will allow you define your Enums as
public enum ControlSelectionType
{
[Description("Not Applicable")]
NotApplicable = 1,
[Description("Single Select Radio Buttons")]
SingleSelectRadioButtons = 2,
[Description("Completely Different Display Text")]
SingleSelectDropDownList = 3,
}
Taken from
http://www.codeguru.com/forum/archive/index.php/t-412868.html
This regex (^[a-z]+|[A-Z]+(?![a-z])|[A-Z][a-z]+) can be used to extract all words from the camelCase or PascalCase name. It also works with abbreviations anywhere inside the name.
MyHTTPServer will contain exactly 3 matches: My, HTTP, Server
myNewXMLFile will contain 4 matches: my, New, XML, File
You could then join them into a single string using string.Join.
string name = "myNewUIControl";
string[] words = Regex.Matches(name, "(^[a-z]+|[A-Z]+(?![a-z])|[A-Z][a-z]+)")
.OfType<Match>()
.Select(m => m.Value)
.ToArray();
string result = string.Join(" ", words);
As #DanielB noted in the comments, that regex won't work for numbers (and with underscores), so here is an improved version that supports any identifier with words, acronyms, numbers, underscores (slightly modified #JoeJohnston's version), see online demo (fiddle):
([A-Z]+(?![a-z])|[A-Z][a-z]+|[0-9]+|[a-z]+)
Extreme example: __snake_case12_camelCase_TLA1ABC → snake, case, 12, camel, Case, TLA, 1, ABC
Tillito's answer does not handle strings already containing spaces well, or Acronyms. This fixes it:
public static string SplitCamelCase(string input)
{
return Regex.Replace(input, "(?<=[a-z])([A-Z])", " $1", RegexOptions.Compiled);
}
If C# 3.0 is an option you can use the following one-liner to do the job:
Regex.Matches(YOUR_ENUM_VALUE_NAME, "[A-Z][a-z]+").OfType<Match>().Select(match => match.Value).Aggregate((acc, b) => acc + " " + b).TrimStart(' ');
Here's an extension method that handles numbers and multiple uppercase characters sanely, and also allows for upper-casing specific acronyms in the final string:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Globalization;
using System.Text.RegularExpressions;
using System.Web.Configuration;
namespace System
{
/// <summary>
/// Extension methods for the string data type
/// </summary>
public static class ConventionBasedFormattingExtensions
{
/// <summary>
/// Turn CamelCaseText into Camel Case Text.
/// </summary>
/// <param name="input"></param>
/// <returns></returns>
/// <remarks>Use AppSettings["SplitCamelCase_AllCapsWords"] to specify a comma-delimited list of words that should be ALL CAPS after split</remarks>
/// <example>
/// wordWordIDWord1WordWORDWord32Word2
/// Word Word ID Word 1 Word WORD Word 32 Word 2
///
/// wordWordIDWord1WordWORDWord32WordID2ID
/// Word Word ID Word 1 Word WORD Word 32 Word ID 2 ID
///
/// WordWordIDWord1WordWORDWord32Word2Aa
/// Word Word ID Word 1 Word WORD Word 32 Word 2 Aa
///
/// wordWordIDWord1WordWORDWord32Word2A
/// Word Word ID Word 1 Word WORD Word 32 Word 2 A
/// </example>
public static string SplitCamelCase(this string input)
{
if (input == null) return null;
if (string.IsNullOrWhiteSpace(input)) return "";
var separated = input;
separated = SplitCamelCaseRegex.Replace(separated, #" $1").Trim();
//Set ALL CAPS words
if (_SplitCamelCase_AllCapsWords.Any())
foreach (var word in _SplitCamelCase_AllCapsWords)
separated = SplitCamelCase_AllCapsWords_Regexes[word].Replace(separated, word.ToUpper());
//Capitalize first letter
var firstChar = separated.First(); //NullOrWhiteSpace handled earlier
if (char.IsLower(firstChar))
separated = char.ToUpper(firstChar) + separated.Substring(1);
return separated;
}
private static readonly Regex SplitCamelCaseRegex = new Regex(#"
(
(?<=[a-z])[A-Z0-9] (?# lower-to-other boundaries )
|
(?<=[0-9])[a-zA-Z] (?# number-to-other boundaries )
|
(?<=[A-Z])[0-9] (?# cap-to-number boundaries; handles a specific issue with the next condition )
|
(?<=[A-Z])[A-Z](?=[a-z]) (?# handles longer strings of caps like ID or CMS by splitting off the last capital )
)"
, RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace
);
private static readonly string[] _SplitCamelCase_AllCapsWords =
(WebConfigurationManager.AppSettings["SplitCamelCase_AllCapsWords"] ?? "")
.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries)
.Select(a => a.ToLowerInvariant().Trim())
.ToArray()
;
private static Dictionary<string, Regex> _SplitCamelCase_AllCapsWords_Regexes;
private static Dictionary<string, Regex> SplitCamelCase_AllCapsWords_Regexes
{
get
{
if (_SplitCamelCase_AllCapsWords_Regexes == null)
{
_SplitCamelCase_AllCapsWords_Regexes = new Dictionary<string,Regex>();
foreach(var word in _SplitCamelCase_AllCapsWords)
_SplitCamelCase_AllCapsWords_Regexes.Add(word, new Regex(#"\b" + word + #"\b", RegexOptions.Compiled | RegexOptions.IgnoreCase));
}
return _SplitCamelCase_AllCapsWords_Regexes;
}
}
}
}
You can use C# extension methods
public static string SpacesFromCamel(this string value)
{
if (value.Length > 0)
{
var result = new List<char>();
char[] array = value.ToCharArray();
foreach (var item in array)
{
if (char.IsUpper(item) && result.Count > 0)
{
result.Add(' ');
}
result.Add(item);
}
return new string(result.ToArray());
}
return value;
}
Then you can use it like
var result = "TestString".SpacesFromCamel();
Result will be
Test String
Using LINQ:
var chars = ControlSelectionType.NotApplicable.ToString().SelectMany((x, i) => i > 0 && char.IsUpper(x) ? new char[] { ' ', x } : new char[] { x });
Console.WriteLine(new string(chars.ToArray()));
I also have an enum which I had to separate. In my case this method solved the problem-
string SeparateCamelCase(string str)
{
for (int i = 1; i < str.Length; i++)
{
if (char.IsUpper(str[i]))
{
str = str.Insert(i, " ");
i++;
}
}
return str;
}
public enum ControlSelectionType
{
NotApplicable = 1,
SingleSelectRadioButtons = 2,
SingleSelectDropDownList = 3,
MultiSelectCheckBox = 4,
MultiSelectListBox = 5
}
public class NameValue
{
public string Name { get; set; }
public object Value { get; set; }
}
public static List<NameValue> EnumToList<T>(bool camelcase)
{
var array = (T[])(Enum.GetValues(typeof(T)).Cast<T>());
var array2 = Enum.GetNames(typeof(T)).ToArray<string>();
List<NameValue> lst = null;
for (int i = 0; i < array.Length; i++)
{
if (lst == null)
lst = new List<NameValue>();
string name = "";
if (camelcase)
{
name = array2[i].CamelCaseFriendly();
}
else
name = array2[i];
T value = array[i];
lst.Add(new NameValue { Name = name, Value = value });
}
return lst;
}
public static string CamelCaseFriendly(this string pascalCaseString)
{
Regex r = new Regex("(?<=[a-z])(?<x>[A-Z])|(?<=.)(?<x>[A-Z])(?=[a-z])");
return r.Replace(pascalCaseString, " ${x}");
}
//In your form
protected void Button1_Click1(object sender, EventArgs e)
{
DropDownList1.DataSource = GeneralClass.EnumToList<ControlSelectionType >(true); ;
DropDownList1.DataTextField = "Name";
DropDownList1.DataValueField = "Value";
DropDownList1.DataBind();
}
The solution from Eoin Campbell works good except if you have a Web Service.
You would need to do the Following as the Description Attribute is not serializable.
[DataContract]
public enum ControlSelectionType
{
[EnumMember(Value = "Not Applicable")]
NotApplicable = 1,
[EnumMember(Value = "Single Select Radio Buttons")]
SingleSelectRadioButtons = 2,
[EnumMember(Value = "Completely Different Display Text")]
SingleSelectDropDownList = 3,
}
public static string GetDescriptionFromEnumValue(Enum value)
{
EnumMemberAttribute attribute = value.GetType()
.GetField(value.ToString())
.GetCustomAttributes(typeof(EnumMemberAttribute), false)
.SingleOrDefault() as EnumMemberAttribute;
return attribute == null ? value.ToString() : attribute.Value;
}
And if you don't fancy using regex - try this:
public static string SeperateByCamelCase(this string text, char splitChar = ' ') {
var output = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
var c = text[i];
//if not the first and the char is upper
if (i > 0 && char.IsUpper(c)) {
var wasLastLower = char.IsLower(text[i - 1]);
if (i + 1 < text.Length) //is there a next
{
var isNextUpper = char.IsUpper(text[i + 1]);
if (!isNextUpper) //if next is not upper (start of a word).
{
output.Append(splitChar);
}
else if (wasLastLower) //last was lower but i'm upper and my next is an upper (start of an achromin). 'abcdHTTP' 'abcd HTTP'
{
output.Append(splitChar);
}
}
else
{
//last letter - if its upper and the last letter was lower 'abcd' to 'abcd A'
if (wasLastLower)
{
output.Append(splitChar);
}
}
}
output.Append(c);
}
return output.ToString();
}
Passes these tests, it doesn't like numbers but i didn't need it to.
[TestMethod()]
public void ToCamelCaseTest()
{
var testData = new string[] { "AAACamel", "AAA", "SplitThisByCamel", "AnA", "doesnothing", "a", "A", "aasdasdAAA" };
var expectedData = new string[] { "AAA Camel", "AAA", "Split This By Camel", "An A", "doesnothing", "a", "A", "aasdasd AAA" };
for (int i = 0; i < testData.Length; i++)
{
var actual = testData[i].SeperateByCamelCase();
var expected = expectedData[i];
Assert.AreEqual(actual, expected);
}
}
#JustSayNoToRegex
Takes a C# identifier, with uderscores and numbers, and converts it to space-separated string.
public static class StringExtensions
{
public static string SplitOnCase(this string identifier)
{
if (identifier == null || identifier.Length == 0) return string.Empty;
var sb = new StringBuilder();
if (identifier.Length == 1) sb.Append(char.ToUpperInvariant(identifier[0]));
else if (identifier.Length == 2) sb.Append(char.ToUpperInvariant(identifier[0])).Append(identifier[1]);
else {
if (identifier[0] != '_') sb.Append(char.ToUpperInvariant(identifier[0]));
for (int i = 1; i < identifier.Length; i++) {
var current = identifier[i];
var previous = identifier[i - 1];
if (current == '_' && previous == '_') continue;
else if (current == '_') {
sb.Append(' ');
}
else if (char.IsLetter(current) && previous == '_') {
sb.Append(char.ToUpperInvariant(current));
}
else if (char.IsDigit(current) && char.IsLetter(previous)) {
sb.Append(' ').Append(current);
}
else if (char.IsLetter(current) && char.IsDigit(previous)) {
sb.Append(' ').Append(char.ToUpperInvariant(current));
}
else if (char.IsUpper(current) && char.IsLower(previous)
&& (i < identifier.Length - 1 && char.IsUpper(identifier[i + 1]) || i == identifier.Length - 1)) {
sb.Append(' ').Append(current);
}
else if (char.IsUpper(current) && i < identifier.Length - 1 && char.IsLower(identifier[i + 1])) {
sb.Append(' ').Append(current);
}
else {
sb.Append(current);
}
}
}
return sb.ToString();
}
}
Tests:
[TestFixture]
static class HelpersTests
{
[Test]
public static void Basic()
{
Assert.AreEqual("Foo", "foo".SplitOnCase());
Assert.AreEqual("Foo", "_foo".SplitOnCase());
Assert.AreEqual("Foo", "__foo".SplitOnCase());
Assert.AreEqual("Foo", "___foo".SplitOnCase());
Assert.AreEqual("Foo 2", "foo2".SplitOnCase());
Assert.AreEqual("Foo 23", "foo23".SplitOnCase());
Assert.AreEqual("Foo 23 A", "foo23A".SplitOnCase());
Assert.AreEqual("Foo 23 Ab", "foo23Ab".SplitOnCase());
Assert.AreEqual("Foo 23 Ab", "foo23_ab".SplitOnCase());
Assert.AreEqual("Foo 23 Ab", "foo23___ab".SplitOnCase());
Assert.AreEqual("Foo 23", "foo__23".SplitOnCase());
Assert.AreEqual("Foo Bar", "Foo_bar".SplitOnCase());
Assert.AreEqual("Foo Bar", "Foo____bar".SplitOnCase());
Assert.AreEqual("AAA", "AAA".SplitOnCase());
Assert.AreEqual("Foo A Aa", "fooAAa".SplitOnCase());
Assert.AreEqual("Foo AAA", "fooAAA".SplitOnCase());
Assert.AreEqual("Foo Bar", "FooBar".SplitOnCase());
Assert.AreEqual("Mn M", "MnM".SplitOnCase());
Assert.AreEqual("AS", "aS".SplitOnCase());
Assert.AreEqual("As", "as".SplitOnCase());
Assert.AreEqual("A", "a".SplitOnCase());
Assert.AreEqual("_", "_".SplitOnCase());
}
}
Simple version similar to some of the above, but with logic to not auto-insert the separator (which is by default, a space, but can be any char) if there's already one at the current position.
Uses a StringBuilder rather than 'mutating' strings.
public static string SeparateCamelCase(this string value, char separator = ' ') {
var sb = new StringBuilder();
var lastChar = separator;
foreach (var currentChar in value) {
if (char.IsUpper(currentChar) && lastChar != separator)
sb.Append(separator);
sb.Append(currentChar);
lastChar = currentChar;
}
return sb.ToString();
}
Example:
Input : 'ThisIsATest'
Output : 'This Is A Test'
Input : 'This IsATest'
Output : 'This Is A Test' (Note: Still only one space between 'This' and 'Is')
Input : 'ThisIsATest' (with separator '_')
Output : 'This_Is_A_Test'
Try this:
using System;
using System.Linq;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
Console
.WriteLine(
SeparateByCamelCase("TestString") == "Test String" // True
);
}
public static string SeparateByCamelCase(string str)
{
return String.Join(" ", SplitByCamelCase(str));
}
public static IEnumerable<string> SplitByCamelCase(string str)
{
if (str.Length == 0)
return new List<string>();
return
new List<string>
{
Head(str)
}
.Concat(
SplitByCamelCase(
Tail(str)
)
);
}
public static string Head(string str)
{
return new String(
str
.Take(1)
.Concat(
str
.Skip(1)
.TakeWhile(IsLower)
)
.ToArray()
);
}
public static string Tail(string str)
{
return new String(
str
.Skip(
Head(str).Length
)
.ToArray()
);
}
public static bool IsLower(char ch)
{
return ch >= 'a' && ch <= 'z';
}
}
See sample online

Categories

Resources