I need to process a large amount of csv data in real time as it is spat out by a TCP port. Here is an example as displayed by Putty:
MSG,3,1920,742,4009C5,14205994,2017/01/29,20:14:27.065,2017/01/29,20:14:27.972,,8000,,,51.26582,-0.33783,,,0,0,0,0
MSG,4,1920,742,4009C5,14205994,2017/01/29,20:14:27.065,2017/01/29,20:14:27.972,,,212.9,242.0,,,0,,,,,
MSG,1,1920,742,4009C5,14205994,2017/01/29,20:14:27.065,2017/01/29,20:14:27.972,BAW469,,,,,,,,,,,
MSG,3,1920,742,4009C5,14205994,2017/01/29,20:14:27.284,2017/01/29,20:14:27.972,,8000,,,51.26559,-0.33835,,,0,0,0,0
MSG,4,1920,742,4009C5,14205994,2017/01/29,20:14:27.284,2017/01/29,20:14:27.972,,,212.9,242.0,,,0,,,,,
I need to put each line of data in string (line) into an array (linedata[]) so that I can read and process certain elements, but linedata = line.Split(','); seems to ignore the many empty elements, with the result that linedata[20], for example, may or may not exist, and if it doesn't I get an error if I try to read it. Even if element 20 in the line contains a value it won't necessarily be the 20th element in the array. And that's no good.
I can work out how to parse line character by character into linedata[], inserting an empty string where appropriate, but surely there must be a better way ? Have I missed something obvious ?
Many Thanks. Perhaps I'd better add that I'm quite new to C#, my past experience is all with Delphi 7. I really miss stringlists.
Edited: sorry, this is now resolved with the help of MSDN's documentation. This code works: lineData = line.Split(separators, StringSplitOptions.None); after setting "string[] separators = { "," };". My big mistake was to follow examples found on tutorial sites which didn't give any clues that the .split method had any options.
https://msdn.microsoft.com/en-us/library/system.stringsplitoptions(v=vs.110).aspx
That link has an example section, look at example 1b specifically. There is an extra parameter to Split called StringSplitOptions which does this.
For Example:
string[] linedata = line.Split(charSeparators, StringSplitOptions.None);
foreach (string line in linedata)
{
Console.Write("<{0}>", line);
}
Console.Write("\n\n");
The way to find this sort of information is to start with the Reference Documentation for the function, and hope it has an option or a link to a similar function.
If you want to also start validating types, handling variants in the format etc... you could move up to a CSV library. If you do not need that functionality, this is the easiest way and efficient for small files.
Some of the overloads for String.Split() take a StringSplitOptions argument, and if you use the RemoveEmptyEntries option, it will...remove the empty entries. So you can specify the None option:
linedata = line.Split(new [] { ',' }, StringSplitOptions.None);
Or better yet, use the overload that doesn't take a StringSplitOptions, which treats it as None by default:
linedata = line.Split(',');
The code in your question indicates that you are doing this, but your description of the problem suggests that you are not.
However, you're probably better off using an actual CSV parser, which would handle things like unescaping and so on.
The StringReader class provides methods for reading lines, characters, or blocks of characters from a string. Hope this could be the clue
string str = #"MSG,3,1920,742,4009C5,14205994,2017/01/29,20:14:27.065,2017/01/29,20:14:27.972,,8000,,,51.26582,-0.33783,,,0,0,0,0
MSG,4,1920,742,4009C5,14205994,2017/01/29,20:14:27.065,2017/01/29,20:14:27.972,,,212.9,242.0,,,0,,,,,
MSG,1,1920,742,4009C5,14205994,2017/01/29,20:14:27.065,2017/01/29,20:14:27.972,BAW469,,,,,,,,,,,
MSG,3,1920,742,4009C5,14205994,2017/01/29,20:14:27.284,2017/01/29,20:14:27.972,,8000,,,51.26559,-0.33835,,,0,0,0,0
MSG,4,1920,742,4009C5,14205994,2017/01/29,20:14:27.284,2017/01/29,20:14:27.972,,,212.9,242.0,,,0,,,,,";
using (StringReader reader = new StringReader(str))
do
{
string[] linedata = reader.ReadLine().Split(',');
} while (reader.Read() != -1);
While you should look into the various ways the String class can help you here, sometimes the quick and dirty "MAKE it fit" option is called for. In this case, that'd be to roll through the strings in advance and ensure you have at least one character between the commas.
public static string FixIt(string s)
{
return s.Replace(",,", ", ,");
}
You should be able to:
var lineData = FixIt(line).Split(',');
Edit: In response to the question below, I'm not sure what you meant, but if you mean doing it without creating a helper method, you can do so easily. The code will be harder to read and troubleshoot if you do it in one line though. My personal rule is, if you have to do it a LOT, it should probably be a method. If you only had to do it once, this is particularly clean. I'd actually do it this way and just wrap it in a method that does all the work for you.
var lineData = line.Replace(",,", ", ,").Split(',');
As a method, it'd be:
public static string[] GiveMeAnArray(string s)
{
return s.Replace(",,", ", ,").Split(',');
}
Is it possible to add some descriptive text to a string format specifier?
Example:
string.Format ("{0:ForeName} is not at home", person.ForeName);
In the example ForeName is added as description.
The above syntax is obviously incorrect, but just to show the idea.
The reason I am asking, is because in my case the strings are in a resource file, so in the resource file you currently only see
{0} is not at home
in some cases it is hard to grasp what the context of {0} is.
EDIT:
In c# 6 string interpolation with the $ operator has been introduced, so string.Format is not needed anymore:
$"{person.ForeName} is not at home";
We usually put comments into our resources file e.g. {0} = Forename.
Then anybody who might be translating the string knows what {0} represents and can translate accordingly.
Also if you use ReSharper, you can enter the comment at the same time when you are adding your string to resources.
Phil Haack and Peli have written a couple of interesting blog posts about alternatives to the default string.format function. They might interest you.
Basically they allow you to use object properties inside the format string like this:
string s = NamedFormat("Hello {FullName} ({EmailAdrress})!", person);
You can the related blog posts here:
http://blog.dotnetwiki.org/2009/01/16/NamedFormatsPexTestimonium.aspx
http://haacked.com/archive/2009/01/14/named-formats-redux.aspx/
http://haacked.com/archive/2009/01/04/fun-with-named-formats-string-parsing-and-edge-cases.aspx/
Perhaps one of the solutions covered in those blog posts would suit your needs.
For strings your method should work, since strings will ignore any format specifiers. However you run the risk of accidentally using that for non-string types, in which case the string will either be translated as format codes or literally displayed:
string.Format ("{0:ForeName} is not at home", "Johnny");
//"Johnny is not at home"
string.Format ("{0:ForeName} will be home at {1:HomeTime}", "Johnny", DateTime.Today)
//Johnny will be home at 0o0eTi0e -- H, h, and m are DateTime format codes.
However since you're storing these in a resource file, I would instead use the "comment" field in the resource file - you could store a copy of the format string and add your descriptions there.
There is no built-in C# function for that. The best I can propose is to insert a comment (this will have no performance impact) :
string.Format ("{0"/*ForeName*/+"} is not at home", person.ForeName);
Personnaly, I don't find it readable, the best approch is to use a third-party tool as David Khaykin suggested in comment (see this answer)
IDEOne.com demo
Here is a somewhat naive implementation of StackOverflow's formatUnicorn method:
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Reflection;
public class Test
{
public static void Main()
{
string formatString = "{firstName} {lastName} is awesome.";
Console.WriteLine(formatString.FormatUnicorn(new {
firstName = "joe",
lastName = "blow"
}));
}
}
public static class StringExtensions {
public static string FormatUnicorn(this string str, object arguments) {
string output = str;
Type type = arguments.GetType();
foreach (PropertyInfo property in type.GetProperties())
{
Regex regex = new Regex(#"\{" + property.Name + #"\}");
output = regex.Replace(output, property.GetValue(arguments, null).ToString());
}
return output;
}
}
The biggest drawback here is the use of reflection, which can be slow. The other is that it doesn't allow for format specifiers.
A better approach might be to create a more complex regular expression that just strips out the comments.
string.Format ("{0} is not at home {1} ", person.ForeName, person.Something);
This shall print the ForeName instead of {0} and something in {1}. There is no way to as you said.
As of Visual Studio 2015 you can do this with Interpolated Strings (its a compiler trick, so it doesn't matter which version of the .net framework you target).
The code then looks something like this
string txt = $"{person.ForeName} is not at home {person.Something}";
Its not ideal if you want to put the strings into resource files for translation, but it oftern makes the code more readable and less error prone.
In my controller method for handling a (potentially hostile) user input field I have the following code:
string tmptext = comment.Replace(System.Environment.NewLine, "{break was here}"); //marks line breaks for later re-insertion
tmptext = Encoder.HtmlEncode(tmptext);
//other sanitizing goes in here
tmptext = tmptext.Replace("{break was here}", "<br />");
var regex = new Regex("(<br /><br />)\\1+");
tmptext = regex.Replace(tmptext, "$1");
My goal is to preserve line breaks for typical non-malicious use and display user input in safe, htmlencoded strings. I take the user input, parse it for newline characters and place a delimiter at the line breaks. I perform the HTML encoding and reinsert the breaks. (i will likely change this to reinserting paragraphs as p tags instead of br, but for now i'm using br)
Now actually inserting real html breaks opens me up to a subtle vulnerability: the enter key. The regex.replace code is there to strip out a malicious user just standing on the enter key and filling the page with crap.
This is a fix for big crap floods of just white but still leaves me open to abuse like entering one character, two line breaks, one character, two line breaks all down the page.
My question is for a method of determining that this is abusive and failing it on validation. I'm scared that there might not be a simple procedural method to do it and instead will need heuristic techniques or bayesian filters. Hopefully, someone has an easier, better way.
EDIT: perhaps I wasn't clear in the problem description, the regex handles seeing multiple line breaks in a row and converting them to just one or two. That problem is solved. The real problem is distinguishing legitimate text from crap flood like this:
a
a
a
...imagine 1000 of these...
a
a
a
a
A random suggestion, inspired by slashdot.org's comment filters: compress your user input with a System.IO.Compression.DeflateStream, and if it is too small in comparison with the original (you'll have to do some experimentation to find a useful cut-off) reject it.
I would HttpUtility.HtmlEncode the string, then convert newline characters to <br/>.
HttpUtility.HtmlEncode(subject).Replace("\r\n", "<br/>").Replace("\r", "<br/>").Replace("\n", "<br/>");
Also you should perform this logic when you are outputting to the user, not when saving in the database. The only validation I do on the database is make sure it's properly escaped (other than normal business rules that is).
EDIT: To fix the actual problem however, you can use Regex to replace multiple newlines with a single newline beforehand.
subject = Regex.Replace(#"(\r\n|\r|\n)+", #"\n", RegexOptions.Singleline);
I'm not sure if you would need RegexOptions.Singleline.
It sounds like you're tempted to try something "clever" with a regex, but IMO the simplest approach is to just loop through the characters of the string copying them to a StringBuilder, filtering as you go.
Any that fail a char.IsWhiteSpace() test are not copied. (If one of these is a newline, then insert a <br/> and don't allow any more <br/>'s to be added until you have hit a non-whitespace character).
edit
If you want to stop the user entering any old crap, give up now. You will never find a way filtering that a user can't find a way around in less than a minute, if they really want to.
You will be much better off putting a limit on the number of newlines, or the total number of characters, in the input.
Think of how much effort it will take to do something clever to sanitise "bad input", and then consider how likely it is that this will happen. Probbaly there is no point. Probably all the sanitisation you really need is to ensure the data is legal (not too large for your system to handle, all dangerous characters stripped or escaped, etc). (This is exactly why forums have human moderators who can filter the posts based on whatever criteria are approriate).
This is not the most efficient way of handling this, nor the smartest (disclaimer),
but if your text is not too big it doesn't matter much and short of any smarter algorithms (note: it's hard to detect something like char\nchar\nchar\n... though you could set a limit on the line len)
You could just Split on white characters (add any you can think of, short of \n) - then Join with just one space and then split on \n (to get lines) - join with <br />. While joining the lines you can test for line.Length > 2 e.g. or something.
To make this faster you can iterate with a more efficient algorithm, char by char, using IndexOf etc..
Again not the most efficient or perfect way of handling this but would give you something fast.
EDIT: to filter 'same lines' - you could use e.g. DistinctUntilChanged - that's from the Ix - Interactive extensions (see NuGet Ix-experimental I think) which should filter 'same lines' consecutive + you could add line test for those.
Rather than attempting to replace the newlines with filtered text and then attempting to use regular expressions on that, why not sanitize your data before inserting the <br /> tags? Don't forget to sanitize the input with HttpUtility.HtmlEncode first.
In an attempt to take care of multiple short lines in a row, here's my best attempt:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
class Program {
static void Main() {
// Arbirary cutoff used to join short strings.
const int Cutoff = 6;
string input =
"\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome\r\n" +
"unsanatized\r\nbreaks\r\nand\ra\nsh\nor\nt\r\n\na\na\na\na" +
"\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na";
input = (input ?? String.Empty).Trim(); // Don't forget to HtmlEncode it.
StringBuilder temp = new StringBuilder();
List<string> result = new List<string>();
var items = input.Split(
new[] { '\r', '\n' },
StringSplitOptions.RemoveEmptyEntries)
.Select(i => new { i.Length, Value = i });
foreach (var item in items) {
if (item.Length > Cutoff) {
if (temp.Length > 0) {
result.Add(temp.ToString());
temp.Clear();
}
result.Add(item.Value);
continue;
}
if (temp.Length > 0) { temp.Append(" "); }
temp.Append(item.Value);
}
if (temp.Length > 0) {
result.Add(temp.ToString());
}
Console.WriteLine(String.Join("<br />", result));
}
}
Produces the following output:
thisisatest<br />string with some<br />unsanatized<br />breaks and a sh or t a a
a a a a a a a a a a a a a a a a a a a
I'm sure you've already come up with this solution but unfortunately what you're asking for isn't very straight forward.
For those interested, here's my first attempt:
using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {
string input = "\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome" +
"\r\nunsanatized\r\nbreaks\r\n\r\n";
input = (input ?? String.Empty).Trim().Replace("\r", String.Empty);
string output = Regex.Replace(
input,
"\\\n+",
"<br />",
RegexOptions.Multiline);
Console.WriteLine(output);
}
}
producing the following output:
thisisatest<br />string<br />with<br />some<br />unsanatized<br />breaks
I am working on a console application, that receives a pretty long list of parameters. For debugging purpose I need to print the parameters passed to a an output file. Right now, I am using the following code to concat command line parameters.
static void Main(string[] args)
{
string Params = string.Empty;
foreach(string arg in args)
{
Params += arg + ",";
}
}
Is there any better way to accomplish this?
What about
Params = string.Join(",", args);
Your foreach approach is not very performant. Since a string is immutable, that means for each iteration of the loop, the string will get thrown away for garbage collection, and a new string will be generated. In the string.Join case, only one string will be generated.
Inside the loop, to get around the same performance, you will have to use a StringBuilder, but in this case it's really no reason not to use string.Join since the code will be much more readable.
You could use this piece of code
String.Join(", ", Environment.GetCommandLineArgs())
You can use String.Join(",",args)
All of the answers here combining the arguments with a single comma will work, but I found that approach lacking somewhat because there's not a clear indicator of "quoted arguments" and those that might contain a comma.
Using the example: Foo.exe an example "is \"fine\", too" okay
The simple join suggestions will yield: an, example, is "fine", too, okay. Not bad, but not very clear and somewhat misleading.
Here's what I threw together that works well enough for me. I'm sure it could be improved further.
String.Join(", ", (from a in args select '"' + a.Replace("\"", #"\""") + '"'));
It returns the string: "an", "example", "is \"fine\", too", "okay". I think this does a better job indicating the actual parameters.
String params = String.Join(",", args);
You should use:
string.Join(",", args);
Strictly speaking the Join function creates a StringBuilder with capacity strings.Length * 16 (this 16 is a fixed number). If you have different args maximum size and if performance is crucial to you, use a StringBuilder with a specific capacity.