Remove excessive whitespace in user input field

Remove excessive whitespace in user input field - c#

In my controller method for handling a (potentially hostile) user input field I have the following code:
string tmptext = comment.Replace(System.Environment.NewLine, "{break was here}"); //marks line breaks for later re-insertion
tmptext = Encoder.HtmlEncode(tmptext);
//other sanitizing goes in here
tmptext = tmptext.Replace("{break was here}", "<br />");
var regex = new Regex("(<br /><br />)\\1+");
tmptext = regex.Replace(tmptext, "$1");
My goal is to preserve line breaks for typical non-malicious use and display user input in safe, htmlencoded strings. I take the user input, parse it for newline characters and place a delimiter at the line breaks. I perform the HTML encoding and reinsert the breaks. (i will likely change this to reinserting paragraphs as p tags instead of br, but for now i'm using br)
Now actually inserting real html breaks opens me up to a subtle vulnerability: the enter key. The regex.replace code is there to strip out a malicious user just standing on the enter key and filling the page with crap.
This is a fix for big crap floods of just white but still leaves me open to abuse like entering one character, two line breaks, one character, two line breaks all down the page.
My question is for a method of determining that this is abusive and failing it on validation. I'm scared that there might not be a simple procedural method to do it and instead will need heuristic techniques or bayesian filters. Hopefully, someone has an easier, better way.
EDIT: perhaps I wasn't clear in the problem description, the regex handles seeing multiple line breaks in a row and converting them to just one or two. That problem is solved. The real problem is distinguishing legitimate text from crap flood like this:
a
a
a
...imagine 1000 of these...
a
a
a
a

A random suggestion, inspired by slashdot.org's comment filters: compress your user input with a System.IO.Compression.DeflateStream, and if it is too small in comparison with the original (you'll have to do some experimentation to find a useful cut-off) reject it.

I would HttpUtility.HtmlEncode the string, then convert newline characters to <br/>.
HttpUtility.HtmlEncode(subject).Replace("\r\n", "<br/>").Replace("\r", "<br/>").Replace("\n", "<br/>");
Also you should perform this logic when you are outputting to the user, not when saving in the database. The only validation I do on the database is make sure it's properly escaped (other than normal business rules that is).
EDIT: To fix the actual problem however, you can use Regex to replace multiple newlines with a single newline beforehand.
subject = Regex.Replace(#"(\r\n|\r|\n)+", #"\n", RegexOptions.Singleline);
I'm not sure if you would need RegexOptions.Singleline.

It sounds like you're tempted to try something "clever" with a regex, but IMO the simplest approach is to just loop through the characters of the string copying them to a StringBuilder, filtering as you go.
Any that fail a char.IsWhiteSpace() test are not copied. (If one of these is a newline, then insert a <br/> and don't allow any more <br/>'s to be added until you have hit a non-whitespace character).
edit
If you want to stop the user entering any old crap, give up now. You will never find a way filtering that a user can't find a way around in less than a minute, if they really want to.
You will be much better off putting a limit on the number of newlines, or the total number of characters, in the input.
Think of how much effort it will take to do something clever to sanitise "bad input", and then consider how likely it is that this will happen. Probbaly there is no point. Probably all the sanitisation you really need is to ensure the data is legal (not too large for your system to handle, all dangerous characters stripped or escaped, etc). (This is exactly why forums have human moderators who can filter the posts based on whatever criteria are approriate).

This is not the most efficient way of handling this, nor the smartest (disclaimer),
but if your text is not too big it doesn't matter much and short of any smarter algorithms (note: it's hard to detect something like char\nchar\nchar\n... though you could set a limit on the line len)
You could just Split on white characters (add any you can think of, short of \n) - then Join with just one space and then split on \n (to get lines) - join with <br />. While joining the lines you can test for line.Length > 2 e.g. or something.
To make this faster you can iterate with a more efficient algorithm, char by char, using IndexOf etc..
Again not the most efficient or perfect way of handling this but would give you something fast.
EDIT: to filter 'same lines' - you could use e.g. DistinctUntilChanged - that's from the Ix - Interactive extensions (see NuGet Ix-experimental I think) which should filter 'same lines' consecutive + you could add line test for those.

Rather than attempting to replace the newlines with filtered text and then attempting to use regular expressions on that, why not sanitize your data before inserting the <br /> tags? Don't forget to sanitize the input with HttpUtility.HtmlEncode first.
In an attempt to take care of multiple short lines in a row, here's my best attempt:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
class Program {
static void Main() {
// Arbirary cutoff used to join short strings.
const int Cutoff = 6;
string input =
"\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome\r\n" +
"unsanatized\r\nbreaks\r\nand\ra\nsh\nor\nt\r\n\na\na\na\na" +
"\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na";
input = (input ?? String.Empty).Trim(); // Don't forget to HtmlEncode it.
StringBuilder temp = new StringBuilder();
List<string> result = new List<string>();
var items = input.Split(
new[] { '\r', '\n' },
StringSplitOptions.RemoveEmptyEntries)
.Select(i => new { i.Length, Value = i });
foreach (var item in items) {
if (item.Length > Cutoff) {
if (temp.Length > 0) {
result.Add(temp.ToString());
temp.Clear();
}
result.Add(item.Value);
continue;
}
if (temp.Length > 0) { temp.Append(" "); }
temp.Append(item.Value);
}
if (temp.Length > 0) {
result.Add(temp.ToString());
}
Console.WriteLine(String.Join("<br />", result));
}
}
Produces the following output:
thisisatest<br />string with some<br />unsanatized<br />breaks and a sh or t a a
a a a a a a a a a a a a a a a a a a a
I'm sure you've already come up with this solution but unfortunately what you're asking for isn't very straight forward.
For those interested, here's my first attempt:
using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {
string input = "\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome" +
"\r\nunsanatized\r\nbreaks\r\n\r\n";
input = (input ?? String.Empty).Trim().Replace("\r", String.Empty);
string output = Regex.Replace(
input,
"\\\n+",
"<br />",
RegexOptions.Multiline);
Console.WriteLine(output);
}
}
producing the following output:
thisisatest<br />string<br />with<br />some<br />unsanatized<br />breaks

Related

Split string to array when some elements are empty

I need to process a large amount of csv data in real time as it is spat out by a TCP port. Here is an example as displayed by Putty:
MSG,3,1920,742,4009C5,14205994,2017/01/29,20:14:27.065,2017/01/29,20:14:27.972,,8000,,,51.26582,-0.33783,,,0,0,0,0
MSG,4,1920,742,4009C5,14205994,2017/01/29,20:14:27.065,2017/01/29,20:14:27.972,,,212.9,242.0,,,0,,,,,
MSG,1,1920,742,4009C5,14205994,2017/01/29,20:14:27.065,2017/01/29,20:14:27.972,BAW469,,,,,,,,,,,
MSG,3,1920,742,4009C5,14205994,2017/01/29,20:14:27.284,2017/01/29,20:14:27.972,,8000,,,51.26559,-0.33835,,,0,0,0,0
MSG,4,1920,742,4009C5,14205994,2017/01/29,20:14:27.284,2017/01/29,20:14:27.972,,,212.9,242.0,,,0,,,,,
I need to put each line of data in string (line) into an array (linedata[]) so that I can read and process certain elements, but linedata = line.Split(','); seems to ignore the many empty elements, with the result that linedata[20], for example, may or may not exist, and if it doesn't I get an error if I try to read it. Even if element 20 in the line contains a value it won't necessarily be the 20th element in the array. And that's no good.
I can work out how to parse line character by character into linedata[], inserting an empty string where appropriate, but surely there must be a better way ? Have I missed something obvious ?
Many Thanks. Perhaps I'd better add that I'm quite new to C#, my past experience is all with Delphi 7. I really miss stringlists.
Edited: sorry, this is now resolved with the help of MSDN's documentation. This code works: lineData = line.Split(separators, StringSplitOptions.None); after setting "string[] separators = { "," };". My big mistake was to follow examples found on tutorial sites which didn't give any clues that the .split method had any options.

https://msdn.microsoft.com/en-us/library/system.stringsplitoptions(v=vs.110).aspx
That link has an example section, look at example 1b specifically. There is an extra parameter to Split called StringSplitOptions which does this.
For Example:
string[] linedata = line.Split(charSeparators, StringSplitOptions.None);
foreach (string line in linedata)
{
Console.Write("<{0}>", line);
}
Console.Write("\n\n");
The way to find this sort of information is to start with the Reference Documentation for the function, and hope it has an option or a link to a similar function.
If you want to also start validating types, handling variants in the format etc... you could move up to a CSV library. If you do not need that functionality, this is the easiest way and efficient for small files.

Some of the overloads for String.Split() take a StringSplitOptions argument, and if you use the RemoveEmptyEntries option, it will...remove the empty entries. So you can specify the None option:
linedata = line.Split(new [] { ',' }, StringSplitOptions.None);
Or better yet, use the overload that doesn't take a StringSplitOptions, which treats it as None by default:
linedata = line.Split(',');
The code in your question indicates that you are doing this, but your description of the problem suggests that you are not.
However, you're probably better off using an actual CSV parser, which would handle things like unescaping and so on.

The StringReader class provides methods for reading lines, characters, or blocks of characters from a string. Hope this could be the clue
string str = #"MSG,3,1920,742,4009C5,14205994,2017/01/29,20:14:27.065,2017/01/29,20:14:27.972,,8000,,,51.26582,-0.33783,,,0,0,0,0
MSG,4,1920,742,4009C5,14205994,2017/01/29,20:14:27.065,2017/01/29,20:14:27.972,,,212.9,242.0,,,0,,,,,
MSG,1,1920,742,4009C5,14205994,2017/01/29,20:14:27.065,2017/01/29,20:14:27.972,BAW469,,,,,,,,,,,
MSG,3,1920,742,4009C5,14205994,2017/01/29,20:14:27.284,2017/01/29,20:14:27.972,,8000,,,51.26559,-0.33835,,,0,0,0,0
MSG,4,1920,742,4009C5,14205994,2017/01/29,20:14:27.284,2017/01/29,20:14:27.972,,,212.9,242.0,,,0,,,,,";
using (StringReader reader = new StringReader(str))
do
{
string[] linedata = reader.ReadLine().Split(',');
} while (reader.Read() != -1);

While you should look into the various ways the String class can help you here, sometimes the quick and dirty "MAKE it fit" option is called for. In this case, that'd be to roll through the strings in advance and ensure you have at least one character between the commas.
public static string FixIt(string s)
{
return s.Replace(",,", ", ,");
}
You should be able to:
var lineData = FixIt(line).Split(',');
Edit: In response to the question below, I'm not sure what you meant, but if you mean doing it without creating a helper method, you can do so easily. The code will be harder to read and troubleshoot if you do it in one line though. My personal rule is, if you have to do it a LOT, it should probably be a method. If you only had to do it once, this is particularly clean. I'd actually do it this way and just wrap it in a method that does all the work for you.
var lineData = line.Replace(",,", ", ,").Split(',');
As a method, it'd be:
public static string[] GiveMeAnArray(string s)
{
return s.Replace(",,", ", ,").Split(',');
}

Regex ignore a pattern

I am trying to figure out a viable way to go about parsing this CSV file. Currently I am using filehelpers which is great. But with this csv file it seems to be having issues.
Each record in the the csv file is contained in quotes and delimited by a comma.
The records have commas within them and 1 record out of the 90,000 records im dealing with has one single " that mucks up the Readline.
The record looks like this "24" Blah ",
So I'm looking to write a regex to insert into the BeforeReadRecord that will go through and replace all instances of " with a space.
I'm newer to regex but I'm not finding any way to exclude three cases.
Case one: each line starts with a "
Case two: each line ends with a "
Case three: each field is separated by ","
I am trying to figure out how I could exclude those three cases and be left to just replace any straggler " .
So far I've been failing miserably and am not even sure if there is a way to accomplish this. Perhaps someone knows of a better csv parser that handles this one odd case as well?
EDIT: Well here's what I ended up with. It takes a little time to process(also just changes any outlier " to ' which is fine since the data that contains quotes is needed for any queries) but looking for any pitfalls I may be falling in to make it faster but it seemed to be the quickest solution so far(took about 7 seconds for 92,000 records) but there doesn't seem any way around checking every line so... My previous solution was a nasty nested if that seemed to 30 seconds or so over the course of processing the records. It accounts for all scenarios except for where someone decides to put a random ", at the end of a field... hoping I don't run into a record like this but it wouldn't surprise me.
in its own method{
engine.BeforeReadRecord += (sender, args) =>
args.RecordLine = checkQuote(args.RecordLine);
var records = engine.ReadFile(reportFilePath);
}
private static string checkQuote(string checkString)
{
if (checkString.Substring(0, 1) == #"""")
{
string removeQuote = #"""" + checkString.Replace(#"""", "'").Replace(#"','", #""",""").Remove(checkString.Length-1,1).Remove(0,1) + #"""";
return removeQuote;
}
else
return checkString; }

File format readers typically don't handle malformed input well. Why should they? If you give a CSV reader bad data, I would expect it to barf. I've rarely had good luck with computer software that makes assumptions about what I meant.
Do you really need a regular expression? If you define a straggler as the last quote character when the number is odd, then it's trivial to remove the last one: just count them and if the number is odd, remove the last one.
For example:
var quoteCount = inputString.Count(c => c == '\"');
if ((quoteCount % 2) == 1)
{
inputString = inputString.Remove(inputString.LastIndexOf('\"'));
}
Done and done.
You could also do it in a single pass with a loop, but that's probably overkill. I strongly suspect that sanitizing the input is not a major bottleneck in your program.
For more complex patterns (i.e. you're looking for "," or for a quote at the start and end, you just write a simple state machine. It's probably a dozen lines of code.
I realize that you might be able to do this with regular expressions. I find regex great for finding stuff and doing simple replacements. For more complicated rules like "replace quote with space unless the quote is at the beginning or end of line or next to a comma", I find it hard to come up with a good expression. For example, what about this case:
"first name","last name","","phone"
You have to take that blank field (i.e. "") into account. You also have to take into account spaces between fields (i.e. "first" , "last" , ""), and a whole host of other things. I'm reasonably sure that regex can do it. My experience has been that I can usually write the simple state machine and prove that it's correct faster than I can puzzle out the required regex. And it's certain that I'll more easily understand the state machine six months later.

How to read text file between ""

I need an "idea" on how to read text file data between quotes. For example:
line 1: "read a title"
line 2: "read a descr"
line 1: "read a title"
line 2: "read a descr"
I want to do a foreach type of thing, and I want to read all Line 1's, and Line 2's as a pair, but between the ".
In my program I am going to output (foreach of course):
readTerminatedNull(file1);
readTerminatedNull(file2);
I would read line by line, but some of the text could be:
line 1: "read a super long
title that goes off"
line 2: "read a descr"
So that's why I want to read between the ".
Sorry if that is too complicated, and it's a little hard to explain.
Edit:
Thanks for all the feed back guys, but I'm not sure you are getting what I am trying to do :p not your faults, I wrote this kinda wierd.
I will have a text file full of refrences, and text. like so.
text inside:
Refren: "myrefrence_1"
String: "This is a string of a refrence"
Refren: "myrefrence_2"
String: "hello world"
Refren: "myrefrence_3"
String: "I like cookies."
I want it to to read myrefrence_1 in the quotes of the first line, and then read the string in the next line between the ".
I will then stuff into my program that matches the refrence with the string.
But sometimes the text will be more than one line.
Refren: "this is text that goes and then
return keys on some parts."
and I still want it to read through the ".

(not tested, but you'll get the idea)
// Read all text from file
string sData = File.ReadAllText(#"c:/file.txt");
// Match strings between " "
Match match = Regex.Match(sData , "\"(\w|\d|\s|\\\")*\"",
RegexOptions.IgnoreCase);
// Read results and strip " out of them
foreach (var sResult in match) {
sResult = sResult.Remove(0,1).Remove(sResult.length-2, 1);
// Do whatever with sResult
}

You could learn some new tricks by looking into state machines. Basically: Read each character at a time and figure out what state you are in now. First, code this as a big while loop with a big switch statement inside. Then, go and read up on the state pattern for how to do this in an object oriented way. Then, ditch that and use delegates, because c# makes this stuff so easy to do.
Then, scrap it all, write some crappy Regular Expression with a multiline flag and slurp it the Perl way. Meditate on why this is the same as your original state machine solution.
Then, get really stuck in and learn about parser generators (lexx/yacc or some .NET variant) and write a simple BNF grammar for your problem. Take special note of how the trivial grammars used in the tutorials are all way more complicated than the one you need to write. Why is that so? Check out what Noam Chomsky had to say about that.
Eventually, you'll burn out. We all do. But you'll have so much fun digging into what makes programming the coolest activity on the planet. Burn-out is just the realization that that's a pipe dream ;)
When you're done, go outside. Meet people. Talk. Smile a lot. Be friendly. You're now a zen infused developer with a wicked grin. Yay for you! You rock!

What you're describing sounds like a single-column CSV file. The easiest way to access that is probably to use the Microsoft.VisualBasic.FileIO.TextFieldParser class, something like:
using (var csvParser = new TextFieldParser(new StringReader(content))
{
Delimiters = new[] {","},
HasFieldsEnclosedInQuotes = true
})
{
while (!csvParser.EndOfData)
{
var fields = csvParser.ReadFields();
Console.Print(fields[0]); //do something with the first (in your case only) field found.
}
}
Probably the easiest way to determine whether this approach makes sense, is to think about what happens if the string you're reading actually contains a double quote. Would it end up as "He said ""this is quoted"", but I wasn't listening" (doubling up the quotes), or is this situation impossible?
If the quotes would be doubled up in this way, then a standard CSV reader like this built-in framework one is probably your best bet.

To read all of the lines of the file you can use:
File.ReadAllLines(pathToFile);
to strip the text from "" you can use the substring method of string: http://msdn.microsoft.com/en-us/library/aka44szs.aspx
you can do it like that:
string strippedString = original.Substring(1, original.length -2);

Try this one
var text = File.ReadAllLines(pathToFile);
var lines = text.Split(':')
.Where((s,i) => i % 2 != 0)
.Select(s => s.trim('"'));

First of all you need to read in the file using:
File.ReadAllLines(filePath);
Then you could split all the lines using the string.Split function.
Splitting on the closing bracket would be your best bet.

As i have understood from you question is you want to read and write text file with some specific settings. is it ?
I would like to refer to to INI files which are the text files it self and provide the settings configurations as you wish to achieve. here are some links these could help you.
http://www.codeproject.com/Articles/1966/An-INI-file-handling-class-using-C
http://jachman.wordpress.com/2006/09/11/how-to-access-ini-files-in-c-net/

Regex BBCode to HTML

I writing BBcode converter to html.
Converter should skip unclosed tags.
I thought about 2 options to do it:
1) match all tags in once using one regex call, like:
Regex re2 = new Regex(#"\[(\ /?(?:b|i|u|quote|strike))\]");
MatchCollection mc = re2.Matches(sourcestring);
and then, loop over MatchCollection using 2 pointers to find start and open tags and than replacing with right html tag.
2) call regex multiple time for every tag and replace directly:
Regex re = new Regex(#"\[b\](.*?)\[\/b\]");
string s1 = re.Replace(sourcestring2,"<b>$1</b>");
What is more efficient?
The first option uses one regex but will require me to loop through all tags and find all pairs, and skip tags that don't have a pair.
Another positive thins is that I don't care about the content between the tags, i just work and replace them using the position.
In second option I don't need to worry about looping and making special replace function.
But will require to execute multiple regex and replaces.
What can you suggest?
If the second option is the right one,
there is a problem with regex
\[b\](.*?)\[\/b\]
how can i fix it to also match multi lines like:
[b]
test 1
[/b]
[b]
test 2
[/b]

One option would be to use more SAX-like parsing, where instead of looking for a particular regex you look for [, then have your program handle that even in some manner, look for the ], handle that even, etc. Although more verbose than the regex it may be easier to understand, and wouldn't necessarily be slower.

r = new System.Text.RegularExpressions.Regex(#"(?:\[b\])(?<name>(?>\[b\](?<DEPTH>)|\[/b\](?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/b\])", System.Text.RegularExpressions.RegexOptions.Singleline);
var s = r.Replace("asdfasdf[b]test[/b]asdfsadf", "<b>$1</b>");
That should give you only elements that have matched closing tags and also handle multi line (even though i specified the option of SingleLine it actually treats it as a single line)
It should also handle [b][b][/b] properly by ignoring the first [b].
As to whether or not this method is better than your first method I couldn't say. But hopefully this will point you in the right direction.
Code that works with your example below:
System.Text.RegularExpressions.Regex r;
r = new System.Text.RegularExpressions.Regex(#"(?:\[b\])(?<name>(?>\[b\](?<DEPTH>)|\[/b\](?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:\[/b\])", System.Text.RegularExpressions.RegexOptions.Singleline);
var s = r.Replace("[b]bla bla[/b]bla bla[b] " + "\r\n" + "bla bla [/b]", "<b>$1</b>");

How to display word differences using c#?

I would like to show the differences between two blocks of text. Rather than comparing lines of text or individual characters, I would like to just compare words separated by specified characters ('\n', ' ', '\t' for example). My main reasoning for this is that the block of text that I'll be comparing generally doesn't have many line breaks in it and letter comparisons can be hard to follow.
I've come across the following O(ND) logic in C# for comparing lines and characters, but I'm sort of at a loss for how to modify it to compare words.
In addition, I would like to keep track of the separators between words and make sure they're included with the diff. So if space is replaced by a hard return, I would like that to come up as a diff.
I'm using Asp.net to display the entire block of text including the deleted original text and added new text (both will be highlighted to show that they were deleted/added). A solution that works with those technologies would be appreciated.
Any advice on how to accomplish this is appreciated?
Thanks!

Microsoft has released a diff project on CodePlex that allows you to do word, character, and line diffs. It is licensed under Microsoft Public License (Ms-PL).
https://github.com/mmanela/diffplex

Other than a few general optimizations, if you need to include the separators in the comparison you are essentially doing a character by character comparison with breaks. Though you could use the O(ND) you linked, you are going to make as many changes to it as you would basically writing your own.
The main problem with difference comparison is finding the continuation (if I delete a single word, but leave the rest the same).
If you want to use their code start with the example and do not write the deleted characters, if there are replaced characters in the same place, do not output this result. You then need to compute the longest continuous run of "changed" words, highlight this string and output.
Sorry thats not much of an answer, but for this problem the answer is basically writing and tuning the function.

Well String.Split with '\n', ' ' and '\t' as the split characters will return you an array of words in your block of text.
You could then compare each array for differences. A simple 1:1 comparison would tell you if any word had been changed. Comparing:
hello world how are you
and:
hello there how are you
would give you that world and changed to there.
What it wouldn't tell you was if words had been inserted or removed and you would still need to parse the text blocks character by character to see if any of the separator characters had been changed.

string string1 = "hello world how are you";
string string2 = "hello there how are you";
var first = string1.Split(' ');
var second = string2.Split(' ');
var primary = first.Length > second.Length ? first : second;
var secondary = primary == second ? first : second;
var difference = primary.Except(secondary).ToArray();

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.