Regex Replace different kinds of single quotes - c#

Those nasty single quotes that love to cause havoc in MySQL, seem to have cousins!!! We have a system where users will to job updates from clients, either pasting in content from emails, or copy and paste from almost anywhere, and every time we cater for one single quote, another pops up. Here are the different ones : ’ ´ ' ` <--
As my regex is pretty weak, my fellow developer said I should just remove them using:
return Regex.Replace(oldText, #"[’´'`""]", #"");
I don't like this, its unprofessional, removing all single quotes. What I want to do, as many forums suggest, is just double up the single quotes. Would this be correct?
return Regex.Replace(oldText, #"[’´'`""]", #"''");
this though is done, because in his DAL, he constructs his insert statements with single quotes:
sql.Append(",`to_complete_by`='" + obj.toCompleteBy + "'");
Would I be able to avoid this error by changing ^^ to this?
sql.Append(",`to_complete_by`=\"" + obj.toCompleteBy + "\"");
or regex replace a preferred method?

If you're working with an old version of MySQL that doesn't support prepared statements, you should at least take advantage of the fact that modern database interfaces can emulate such features of they aren't supported by the database itself. In your case it's not so much the potential performance gain, but the added security though separation of statement syntax and value representations.

Related

.NET Regular Expression (perl-like) for detecting text that was pasted twice in a row

I've got a ton of json files that, due to a UI bug with the program that made them, often have text that was accidentally pasted twice in a row (no space separating them).
Example: {FolderLoc = "C:\testC:\test"}
I'm wondering if it's possible for a regular expression to match this. It would be per-line. If I can do this, I can use FNR, which is a batch text processing tool that supports .NET RegEx, to get rid of the accidental duplicates.
I regret not having an example of one of my attempts to show, but this is a very unique problem and I wasn't able to find anything on search engines resembling it to even start to base a solution off of.
Any help would be appreciated.
Can collect text along the string (.+ style) followed by a lookahead check for what's been captured up to that point, so what would be a repetition of it, like
/(.+)(?=\1)/; # but need more restrictions
However, this gets tripped even just on double leTTers, so it needs at least a little more. For example, our pattern can require the text which gets repeated to be at least two words long.
Here is a basic and raw example. Please also see the note on regex at the end.
use warnings;
use strict;
use feature 'say';
my #lines = (
q(It just wasn't able just wasn't able no matter how hard it tried.),
q(This has no repetitions.),
q({FolderLoc = "C:\testC:\test"}),
);
my $re_rep = qr/(\w+\W+\w+.+)(?=\1)/; # at least two words, and then some
for (#lines) {
if (/$re_rep/) {
# Other conditions/filtering on $1 (the capture) ?
say $1
}
}
This matches at least two words: word (\w+) + non-word-chars + word + anything. That'll still get some legitimate data, but it's a start that can now be customized to your data. We can tweak the regex and/or further scrutinize our catch inside that if branch.
The pattern doesn't allow for any intervening text (the repetition must follow immediately), what is changed easily if needed; the question is whether then some legitimate repetitions could get flagged.
The program above prints
just wasn't able
C:\test
Note on regex This quest, to find repeated text, is much too generic
as it stands and it will surely pick on someone's good data. It is enough to note that I had to require at least two words (with one word that that is flagged), which is arbitrary and still insufficient. For one, repeated numbers realistically found in data files (3,3,3,3,3) will be matched as well.
So this needs further specialization, for what we need to know about data.

Regex ignore a pattern

I am trying to figure out a viable way to go about parsing this CSV file. Currently I am using filehelpers which is great. But with this csv file it seems to be having issues.
Each record in the the csv file is contained in quotes and delimited by a comma.
The records have commas within them and 1 record out of the 90,000 records im dealing with has one single " that mucks up the Readline.
The record looks like this "24" Blah ",
So I'm looking to write a regex to insert into the BeforeReadRecord that will go through and replace all instances of " with a space.
I'm newer to regex but I'm not finding any way to exclude three cases.
Case one: each line starts with a "
Case two: each line ends with a "
Case three: each field is separated by ","
I am trying to figure out how I could exclude those three cases and be left to just replace any straggler " .
So far I've been failing miserably and am not even sure if there is a way to accomplish this. Perhaps someone knows of a better csv parser that handles this one odd case as well?
EDIT: Well here's what I ended up with. It takes a little time to process(also just changes any outlier " to ' which is fine since the data that contains quotes is needed for any queries) but looking for any pitfalls I may be falling in to make it faster but it seemed to be the quickest solution so far(took about 7 seconds for 92,000 records) but there doesn't seem any way around checking every line so... My previous solution was a nasty nested if that seemed to 30 seconds or so over the course of processing the records. It accounts for all scenarios except for where someone decides to put a random ", at the end of a field... hoping I don't run into a record like this but it wouldn't surprise me.
in its own method{
engine.BeforeReadRecord += (sender, args) =>
args.RecordLine = checkQuote(args.RecordLine);
var records = engine.ReadFile(reportFilePath);
}
private static string checkQuote(string checkString)
{
if (checkString.Substring(0, 1) == #"""")
{
string removeQuote = #"""" + checkString.Replace(#"""", "'").Replace(#"','", #""",""").Remove(checkString.Length-1,1).Remove(0,1) + #"""";
return removeQuote;
}
else
return checkString; }
File format readers typically don't handle malformed input well. Why should they? If you give a CSV reader bad data, I would expect it to barf. I've rarely had good luck with computer software that makes assumptions about what I meant.
Do you really need a regular expression? If you define a straggler as the last quote character when the number is odd, then it's trivial to remove the last one: just count them and if the number is odd, remove the last one.
For example:
var quoteCount = inputString.Count(c => c == '\"');
if ((quoteCount % 2) == 1)
{
inputString = inputString.Remove(inputString.LastIndexOf('\"'));
}
Done and done.
You could also do it in a single pass with a loop, but that's probably overkill. I strongly suspect that sanitizing the input is not a major bottleneck in your program.
For more complex patterns (i.e. you're looking for "," or for a quote at the start and end, you just write a simple state machine. It's probably a dozen lines of code.
I realize that you might be able to do this with regular expressions. I find regex great for finding stuff and doing simple replacements. For more complicated rules like "replace quote with space unless the quote is at the beginning or end of line or next to a comma", I find it hard to come up with a good expression. For example, what about this case:
"first name","last name","","phone"
You have to take that blank field (i.e. "") into account. You also have to take into account spaces between fields (i.e. "first" , "last" , ""), and a whole host of other things. I'm reasonably sure that regex can do it. My experience has been that I can usually write the simple state machine and prove that it's correct faster than I can puzzle out the required regex. And it's certain that I'll more easily understand the state machine six months later.

SQL Like Operator (single quotation) [duplicate]

The MySQL documentation says that it should be \'. However, both scite and mysql shows that '' works. I saw that and it works. What should I do?
The MySQL documentation you cite actually says a little bit more than you mention. It also says,
A “'” inside a string quoted with “'” may be written as “''”.
(Also, you linked to the MySQL 5.0 version of Table 8.1. Special Character Escape Sequences, and the current version is 5.6 — but the current Table 8.1. Special Character Escape Sequences looks pretty similar.)
I think the Postgres note on the backslash_quote (string) parameter is informative:
This controls whether a quote mark can be represented by \' in a string literal. The preferred, SQL-standard way to represent a quote mark is by doubling it ('') but PostgreSQL has historically also accepted \'. However, use of \' creates security risks...
That says to me that using a doubled single-quote character is a better overall and long-term choice than using a backslash to escape the single-quote.
Now if you also want to add choice of language, choice of SQL database and its non-standard quirks, and choice of query framework to the equation, then you might end up with a different choice. You don't give much information about your constraints.
Standard SQL uses doubled-up quotes; MySQL has to accept that to be reasonably compliant.
'He said, "Don''t!"'
What I believe user2087510 meant was:
name = 'something'
name = name.replace("'", "\\'")
I have also used this with success.
There are three ways I am aware of. The first not being the prettiest and the second being the common way in most programming languages:
Use another single quote: 'I mustn''t sin!'
Use the escape character \ before the single quote': 'I mustn\'t sin!'
Use double quotes to enclose string instead of single quotes: "I mustn't sin!"
just write '' in place of ' i mean two times '
Here's an example:
SELECT * FROM pubs WHERE name LIKE "%John's%"
Just use double quotes to enclose the single quote.
If you insist in using single quotes (and the need to escape the character):
SELECT * FROM pubs WHERE name LIKE '%John\'s%'
Possibly off-topic, but maybe you came here looking for a way to sanitise text input from an HTML form, so that when a user inputs the apostrophe character, it doesn't throw an error when you try to write the text to an SQL-based table in a DB. There are a couple of ways to do this, and you might want to read about SQL injection too.
Here's an example of using prepared statements and bound parameters in PHP:
$input_str = "Here's a string with some apostrophes (')";
// sanitise it before writing to the DB (assumes PDO)
$sql = "INSERT INTO `table` (`note`) VALUES (:note)";
try {
$stmt = $dbh->prepare($sql);
$stmt->bindParam(':note', $input_str, PDO::PARAM_STR);
$stmt->execute();
} catch (PDOException $e) {
return $dbh->errorInfo();
}
return "success";
In the special case where you may want to store your apostrophes using their HTML entity references, PHP has the htmlspecialchars() function which will convert them to '. As the comments indicate, this should not be used as a substitute for proper sanitisation, as per the example given.
Replace the string
value = value.replace(/'/g, "\\'");
where value is your string which is going to store in your Database.
Further,
NPM package for this, you can have look into it
https://www.npmjs.com/package/mysql-apostrophe
I think if you have any data point with apostrophe you can add one apostrophe before the apostrophe
eg. 'This is John's place'
Here MYSQL assumes two sentence 'This is John' 's place'
You can put 'This is John''s place'. I think it should work that way.
In PHP I like using mysqli_real_escape_string() which escapes special characters in a string for use in an SQL statement.
see https://www.php.net/manual/en/mysqli.real-escape-string.php

a cleaner way of representing double quote?

really simple question... just want to represent double quote " without needing to do "" or \"
cases that I'm aware of:
var s=#"123 "" 456 """;
var s="123 \" 456 \"";
It'd make a reasonalbe difference if I could remove this noise somehow. The reason is that the escape sequence \ and the double quote have meaning in a domain specific language (DSL) that we're using. Sometimes it's convenient to throw some syntax inline into a C# string.
What I'd like is a way to tell .net not to touch it. Perhaps some kind of catch all via the DLR?
Within a C# literal, there's nothing you can to - don't forget this is all done at compile-time.
If you don't use single quotes, you could always do:
var s = "123 ' 456 '".Replace("'", "\"");
(Or choose some other character you don't use much, and replace that afterwards instead.)
Other than that, avoiding storing lots of data in your source code helps a lot with this sort of thing - for test data, I often use an embedded resource and load that in at execution time.
I don't suppose you could just read them in from a file or database?
Yeah, there's definitely a way to do that, and I use it all the time for exactly that reason.
You create a string resource collection (open Project Properties, Resources, make sure it's on Strings) and put your literal strings in there. Then, when you need one of those strings, use the Properties.Resources.{insert string resource name} reference to collect it in a pure and unadulterated form!
For completeness, I'll mention that you can use hex in a C# string, so in this case, \x0022. Note that you can omit the leading 0's if the character immediately following isn't hex.

How do I use a regular expression to add linefeeds?

I have a really long string. I would like to add a linefeed every 80 characters. Is there a regular expression replacement pattern I can use to insert "\r\n" every 80 characters? I am using C# if that matters.
I would like to avoid using a loop.
I don't need to worry about being in the middle of a word. I just want to insert a linefeed exactly every 80 characters.
I don't know the exact C# names, but it should be something like
str.Replace("(.{80})", "$1\r\n");
The idea is to grab 80 characters and save it in a group, then put it back in (I think "$1" is the right syntax) along with the "\r\n".
(Edit: The original regex had a + in it, which you definitely don't want. That would completely eliminate everything except the last line and any leftover pieces--a decidedly suboptimal result.)
Note that this way, you will most likely split inside words, so it might look pretty ugly.
You should be looking more into word wrapping if this is indeed supposed to be readable text. A little googling turned up a couple of functions; or if this is a text box, you can just turn on the WordWrap property.
Also, check out the .Net page at regular-expressions.info. It's by far the best reference site for regexes that I know of. (Jan Goyvaerts is on SO, but nobody told me to say that.)

Categories

Resources