Fix html encoded text stored in the database - c#

I have a sql server db that has a table which stores a plain text value in a nvarchar column. Unfortunately there was a bug in the C# code that was running Encoder.HtmlEncode() on chinese characters before inserting it into the table . e.g text value of 您好 is being stored in the table as 您好
Is there any way I clean up this data using just T-sql? This database is heavily locked down, so I can't easily run any code against it other than T-sql.

From what the problem seems to be, you have an option.
You could create a temp table that will store the HTML entity of the characters. As an example;
CREATE TABLE dbo.TempHost
{
Entity varchar(255),
Character nvarchar(255)
}
Then you can actually find the data as csv online (http://www.khngai.com/chinese/charmap/tbluni.php?page=0 or copy and paste to excel), and import it into the table. From there on, all you will need to do is to scan the data and call REPLACE() function and update.

This is a fun challenge, and by fun I mean not really fun. T-SQL is quite bad at string manipulation. To make it even better, HTML entities actually encode a Unicode code point, and there is no simple way of converting that to a Unicode character in T-SQL.
Using a lookup table is probably the most viable method, in that it's likely to be more efficient than what I'm going to propose here: use a function to do the entity replacement. Warning: scalar-valued functions perform horribly in T-SQL and string manipulation is none too fast either. Nevertheless, I present this for, um, inspirational purposes:
CREATE FUNCTION dbo._ConvertEntities(#in NVARCHAR(MAX)) RETURNS NVARCHAR(MAX) AS BEGIN
WHILE 1 = 1 BEGIN;
DECLARE #entityStart INT = CHARINDEX('&#x', #in);
IF #entityStart = 0 BREAK;
DECLARE #entityEnd INT = CHARINDEX(';', #in, #entityStart)
DECLARE #entity VARCHAR(MAX) = SUBSTRING(#in, #entityStart + LEN('&#x'), #entityEnd - #entityStart - LEN('&#x'));
IF #entity NOT LIKE '[0-9A-F][0-9A-F][0-9A-F][0-9A-F]' RETURN #in;
DECLARE #entityChar NCHAR(1) = CONVERT(NCHAR(1), CONVERT(BINARY(2), REVERSE(CONVERT(BINARY(2), #entity, 2))));
SET #in = STUFF(#in, #entityStart, #entityEnd - #entityStart + 1, #entityChar);
END;
RETURN #in;
END;
Aside from performance issues, this function has the major shortcoming that it only works for entities of the form &#x????;, with ???? four hexadecimal digits. It fails quite badly for other entities (like those needing surrogates, those coded as decimal, or special entities like "). I've made it bail out in this case. Although it's fairly easy to extend it to handle single-byte entities, extending it to >4 would be agony.
Realistically, you want to do this in client software using a real programming language. Even if the database is sufficiently locked down that you cannot directly execute queries, you are presumably able to query data if it's not too much, and you can insert data back using generated statements (a lot of them if need be). Terribly slow, but more or less viable.
For completeness, I also mention the option of running CLR code in SQL Server using CLR integration. This requires that the server already allows this or that you can reconfigure it to allow it (improbable if it's "heavily locked down"). The main reason this would be attractive is because it's definitely easier and faster to decode the entities in CLR code, and using CLR integration means you're not using client code (so the data doesn't leave the server). On the other hand, since you need administrative access to the machine to deploy the assembly, this would seem to be a theoretical advantage at best. As far as performance goes, though, it probably can't be beat.

You could take advantage of the fact the characters are being stored all start with "&#x" and are eight characters long. You could loop through the table updating cutting out the bad characters using something like the example below.
DECLARE #str VARCHAR(100)
SET #str = 'Hello 頶頴World'
DECLARE #pos int SELECT #pos = CHARINDEX('&#x', #str)
WHILE #pos > 0
BEGIN
SET #str = LEFT(#str, #pos -1) + RIGHT(#str, LEN(#str) -#pos - 8)
SELECT #pos = CHARINDEX('&#x', #str)
END
SELECT #str

HTML encoding is not the same as XML encoding, but thanks to this question, I've realized there is an embarrassingly simple way of achieving this:
SELECT
REPLACE(
CONVERT(NVARCHAR(MAX),
CONVERT(XML,
REPLACE(REPLACE(_column_, '<', '<'), '"', '"')
)
),
'<', '<'
)
Stick this in an UPDATE and you're done. Well, almost -- if the code contains non-XML escaped entities like é, you'd need to replace these separately. Also, we do need to dance around the issue of XML escaping (hence the < replacing in case there's a < somewhere).
It may still need some refinement, but this sure looks a lot more promising than a scalar-valued function. :-)

Related

RegEx to extract parameters from a SQL stored procedure definition

I'm attempting to come up with a RegEx pattern to identify the string of parameters in a stored procedure definition. It's rather obtuse, considering all the various possibilities, but here's what I have so far (with global flags set for case insensitivity and single-line mode):
(?:create proc.*?)((?:[#]\w+\s+[A-Za-z0-9_()]+(?:\s*=\s*\S+)?(?:\s*,\s*)?)+)(?:.*?with\s+(?:native_compilation|schemabinding|execute\s+as\s+\S+))?(?:[\s\)]*?as)
I've set up several unit tests with a variety of stored proc definitions - with or without parameters, with or without defaults, etc. - and it seems to work in all cases except for one (that I've found so far):
CREATE PROCEDURE [dbo].[sp_creatediagram]
(
#diagramname sysname,
#owner_id int = null,
#version int,
#definition varbinary(max)
)
WITH EXECUTE AS 'dbo'
AS
BEGIN
set nocount on
...
Obviously I'd expect the first capture group to capture all four parameters...
#diagramname sysname,
#owner_id int = null,
#version int,
#definition varbinary(max)
...but for some reason, the RegEx search halts after the second parameter - notably the one that includes a default assignment - and doesn't proceed to capture the remaining two parameters. The first capture group ends up looking like this:
#diagramname sysname,
#owner_id int = null,
I won't be the least bit surprised to learn that I'm grossly overcomplicating this, but I do feel like I'm really close. I imagine there must be something about the way the RegEx engine works that I'm not quite understanding. Any help is hugely appreciated. Thanks very much in advance.

How to prevent Sql Injection in stored queries designed for later use? Automation Systems

I am working with a .net webtool and our clients require some extremely large queries that are just plain not viable for our web servers use. Our solution was to kick the query back to our automation system and feed the data back to the user when its finished. The problem is, passing the query from the webtool to automation requires we store the query as a string for later use. This does not allow us to use parameterized inputs.
What is best practice here? How can we scrub this data before running it? Obviously, the inputs should be validated initially but I am looking for a more, catchall solution.
Any help would be greatly appreciated!
There are a some assumptions made in the question that need to be validated, such as the statement "we store the query as a string for later use. This does not allow us to use parameterized inputs." It is possible to use parameterized queries, and storing the complete SQL content as a string. Below is an example of a parameterized query:
declare #sql nvarchar(max)
declare #monthNo int = 2
declare #minAmount decimal = 100
set #sql = N'select column1, column2 from dbo.Mytable where mon = #MonthNo and amount > #minAmount'
exec sp_executesql #sql, N'#monthNo int, #minAmount decimal', #monthNo, #minAmount
If the above example is not a viable option, here is some syntax to catch injection attempts within a string:
-- check for sql injection
declare #sql nvarchar(max)
set #sql = N'My fancy sql query; truncate table dbo.VeryImportantTable'
IF CHARINDEX(';', ISNULL(#sql,'')) != 0 OR CHARINDEX('--', ISNULL(#sql,'')) != 0
BEGIN
RAISERROR('Invalid input parameter', 16, 1)
RETURN -1
END
Since there are no examples in the question of code or its implementation, the answer contains some degree of speculation.
Summing :
Just to make it clearer for me and give my understanding :
1) You've got users that provide large queries via a webtool.
2) Your webtool give them to an automation system via SQL server.
3) Your automation system with SQL dynamic will query and feed back results to your users.
I assume that you will put 2 single quotes on each single quote at step 2) in order to reduce your query to something "storable".
Solution :
I think it will be nearly impossible to protect yourself of your user at this point without changing your architecture. But if you really want to keep it, the simpliest things you can do here is an analytics at step 1) and store your query without any edit at step 2).
Example :
I assume that you already had your own set of rules to analyse your query so I will just talk about "storing query".
Two ways to go. One is not using SQL Server to store query like a file (that is the safest solution and probably the worst in term of performance).
The second is to use special replacement of your single quote character and convert back at execution time (it's safe until someone now of your replacent character(s)).

SQL server: replace characters from string within a specified range

I'm having a question regarding replacing characters on a specific location in a string. My c# application has the ability to read files such as: TXT, CSV, Excel, database, etc. and import them to a SQL server on the network (the user is able to choose to which server we import the files). The application is ment to compare the two imported files with eachother. To improve the comparison between the two tables I want to be able to replace some specific characters. To give you an example. One column in the first imported file has part numbers without any special characters. The second imported file also have a column with the same part numbers as the first imported file does, only those partnumbers are seperated with a dot on every third character. To improve the search I remove ALL the dots from the second imported file. This can easily be done with a REPLACE transact. The query (that is exectuded from my c# application, I don't want to make a stored procedure because the server can be changed by the user's choice) will look like this:
UPDATE myTable SET myColumn = REPLACE(myColumn , '.', '');
This replace statement is working just fine. However the hard part I want to achieve. Lets say a column in the first imported file has partnumbers that looks like this:
132105213.000
452993424.001
436345332.002
etc...
And the second imported file has the same partnumbers only they look like this:
132.105.213.000
452.993.424.001
436.345.332.002
etc...
To improve the comparison between those two column I only want to remove the FIRST TWO dots and leave the third dot. So the REPLACE transact should only be executed on character 4 to 8. Is there any way to do this on the server side?
Some things to consider:
I don't want to use the STUFF transaction, because every character string on a row could be different from eachother and mess up the replacing.
The user is specifying the range where the REPLACE transact should be executed on. E.g. the user is entering replace from character position 4 to position 8.
Preferably without making a stored procedure as the comparison server can differ by the user's choice.
Preferably from SQL Version 2005 and up. Absolute lowest version will be SQL Server 2008.
If more info needed place a comment below so I can edit my question!
I don't understandand your adversity to STUFF, the following seems like it would work fine:
DECLARE #Start INT = 4,
#End INT = 9,
#Replace NCHAR(1) = '.';
SELECT s = STUFF(t.String,
#Start,
#End - #Start,
REPLACE(SUBSTRING(t.String, #Start, #End - #Start), #Replace, '')
)
FROM (VALUES
('132.105.213.000'),
('452.993.424.001'),
('436.345.332.002'),
('132105213.000'),
('452993424.001'),
('436345332.002')
) AS t (String);
Basically you are extracting the string between the specified characters (SUBSTRING(t.String, #Start, #End - #Start)), then performing the replace on this extract, and stuffing what is left back into the original string.
Try this. It will leave the last period
DECLARE #t table(val varchar(50))
INSERT #t values
('132.105.213.000'),
('452.993.424.001'),
('436.345.332.002'),
('123')
SELECT replace(left(val, len(val) - len(rightval)), '.', '') + rightval
FROM #t t
OUTER APPLY
(SELECT right(val, charindex('.',reverse(val))) rightval) x
The answers you provided helped me with figuring out the solution, atleast I think it is a good solution? This is how the query looks like:
UPDATE myTable
SET myColumn = STUFF(myColumn, fromCharPosition, toCharPosition, REPLACE(SUBSTRING(myColumn, fromCharPosition, toCharPosition, charToReplace, charReplacement));
So the query will look like this for the example I made in my question:
UPDATE myTable
SET partNumber = STUFF(partNumber, 4, 8, REPLACE(SUBSTRING(partNumber, 4, 8), '.', ''));
Again thanks for helping out this trainee!

Using IN operator with Stored Procedure Parameter

I am building a website in ASP.NET 2.0, some description of the page I am working about:
ListView displaying a table (of posts) from my access db, and a ListBox with Multiple select mode used to filter rows (by forum name, value=forumId).
I am converting the ListBox selected values into a List, then running the following query.
Parameter:
OleDbParameter("#Q",list.ToString());
Procedure:
SELECT * FROM sp_feedbacks WHERE forumId IN ([#Q])
The problem is, well, it doesn't work. Even when I run it from MSACCESS 2007 with the string 1,4, "1","4" or "1,4" I get zero results. The query works when only one forum is selected. (In (1) for instance).
SOLUTION?
So I guess I could use WHERE with many OR's but I would really like to avoid this option.
Another solution is to convert the DataTable into list then filter it using LINQ, which seems very messy option.
Thanks in advance,
BBLN.
I see 2 problems here:
1) list.ToString() doesn't do what you expect. Try this:
List<int> foo = new List<int>();
foo.Add(1);
foo.Add(4);
string x = foo.ToString();
The value of "x" will be "System.Collections.Generic.List`1[System.Int32]" not "1,4"
To create a comma separated list, use string.Join().
2) OleDbParameter does not understand arrays or lists. You have to do something else. Let me explain:
Suppose that you successfully use string.Join() to create the parameter. The resulting SQL will be:
SELECT * FROM sp_feedbacks WHERE forumId IN ('1,4')
The OLEDB provider knows that strings must have quotation marks around them. This is to protect you from SQL injection attacks. But you didn't want to pass a string: you wanted to pass either an array, or a literal unchanged value to go into the SQL.
You aren't the first to ask this question, but I'm afraid OLEDB doesn't have a great solution. If it were me, I would discard OLEDB entirely and use dynamic SQL. However, a Google search for "parameterized SQL array" resulted in some very good solutions here on Stack Overflow:
WHERE IN (array of IDs)
Passing an array of parameters to a stored procedure
Good Luck! Post which approach you go with!
When you have:
col in ('1,4')
This tests that col is equal to the string '1,4'. It is not testing for the values individually.
One way to solve this is using like:
where ','&#Q&',' like '*,'&col&',*'
The idea is to add delimiters to each string. So, a value of "1" becomes ",1,"in the column. A value of "1,4" for #Q becomes ",1,4,". Now when you do the comparison, there is no danger that "1" will match "10".
Note (for those who do not know). The wildcard for like is * rather than the SQL standard %. However, this might differ depending on how you are connecting, so use the appropriate wildcard.
Passing such a condition to a query has always been a problem. To a stored procedure it is worse because you can't even adjust the query to suit. 2 options currently:
use a table valued parameter and pass in multiple values that way (a bit of a nuisance to be honest)
write a "split" multi-value function as either a UDF or via SQL/CLR and call that from the query
For the record, "dapper" makes this easy for raw commands (not sprocs) via:
int[] ids = ...
var list = conn.Query<Foo>(
"select * from Foo where Id in #ids",
new { ids } ).ToList();
It figures out how to turn that into parameters etc for you.
Just in case anyone is looking for an SQL Server Solution:
CREATE FUNCTION [dbo].[SplitString]
(
#Input NVARCHAR(MAX),
#Character CHAR(1)
)
RETURNS #Output TABLE (
Item NVARCHAR(1000)
)
AS BEGIN
DECLARE #StartIndex INT, #EndIndex INT
SET #StartIndex = 1
IF SUBSTRING(#Input, LEN(#Input) - 1, LEN(#Input)) <> #Character
BEGIN
SET #Input = #Input + #Character
END
WHILE CHARINDEX(#Character, #Input) > 0
BEGIN
SET #EndIndex = CHARINDEX(#Character, #Input)
INSERT INTO #Output(Item)
SELECT SUBSTRING(#Input, #StartIndex, #EndIndex - 1)
SET #Input = SUBSTRING(#Input, #EndIndex + 1, LEN(#Input))
END
RETURN
END
Giving an array of strings, I will convert it to a comma separated List of strings using the following code
var result = string.Join(",", arr);
Then I could pass the parameter as follows
Command.Parameters.AddWithValue("#Parameter", result);
The In Stored Procedure Definition, I would use the parameter from above as follows
select * from [dbo].[WhateverTable] where [WhateverColumn] in (dbo.splitString(#Parameter, ','))

How to pass and Return a Array to stored procedure?

I would like to know how to pass an array to a stored procedure which in turn will return an array from c# .net?
There are different options here depending on the scenario. I'm using SQL Server for a lot of the examples below, but much of it is broadly transferable between systems.
For a relatively small array (ideally vector) you can construct a delimited string (tab delimited, comma delimited, whatever), and pass that into your DB and parse - usually manually (DBMS often lack a "split" routine), but it is very easy to obtain a pre-written "split" implementation (for example, as a UDF in SQL Server). Typical usage:
SELECT st.*
FROM dbo.SplitUDF(#myarg) #udf
INNER JOIN SOME_TABLE st ON st.ID = #udf.Value
Xml is another option, especially for complex data; SQL Server 2005 and above has inbuilt xml parsing, but that should not be assumed in general.
Table-valued parameters are another option, but this is SQL Server 2008 only - it might well be what you are looking for, though.
Another option, especially for large data, is to pump the data into the server separately (bulk insert, SQLBulkCopy, "bcp", SSIS whatever) and process the data via SQL once it is there.
To get array/tabular data out, a standard SELECT should be your default option, although you can of course also construct xml or delimited character data. The latter can be accomplished via a quirk of SQL:
DECLARE #foo varchar(max)
SET #foo = ''
SELECT #foo = #foo + st.SomeColumn + '|' -- pipe-delimited, note trailing |
FROM SOME_TABLE st

Categories

Resources