Displaying special characters on html C# - c#

I've been trying to figure this out for quite a while now and I seem to be getting nowhere.
I've been trying to display special chars on a webpage but I'm having lots of trouble achieving my goal.
I'll use the † symbol as an example to explain my issue.
I'm loading the string from a database so the symbol is stored as \u0086
Let's use the word †est as an example.. When I load the string on CodeBehind I get "\u0086est" and on the webpage I get est instead of the correct symbol.
I've been trying to encode the string in multiple ways but it seems that I'm out of luck as I can't seem to get it to work.
The closest I got was using this:
System.Text.Encoding.GetEncoding(1252).GetString(HttpUtility.UrlDecodeToBytes(myString))
which returned †est ... However I'm not certain if it would be a good idea to specify the codepage like this as I'm loading the strings from a database and other special characters may appear as well.
It would be great if I could get some help.
EDIT:
I'm simply setting a label's text to the string I'm loading from the database.
myLabel.InnerText = myString;

Related

Detect non-displayable characters in string C#

Hopefully someone can help me with this, because I haven't found any solution online so far.
I am processing strings with special characters and I want to detect if any character in a string can't be displayed properly by for instance a webbrowser or even Visual Studio itself. The following string shows such characters. This comes from the Text vizualizer in VS2019:
TargetsforReduceCO
I've checked similar questions, but the answers were mostly limited to checking if the character code exceeds 255. However, there are lots of characters that can still be displayed, like Greek and Cyrillic symbols.
I also found this website that has an overview of all Unicode characters and show how they are displayed in the browser, but there doesn't seem to be any logic in which characters can't be displayed and their character code.
I can imagine that VS doesn't know which characters can't be displayed in various browsers, but I'm hoping that there is at least a way of checking if VS can display them.
Thanks in advance for your help!
Edit:
Right now I'm using
input.Any(c => !char.IsLetterOrDigit(c) && c > 255);
Because the input shouldn't normally contain other symbols than what you can usually find in a text, but I'm sure it will be triggered on symbols that can actually be displayed by VS or a webbrowser.
Type char has a number of static member methods like IsPunctuation() that should help you "categorize" character by character. See example on this page System.Char reference. Each of those methods' documentation explains what characters it applies to. As commenters have mentioned, your "displayable" criterion is more a font-presentation problem than a character value problem but you'll be able to narrow down what your system can work with using these methods. Look out for other methods like GetUnicodeCategory().
It may be that something as simple as !char.IsControl(c) will do the trick.
See similar Q&A here C# Printable Characters

How to use AntiXss with a Web API

This is a question that has been asked before, but I've not found the information I'm looking for or maybe I'm just missing the point so please bear with me. I can always adjust my question if I'm asking it the wrong way.
If for example, I have a POST endpoint that use a simply DTO object with 2 properties (i.e. companyRequestDto) and contains a script tag in one of its properties. When I call my endpoint from Postman I use the following:
{
"company": "My Company<script>alert(1);</script>",
"description": "This is a description"
}
When it is received by the action in my endpoint,
public void Post(CompanyRequestDto companyRequestDto)
my DTO object will automatically be set and its properties will be set to:
companyDto.Company = "My Brand<script>alert(1);</script>";
companyDto.Description = "This is a description";
I clearly don't want this information to be stored in our database as is, nor do I want it stored as an escaped string as displayed above.
1) Request: So my first question is how do I throw an error if the DTO posted contains some invalid content such as the tag?
I've looked at Microsoft AntiXss but I don't understand how to handle this as the data provided in the properties of a DTO object is not an html string but just a string, so What I am missing here as I don't understand how this is helping sanitizing or validating the passed data.
When I call
var test = AntiXss.AntiXssEncoder.HtmlEncode(companyRequestDto.Company, true);
It returns an encoded string, but then what??
Is there a way to remove disallowed keywords or just simply throw an error?
2) Response: Assuming 1) was not implemented or didn't work properly and it ended up being stored in our database, am I suppose to return encoded data as a json string, so instead of returning:
"My company"
Am I suppose to return:
"My Company<script>alert(1)</script>"
Is the browser (or whatever app) just supposed to display as below then?:
"My Company<script>alert(1)</script>"
3) Code: Assuming there is a way to sanitize or throw an error, should I use this at the property level using attribute on all the properties of my various DTO objects or is there a way to apply this at the class level using an attribute that will validate and/or sanitize all string properties of a DTO object for example?
I found interesting articles but none really answering my problems or I'm having other problems with some of the answers:
asp.net mvc What is the difference between AntiXss.HtmlEncode and HttpUtility.HtmlEncode?
Stopping XSS when using WebAPI (currently looking into this one but don't see how example is solving problem as property is always failing whether I use the script tag or not)
how to sanitize input data in web api using anti xss attack (also looking at this one but having a problem calling ReadFromStreamAsync from my project at work. Might be down to some of the settings in my web.config but haven't figured out why but it always seems to return an empty string)
Thanks.
UPDATE 1:
I've just finished going through the answer from Stopping XSS when using WebAPI
This is probably the closest one to what I am looking for. Except I don't want to encode the data, as I don't want to store it in my database, so I'll see if I can figure out how to throw an error but I'm not sure what the condition will be. Maybe I should just look for characters such as <, >, ; , etc... as these will not likely be used in any of our fields.
You need to consider where your data will be used when you think about encoding, so that data with in it is only a problem if it's rendered as HTML so if you are going to display data that has been provided by users anywhere, it's probably at the point you are going to display it that you would want to html encode it for display (you want to avoid repeatedly html encoding the same string when saving it for example).
Again, it depends what the response is going to be used for... you probably want to html encode it at the point it's going to be displayed... remember if you are encoding something in the response it may not match whats in data so if the calling code could do something like call your API to search for a company with that name that could cause problems. If the browser does display the html encoded version it might look ugly but it's better than users being compromised by XSS attacks.
It's quite difficult to sanitize text for things like tags if you allow most characters for normal use. It's easier if you can whitelist characters allowed and only allow, say, alphanumeric but that isn't often possible. This can be done using a regex validation attribute on the DTO object. The best approach I think is to encode values for display if you can't stop certain characters. It's really difficult to try to allow all characters but avoid things like as people can start using ascii characters etc.

replace a string if it Contains Format in c#

I´m developing a Quality Management software to show Errors of a Provisioning tool.
therefore I read all errors from an XML and group them
but there are errors like: "Forcedtime 1.08.2016 17:51:00 is in the past"
Is It possible to find Date-Values like this in a string and delete them ?
I can't work with a hard coded replace cause there are many different values for the Date Time.
Thanks for helping me
Yet another suggestion is to read the XML as such and discard the nodes you don't like/need. Then process only the nodes left.
In fact, while the Regex will do the trick for replacing something in the text, it might not be a good fit since this is structured data as opposed to bunch of data in a string format.
(0?[1-9]|[12][0-9]|3[01])\.(0?[1-9]|1[0-2])\.(\d{4}) (00|[0-9]|1[0-9]|2[0-3]):([0-9]|[0-5][0-9]):([0-9]|[0-5][0-9])
It checks for date time 1.08.2016 17:51:00.
You have to compare this date format in your string if matches then replace it
Try to learn reg exp

Find a specific string within another string without getting similar results

In a console application I get as input the RTF (Rich text Format) code of a file. The source is a database and data gathered via query.
My goal is to search whether in the input code, as string, is present the code: \par (end of carriage in RTF).
I tried with string.IndexOf and string.Contains but both returns me bad results since they match also code like: "\pard".
Given a string like:
{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 Times New Roman;}
{\f1\fnil\fcharset0 MS Sans Serif;}
\deflang1033\pard\plain\tx0\f2\lang1033\fs20\cf1 Payment}
How can I build my condition so that it return false, since the string does not contain \par? Eventually how could I set a regex to say that exactly the keyword "\par" (so length 4 chars) and no other will match? Thanks.
EDIT: The language used is C# and I am developing the console application with VS 2010.
You don't tell us the language you are using, but generally you need a word boundary something like this:
\\par\b
to ensure that there is not a word character following

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash
If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>
IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.
Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.
A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Categories

Resources