Is there a method to get escaped unicode characters from a string? - c#

I need to created a method which validates if a string contains an escaped unicode character. First thing that passed through my mind was to convert the string to char[] in order to get each element and then just to extract the "\uxxxx" from there. It didn't work. Here is a little thing:
public static string Quoted(string text)
=> $"\"{text}\"";
string inputData = Quoted(#"a \u26Be b");
Console.WriteLine(inputData) // ==> output will be: "a \u26Be b"
So there is obviously an large unicode character. I feel those quotes make my work harder. Also, I'm a begginer who just passed the alghoritm part of learning, now I'm learning and working with .Json, github, gitbash, etc. This task is one unit test in order to validate .Json file. Any help or hints would be appreciated. Thanks in advance!

Related

unicode to human readable string c# .net

This is probably a very basic question, but really appreciate if you could help me with this:
I want to convert an string that contains characters like \u000d\u000a\u000d\u000 to a human readable string, however I don't want to use .Replace method since the Unicode characters might be much more than what I include the software to check and replace.
string = "Test \u000d\u000a\u000d\u000aTesting with new line. \u000d\u000a\u000d\u000aone more new line"
I receive this string as a json Object from my server.
Do you even need that?
For example, the following code will print abc which is the actual decoded value:
var unicodeString = "\u0061\u0062\u0063";
Console.WriteLine(unicodeString);

How to arrange the matched output in c#?

i'm matching words to create simple lexical analyzer.
here is my example code and output
example code:
public class
{
public static void main (String args[])
{
System.out.println("Hello");
}
}
output:
public = identifier
void = identifier
main = identifier
class = identifier
as you all can see my output is not arranged as the input comes. void and main comes after class but in output the class comes at the end. i want to print result as the input is matched.
c# code:
private void button1_Click(object sender, EventArgs e)
{
if (richTextBox1.Text.Contains("public"))
richTextBox2.AppendText("public = identifier\n");
if (richTextBox1.Text.Contains("void"))
richTextBox2.AppendText("void = identifier\n");
if (richTextBox1.Text.Contains("class"))
richTextBox2.AppendText("class = identifier\n");
if (richTextBox1.Text.Contains("main"))
richTextBox2.AppendText("main = identifier\n");
}
Your code is asking the following qustions:
Does the input contain the text "public"? If so, write down "public = identifier".
Does the input contain the text "void"? If so, write down "void = identifier".
Does the input contain the text "class"? If so, write down "class = identifier".
Does the input contain the text "main"? If so, write down "main = identifier".
The answer to all of these questions is yes, and since they're executed in that exact order, the output you get should not be surprising. Note: public, void, class and main are keywords, not identifiers.
Splitting on whitespace?
So your approach is not going to help you tokenize that input. Something slightly more in the right direction would be input.Split() - that will cut up the input at whitespace boundaries and give you an array of strings. Still, there's a lot of whitespace entries in there.
input.Split(new char[] { ' ', '\t', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries) is a little better, giving us the following output: public, class, {, public, static, void, main, (String, args[]), {, System.out.println("Hello");, } and }.
But you'll notice that some of these strings contain multiple 'tokens': (String, args[]) and System.out.println("Hello");. And if you had a string with whitespace in it it would get split into multiple tokens. Apparently, just splitting on whitespace is not sufficient.
Tokenizing
At this point, you would start writing a loop that goes over every character in the input, checking if it's whitespace or a punctuation character (such as (, ), {, }, [, ], ., ;, and so on). Those characters should be treated as the end of the previous token, and punctuation characters should also be treated as a token of their own. Whitespace can be skipped.
You'll also have to take things like string literals and comments into account: anything in-between two double-quotes should not be tokenized, but be treated as part of a single 'string' token (including whitespace). Also, strings can contain escape sequences, such as \", that produce a single character (that double quote should not be treated as the end of the string, but as part of its content).
Anything that comes after two forward slashes should be ignored (or parsed as a single 'comment' token, if you want to process comments somehow), until the next newline (newline characters/sequences differ across operating systems). Anything after a /* should be ignored until you encounter a */ sequence.
Numbers can optionally start with a minus sign, can contain a dot (or start with a dot), a scientific notation part (e..), which can also be negative, and there are type suffixes...
In other words, you're writing a state machine, with different behaviour depending on what state you're in: 'string', 'comment', 'block comment', 'numeric literal', and so on.
Lexing
It's useful to assign a type to each token, either while tokenizing or as a separate step (lexing). public is a keyword, main is an identifier, 1234 is an integer literal, "Hello" is a string literal, and so on. This will help during the next step.
Parsing
You can now move on to parsing: turning a list of tokens into an abstract syntax tree (AST). At this point you can check if a list of tokens is actually valid code. You basically repeat the above step, but at a higher level.
For example, public, protected and private are keyword tokens, and they're all access modifiers. As soon as you encounter one of these, you know that either a class, a function, a field or a property definition must follow. If the next token is a while keyword, then you signal an error: public while is not a valid C# construct. If, however, the next token is a class keyword, then you know it's a class definition and you continue parsing.
So you've got a state machine once again, but this time you've got states like 'class definition', 'function definition', 'expression', 'binary expression', 'unary expression', 'statement', 'assignment statement', and so on.
Conclusion
This is by no means complete, but hopefully it'll give you a better idea of all the steps involved and how to approach this. There are also tools available that can generate parsing code from a grammar specification, which can ease the job somewhat (though you still need to learn how to write such grammars).
You may also want to read the C# language specification, specifically the part about its grammar and lexical structure. The spec can be downloaded for free from one of Microsofts websites.
CodeCaster is right. You are not on the right path.
I have an lexical analyzer made by me some time ago as a project.
I know, I know I'm not supposed to put things on a plate here, but the analyzer is for c++ so you'll have to change a few things.
Take a look at the source code and please try to understand how it works at least: C++ Lexical Analyzer
In the strictest sense, the reason for the described behaviour is that in the evaluating code, the search for void comes before the search for class. However, the approach in total seems far too simple for a lexical analysis, as it simply checks for substrings. I totally second the comments above; depending on what you are trying to achieve in the big picture, a more sophisticated approach might be necessary.

Regular Expression to match a quoted string embedded in another quoted string

I have a data source that is comma-delimited, and quote-qualified. A CSV. However, the data source provider sometimes does some wonky things. I've compensated for all but one of them (we read in the file line-by-line, then write it back out after cleansing), and I'm looking to solve the last remaining problem when my regex-fu is pretty weak.
Matching a Quoted String inside of another Quoted String
So here is our example string...
"foobar", 356, "Lieu-dit "chez Métral", Chilly, FR", "-1,000.09", 467, "barfoo", 1,345,456,235,231, "935.18"
I am looking to match the substring "chez Métral", in order to replace it with the substring chez Métral. Ideally, in as few lines of code as possible. The final goal is to write the line back out (or return it as a method return value) with the replacement already done.
So our example string would end up as...
"foobar", 356, "Lieu-dit chez Métral, Chilly, FR", "-1,000.09", 467, "barfoo", 1,345,456,235,231, "935.18"
I know I could define a pattern such as (?<quotedstring>\"\w+[^,]+\") to match quoted strings, but my regex-fu is weak (database developer, almost never use C#), so I'm not sure how to match another quoted string within the named group quotedstring.
FYI: For those noticing the large integer that is formatted with commas but not quote-qualified, that's already handled. As is the random use of row-delimiters (sometimes CR, sometimes LF). As other problems...
Replace with this regex
(?<!,\s*|^)"([^",]*)"
now replace it with $1
try it here
escaping " with "" it would become
(?<!,\s*|^)""([^"",]*)""

Convert string to char

I get from another class string that must be converted to char. It usually contains only one char and that's not a problem. But control chars i receive like '\\n' or '\\t'.
Is there standard methods to convert this to endline or tab char or i need to parse it myself?
edit:
Sorry, parser eat one slash. I receive '\\t'
I assume that you mean that the class that sends you the data is sending you a string like "\n". In that case you have to parse this yourself using:
Char.Parse(returnedChar)
Otherwise you can just cast it to a string like this
(string)returnedChar
New line:
string escapedNewline = #"\\n";
string cleanupNewLine = escapedNewline.Replace(#"\\n", Environment.NewLine);
OR
string cleanupNewLine = escapedNewline.Replace(#"\\n", "\n");
Tab:
string escapedTab = #"\\t";
string cleanupTab= escapedTab.Replace(#"\\t", "\t");
Note the lack of the literal string (i.e. i did not use #"\t" because that will not represent a Tab)
Alternatively you could consider Regular Expressions if you need to replace a range of different string patterns.
You should probably write a utility function to encapsulate the common behaviour above for all the possible Escape Sequences
Then you'd write some Unit Tests to cover each of the cases you can think of.
As you encounter any bugs you add more unit tests to cover those cases.
UPDATE
You could represent a tab in the XML with a special character sequence:
see this article
This article applies to SQL Server but may well be relevant to C# also?
To be absolutely sure, you could try generating a string with a tab in it and putting it into some XML (programmatically) and using XmlSerializer to serialize that to a file to see what the output is, then you can be sure that this will faithfully 'round-trip' the string with the tab still in it.
how about using string.ToCharArray()
You can then add the appropriate logic to process whatever was in the string.
char.parse(string); is used to convert string to char and you can do vice versa
char.tostring();
100% solved

Generate Public "ID Text" (like Stackoverflow)

I'm trying to create a simple method to turn a name (first name, last name, middle initial) into a public URL-friendly ID (like Stackoverflow does with question titles). Now people could enter all kinds of crazy characters, umlauts etc., is there something in .NET I can use to normalize it to URL-acceptable/english characters or do I need to write my own method to get this done?
Thank you!
Edit: An example (e.g. via RegEx or other way) would be super helpful!!! :)
Sounds like what you're after is a Slug Generator!
Simple method using UrlEncode
You obviously have to do something to deal with the collisions (prevent them on user creation being sensible but that means you are tied to this structure)
s => Regex.Replace(HttpUtility.UrlEncode(s), "%..", "")
This is relying on the output of UrlEncode always using two characters for the encoded form and that you are happy to have space convert to '+'
A regular expression to validate the string with the characters and lengths you wish to allow.
Think you'll have to write your own method...
Safelist of characters...
A to Z
For Each c As char In SafeList
If safe ... etc.
Next

Categories

Resources