c# .NET isn't rendering russian characters - c#

I'm using regex to match a string of unicode and store it in a string. For example:
NOTE: The following content must be read from an outside text file or else visual studio will automagically render it into russian.
"Name": "\u0412\u0438\u043d\u043d\u0438\u0446\u0430, \u0443\u043b. \u041a\u0438\u0435\u0432\u0441\u043a\u0430\u044f, 14-\u0431",
I'm using the pattern:
"\"Name\":\\s*\"(?<match>[^\"]+)\""
However, when I store the match in a string, the string is saved as:
match = "\\u0412\\u0438\\u043d\\u043d\\u0438\\u0446\\u0430, \\u0443\\u043b. \\u041a\\u0438\\u0435\\u0432\\u0441\\u043a\\u0430\\u044f, 14-\\u0431"
.NET is storing the string with an extra "\"
I tried using:
match = match.replace(#"\\", #"\")
but .NET doesn't recognize #"\\" as existing because it is looking at the 'visualizer version'.
How can I store my unicode without c# adding an extra '\'?
EDIT:
Another point:
// this works!
string russianCharacters = "\u041b\u044c\u0432\u043e\u0432, \u0414\u043e\u043b\u0438\u043d\u0430, \u0432\u0443\u043b. \u0427\u043e\u0440\u043d\u043e\u0432\u043e\u043b\u0430, 18");
This renders correctly in the visualizer as russian characters. But when I store characters from a regex match FROM AN OUTSIDE TEXT FILE, it is stored as an excaped sequence.
How can I render my string as russian characters instead of an escaped sequence of unicode?

It seems you read the string from a text file that actually contains literal Unicode points, not actual Unicode symbols. That is, your C# variable looks like:
var match = "\\u0412\\u0438\\u043d\\u043d\\u0438\\u0446\\u0430, \\u0443\\u043b. \\u041a\\u0438\\u0435\\u0432\\u0441\\u043a\\u0430\\u044f, 14-\\u0431"
or
var match = #"\u0412\u0438\u043d\u043d\u0438\u0446\u0430, \u0443\u043b. \u041a\u0438\u0435\u0432\u0441\u043a\u0430\u044f, 14-\u0431"
In this case, to get the actual Unicode string, you need to use Regex.Unescape:
Converts any escaped characters in the input string.
C# demo:
var s = "\\u0412\\u0438\\u043d\\u043d\\u0438\\u0446\\u0430, \\u0443\\u043b. \\u041a\\u0438\\u0435\\u0432\\u0441\\u043a\\u0430\\u044f, 14-\\u0431";
Console.WriteLine(s);
// \u0412\u0438\u043d\u043d\u0438\u0446\u0430, \u0443\u043b. \u041a\u0438\u0435\u0432\u0441\u043a\u0430\u044f, 14-\u0431
Console.WriteLine(Regex.Unescape(s));
// Винница, ул. Киевская, 14-б

The extra '\' is just an escape character. I'm guessing you are viewing the value in the debugger window in which case it is showing the extra '\' but the underlying value will not have the extra '\'. Try using the actual value and you will see this.
This code works as expected:
var myString = "\"Name\": \"\u0412\u0438\u043d\u043d\u0438\u0446\u0430, \u0443\u043b.\u041a\u0438\u0435\u0432\u0441\u043a\u0430\u044f, 14 - \u0431\",";
var pattern = "\"Name\":\\s*\"(?<match>[^\"]+)\"";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(myString);
if (matches.Count > 0)
{
foreach (Match match in matches)
{
var ma = System.Web.HttpUtility.HtmlDecode(match.ToString());
}
}

Related

Use RegEx to uppercase and lowercase the string

I am trying to convert a string to uppercase and lowercase based on the index.
My string is a LanguageCode like cc-CC where cc is the language code and CC is the country code. The user can enter in any format like "cC-Cc". I am using the regular expression to match whether the data is in the format cc-CC.
var regex = new Regex("^[a-z]{2}-[A-Z]{2}$", RegexOptions.IgnoreCase);
//I can use CultureInfos from .net framework and compare it's valid or not.
//But the requirement is it should allow invalid language codes also as long
//The enterd code is cc-CC format
Now when the user enters something cC-Cc I'm trying to lowercase the first two characters and then uppercase last two characters.
I can split the string using - and then concatenate them.
var languageDetails = languageCode.Split('-');
var languageCodeUpdated = $"{languageDetails[0].ToLowerInvariant()}-{languageDetails[1].ToUpperInvariant()}";
I thought can I avoid multiple strings creation and use RegEx itself to uppercase and lowercase accordingly.
While searching for the same I found some solutions to use \L and \U but I am not able to use them as the C# compiler showing error. Also, RegEx.Replace() has a parameter or delegate MatchEvaluator which I'm not able to understand.
Is there any way in C# we can use RegEx to replace uppercase with lowercase and vice versa.
.NET regex does not support case modifying operators.
You may use MatchEvaluator:
var result = Regex.Replace(s, #"(?i)^([a-z]{2})-([a-z]{2})$", m =>
$"{m.Groups[1].Value.ToLower()}-{m.Groups[2].Value.ToUpper()}");
See the C# demo.
Details
(?i) - the inline version of RegexOptions.IgnoreCase mopdiofier
^ - start of the string
([a-z]{2}) - Capturing group #1: 2 ASCII letters
- - a hyphen
([a-z]{2}) - Capturing group #2: 2 ASCII letters
$ - end of string.
TLDR: This is Regex.Replace with \U and \L support.
private static string EnhancedReplace(string input, string pattern, string replacement, RegexOptions options)
{
replacement = Regex.Replace(replacement, #"(?<mode>\\[UL])(?<group>\$((\d+)|({[^}]+})))", #"<!<mode:${mode}>%&${group}&%>");
var output = Regex.Replace(input, pattern, replacement, options);
output = Regex.Replace(output, #"<!<mode:\\L>%&(?<value>[\w\W]*?)&%>", x => x.Groups["value"].Value.ToLower());
output = Regex.Replace(output, #"<!<mode:\\U>%&(?<value>[\w\W]*?)&%>", x => x.Groups["value"].Value.ToUpper());
return output;
}
How To Use
Call the function with \U followed by the group to be uppercase
var result = EnhancedReplace(input, #"(public \w+ )(\w)", #"$1\U$2", RegexOptions.None);
Will replace this:
public string test12 { get; set; } = "test3";
With that:
public string Test12 { get; set; } = "test3";
Details
I'm currently working on an app which allows the user to define a batch of Regex Replace operations.
For example the user enters json and the batch converts it to a C#-Class.
Therefore, speed is no key requirement. But it would be very handy to be able to use \U and \L.
This method will apply Regex.Replace 3 times to the whole content and one time to the replacement string. Therefore it’s at least three times slower than Regex.Replace without \U \L support.
Step by Step
The first Regex.Replace enhances the replacement string.
It replaces: \U$1 with <!<mode:\\U>%&$1&%>
(Also works for named groups: ${groupName})
The new replacement will be applied to the content.
& 4. The inserted placeholder is now relatively unique. That allows you to search only for <!<mode:\\U>%&Actual Value&%> and use the MatchEvaluator to replace it with its uppercase version. The same will be done for \L
Regex101 Demo:
Step 1: Enhance pattern with placeholder
https://regex101.com/r/ZtqigN/1
Step 2 Use new replacement pattern
https://regex101.com/r/PWLTFD/1
Step 3&4 Resolve new placeholders
https://regex101.com/r/5DIIUo/1
Answer
var result = EnhancedReplace(input, #"(cc)(-)(cc)", #"\L$1$2\U$3", RegexOptions.IgnoreCase);

C# - Save text into variable

I want to save an e-mail-address out of a .txt-file into a string variable. This is my code:
String path = "C:\\Users\\test.txt";
string from;
var fro = new Regex("from: (?<fr>)");
using (var reader = new StreamReader(File.OpenRead(#path)))
{
while (true)
{
var nextLine = reader.ReadLine();
if (nextLine == null)
break;
var matchb = fro.Match(nextLine);
if (matchb.Success)
{
from = matchb.Groups["fr"].Value;
Console.WriteLine(from);
}
}
}
I know that matchb.Success is true, however from won't be displayed correctly. I'm afraid it has something to do with the escape sequence, but I was unable to find anything helpful on the internet.
The textfile might look like this:
LOG 00:01:05 processID=123456-12345 from: test#test.org
LOG 00:01:06 processID=123456-12345 OK
Your (?<fr>) pattern defines a named group "fr" that matches an empty string.
To fill the group with some value you need to define the group pattern.
If you plan to match the rest of the line, you may use .*. To match a sequence of non-whitespace chars, use \S+. To match a sequence of non-whitespace chars that has a # inside, use \S+#\S+. All the three approaches will work for the current scenario.
In C#, it will look like
var fro = new Regex(#"from: *(?<fr>\S+#\S+)");
Note that #"..." is a verbatim string literal where a single backslash defines a literal backslash, so you do not have to double it. I also suggest using the * quantifier to match 0 or more spaces before the email. You might want to use \s* (to match any 0+ whitespace chars) or [\p{Zs}\t]* (to match only horizontal whitespace chars) instead.

Regex to find next word (which contains special character) after given word

I am facing problem with writing REGEX to get desired output from a string.
I have a string like string simpleInput = #"Website address www.yahoo[mail].com AND Following is the";
I want to specify "address" word and in result want the next word after it, i.e."www.yahoo[mail].com"
I have written following piece of code.
string pattern = #"address (?<after>\w+)";
MatchCollection matches = Regex.Matches(simpleInput, pattern, RegexOptions.Multiline | RegexOptions.IgnoreCase);
string nextWord = string.Empty;
foreach (Match match in matches)
{
nextWord = match.Groups["after"].ToString();
}
Console.WriteLine("Word is: " + nextWord );
This gives me output as:
Word is: www
Where as I expect output to be www.yahoo[mail].com
Can anyone please help?
I tried with \D+, that gives me entire string.. till the end of string, so gives additional text like "AND Following is the" also comes in result.
Where as I just wanted the single word "www.yahoo[mail].com"
\w+ doesn't match . or some other characters in the string you want to match. Try using \S+ instead which means non-space characters:
string pattern = #"address (\S+)";

regex to split line (csv file)

I am not good in regex. Can some one help me out to write regex for me?
I may have values like this while reading csv file.
"Artist,Name",Album,12-SCS
"val""u,e1",value2,value3
Output:
Artist,Name
Album
12-SCS
Val"u,e1
Value2
Value3
Update:
I like idea using Oledb provider. We do have file upload control on the web page, that I read the content of the file using stream reader without actual saving file on the file system. Is there any way I can user Oledb provider because we need to specify the file name in connection string and in my case i don't have file saved on file system.
Just adding the solution I worked on this morning.
var regex = new Regex("(?<=^|,)(\"(?:[^\"]|\"\")*\"|[^,]*)");
foreach (Match m in regex.Matches("<-- input line -->"))
{
var s = m.Value;
}
As you can see, you need to call regex.Matches() per line. It will then return a MatchCollection with the same number of items you have as columns. The Value property of each match is, obviously, the parsed value.
This is still a work in progress, but it happily parses CSV strings like:
2,3.03,"Hello, my name is ""Joshua""",A,B,C,,,D
Actually, its pretty easy to match CVS lines with a regex. Try this one out:
StringCollection resultList = new StringCollection();
try {
Regex pattern = new Regex(#"
# Parse CVS line. Capture next value in named group: 'val'
\s* # Ignore leading whitespace.
(?: # Group of value alternatives.
"" # Either a double quoted string,
(?<val> # Capture contents between quotes.
[^""]*(""""[^""]*)* # Zero or more non-quotes, allowing
) # doubled "" quotes within string.
""\s* # Ignore whitespace following quote.
| (?<val>[^,]*) # Or... zero or more non-commas.
) # End value alternatives group.
(?:,|$) # Match end is comma or EOS",
RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
Match matchResult = pattern.Match(subjectString);
while (matchResult.Success) {
resultList.Add(matchResult.Groups["val"].Value);
matchResult = matchResult.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Disclaimer: The regex has been tested in RegexBuddy, (which generated this snippet), and it correctly matches the OP test data, but the C# code logic is untested. (I don't have access to C# tools.)
Regex is not the suitable tool for this. Use a CSV parser. Either the builtin one or a 3rd party one.
Give the TextFieldParser class a look. It's in the Microsoft.VisualBasic assembly and does delimited and fixed width parsing.
Give CsvHelper a try (a library I maintain). It's available via NuGet.
You can easily read a CSV file into a custom class collection. It's also very fast.
var streamReader = // Create a StreamReader to your CSV file
var csvReader = new CsvReader( streamReader );
var myObjects = csvReader.GetRecords<MyObject>();
Regex might get overly complex here. Split the line on commas, and then iterate over the resultant bits and concatenate them where "the number of double quotes in the concatenated string" is not even.
"hello,this",is,"a ""test"""
...split...
"hello | this" | is | "a ""test"""
...iterate and merge 'til you've an even number of double quotes...
"hello,this" - even number of quotes (note comma removed by split inserted between bits)
is - even number of quotes
"a ""test""" - even number of quotes
...then strip of leading and trailing quote if present and replace "" with ".
It could be done using below code:
using Microsoft.VisualBasic.FileIO;
string csv = "1,2,3,"4,3","a,"b",c",end";
TextFieldParser parser = new TextFieldParser(new StringReader(csv));
//To read from file
//TextFieldParser parser = new TextFieldParser("csvfile.csv");
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
string[] fields =null;
while (!parser.EndOfData)
{
fields = parser.ReadFields();
}
parser.Close();

C# Regex.Split - Subpattern returns empty strings

Hey, first time poster on this awesome community.
I have a regular expression in my C# application to parse an assignment of a variable:
NewVar = 40
which is entered in a Textbox. I want my regular expression to return (using Regex.Split) the name of the variable and the value, pretty straightforward. This is the Regex I have so far:
var r = new Regex(#"^(\w+)=(\d+)$", RegexOptions.IgnorePatternWhitespace);
var mc = r.Split(command);
My goal was to do the trimming of whitespace in the Regex and not use the Trim() method of the returned values. Currently, it works but it returns an empty string at the beginning of the MatchCollection and an empty string at the end.
Using the above input example, this is what's returned from Regex.Split:
mc[0] = ""
mc[1] = "NewVar"
mc[2] = "40"
mc[3] = ""
So my question is: why does it return an empty string at the beginning and the end?
Thanks.
The reson RegEx.Split is returning four values is that you have exactly one match, so RegEx.Split is returning:
All the text before your match, which is ""
All () groups within your match, which are "NewVar" and "40"
All the text after your match, which is ""
RegEx.Split's primary purpose is to extract any text between the matched regex, for example you could use RegEx.Split with a pattern of "[,;]" to split text on either commas or semicolons. In NET Framework 1.0 and 1.1, Regex.Split only returned the split values, in this case "" and "", but in NET Framework 2.0 it was modified to also include values matched by () within the Regex, which is why you are seeing "NewVar" and "40" at all.
What you were looking for is Regex.Match, not Regex.Split. It will do exactly what you want:
var r = new Regex(#"^(\w+)=(\d+)$");
var match = r.Match(command);
var varName = match.Groups[0].Value;
var valueText = match.Groups[1].Value;
Note that RegexOptions.IgnorePatternWhitespace means you can include extra spaces in your pattern - it has nothing to do with the matched text. Since you have no extra whitespace in your pattern it is unnecesssary.
From the docs, Regex.Split() uses the regular expression as the delimiter to split on. It does not split the captured groups out of the input string. Also, the IgnorePatternWhitespace ignore unescaped whitespace in your pattern, not the input.
Instead, try the following:
var r = new Regex(#"\s*=\s*");
var mc = r.Split(command);
Note that the whitespace is actually consumed as a part of the delimiter.

Categories

Resources