UTF-8 escape sequence as string: surely a better way - c#

Reviewing some old code of mine, and wondered if there was a better way to create a literal string with unicode symbols...
I have a REST interface that requires certain escaped characters; for example, a property called username with value of john%foobar+Smith that must be requested like this:
{"username":"john\u0025foobar\u002bSmith"}
My c# method to replace certain characters like % and + is pretty basic:
public static string EncodeUTF8(string unescaped) {
string utf8_ampersand = #"\u0026";
string utf8_percent = #"\u0025";
string utf8_plus = #"\u002b";
return unescaped.Replace("&", utf8_ampersand).Replace("+", utf8_plus).Replace("%", utf8_percent);
}
This seems an antiquated way to do this; surely there is some single line method using Encoding that would output literal UTF code, but I can't find any examples that aren't essentially replace statements like mine... is there a better way?

You could do it with Regex:
static readonly Regex ReplacerRegex = new Regex("[&+%]");
public static string Replace(Match match)
{
// 4-digits hex of the matched char
return #"\u" + ((int)match.Value[0]).ToString("x4");
}
public static string EncodeUTF8(string unescaped)
{
return ReplacerRegex.Replace(unescaped, Replace);
}
But i don't suggest it very much (unless you have tens of replaces). I do think it would be slower, and bigger to write.

Related

How to ignore case sensitive in this xpath c# selenium

I have a simple xpath:
driver.findelement(by.xpath("//li[contains(text(), 'chain')]").click()
This code is working but its not recognize chain in uppercase, how to ignore case sensitive in this xpath?
You can use the contains and translate functions together like this:
//li[contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'chain')]
Use translate, but to make it shorter, you may translate only characters you are looking for
//li[contains(translate(text(), 'CHAIN', 'chain'), 'chain')]
If you are going to use this a lot, you may even write a method for that. I'll write in Java (sorry, not familiar with C#):
public static void main(String[] args) {
String l = "//li[" + containsTextIgnoringCase("StackOverflow") + "]";
System.out.println(l);
}
public static String containsTextIgnoringCase(String s) {
return String.format("contains(translate(text(), '%s', '%s'), '%s')", s.toUpperCase(), s.toLowerCase(), s);
}
output:
//li[contains(translate(text(), 'STACKOVERFLOW', 'stackoverflow'), 'stackoverflow')]
There's some space for optimization (not sure if it costs the effort, but still): translate only unique chars, and handle quotes, if passed in string
Use the value of chain in the way that assign it to a separate string variable and then apply .ToUpper() extension on it.
string strChain = "chain";
string getUpper = strChain.ToUpper();
It will give you cain in uppercase.
You can use it directly inside driver.findelement.

Assess if a c# string is a single Emoji OR an Emoji ZWJ Sequence?

What would be a way to tell if a c# string is a single Emoji, or a valid Emoji ZWJ Sequences?
I would like to basically be able to find any Emoji from the official unicode list, http://www.unicode.org/reports/tr51/tr51-15.html#emoji_data
I don't seem to find a nuget package for this, and most SO questions don't seem to be easily applicable to my case (i.e. Is there a way to check if a string in JS is one single emoji? )
I ended up using Unicode regex, which are partially implemented in .NET.
Using this question (C# - Regular expression to find a surrogate pair of a unicode codepoint from any string?), I came up with the following.
Regex
//Returns the Emoji
#"([\uD800-\uDBFF][\uDC00-\uDFFF]\p{M}*){1,5}|\p{So}"
//Returns true if the string is a single Emoji
#"^(?>(?>[\uD800-\uDBFF][\uDC00-\uDFFF]\p{M}*){1,5}|\p{So})$"
Tests
public class EmojiTests
{
private static readonly Regex IsEmoji = new Regex(#"^(?>(?>[\uD800-\uDBFF][\uDC00-\uDFFF]\p{M}*){1,5}|\p{So})$", RegexOptions.Compiled);
[Theory]
[InlineData("⭐")]
[InlineData("😁")]
[InlineData("🃏")]
[InlineData("🏴")]
[InlineData("👪🏿")]
[InlineData("🤌")]//pinched fingers, coming soon :p
public void ValidEmojiCases(string input)
{
Assert.Matches(IsEmoji, input);
}
[Theory]
[InlineData("")]
[InlineData(":p")]
[InlineData("a")]
[InlineData("<")]
[InlineData("⭐⭐")]
[InlineData("🃏a")]
[InlineData("‼️")]
[InlineData("↔️")]
public void InvalidEmojiCases(string input)
{
Assert.DoesNotMatch(IsEmoji, input);
}
}
It is not perfect (i.e. returns true for "™️", false for "◻️"), but that will do.

Casting HexNumber as character to string

I need to process a numeral as a string.
My value is 0x28 and this is the ascii code for '('.
I need to assign this to a string.
The following lines do this.
char c = (char)0x28;
string s = c.ToString();
string s2 = ((char)0x28).ToString();
My usecase is a function that only accepts strings.
My call ends up looking cluttered:
someCall( ((char)0x28).ToString() );
Is there a way of simplifying this and make it more readable without writing '(' ?
The Hexnumber in the code is always paired with a Variable that contains that hex value in its name, so "translating" it would destroy that visible connection.
Edit:
A List of tuples is initialised with this where the first item has the character in its name and the second item results from a call with that character.
One of the answers below is exactly what i am looking for so i incorporated it here now.
{ existingStaticVar0x28, someCall("\u0028") }
The reader can now instinctively see the connection between item1 and item2 and is less likely to run into a trap when this gets refactored.
You can use Unicode character escape sequence in place of a hex to avoid casting:
string s2 = '\u28'.ToString();
or
someCall("\u28");
Well supposing that you have not a fixed input then you could write an extension method
namespace MyExtensions
{
public static class MyStringExtensions
{
public static string ConvertFromHex(this string hexData)
{
int c = Convert.ToInt32(hexCode, 16);
return new string(new char[] {(char)c});
}
}
}
Now you could call it in your code wjth
string hexNumber = "0x28"; // or whatever hexcode you need to convert
string result = hexNumber.ConvertFromHex();
A bit of error handling should be added to the above conversion.

words stemmer class c#

I am trying the following stemming class :
static class StemmerSteps
{
public static string stepSufixremover(this string str, string suffex)
{
if (str.EndsWith(suffex))
{
................
}
return str;
}
public static string stepPrefixemover(this string str, string prefix)
{
if (str.StartsWith(prefix)
{
.....................
}
return str;
}
}
this class works with one prefix or suffix. is there any suggestion to allow a list of prefixes or suffixes to go through the class and compare against each (str). your kind action really appreciated.
Instead of creating your own class from scratch (unless this is homework) I would definitive use an existing library. This answer provides an example of code that that implements the Porter Stemming Algorithm:
https://stackoverflow.com/questions/7611455/how-to-perform-stemming-in-c
Put your suffix/prefixes in a collection (like a List<>), and loop through and apply each possible one. This collection would need to be passed into the method.
List<string> suffixes = ...;
for (suffix in suffixes)
if (str.EndsWith(suffix))
str = str.Remove(str.Length - suffix.Length, suffix.Length);
EDIT
Considering your comment:
"just want to look if the string starts-/endswith any of the passed strings"
may be something like this can fit your needs:
public static string stepSufixremover(this string str, IEnumerable<string> suffex)
{
string suf = suffex.Where(x=>str.EndsWith(x)).SingleOrDefault();
if(!string.IsNullOrEmpty(suf))
{
str = str.Remove(str.Length - suf.Length, suf.Length);
}
return str;
}
If you use this like:
"hello".stepone(new string[]{"lo","l"}).Dump();
it produces:
hel
The simplest code would involve regular expressions.
For example, this would identify some English suffixes:
'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
One problem is that stemming is not as accurate as lemmatization. Lematization would require POS tagging for accuracy. For example, you don't want to add an -ing suffix to dove if it's a noun.
Another problem is that some suffixes also require prefixes. For example, you must add en- to -rich- to add a -ment suffix in en-rich-ment -- unlike a root like -govern- where you can add the suffix without any prefix.

C# 2.0 function which will return the formatted string

I am using C# 2.0 and I have got below type of strings:
string id = "tcm:481-191820"; or "tcm:481-191820-32"; or "tcm:481-191820-8"; or "tcm:481-191820-128";
The last part of string doesn't matter i.e. (-32,-8,-128), whatever the string is it will render below result.
Now, I need to write one function which will take above string as input. something like below and will output as "tcm:0-481-1"
public static string GetPublicationID(string id)
{
//this function will return as below output
return "tcm:0-481-1"
}
Please suggest!!
If final "-1" is static you could use:
public static string GetPublicationID(string id)
{
int a = 1 + id.IndexOf(':');
string first = id.Substring(0, a);
string second = id.Substring(a, id.IndexOf('-') - a);
return String.Format("{0}0-{1}-1", first, second);
}
or if "-1" is first part of next token, try this
public static string GetPublicationID(string id)
{
int a = 1 + id.IndexOf(':');
string first = id.Substring(0, a);
string second = id.Substring(a, id.IndexOf('-') - a + 2);
return String.Format("{0}0-{1}", first, second);
}
This syntax works even for different length patterns, assuming that your string is
first_part:second_part-anything_else
All you need is:
string.Format("{0}0-{1}", id.Substring(0,4), id.Substring(4,5));
This just uses substring to get the first four characters and then the next five and put them into the format with the 0- in there.
This does assume that your format is a fixed number of characters in each position (which it is in your example). If the string might be abcd:4812... then you will have to modify it slightly to pick up the right length of strings. See Marco's answer for that technique. I'd advise using his if you need the variable length and mine if the lengths stay the same.
Also as an additional note your original function of returning a static string does work for all of those examples you provided. I have assumed there are other numbers visible but if it is only the suffix that changes then you could happily use a static string (at which point declaring a constant or something rather than using a method would probably work better).
Obligatory Regular Expression Answer:
using System.Text.RegularExpressions;
public static string GetPublicationID(string id)
{
Match m = RegEx.Match(#"tcm:([\d]+-[\d]{1})", id);
if(m.Success)
return string.Format("tcm:0-{0}", m.Groups[1].Captures[0].Value.ToString());
else
return string.Empty;
}
Regex regxMatch = new Regex("(?<prefix>tcm:)(?<id>\\d+-\\d)(?<suffix>.)*",RegexOptions.Singleline|RegexOptions.Compiled);
string regxReplace = "${prefix}0-${id}";
string GetPublicationID(string input) {
return regxMatch.Replace(input, regxReplace);
}
string test = "tcm:481-191820-128";
stirng result = GetPublicationID(test);
//result: tcm:0-481-1

Categories

Resources