Using regular expressions to calculate Unique users? - c#

I want to define a Regular expression pattern that will give unique user counts in a file. One more thing i also want to apply length count such that Users value does not exceed more than 15 characters.
So that my code will return 2 in the logs provided below as it should discard users value exceeding length 15.
Logs file format :
User:fd441f1f-22c0-45d2-b020-32e1e6a15a73
User:fd441f1f-22c0-45d2-b020-32e1e6a15f43
User:fd441f1f-24g0-45d2-b050-32e1e6a15a73
User: karansha
User: gulanand
Code i tried:
Regex regex = new Regex(#"User:\s*(?<username>.*?)\s");
MatchCollection matches = regex.Matches(x);
foreach (Match match in matches)
{
var user = match.Groups["username"].Value;
if (!users.Contains(user)) users.Add(user);
}
int numberOfUsers = users.Count;

You can do that with LINQ:
int numberOfUsers = regex.Matches(x)
.Cast<Match>()
.Select(m => m.Groups["username"].Value)
.Distinct() // pick only unique names
.Count(name => name.Length < 15); // calculate count
Or without regular expressions:
int numberOfUsers = File("log.txt")
.ReadLines()
.Select(line => line.Replace("User:", "").Trim())
.Distinct()
.Count(name => name.Length < 15);

I wouldn't use a Regex for this.
Try using string.Split() and Distinct instead.
int numberOfUsers = x.Split(new string[] { "User:" }, StringSplitOptions.RemoveEmptyEntries)
.Distinct()
.Count(name => name.Length < 15);

Related

find repeated substring in a string

I have a substring
string subString = "ABC";
Every time all three chars appears in a input, you get one point
for example, if input is:
"AABKM" = 0 points
"AAKLMBDC" = 1 point
"ABCC" = 1 point because all three occurs once
"AAZBBCC" = 2 points because ABC is repeated twice;
etc..
The only solution I could come up with is
Regex.Matches(input, "[ABC]").Count
But does not give me what I'm looking for.
Thanks
You could use a ternary operation, where first we determine that all the characters are present in the string (else we return 0), and then select only those characters, group by each character, and return the minimum count from the groups:
For example:
string subString = "ABC";
var inputStrings = new[] {"AABKM", "AAKLMBDC", "ABCC", "AAZBBCC"};
foreach (var input in inputStrings)
{
var result = subString.All(input.Contains)
? input
.Where(subString.Contains)
.GroupBy(c => c)
.Min(g => g.Count())
: 0;
Console.WriteLine($"{input}: {result}");
}
Output
It could be done with a single line, using Linq. However I am not very confident that this could be a good solution
string subString = "ABC";
string input = "AAZBBBCCC";
var arr = input.ToCharArray()
.Where(x => subString.Contains(x))
.GroupBy(x => x)
.OrderBy(a => a.Count())
.First()
.Count();
The result is 2 because the letter A is present only two times.
Let's try to explain the linq expression.
First transform the input string in a sequence of chars, then take only the chars that are contained in the substring. Now group these chars and order them according the the number of occurrences. At this point take the first group and read the count of chars in that group.
Let's see if someone has a better solution.
try this code :
string subString = "ABC";
var input = new[] { "AABKM", "AAKLMBDC", "ABCC", "AAZBBCC" };
foreach (var item in input)
{
List<int> a = new List<int>();
for (int i = 0; i < subString.Length; i++)
{
a.Add(Regex.Matches(item, subString.ToList()[i].ToString()).Count);
}
Console.WriteLine($"{item} : {a.Min()}");
}

Splitting text and integers into array/list

I'm trying to find a way to split a string by its letters and numbers but I've had luck.
An example:
I have a string "AAAA000343BBB343"
I am either needing to split it into 2 values "AAAA000343" and "BBB343" or into 4 "AAAA" "000343" "BBB" "343"
Any help would be much appreciated
Thanks
Here is a RegEx approach to split your string into 4 values
string input = "AAAA000343BBB343";
string[] result = Regex.Matches(input, #"[a-zA-Z]+|\d+")
.Cast<Match>()
.Select(x => x.Value)
.ToArray(); //"AAAA" "000343" "BBB" "343"
So you can use regex
For
"AAAA000343" and "BBB343"
var regex = new Regex(#"[a-zA-Z]+\d+");
var result = regex
.Matches("AAAA000343BBB343")
.Cast<Match>()
.Select(x => x.Value);
// result outputs: "AAAA000343" and "BBB343"
For
4 "AAAA" "000343" "BBB" "343"
See #fubo answer
Try this:
var numAlpha = new Regex("(?<Alpha>[a-zA-Z]*)(?<Numeric>[0-9]*)");
var match = numAlpha.Match("codename123");
var Character = match.Groups["Alpha"].Value;
var Integer = match.Groups["Numeric"].Value;

Extract integer from the end of a string

I have multiple IDs in a List<string>()
List<string> IDList = new List<string>() {
"ID101", //101
"I2D102", //102
"103", //103
"I124D104", //104
"ID-105", //105
"-1006" }; //1006
Rule: The string always ends with the id which has length 1 to n and is int only
I need to extract them to int values. But my solution doesn't work
List<int> intList = IDList.Select(x => int.Parse(Regex.Match(x, #".*\d*").Value)).ToList();
If ID is always at the end, you could use LINQ solution instead of Regex:
var query = IDList.Select(id =>
int.Parse(new string(id.Reverse()
.TakeWhile(x => char.IsNumber(x))
.Reverse().ToArray())));
The idea is to take the characters from the last till it finds no number. Whatever you get, you convert it into int. The good thing about this solution is it really represents what you specify.
Well, according to
Rule: The string always ends with the id which has length 1 to n and
is int only
the pattern is nothing but
[0-9]{1,n}$
[0-9] - ints only
{1,n} - from 1 to n (both 1 and n are included)
$ - string always ends with
and possible implementation could be something like this
int n = 5; //TODO: put actual value
String pattern = "[0-9]{1," + n.ToString() + "}$";
List<int> intList = IDList
.Select(line => int.Parse(Regex.Match(line, pattern).Value))
.ToList();
In case there're some broken lines, say "abc" (and you want to filter them out):
List<int> intList = IDList
.Select(line => Regex.Match(line, pattern))
.Where(match => match.Success)
.Select(match => int.Parse(match.Value))
.ToList();
Here's another LINQ approach which works if the number is always at the end and negative values aren't possible. Skips invalid strings:
List<int> intList = IDList
.Select(s => s.Reverse().TakeWhile(Char.IsDigit))
.Where(digits => digits.Any())
.Select(digits => int.Parse(String.Concat(digits.Reverse())))
.ToList();
( Edit: similar to Ian's approach )
This below code extract last id as integer from collection and ignore them which end with none integer value
List<int> intList = IDList.Where(a => Regex.IsMatch(a, #"\d+$") == true)
.Select(x => int.Parse(Regex.Match(x, #"\d+$").Value)).ToList();
i assume you want the last numbers :
var res = IDList.Select(x => int.Parse(Regex.Match(x, #"\d+$").Value)).ToList();

Split a string into an array

I want to split a string to an array of sub-strings. The string is delimited by space, but space may appear inside the sub-strings too. And spliced strings must be of the same length.
Example:
"a b aab bb aaa" -> "a b", "aab", "bb ", "aaa"
I have the following code:
var T = Regex.Split(S, #"(?<=\G.{4})").Select(x => x.Substring(0, 3));
But I need to parameterize this code, split by various length(3, 4, 5 or n) and I don't know how do this. Please help.
If impossible to parameterize Regex, fully linq version ok.
You can use the same regex, but "parameterize" it by inserting the desired number into the string.
In C# 6.0, you can do it like this:
var n = 5;
var T = Regex.Split(S, $#"(?<=\G.{{{n}}})").Select(x => x.Substring(0, n-1));
Prior to that you could use string.Format:
var n = 5;
var regex = string.Format(#"(?<=\G.{{{0}}})", n);
var T = Regex.Split(S, regex).Select(x => x.Substring(0, n-1));
It seems rather easy with LINQ:
var source = "a b aab bb aaa";
var results =
Enumerable
.Range(0, source.Length / 4 + 1)
.Select(n => source.Substring(n * 4, 3))
.ToList();
Or using Microsoft's Reactive Framework's team's Interactive Extensions (NuGet "Ix-Main") and do this:
var results =
source
.Buffer(3, 4)
.Select(x => new string(x.ToArray()))
.ToList();
Both give you the output you require.
A lookbehind (?<=pattern) matches a zero-length string. To split using spaces as delimiters, the match has to actually return a "" (the space has to be in the main pattern, outside the lookbehind).
Regex for length = 3: #"(?<=\G.{3}) " (note the trailing space)
Code for length n:
var n = 3;
var S = "a b aab bb aaa";
var regex = #"(?<=\G.{" + n + #"}) ";
var T = Regex.Split(S, regex);
Run this code online

C# Accurately Replace String/SubString

Currently I have large entries (in array) of Pinyin tone notation, some string are combined, for example Diànnǎo = Diàn + nǎo
Now problem is I want replace a string that contain 2 or more, for example:
string[] Py = { "xi", "xia", "xian" };
string[] Km = { "shi", "shie, "shien" };
string[] Input = "xiaguo";
for (int i = 0; i < Py.Length; i++)
if (Input.Contains(Py[i]))
Input = Input.Replace(Py[i], Km[i]);
Code above have a problem due to loop index, xiaguo contains xi become true (shiaguo) not (shieguo) since xi get first before xia
How do I achieve this? and make sure get xia instead of xi
Full code I posted on GitHub: https://github.com/Anime4000/py2km/blob/beta/py2km.api/Converter.cs#L15
Assuming longer tokens take precedence over shorter tokens, the 2 arrays can be converted to a dictionary and then sorted by the length of the key:
var dic = new Dictionary<string, string>
{
{"xi","shi"},
{"xia","shie"},
{"xian","shien"},
}.OrderByDescending(x => x.Key.Length)
.ThenBy(x => x.Key)
.ToDictionary(x => x.Key, x => x.Value);
string input = "xiaguo";
foreach(var d in dic)
input = input.Replace(d.Key, d.Value);
Console.WriteLine(input);
The above example with sort the dictionary:
by the length of the key
then by the alpha sort of the key
finally, the LINQ query is converted back to a dictionary.
From there, just iterate over the dictionary and replace all the tokens; there's no need to check to see if the key/token exists.
you could use a regular expresion for this.
I modified your code so the regex wil only match xi and not xia.
the regex "xi\b" matches xi and the \b means word boundary so it only matches that exact word.
string[] Py = { "xi", "xia", "xian" };
string[] Km = { "shi", "shie, "shien" };
string[] Input = "xiaguo";
string pattern = "xi\b"
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
for (int i = 0; i < Py.Length; i++)
{
MatchCollection matches = rgx.Matches(Py[i]);
if (matches.Count > 0)
{
Input = Input.Replace(Py[i], Km[i]);
}
}
The tone/language specifics could not have an easy structure, so you may assume some pattern and then find out later that it's not right for some 'word'.
Anyway, to handle the informed scenario, you must order the target tone by descending length, and then perform only a single replacement for each 'word' (this will avoid replacing xi, xia when processing xian.
The steps would be:
for each replacement ordered by length descending
try to find tone
if found: replace and mark as done (jump to next 'word')
The idea here is the same as when replacing a two numbers in a list, say 2 to 1 and 3 to 2, for example. The order really matter, if you replace 3 by 2 then you will be replacing both 3 and 2 to 1 after all.

Categories

Resources