To Count Occurrences of all sub strings in string C# - c#

Question: I have a long string and I require to find the count of occurrences of all sub strings present under that string and print a list of all sub strings and their count (if count is > 1) in decreasing order of count.
Example:
String = "abcdabcd"
Result:
Substrings Count
abcd 2
abc 2
bcd 2
ab 2
bc 2
cd 2
a 2
b 2
c 2
d 2
Problem: My string can be 5000 character long and I am not able to find a efficient way to achieve this.( Efficiency is very important for application)
Is there any algorithm present or by multi threading it is possible. please help.

Example using: Find a common string within a list of strings
void Main()
{
"abcdabcd".getAllSubstrings()
.AsParallel()
.GroupBy(x => x)
.Select(g => new {g.Key, count=g.Count()})
.Dump();
}
// Define other methods and classes here
public static class Ext
{
public static IEnumerable<string> getAllSubstrings(this string word)
{
return from charIndex1 in Enumerable.Range(0, word.Length)
from charIndex2 in Enumerable.Range(0, word.Length - charIndex1 + 1)
where charIndex2 > 0
select word.Substring(charIndex1, charIndex2);
}
}
Produces:
a 2
dabc 1
abcdabc 1
b 2
abc 2
dabcd 1
bc 2
bcda 1
abcd 2
ab 2
bcdab 1
cdabc 1
abcda 1
d 2
bcdabc 1
dab 1
bcd 2
abcdab 1
c 2
bcdabcd 1
abcdabcd 1
cd 2
da 1
cdab 1
cda 1
cdabcd 1

Related

Regular expression to include starting string

I have managed to match into groups as follows using the below expression but its incomplete.
\([^\)]*\)
Example strings are,
s11(h 1 1 c)(h 1 1 c) x="" y="" z="" phi="" theta=""
e(45,10,h 1 1 c,1,cross,max) x="" y="" z="" phi="" theta=""
With the above expression I can match (h 1 1 c)(h 1 1 c) and (45,10,h 1 1 c,1,cross,max)
But I want to capture the starting string s11 and e along with (h 1 1 c)(h 1 1 c) and (45,10,h 1 1 c,1,cross,max)
You can use
var lines = new List<string> { "s11(h 1 1 c)(h 1 1 c) x=\"\" y=\"\" z=\"\" phi=\"\" theta=\"\"",
"e(45,10,h 1 1 c,1,cross,max) x=\"\" y=\"\" z=\"\" phi=\"\" theta=\"\""};
foreach (var s in lines)
{
Console.WriteLine("==== Next string: \"" + s + "\" =>");
Console.WriteLine(string.Join(", ",
Regex.Matches(s, #"\w+(?:\([^()]*\))+").Cast<Match>().Select(x => x.Value)));
Console.WriteLine("=== With groups and captures:");
var results = Regex.Matches(s, #"(\w+)(?:(\([^()]*\)))+");
foreach (Match m in results)
{
Console.WriteLine(m.Groups[1].Value);
Console.WriteLine(string.Join(", ", m.Groups[2].Captures.Cast<Capture>().Select(z => z.Value)));
}
}
See the C# demo. Output:
==== Next string: "s11(h 1 1 c)(h 1 1 c) x="" y="" z="" phi="" theta=""" =>
s11(h 1 1 c)(h 1 1 c)
=== With groups and captures:
s11
(h 1 1 c), (h 1 1 c)
==== Next string: "e(45,10,h 1 1 c,1,cross,max) x="" y="" z="" phi="" theta=""" =>
e(45,10,h 1 1 c,1,cross,max)
=== With groups and captures:
e
(45,10,h 1 1 c,1,cross,max)
Depending on what exact results you want to get, you may use a regex with or without capturing groups:
\w+(?:\([^()]*\))+
(\w+)(?:(\([^()]*\)))+
See the regex 1 demo and regex 2 demo.
Details
\w+ - one or more word chars (letters, digits and some connector puncutation)
(?:\([^()]*\))+ - one or more repetitions of
\( - a ( char
[^()]* - zero or more chars other than ( and )
\) - a ) char.

Regex to get numbers after a period in a string

I'm trying to find the right regex to extract the numbers after the . in the string below. E.g, the first line should return and array of 1 1 1 1 1, the second should return 2 1 0 1 2. I can't seem to figure the correct regex expression to achieve this. Any help would be appreciated.
line = 0.1, 1.1, 2.1, 3.1, 4.1 // payline 0
line = 0.2, 1.1, 2.0, 3.1, 4.2 // payline 1
So far, I have the code below, but it just returns all the the numbers in the sting instead. eg, the first line returns 0 1 1 1 2 1 3 1 4 1 0 and the second returns 0 2 1 1 2 0 3 1 4 2 1
foreach (var line in Paylines)
{
int[] lines = (from Match m in Regex.Matches(line.ToString(), #"\d+")
select int.Parse(m.Value)).ToArray();
foreach (var x in lines)
{
Console.WriteLine(x.ToString());
}
}
You may use a lookbehind-based regex solution:
#"(?<=\.)\d+"
It matches 1+ digits after a dot without placing the dot into a match value.
See the regex demo.
In C#, you may use
var myVals = Regex.Matches(line, #"(?<=\.)\d+", RegexOptions.ECMAScript)
.Cast<Match>()
.Select(m => int.Parse(m.Value))
.ToList();
See the C# demo.
The RegexOptions.ECMAScript option is passed for the \d to only match ASCII digits in the [0-9] range and avoid matching other Unicode digits.

what is the best way to split string in c#

I have a string that look like that "a,b,c,d,e,1,4,3,5,8,7,5,1,2,6.... and so on.
I am looking for the best way to split it and make it look like that:
a b c d e
1 4 3 5 8
7 5 1 2 6
Assuming, that you have a fix number of columns (5):
string Input = "a,b,c,d,e,11,45,34,33,79,65,75,12,2,6";
int i = 0;
string[][] Result = Input.Split(',').GroupBy(s => i++/5).Select(g => g.ToArray()).ToArray();
First I split the string by , character, then i group the result into chunks of 5 items and select those chunks into arrays.
Result:
a b c d e
11 45 34 33 79
65 75 12 2 6
to write that result into a file you have to
using (System.IO.StreamWriter writer =new System.IO.StreamWriter(path,false))
{
foreach (string[] line in Result)
{
writer.WriteLine(string.Join("\t", line));
}
};

Remove single alphabets from a string

I need help in removing letters but not words from an incoming data string. Like the following,
String A = "1 2 3A 4 5C 6 ABCD EFGH 7 8D 9";
to
String A = "1 2 3 4 5 6 ABCD EFGH 7 8 9";
You need to match a letter and ensure that there is no letter before and after. So match
(?<!\p{L})\p{L}(?!\p{L})
and replace with an empty string.
Look around assertions on regular-expresssion.info
Unicode properties on regular-expresssion.info
In C#:
string s = "1 2 3A 4 5C 6 ABCD EFGH 7 8D 9";
string result = Regex.Replace(s, #"(?<!\p{L}) # Negative lookbehind assertion to ensure not a letter before
\p{L} # Unicode property, matches a letter in any language
(?!\p{L}) # Negative lookahead assertion to ensure not a letter following
", String.Empty, RegexOptions.IgnorePatternWhitespace);
The "obligatory" Linq approach:
string[] words = A.Split();
string result = string.Join(" ",
words.Select(w => w.Any(c => Char.IsDigit(c)) ?
new string(w.Where(c => Char.IsDigit(c)).ToArray()) : w));
This approach looks if each word contains a digit. Then it filters out the non-digit chars and creates a new string from the result. Otherwise it just takes the word.
And here comes the old school:
Dim A As String = "1 2 3A 4 5C 6 ABCD EFGH 7 8D 9"
Dim B As String = "1 2 3 4 5 6 ABCD EFGH 7 8 9"
Dim sb As New StringBuilder
Dim letterCount As Integer = 0
For i = 0 To A.Length - 1
Dim ch As Char = CStr(A(i)).ToLower
If ch >= "a" And ch <= "z" Then
letterCount += 1
Else
If letterCount > 1 Then sb.Append(A.Substring(i - letterCount, letterCount))
letterCount = 0
sb.Append(A(i))
End If
Next
Debug.WriteLine(B = sb.ToString) 'prints True

Best way to Find which cell of string array contins text

I have a block of text that im taking from a Gedcom (Here and Here) File
The text is flat and basically broken into "nodes"
I am splitting each node on the \r char and thus subdividing it into each of its parts( amount of "lines" can vary)
I know the 0 address will always be the ID but after that everything can be anywhere so i want to test each Cell of the array to see if it contains the correct tag for me to proccess
an example of what two nodes would look like
0 #ind23815# INDI <<<<<<<<<<<<<<<<<<< Start of node 1
1 NAME Lawrence /Hucstepe/
2 DISPLAY Lawrence Hucstepe
2 GIVN Lawrence
2 SURN Hucstepe
1 POSITION -850,-210
2 BOUNDARY_RECT (-887,-177),(-813,-257)
1 SEX M
1 BIRT
2 DATE 1521
1 DEAT Y
2 DATE 1559
1 NOTE * Born: Abt 1521, Kent, England
2 CONT * Marriage: Jane Pope 17 Aug 1546, Kent, England
2 CONT * Died: Bef 1559, Kent, England
2 CONT
1 FAMS #fam08318#
0 #ind23816# INDI <<<<<<<<<<<<<<<<<<<<<<< Start of Node 2
1 NAME Jane /Pope/
2 DISPLAY Jane Pope
2 GIVN Jane
2 SURN Pope
1 POSITION -750,-210
2 BOUNDARY_RECT (-787,-177),(-713,-257)
1 SEX F
1 BIRT
2 DATE 1525
1 DEAT Y
2 DATE 1609
1 NOTE * Born: Abt 1525, Tenterden, Kent, England
2 CONT * Marriage: Lawrence Hucstepe 17 Aug 1546, Kent, England
2 CONT * Died: 23 Oct 1609
2 CONT
1 FAMS #fam08318#
0 #ind23817# INDI <<<<<<<<<<< start of Node 3
So a when im done i have an array that looks like
address , string
0 = "1 NAME Lawrence /Hucstepe/"
1 = "2 DISPLAY Lawrence Hucstepe"
2 = "2 GIVN Lawrence"
3 = "2 SURN Hucstepe"
4 = "1 POSITION -850,-210"
5 = "2 BOUNDARY_RECT (-887,-177),(-813,-257)"
6 = "1 SEX M"
7 = "1 BIRT "
8 = "1 FAMS #fam08318#"
So my question is what is the best way to search the above array to see which Cell has the SEX tag or the NAME Tag or the FAMS Tag
this is the code i have
private int FindIndexinArray(string[] Arr, string search)
{
int Val = -1;
for (int i = 0; i < Arr.Length; i++)
{
if (Arr[i].Contains(search))
{
Val = i;
}
}
return Val;
}
But it seems inefficient because i end up calling it twice to make sure it doesnt return a -1
Like so
if (FindIndexinArray(SubNode, "1 BIRT ") != -1)
{
// add birthday to Struct
I.BirthDay = SubNode[FindIndexinArray(SubNode, "1 BIRT ") + 1].Replace("2 DATE ", "").Trim();
}
sorry this is a longer post but hopefully you guys will have some expert advice
Can use the static method FindAll of the Array class:
It will return the string itself though, if that works..
string[] test = { "Sex", "Love", "Rock and Roll", "Drugs", "Computer"};
Array.FindAll(test, item => item.Contains("Sex") || item.Contains("Drugs") || item.Contains("Computer"));
The => indicates a lamda expression. Basically a method without a concrete implementation.
You can also do this if the lamda gives you the creeps.
//Declare a method
private bool HasTag(string s)
{
return s.Contains("Sex") || s.Contains("Drugs") || s.Contains("Computer");
}
string[] test = { "Sex", "Love", "Rock and Roll", "Drugs", "Computer"};
Array.FindAll(test, HasTag);
What about a simple regular expression?
^(\d)\s=\s\"\d\s(SEX|BIRT|FAMS){1}.*$
First group captures the address, second group the tag.
Also, it might be quicker to dump all array items into a string and do your regex on the whole lot at once.
"But it seems inefficient because i end up calling it twice to make sure it doesnt return a -1"
Copy the returned value to a variable before you test to prevent multiple calls.
IndexResults = FindIndexinArray(SubNode, "1 BIRT ")
if (IndexResults != -1)
{
// add birthday to Struct
I.BirthDay = SubNode[IndexResults].Replace("2 DATE ", "").Trim();
}
The for loop in method FindIndexinArray shd break once you find a match if you are interested in only the first match.

Categories

Resources