Question: I have a long string and I require to find the count of occurrences of all sub strings present under that string and print a list of all sub strings and their count (if count is > 1) in decreasing order of count.
Example:
String = "abcdabcd"
Result:
Substrings Count
abcd 2
abc 2
bcd 2
ab 2
bc 2
cd 2
a 2
b 2
c 2
d 2
Problem: My string can be 5000 character long and I am not able to find a efficient way to achieve this.( Efficiency is very important for application)
Is there any algorithm present or by multi threading it is possible. please help.
Example using: Find a common string within a list of strings
void Main()
{
"abcdabcd".getAllSubstrings()
.AsParallel()
.GroupBy(x => x)
.Select(g => new {g.Key, count=g.Count()})
.Dump();
}
// Define other methods and classes here
public static class Ext
{
public static IEnumerable<string> getAllSubstrings(this string word)
{
return from charIndex1 in Enumerable.Range(0, word.Length)
from charIndex2 in Enumerable.Range(0, word.Length - charIndex1 + 1)
where charIndex2 > 0
select word.Substring(charIndex1, charIndex2);
}
}
Produces:
a 2
dabc 1
abcdabc 1
b 2
abc 2
dabcd 1
bc 2
bcda 1
abcd 2
ab 2
bcdab 1
cdabc 1
abcda 1
d 2
bcdabc 1
dab 1
bcd 2
abcdab 1
c 2
bcdabcd 1
abcdabcd 1
cd 2
da 1
cdab 1
cda 1
cdabcd 1
I have a string that look like that "a,b,c,d,e,1,4,3,5,8,7,5,1,2,6.... and so on.
I am looking for the best way to split it and make it look like that:
a b c d e
1 4 3 5 8
7 5 1 2 6
Assuming, that you have a fix number of columns (5):
string Input = "a,b,c,d,e,11,45,34,33,79,65,75,12,2,6";
int i = 0;
string[][] Result = Input.Split(',').GroupBy(s => i++/5).Select(g => g.ToArray()).ToArray();
First I split the string by , character, then i group the result into chunks of 5 items and select those chunks into arrays.
Result:
a b c d e
11 45 34 33 79
65 75 12 2 6
to write that result into a file you have to
using (System.IO.StreamWriter writer =new System.IO.StreamWriter(path,false))
{
foreach (string[] line in Result)
{
writer.WriteLine(string.Join("\t", line));
}
};
I need to export Entities to a CSV File using CSVHelper. I made a trial work but I would have to write every field manually. What I want is to Write a record Prepended with either an 'H' or a 'D' and end every line with a single space. My Demo models:
PersonId FirstName LastName DateOfBirth
1 Randy Smith 1968-08-31
2 Zachary Smith 2002-01-10
3 Angie Smith 1969-11-20
4 Khelzie Smith 1996-07-27
AutoId Year Make Model OwnerId
1 2000 Toyota 4Runner 1
2 1995 Ford Mustang 1
3 2014 Chevrolet Corvette Stingray Coupe 2
4 2014 Volkswagen Beetle Coupe 4
5 1980 Ford F-150 2
6 1968 Chevrolet Camaro 3
7 2000 Tonka Truck 3
8 1993 Honda Accord 4
Into a CSV File Like this:
H 1 Randy Smith 8/31/1968
D 1 2000 Toyota 4Runner
D 2 1995 Ford Mustang
H 2 Zachary Smith 1/10/2002
D 3 2014 Chevy Corevett
D 5 1980 Ford F-150
H 3 Angie Smith 11/20/1969
D 6 1968 Chevrolet Camaro
D 7 2000 Tonka Truck
H 4 Khelzie Smith 7/27/1996
D 4 2014 Volkswagen Beetle Coupe
This is the Code I finally got to work:
StreamWriter textWriter = File.CreateText(fileName);
var csv = new CsvWriter(textWriter);
csv.Configuration.Delimiter = delimiter;
csv.Configuration.QuoteNoFields = true;
// This will skip those people who don't own a vehicle
foreach (Person person in people.Where(person => person.Vehicles.Count > 0))
{
// The letter 'H' must prefix every Header line
csv.WriteField((#"H " + person.PersonId));
csv.WriteField(person.FirstName);
csv.WriteField(person.LastName);
// Headers lines must end with a single space.
csv.WriteField((person.DateOfBirth.ToShortDateString() + " "));
csv.NextRecord();
foreach (Automobile auto in person.Vehicles)
{
// The letter 'D' must prefix every Detail line
csv.WriteField((#"D " + auto.AutoId));
csv.WriteField(auto.Year);
csv.WriteField(auto.Make);
// Details lines must end with a single space.
csv.WriteField((auto.Model + " "));
csv.NextRecord();
}
}
The real tables have ~70 fields apiece.
Just for those that have as thick a skull as mine, here is a solution:
foreach (TransactionHeader header in headers)
{
csv.WriteField("H");
csv.WriteRecord(header);
csv.WriteField(" ");
csv.NextRecord();
foreach (TransactionDetail detail in header.TransactionDetail)
{
csv.WriteField("D");
csv.WriteRecord(detail);
csv.WriteField(" ");
csv.NextRecord();
}
}
Thanks to everyone who saw this as pretty obvious and patiently waited for me to bash my head down on my desk enough times and then figure this out myself.
I need help in removing letters but not words from an incoming data string. Like the following,
String A = "1 2 3A 4 5C 6 ABCD EFGH 7 8D 9";
to
String A = "1 2 3 4 5 6 ABCD EFGH 7 8 9";
You need to match a letter and ensure that there is no letter before and after. So match
(?<!\p{L})\p{L}(?!\p{L})
and replace with an empty string.
Look around assertions on regular-expresssion.info
Unicode properties on regular-expresssion.info
In C#:
string s = "1 2 3A 4 5C 6 ABCD EFGH 7 8D 9";
string result = Regex.Replace(s, #"(?<!\p{L}) # Negative lookbehind assertion to ensure not a letter before
\p{L} # Unicode property, matches a letter in any language
(?!\p{L}) # Negative lookahead assertion to ensure not a letter following
", String.Empty, RegexOptions.IgnorePatternWhitespace);
The "obligatory" Linq approach:
string[] words = A.Split();
string result = string.Join(" ",
words.Select(w => w.Any(c => Char.IsDigit(c)) ?
new string(w.Where(c => Char.IsDigit(c)).ToArray()) : w));
This approach looks if each word contains a digit. Then it filters out the non-digit chars and creates a new string from the result. Otherwise it just takes the word.
And here comes the old school:
Dim A As String = "1 2 3A 4 5C 6 ABCD EFGH 7 8D 9"
Dim B As String = "1 2 3 4 5 6 ABCD EFGH 7 8 9"
Dim sb As New StringBuilder
Dim letterCount As Integer = 0
For i = 0 To A.Length - 1
Dim ch As Char = CStr(A(i)).ToLower
If ch >= "a" And ch <= "z" Then
letterCount += 1
Else
If letterCount > 1 Then sb.Append(A.Substring(i - letterCount, letterCount))
letterCount = 0
sb.Append(A(i))
End If
Next
Debug.WriteLine(B = sb.ToString) 'prints True
I need to parse fixed width records using c# and Regular Expressions.
Each record contains a number of fixed width fields, with each field potentially having non-trivial validation rules. The problem I'm having is with a match being applied across the fixed width field boundaries.
Without the rules it is easy to break apart a fixed width string of length 13 into 4 parts like this:
(?=^.{13}$).{1}.{5}.{6}.{1}
Here is a sample field rule:
Field can be all spaces OR start with [A-Z] and be right padded with spaces. Spaces cannot occur between letters
If the field was the only thing I have to validate I could use this:
(?=^[A-Z ]{5}$)([ ]{5}|[A-Z]+[ ]*)
When I add this validation as part of a longer list I have to remove the ^ and $ from the lookahead and I start to get matches that are not of length 5.
Here is the full regex along with some sample text that should match and not match the expression.
(?=^[A-Z ]{13}$)A(?=[A-Z ]{5})([ ]{5}|(?>[A-Z]{1,5})[ ]{0,4})(?=[A-Z ]{6})([ ]{6}|(?>[A-Z]{1,6})[ ]{0,5})Z
How do I implement the rules so that, for each field, the immediate next XX characters are used for the match and ensure that matches do not overlap?
Lines that should match:
ABCDEFGHIJKLZ
A Z
AB Z
A G Z
AB G Z
ABCDEF Z
ABCDEFG Z
A GHIJKLZ
AB GHIJKLZ
Lines that should not match:
AB D Z
AB D F Z
AB F Z
A G I Z
A G I LZ
A G LZ
AB FG LZ
AB D FG Z
AB FG I Z
AB D FG i Z
The following 3 should not match but do.
AB FG Z
AB FGH Z
AB EFGH Z
EDIT:
General solution (based on Ωmega's answer) with named captures for clarity:
(?<F1>F1Regex)(?<=^.{Len(F1)})
(?<F2>F2Regex)(?<=^.{Len(F1+F2)})
(?<F3>F3Regex)(?<=^.{Len(F1+F2+F3)})
...
(?<Fn>FnRegex)
Another example: Spaces between regex and zero-width positive lookback (?<= are for clarity.
(?<F1>\d{2}) (?<=^.{2})
(?<F2>[A-Z]{5}) (?<=^.{7})
(?<F3>\d{4}) (?<=^.{11})
(?<F4>[A-Z]{6}) (?<=^.{17})
(?<F5>\d{4})
If the input string is fixed in size, then you can match a specific position using look-aheads and look-behinds, like this:
(?<=^.{s})(?<fieldName>.*)(?=.{e}$)
where:
s = start position
e = string length - match length - s
If you concatenate multiple regexes, like this one, then you will get all the fields with specific positioning.
Example
Fixed length: 10
Field 1: start 0, length 3
Field 2: start 3, length 5
Field 3: start 8, length 2
Use this regex, ignoring white spaces:
var match = Regex.Match("0123456789", #"
(?<=^.{0})(?<name1>.*)(?=.{7}$)
(?<=^.{3})(?<name2>.*)(?=.{2}$)
(?<=^.{8})(?<name3>.*)(?=.{0}$)",
RegexOptions.IgnorePatternWhitespace)
var field1 = match.Groups["name1"].Value;
var field2 = match.Groups["name2"].Value;
var field3 = match.Groups["name3"].Value;
You can place whatever rule you want to match the fields.
I used .* for all of them, but you can place anything there.
Example 2
var match = Regex.Match(" 1a any-8888", #"
(?<=^.{0})(?<name1>\s*\d*[a-zA-Z])(?=.{9}$)
(?<=^.{3})(?<name2>.*)(?=.{4}$)
(?<=^.{8})(?<name3>(?<D>\d)\k<D>*)(?=.{0}$)
",
RegexOptions.IgnorePatternWhitespace)
var field1 = match.Groups["name1"].Value; // " 1a"
var field2 = match.Groups["name2"].Value; // " any-"
var field3 = match.Groups["name3"].Value; // "8888"
Here is your regex
I tested all of them, but the this sample is with the one you said should not pass, but passed... this time, it won't pass:
var match = Regex.Match("AB FG Z", #"
^A
(?<=^.{1}) (?<name1>([ ]{5}|(?>[A-Z]{1,5})[ ]{0,4})) (?=.{7}$)
(?<=^.{6}) (?<name2>([ ]{6}|(?>[A-Z]{1,6})[ ]{0,5})) (?=.{1}$)
Z$
",
RegexOptions.IgnorePatternWhitespace)
// no match with this input string
Match match = Regex.Match(
Regex.Replace(text, #"^(.)(.{5})(.{6})(.)$", "$1,$2,$3,$4"),
#"^[A-Z ],[A-Z]*[ ]*,[A-Z]*[ ]*,[A-Z ]$");
Check this code here.
I think it is possible to validate it by single regex pattern
^[A-Z ][A-Z]*[ ]*(?<=^.{6})[A-Z]*[ ]*(?<=^.{12})[A-Z ]$
If you need also capture all such groups, use
^([A-Z ])([A-Z]*[ ]*)(?<=^.{6})([A-Z]*[ ]*)(?<=^.{12})([A-Z ])$
I have already posted this before, but this answer is more specific to your question, and not generalized.
This solves all the cases you have presented in your question, the way you wanted.
Program to test all cases in your question
class Program
{
static void Main()
{
var strMatch = new string[]
{
// Lines that should match:
"ABCDEFGHIJKLZ",
"A Z",
"AB Z",
"A G Z",
"AB G Z",
"ABCDEF Z",
"ABCDEFG Z",
"A GHIJKLZ",
"AB GHIJKLZ",
};
var strNotMatch = new string[]
{
// Lines that should not match:
"AB D Z",
"AB D F Z",
"AB F Z",
"A G I Z",
"A G I LZ",
"A G LZ",
"AB FG LZ",
"AB D FG Z",
"AB FG I Z",
"AB D FG i Z",
// The following 3 should not match but do.
"AB FG Z",
"AB FGH Z",
"AB EFGH Z",
};
var pattern = #"
^A
(?<=^.{1}) (?<name1>([ ]{5}|(?>[A-Z]{1,5})[ ]{0,4})) (?=.{7}$)
(?<=^.{6}) (?<name2>([ ]{6}|(?>[A-Z]{1,6})[ ]{0,5})) (?=.{1}$)
Z$
";
foreach (var eachStrThatMustMatch in strMatch)
{
var match = Regex.Match(eachStrThatMustMatch,
pattern, RegexOptions.IgnorePatternWhitespace);
if (!match.Success)
throw new Exception("Should match.");
}
foreach (var eachStrThatMustNotMatch in strNotMatch)
{
var match = Regex.Match(eachStrThatMustNotMatch,
pattern, RegexOptions.IgnorePatternWhitespace);
if (match.Success)
throw new Exception("Should match.");
}
}
}