I try to find the numbers of apples and oranges in different strings using Pidgin, but i cant seem to skip over variable lengths of text, I want to find the numbers 1,2,3,4 in the following:
List<string> testStrings = new List<string> {
"12 blerg",
"1 apples",
" 2 apples ignore_this",
"3 oranges ignore_this",
"this 5 6 should be ignored but fails 4 apples ignore_this"};
I can get 1,2,3, to work, i.e. skip whitespace and ignore text after keywords, I have tried skipUntil but can't get it to work, the string "this 5 6 should be ignored but fails 4 apples ignore_this" should return the number 4 but is skipped completely.
The code is in this (non-working) fiddle, so you need to run it locally :/
https://dotnetfiddle.net/jEppA8
edit: full listing below:
using System;
using System.Collections.Generic;
using Pidgin;
using static Pidgin.Parser;
using static Pidgin.Parser<char>;
public class Program
{
public static void Main()
{
Parser<char, string> multiThings = OneOf(
String("apples"),
String("oranges")
);
Parser<char, string> amountOfThings = Digit.ManyString().Between(Whitespaces, Whitespaces);
Parser<char, string> amountOfThingsFollowedBy = amountOfThings.Before(Whitespaces.Before(multiThings));
Parser<char, string> simpleSkipBefore = amountOfThings.Before(Any.SkipUntil(amountOfThings));
Parser<char, string> simpleSkipAfter = Any.SkipUntil(amountOfThings).Then(amountOfThingsFollowedBy);
Parser<char, string> amountOfThingsAnyWhere = OneOf(
amountOfThingsFollowedBy,
simpleSkipBefore,
simpleSkipAfter
);
List<string> testStrings = new List<string> {
"12 blerg",
"1 apples",
" 2 apples ignore_this",
"3 oranges ignore_this",
"this 5 6 should be ignored but fails 4 apples ignore_this"};
foreach (var str in testStrings)
{
try
{
Console.WriteLine(str + " ----> " + amountOfThingsAnyWhere.ParseOrThrow(str));
} catch (Exception e)
{
Console.WriteLine(str + " exception: " + e);
}
}
}
}
If you are OK with using Regular Expressions, here is how you can do it.
I am only showing for "apple"/"apples" but it would be similar for oranges, or any other word whose plural form requires suffixing with "s".
private List<string> CountApples(string s)
{
Regex rg = new Regex(#"([0-9]+)\s\bapples?\b");
MatchCollection matches = rg.Matches(s);
var result = new List<string>();
foreach (Match match in matches)
{
var quantifierGroup = match.Groups[1];
result.Add(quantifierGroup.Value);
}
return result;
}
Input:
"this 5 6 should be ignored but fails 4 apples ignore_this and 42 apples again"
Output:
[4, 42]
The regex ([0-9]+)\s\bapples?\b is looking for any number [0-9]+, and putting this into a group (), to retrieve it later.
Then it expects a white space \s and either apple or apples exactly (thanks to the \b word delimiters.
It will not work for words starting with "apple" like "applewood".
I was able to make it work with the following:
Parser<char, string> amount = Digit.ManyString();
Parser<char, string> amountOfThings =
amount.Before(Whitespace.AtLeastOnce()).Before(multiThings);
Parser<char, string> amountOfThingsAnyWhere =
Any.SkipUntil(Try(Lookahead(amountOfThings))).Then(amountOfThings);
This works with your given input as well as:
"12 oranges",
" 12 oranges",
"xx12 oranges",
" all text no digits ",
"2 apples ignore_this",
"xx2 apples ignore_this",
"xx2 2 apples ignore_this",
"apples",
"apples ignore_this"
LookAhead is necessary to prevent SkipUntil from consuming the
terminator (amountOfThings).
Try is necessary to allow backpeddling when the start of the pattern
is encountered (in this case any digit not followed by an apple or an
orange).
I am still trying to get it to work when amountOfThings is modified to
Parser<char, string> amountOfThings =
amount.Then(Whitespace.AtLeastOnce()).Then(multiThings);
I was hoping this would return '4 apples' and so on, but it only returns the name of the fruit alone!
EDIT:
Here is how to make it say the number and name of the fruit as well:
Parser<char, string> amount = Digit.ManyString();
Parser<char, string> amountOfThings =
amount
.Then(Whitespace.AtLeastOnce())
.Then(multiThings);
Parser<char, string> amountOfThingsEntire =
amount
.Before(Whitespace.AtLeastOnce())
.Then(multiThings, (amountResult, fruitResult) => $"{amountResult} {fruitResult}");
Parser<char, string> amountOfThingsAnyWhere =
Any.SkipUntil(Try(Lookahead(amountOfThings))).Then(amountOfThingsEntire);
Related
Suppose I have a list of strings {"boy", "car", "ball"} and a text "the boy sold his car to buy a ball".
Given another string list {"dog", "bar", "bone"}, my objective is to find all occurrences of the first list inside the text and swap them for the strings of the second list:
BEFORE: the [boy] sold his [car] to buy a [ball]
AFTER: the [dog] sold his [bar] to buy a [bone]
My first thought was to use Regex but I have no idea how to associate a list of strings into a regex and I don't want to write Aho-Corasick.
What is the right way to go for that?
Another example:
Text: aaa bbb abab aabb bbaa ubab
replacing {aa, bb, ab, ub} for {11, 22, 35, &x}
BEFORE: [aa]a [bb]b [ab][ab] [aa][bb] [bb][aa] [ub][ab]
AFTER: [11]a [22]b [35][35] [11][22] [22][11] [&x][35]
If you want to use regex, you may use something like this:
var findList = new List<string>() { "boy", "car", "ball" };
var replaceList = new List<string>() { "dog", "bar", "bone" };
// Create a dictionary from the lists or have a dictionary from the beginning.
var dictKeywords = findList.Select((s, i) => new { s, i })
.ToDictionary(x => x.s, x => replaceList[x.i]);
string input = "the boy sold his car to buy a ball";
// Construct the regex pattern by joining the dictionary keys with an 'OR' operator.
string pattern = string.Join("|", dictKeywords.Keys.Select(s => Regex.Escape(s)));
string output =
Regex.Replace(input, pattern, delegate (Match m)
{
string replacement;
if (dictKeywords.TryGetValue(m.Value, out replacement)) return replacement;
return m.Value;
});
Console.WriteLine(output);
Output:
the dog sold his bar to buy a bone
No need to use Regex, string.Replace would suffice
var input = "the boy sold his car to buy a ball";
var oldvalues = new List<string>() { "boy", "car", "ball" };
var newValues = new List<string>() { "dog", "bar", "bone" };
var output = input;
for (int i = 0; i < oldvalues.Count; i++)
{
output = output.Replace(oldvalues[i], newValues[i]);
}
Console.WriteLine(output);
I am trying to split a string into an array of strings.
My current string looks like this and this is all in one string. It also has newlines (\r\n) and spaces. I put a better-looking example here.
BFFPPB14 Dark Chocolate Dried Cherries 14 oz (397g)
INGREDIENTS: DARK CHOCOLATE (SUGAR, CHOCOLATE LIQUOR, COCOA BUTTER,
ANHYDROUS MILK FAT, SOYA LECITHIN, VANILLIN [AN ARTIFICIAL FLAVOR]), DRIED
TART CHERRIES (CHERRIES, SUGAR), GUM ARABIC, CONFECTIONER'S GLAZE.
CONTAINS: MILK, SOY
ALLERGEN INFORMATION: MAY CONTAIN TREE NUTS, PEANUTS, EGG AND
WHEAT.
01/11/2019
Description: Sweetened dried Montmorency cherries that are panned with dark chocolate.
Storage Conditions: Store at ambient temperatures with a humidity less than 50%.
Shelf Life: 9 months
Company Name
Item No.: 701804
Bulk: 415265
Supplier: Cherryland's Best
WARNING: CHERRIES MAY CONTAIN PITS
My Regex looks like this
List<string> result = Regex.Split(text, #"INGREDIENTS: |CONTAINS: |ALLERGEN INFORMATION: |(\d{1,2}/\d{1,2}/\d{2,4})|Description: |Storage Conditions: |Shelf Life: |Company Name|Item No.: |Bulk: |Supplier: |WARNING: ").ToList();
This is what result looks like
Note: The first string is the product name
Sometimes I get strings that don't have a supplier or a warning, I want the split to have empty strings if it doesn't find that split value.
EX:
result[0] = "blabla"
result[1] = ""
result[2] = "blabla"
That way I know that result 1 was split on the value (INGREDIENTS: ) and I can assign it to something
Using a regex may have performance concerns if you are using this in a high volume application. Below is one possible regex you could use. It is somewhat difficult to parse the product line and the "company name" line since it wasn't clear if the product code had a pattern and the company name line didn't have a ':' like the other fields, so the regex is somewhat "hacky" in those areas:
using System;
using System.Text.RegularExpressions;
using System.Linq;
namespace so20190113_01 {
class Program {
static void Main(string[] args) {
string text =
#"BFFPPB14 Dark Chocolate Dried Cherries 14 oz (397g)
INGREDIENTS: DARK CHOCOLATE (SUGAR, CHOCOLATE LIQUOR, COCOA BUTTER, ANHYDROUS MILK FAT, SOYA LECITHIN, VANILLIN [AN ARTIFICIAL FLAVOR]), DRIED TART CHERRIES (CHERRIES, SUGAR), GUM ARABIC, CONFECTIONER'S GLAZE.
CONTAINS: MILK, SOY
ALLERGEN INFORMATION: MAY CONTAIN TREE NUTS, PEANUTS, EGG AND WHEAT.
01/11/2019
Description: Sweetened dried Montmorency cherries that are panned with dark chocolate.
Storage Conditions: Store at ambient temperatures with a humidity less than 50%. Shelf Life: 9 months
Company Name
Item No.: 701804
Bulk: 415265
Supplier: Cherryland's Best
WARNING: CHERRIES MAY CONTAIN PITS";
string pat =
#"^\s*(?<product>\w+\s+\w+\s+\w*[^:]+)$
|^ingredients:\s*(?<ingredients>.*)$
|^contains:\s*(?<contains>.*)$
|^allergen\s+information:\s*(?<allergen>.*)$
|^(?<date>(\d{1,2}/\d{1,2}/\d{2,4}))$
|^description:\s*(?<description>.*)$
|^storage\sconditions:\s*(?<storage>.*)$
|^shelf\slife:\s*(?<shelf>.*)$
|^company\sname\s*(?<company>.*)$
|^item\sno\.:\s*(?<item>.*)$
|^bulk:\s*(?<bulk>.*)$
|^supplier:\s*(?<supplier>.*)$
|^warning:\s*(?<warning>.*)$
";
Regex r = new Regex(pat, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
// Match the regular expression pattern against a text string.
Match m = r.Match(text); // you might want to use the overload that supports a timeout value
Console.WriteLine("Start---");
while (m.Success) {
foreach (Group g in m.Groups.Where(x => x.Success)) {
switch (g.Name) {
case "product":
Console.WriteLine($"Product({g.Success}): '{g.Value.Trim()}'");
break;
case "ingredients":
Console.WriteLine($"Ingredients({g.Success}): '{g.Value.Trim()}'");
break;
// etc.
}
}
m = m.NextMatch();
}
Console.WriteLine("End---");
}
}
}
I think a parser is the only way. Originally, I tried using this regex:
^([\w \.]+?):([\s\S]+?)(?=((^[\w \.]+?):))
The key component there is the look-ahead ?= which allows the string to match all text from label to label. However, it doesn't work on the final line item since it does not precede another label and I could not find a regex that stops matching at a pattern that may not exist. If that regex exists, you can do it all in one line of code:
KeyValuePair<string, string>[] kvs = null;
//one line of code if the look-ahead would also consider non-existent matches
kvs = Regex.Matches(text, #"^([\w \.]+?):([\s\S]+?)(?=((^[\w \.]+?):))", RegexOptions.Multiline)
.Cast<Match>()
.Select(x => new KeyValuePair<string, string>(x.Groups[1].Value, x.Groups[2].Value.Trim(' ', '\r', '\n', '\t')))
.ToArray();
This code does it well enough. Also, the document is not formatted consistently in that Company Name does not precede a colon. This is the only anchor pattern that will work since various lines are broken by new lines.
KeyValuePair<string, string>[] kvs = null;
//Otherwise, you have to write a parser
//get all start indexes of labels
var matches = Regex.Matches(text, #"^.+?:", RegexOptions.Multiline).Cast<Match>().ToArray();
kvs = new KeyValuePair<string, string>[matches.Length];
KeyValuePair<string, string> GetKeyValuePair(Match match1, int match1EndIndex)
{
//get the label
var label = text.Substring(match1.Index, match1.Value.Length - 1);
//get the desc and trim white space
var descStart = match1.Index + match1.Value.Length + 1;
var desc = text
.Substring(descStart, match1EndIndex - descStart)
.Trim(' ', '\r', '\n', '\t');
return new KeyValuePair<string, string>(label, desc);
}
for (int i = 0; i < matches.Length - 1; i++)
{
kvs[i] = GetKeyValuePair(matches[i], matches[i + 1].Index);
}
kvs[kvs.Length - 1] = GetKeyValuePair(matches[matches.Length - 1], text.Length);
foreach (var kv in kvs)
{
Console.WriteLine($"{kv.Key}: {kv.Value}");
}
So if your requirement is :
find a line with starting with with specif word
use Linq
use StartsWith
code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace ConsoleApp12
{
class Program
{
public static void Main(string[] args)
{
// test string
var str = #"BFFPPB10 Dark Chocolate Macadamia Nuts 11 oz (312g)\r\nINGREDIENTS: DARK CHOCOLATE (SUGAR, CHOCOLATE, COCOA BUTTER, \r\nANHYDROUS MILK FAT, SOY LECITHIN, VANILLA), MACADAMIA NUTS, SEA SALT.\r\nCONTAINS: MACADAMIA NUTS, MILK, SOY.\r\nALLERGEN INFORMATION: MAY CONTAIN OTHER TREE NUTS, PEANUTS, EGG AND\r\nWHEAT.\r\n01/11/2019\r\nDescription: Dry roasted, salted macadamias covered in dark chocolate.\r\nStorage Conditions: Store at ambient temperatures with a humidity less than 50%. \r\nShelf Life: 12 months\r\nBlain's Farm & Fleet\r\nItem No.: 701772\r\nBulk: 421172\r\nSupplier: Devon's\r\n";
// Keys
const string KEY_INGREDIENTS = "INGREDIENTS:";
const string KEY_CONTAINS = "CONTAINS:";
const string KEY_ALLERGEN_INFORMATION = "ALLERGEN INFORMATION:";
const string KEY_DESCRPTION = "Description:";
const string KEY_STORAGE_CONDITION = "Storage Conditions:";
const string KEY_SHELFLIFE = "Shelf Life:";
const string KEY_ITEM_NO = "Item No.:";
const string KEY_BULK = "Bulk:";
const string KEY_SUPPLIER = "Supplier:";
const string KEY_WARNING = "WARNING:";
const string KEY_YEAR_Regex = #"^\d{1,2}/\d{1,2}/\d{4}$";
const string KEY_AFTER_COMPANY_NAME = KEY_ITEM_NO;
// Helpers
var keys = new string[]
{ KEY_INGREDIENTS, KEY_CONTAINS, KEY_ALLERGEN_INFORMATION, KEY_DESCRPTION, KEY_STORAGE_CONDITION,
KEY_SHELFLIFE, KEY_ITEM_NO, KEY_BULK, KEY_SUPPLIER, KEY_WARNING };
var lines = str.Split(new string[] { #"\r\n" }, StringSplitOptions.RemoveEmptyEntries);
void log(string key, string val)
{
Console.WriteLine($"{key} => {val}");
Console.WriteLine();
}
void removeLine(string line)
{
if (line != null) lines = lines.Where(w => w != line).ToArray();
}
// get Multi Line Item with key
string getMultiLine(string key)
{
var line = lines
.Select((linetxt, index) => new { linetxt, index })
.Where(w => w.linetxt.StartsWith(key))
.FirstOrDefault();
if (line == null) return string.Empty;
var result = line.linetxt;
for (int i = line.index + 1; i < lines.Length; i++)
{
if (!keys.Any(a => lines[i].StartsWith(a)))
result += lines[i];
else
break;
}
return result;
}
// get single Line Item before spesic key if the Line is not a key
string getLinebefore(string the_after_key)
{
var the_after_line = lines
.Select((linetxt, index) => new { linetxt, index })
.Where(w => w.linetxt.StartsWith(the_after_key))
.FirstOrDefault();
if (the_after_line == null) return string.Empty;
var the_before_line_text = lines[the_after_line.index - 1];
//not a key
if (!keys.Any(a => the_before_line_text.StartsWith(a)))
return the_before_line_text;
else
return null;
}
// 1st get item without key
var itemName = lines.FirstOrDefault();
removeLine(itemName);
var year = lines.Where(w => Regex.Match(w, KEY_YEAR_Regex).Success).FirstOrDefault();
removeLine(year);
var companyName = getLinebefore(KEY_AFTER_COMPANY_NAME);
removeLine(companyName);
//2nd get item with Keys
var ingredients = getMultiLine(KEY_INGREDIENTS);
var contanins = getMultiLine(KEY_CONTAINS);
var allergenInfromation = getMultiLine(KEY_ALLERGEN_INFORMATION);
var description = getMultiLine(KEY_DESCRPTION);
var storageConditions = getMultiLine(KEY_STORAGE_CONDITION);
var shelfLife = getMultiLine(KEY_SHELFLIFE);
var itemNo = getMultiLine(KEY_ITEM_NO);
var bulk = getMultiLine(KEY_BULK);
var supplier = getMultiLine(KEY_SUPPLIER);
var warning = getMultiLine(KEY_WARNING);
// 3rd log
log("ItemName", itemName);
log("Ingredients", ingredients);
log("contanins", contanins);
log("Allergen Infromation", allergenInfromation);
log("Year", year);
log("Description", description);
log("Storage Conditions", storageConditions);
log("Shelf Life", shelfLife);
log("CompanyName", companyName);
log("Item No", itemNo);
log("Bulk", bulk);
log("Supplier", supplier);
log("warning", warning);
Console.ReadLine();
}
}
}
will output
ItemName => BFFPPB10 Dark Chocolate Macadamia Nuts 11 oz (312g)
Ingredients => INGREDIENTS: DARK CHOCOLATE (SUGAR, CHOCOLATE, COCOA
BUTTER, ANHYDROUS MILK FAT, SOY LECITHIN, VANILLA), MACADAMIA NUTS,
SEA SALT.
contanins => CONTAINS: MACADAMIA NUTS, MILK, SOY.
Allergen Infromation => ALLERGEN INFORMATION: MAY CONTAIN OTHER TREE
NUTS, PEANUTS, EGG ANDWHEAT.
Year => 01/11/2019
Description => Description: Dry roasted, salted macadamias covered in
dark chocolate.
Storage Conditions => Storage Conditions: Store at ambient
temperatures with a humidity less than 50%.
Shelf Life => Shelf Life: 12 months
CompanyName => Blain's Farm & Fleet
Item No => Item No.: 701772
Bulk => Bulk: 421172
Supplier => Supplier: Devon's
warning =>
I want to match the string in Hong kong language
I have month and year as below in hongkong language
二零一六年六月份 ===>June 2016
二零一五年六月份 ===>June 2015
I have use culture info (zh-HK) to get month like
But how to get year? Please help
Basically, you need to create a dictionary that uses the Chinese characters as the key and the corresponding numbers as the value:
var dict = new Dictionary<String, String>() {
{"零", "0"},
{"一", "1"},
{"二", "2"},
{"三", "3"},
{"四", "4"},
{"五", "5"},
{"六", "6"},
{"七", "7"},
{"八", "8"},
{"九", "9"},
{"十", "1"} // this is needed for the months to work. If you know Chinese you would know what I mean
};
Then, you split the input string with the separator "年":
string[] twoParts = inputString.Split('年');
You loop through each character of the first part. Using the dictionary you created, you can easily get 2016 from "二零一六".
For the second part, check whether "份" is present at the end. If it is, substring it off. (sometimes months can be written without "份"). After that, do one more substring to get rid of the "月".
Now you use the dictionary above again to turn something like "十二" to "12"
Now you have the year and the month, just create a new instance of DateTime!
Here's the full code:
string inputString = ...;
var dict = new Dictionary<String, String>() {
{"零", "0"},
{"一", "1"},
{"二", "2"},
{"三", "3"},
{"四", "4"},
{"五", "5"},
{"六", "6"},
{"七", "7"},
{"八", "8"},
{"九", "9"},
{"十", "1"} // this is needed for the months to work. If you know Chinese you would know what I mean
};
string[] twoParts = inputString.Split ('年');
StringBuilder yearBuilder = new StringBuilder ();
foreach (var character in twoParts[0]) {
yearBuilder.Append (dict [character.ToString ()]);
}
string month = twoParts [1];
if (month [month.Length - 1] == '份') {
month = month.Substring (0, month.Length - 1);
}
month = month.Substring (0, month.Length - 1);
StringBuilder monthBuilder = new StringBuilder ();
foreach (var character in month) {
monthBuilder.Append (dict [character.ToString ()]);
}
var date = new DateTime (Convert.ToInt32 (yearBuilder.ToString()), Convert.ToInt32 (monthBuilder.ToString()), 1);
Console.WriteLine (date);
EDIT:
I just realized that this doesn't work if the month is October, in which case it will parse to January. To fix this, you need to use a separate dictionary for the months. Since the SE editor doesn't allow me to enter too many Chinese characters, I will try to tell you want to put in this dictionary in the comments.
When you parse the months, please use the new dictionary. So now the month parsing code will look like this:
month = month.Substring (0, month.Length - 1);
string monthNumberString = newDict[month];
No need for the for each loop.
I need to pull the city and state out string of data that look as follows:
8 mi SSW of Newtown, PA
10 mi SE of Milwaukee, WI
29 Miles E of Orlando, FL
As of right now I am passing each string individually into a method
string statusLocation = "8 mi SSW of Newtown, PA"
etc. one at a time.
What would be the best way to search this string for the city state? I was thinking either regex or substring and index of the comma etc. I wasn’t quite sure what kind of issues I would run into if a state is 3 characters or a city has a comma in it because this is Canada data as well and I am not sure how they abbreviate stuff.
You could do a
string str = "8 mi SSW of Newtown, PA";
var parts = str.Split(new[] {' '}, 5);
parts then looks like this: { "8", "mi", "SSW", "of", "Newtown, PA" }, and you can access the "Newtown, PA" easily with parts[4].
You could use this regular expression:
of (.*), ([a-zA-Z]{2})$
That will capture everything after the of, up a comma that is followed by a space then two letters, then a line ending. For example:
var regex = new Regex("of (.*), ([a-zA-Z]{2})$");
var strings = new[]
{
"8 mi SSW of Newtown, PA",
"10 mi SE of Milwaukee, WI",
"29 Miles E of Orlando, FL"
};
foreach (var str in strings)
{
var match = regex.Match(str);
var city = match.Groups[1];
var state = match.Groups[2];
Console.Out.WriteLine("state = {0}", state);
Console.Out.WriteLine("city = {0}", city);
}
This of course assumes some consistency with the data, like the state being two letters.
I'm working on application which parses Google Calendar via Google API to DDay.iCal
The main attributes, properties are handled easily... ev.Summary = evt.Title.Text;
The problem is when I got an recurring event, the XML contains a field like:
<gd:recurrence>
DTSTART;VALUE=DATE:20100916
DTEND;VALUE=DATE:20100917
RRULE:FREQ=YEARLY
</gd:recurrence>
or
<gd:recurrence>
DTSTART:20100915T220000Z
DTEND:20100916T220000Z
RRULE:FREQ=YEARLY;BYMONTH=9;WKST=SU"
</gd:recurrence>
using the following code:
String[] lines =
evt.Recurrence.Value.Split(new char[] { '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries);
foreach (String line in lines)
{
if (line.StartsWith("R"))
{
RecurrencePattern rp = new RecurrencePattern(line);
ev.RecurrenceRules.Add(rp);
}
else
{
ISerializationContext ctx = new SerializationContext();
ISerializerFactory factory = new DDay.iCal.Serialization.iCalendar.SerializerFactory();
ICalendarProperty property = new CalendarProperty();
IStringSerializer serializer = factory.Build(property.GetType(), ctx) as IStringSerializer;
property = (ICalendarProperty)serializer.Deserialize(new StringReader(line));
ev.Properties.Add(property);
Console.Out.WriteLine(property.Name + " - " + property.Value);
}
}
RRULEs are parsed correctly, but the problem is that other property (datetimes) values are empty...
Here is the starting point of what I'm doing, going off of the RFC-5545 spec's recurrence rule. It isn't complete to the spec and may break given certain input, but it should get you going. I think this should all be doable using RegEx, and something as heavy as a recursive decent parser would be overkill.
RRULE:(?:FREQ=(DAILY|WEEKLY|SECONDLY|MINUTELY|HOURLY|DAILY|WEEKLY|MONTHLY|YEARLY);)?(?:COUNT=([0-9]+);)?(?:INTERVAL=([0-9]+);)?(?:BYDAY=([A-Z,]+);)?(?:UNTIL=([0-9]+);)?
I am building this up using http://regexstorm.net/tester.
The test input I'm using is:
DTSTART;TZID=America/Chicago:20140711T133000\nDTEND;TZID=America/Chicago:20140711T163000\nRRULE:FREQ=WEEKLY;INTERVAL=8;BYDAY=FR;UNTIL=20141101
DTSTART;TZID=America/Chicago:20140711T133000\nDTEND;TZID=America/Chicago:20140711T163000\nRRULE:FREQ=WEEKLY;COUNT=5;INTERVAL=8;BYDAY=FR;UNTIL=20141101
DTSTART;TZID=America/Chicago:20140711T133000\nDTEND;TZID=America/Chicago:20140711T163000\nRRULE:FREQ=WEEKLY;BYDAY=FR;UNTIL=20141101
Sample matching results would look like:
Index Position Matched String $1 $2 $3 $4 $5
0 90 RRULE:FREQ=WEEKLY;INTERVAL=8;BYDAY=FR;UNTIL=20141101 WEEKLY 8 FR 20141101
1 236 RRULE:FREQ=WEEKLY;COUNT=5;INTERVAL=8;BYDAY=FR;UNTIL=20141101 WEEKLY 5 8 FR 20141101
2 390 RRULE:FREQ=WEEKLY;BYDAY=FR;UNTIL=20141101 WEEKLY FR 20141101
Usage is like:
string freqPattern = #"RRULE:(?:FREQ=(DAILY|WEEKLY|SECONDLY|MINUTELY|HOURLY|DAILY|WEEKLY|MONTHLY|YEARLY);?)?(?:COUNT=([0-9]+);?)?(?:INTERVAL=([0-9]+);?)?(?:BYDAY=([A-Z,]+);?)?(?:UNTIL=([0-9]+);?)?";
MatchCollection mc = Regex.Matches(rule, freqPattern, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
foreach (Match m in mc)
{
string frequency = m.Groups[1].ToString();
string count = m.Groups[2].ToString();
string interval = m.Groups[3].ToString();
string byday = m.Groups[4].ToString();
string until = m.Groups[5].ToString();
System.Console.WriteLine("recurrence => frequency: \"{0}\", count: \"{1}\", interval: \"{2}\", byday: \"{3}\", until: \"{4}\"", frequency, count, interval, byday, until);
}
This is a great example of when to use regular expressions. Try this out for general parsing:
\s*(\w+):((\w+=\w+;)+(\w+=\w+)?|\w+)
Or, you might decide to have something more schema-specific.
\s*(?:DTSTART:)(?'Start'\w+)
\s*(?:DTEND:)(?'End'\w+)
\s*(?:RRULE:)(?'Rule'(\w+=\w+;)+(\w+=\w+)?)