Parsing Google calendar to DDay.iCal - c#

I'm working on application which parses Google Calendar via Google API to DDay.iCal
The main attributes, properties are handled easily... ev.Summary = evt.Title.Text;
The problem is when I got an recurring event, the XML contains a field like:
<gd:recurrence>
DTSTART;VALUE=DATE:20100916
DTEND;VALUE=DATE:20100917
RRULE:FREQ=YEARLY
</gd:recurrence>
or
<gd:recurrence>
DTSTART:20100915T220000Z
DTEND:20100916T220000Z
RRULE:FREQ=YEARLY;BYMONTH=9;WKST=SU"
</gd:recurrence>
using the following code:
String[] lines =
evt.Recurrence.Value.Split(new char[] { '\n', '\r' }, StringSplitOptions.RemoveEmptyEntries);
foreach (String line in lines)
{
if (line.StartsWith("R"))
{
RecurrencePattern rp = new RecurrencePattern(line);
ev.RecurrenceRules.Add(rp);
}
else
{
ISerializationContext ctx = new SerializationContext();
ISerializerFactory factory = new DDay.iCal.Serialization.iCalendar.SerializerFactory();
ICalendarProperty property = new CalendarProperty();
IStringSerializer serializer = factory.Build(property.GetType(), ctx) as IStringSerializer;
property = (ICalendarProperty)serializer.Deserialize(new StringReader(line));
ev.Properties.Add(property);
Console.Out.WriteLine(property.Name + " - " + property.Value);
}
}
RRULEs are parsed correctly, but the problem is that other property (datetimes) values are empty...

Here is the starting point of what I'm doing, going off of the RFC-5545 spec's recurrence rule. It isn't complete to the spec and may break given certain input, but it should get you going. I think this should all be doable using RegEx, and something as heavy as a recursive decent parser would be overkill.
RRULE:(?:FREQ=(DAILY|WEEKLY|SECONDLY|MINUTELY|HOURLY|DAILY|WEEKLY|MONTHLY|YEARLY);)?(?:COUNT=([0-9]+);)?(?:INTERVAL=([0-9]+);)?(?:BYDAY=([A-Z,]+);)?(?:UNTIL=([0-9]+);)?
I am building this up using http://regexstorm.net/tester.
The test input I'm using is:
DTSTART;TZID=America/Chicago:20140711T133000\nDTEND;TZID=America/Chicago:20140711T163000\nRRULE:FREQ=WEEKLY;INTERVAL=8;BYDAY=FR;UNTIL=20141101
DTSTART;TZID=America/Chicago:20140711T133000\nDTEND;TZID=America/Chicago:20140711T163000\nRRULE:FREQ=WEEKLY;COUNT=5;INTERVAL=8;BYDAY=FR;UNTIL=20141101
DTSTART;TZID=America/Chicago:20140711T133000\nDTEND;TZID=America/Chicago:20140711T163000\nRRULE:FREQ=WEEKLY;BYDAY=FR;UNTIL=20141101
Sample matching results would look like:
Index Position Matched String $1 $2 $3 $4 $5
0 90 RRULE:FREQ=WEEKLY;INTERVAL=8;BYDAY=FR;UNTIL=20141101 WEEKLY 8 FR 20141101
1 236 RRULE:FREQ=WEEKLY;COUNT=5;INTERVAL=8;BYDAY=FR;UNTIL=20141101 WEEKLY 5 8 FR 20141101
2 390 RRULE:FREQ=WEEKLY;BYDAY=FR;UNTIL=20141101 WEEKLY FR 20141101
Usage is like:
string freqPattern = #"RRULE:(?:FREQ=(DAILY|WEEKLY|SECONDLY|MINUTELY|HOURLY|DAILY|WEEKLY|MONTHLY|YEARLY);?)?(?:COUNT=([0-9]+);?)?(?:INTERVAL=([0-9]+);?)?(?:BYDAY=([A-Z,]+);?)?(?:UNTIL=([0-9]+);?)?";
MatchCollection mc = Regex.Matches(rule, freqPattern, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
foreach (Match m in mc)
{
string frequency = m.Groups[1].ToString();
string count = m.Groups[2].ToString();
string interval = m.Groups[3].ToString();
string byday = m.Groups[4].ToString();
string until = m.Groups[5].ToString();
System.Console.WriteLine("recurrence => frequency: \"{0}\", count: \"{1}\", interval: \"{2}\", byday: \"{3}\", until: \"{4}\"", frequency, count, interval, byday, until);
}

This is a great example of when to use regular expressions. Try this out for general parsing:
\s*(\w+):((\w+=\w+;)+(\w+=\w+)?|\w+)
Or, you might decide to have something more schema-specific.
\s*(?:DTSTART:)(?'Start'\w+)
\s*(?:DTEND:)(?'End'\w+)
\s*(?:RRULE:)(?'Rule'(\w+=\w+;)+(\w+=\w+)?)

Related

C# Regex Split but include empty string if fails to split

I am trying to split a string into an array of strings.
My current string looks like this and this is all in one string. It also has newlines (\r\n) and spaces. I put a better-looking example here.
BFFPPB14 Dark Chocolate Dried Cherries 14 oz (397g)
INGREDIENTS: DARK CHOCOLATE (SUGAR, CHOCOLATE LIQUOR, COCOA BUTTER,
ANHYDROUS MILK FAT, SOYA LECITHIN, VANILLIN [AN ARTIFICIAL FLAVOR]), DRIED
TART CHERRIES (CHERRIES, SUGAR), GUM ARABIC, CONFECTIONER'S GLAZE.
CONTAINS: MILK, SOY
ALLERGEN INFORMATION: MAY CONTAIN TREE NUTS, PEANUTS, EGG AND
WHEAT.
01/11/2019
Description: Sweetened dried Montmorency cherries that are panned with dark chocolate.
Storage Conditions: Store at ambient temperatures with a humidity less than 50%.
Shelf Life: 9 months
Company Name
Item No.: 701804
Bulk: 415265
Supplier: Cherryland's Best
WARNING: CHERRIES MAY CONTAIN PITS
My Regex looks like this
List<string> result = Regex.Split(text, #"INGREDIENTS: |CONTAINS: |ALLERGEN INFORMATION: |(\d{1,2}/\d{1,2}/\d{2,4})|Description: |Storage Conditions: |Shelf Life: |Company Name|Item No.: |Bulk: |Supplier: |WARNING: ").ToList();
This is what result looks like
Note: The first string is the product name
Sometimes I get strings that don't have a supplier or a warning, I want the split to have empty strings if it doesn't find that split value.
EX:
result[0] = "blabla"
result[1] = ""
result[2] = "blabla"
That way I know that result 1 was split on the value (INGREDIENTS: ) and I can assign it to something
Using a regex may have performance concerns if you are using this in a high volume application. Below is one possible regex you could use. It is somewhat difficult to parse the product line and the "company name" line since it wasn't clear if the product code had a pattern and the company name line didn't have a ':' like the other fields, so the regex is somewhat "hacky" in those areas:
using System;
using System.Text.RegularExpressions;
using System.Linq;
namespace so20190113_01 {
class Program {
static void Main(string[] args) {
string text =
#"BFFPPB14 Dark Chocolate Dried Cherries 14 oz (397g)
INGREDIENTS: DARK CHOCOLATE (SUGAR, CHOCOLATE LIQUOR, COCOA BUTTER, ANHYDROUS MILK FAT, SOYA LECITHIN, VANILLIN [AN ARTIFICIAL FLAVOR]), DRIED TART CHERRIES (CHERRIES, SUGAR), GUM ARABIC, CONFECTIONER'S GLAZE.
CONTAINS: MILK, SOY
ALLERGEN INFORMATION: MAY CONTAIN TREE NUTS, PEANUTS, EGG AND WHEAT.
01/11/2019
Description: Sweetened dried Montmorency cherries that are panned with dark chocolate.
Storage Conditions: Store at ambient temperatures with a humidity less than 50%. Shelf Life: 9 months
Company Name
Item No.: 701804
Bulk: 415265
Supplier: Cherryland's Best
WARNING: CHERRIES MAY CONTAIN PITS";
string pat =
#"^\s*(?<product>\w+\s+\w+\s+\w*[^:]+)$
|^ingredients:\s*(?<ingredients>.*)$
|^contains:\s*(?<contains>.*)$
|^allergen\s+information:\s*(?<allergen>.*)$
|^(?<date>(\d{1,2}/\d{1,2}/\d{2,4}))$
|^description:\s*(?<description>.*)$
|^storage\sconditions:\s*(?<storage>.*)$
|^shelf\slife:\s*(?<shelf>.*)$
|^company\sname\s*(?<company>.*)$
|^item\sno\.:\s*(?<item>.*)$
|^bulk:\s*(?<bulk>.*)$
|^supplier:\s*(?<supplier>.*)$
|^warning:\s*(?<warning>.*)$
";
Regex r = new Regex(pat, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
// Match the regular expression pattern against a text string.
Match m = r.Match(text); // you might want to use the overload that supports a timeout value
Console.WriteLine("Start---");
while (m.Success) {
foreach (Group g in m.Groups.Where(x => x.Success)) {
switch (g.Name) {
case "product":
Console.WriteLine($"Product({g.Success}): '{g.Value.Trim()}'");
break;
case "ingredients":
Console.WriteLine($"Ingredients({g.Success}): '{g.Value.Trim()}'");
break;
// etc.
}
}
m = m.NextMatch();
}
Console.WriteLine("End---");
}
}
}
I think a parser is the only way. Originally, I tried using this regex:
^([\w \.]+?):([\s\S]+?)(?=((^[\w \.]+?):))
The key component there is the look-ahead ?= which allows the string to match all text from label to label. However, it doesn't work on the final line item since it does not precede another label and I could not find a regex that stops matching at a pattern that may not exist. If that regex exists, you can do it all in one line of code:
KeyValuePair<string, string>[] kvs = null;
//one line of code if the look-ahead would also consider non-existent matches
kvs = Regex.Matches(text, #"^([\w \.]+?):([\s\S]+?)(?=((^[\w \.]+?):))", RegexOptions.Multiline)
.Cast<Match>()
.Select(x => new KeyValuePair<string, string>(x.Groups[1].Value, x.Groups[2].Value.Trim(' ', '\r', '\n', '\t')))
.ToArray();
This code does it well enough. Also, the document is not formatted consistently in that Company Name does not precede a colon. This is the only anchor pattern that will work since various lines are broken by new lines.
KeyValuePair<string, string>[] kvs = null;
//Otherwise, you have to write a parser
//get all start indexes of labels
var matches = Regex.Matches(text, #"^.+?:", RegexOptions.Multiline).Cast<Match>().ToArray();
kvs = new KeyValuePair<string, string>[matches.Length];
KeyValuePair<string, string> GetKeyValuePair(Match match1, int match1EndIndex)
{
//get the label
var label = text.Substring(match1.Index, match1.Value.Length - 1);
//get the desc and trim white space
var descStart = match1.Index + match1.Value.Length + 1;
var desc = text
.Substring(descStart, match1EndIndex - descStart)
.Trim(' ', '\r', '\n', '\t');
return new KeyValuePair<string, string>(label, desc);
}
for (int i = 0; i < matches.Length - 1; i++)
{
kvs[i] = GetKeyValuePair(matches[i], matches[i + 1].Index);
}
kvs[kvs.Length - 1] = GetKeyValuePair(matches[matches.Length - 1], text.Length);
foreach (var kv in kvs)
{
Console.WriteLine($"{kv.Key}: {kv.Value}");
}
So if your requirement is :
find a line with starting with with specif word
use Linq
use StartsWith
code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace ConsoleApp12
{
class Program
{
public static void Main(string[] args)
{
// test string
var str = #"BFFPPB10 Dark Chocolate Macadamia Nuts 11 oz (312g)\r\nINGREDIENTS: DARK CHOCOLATE (SUGAR, CHOCOLATE, COCOA BUTTER, \r\nANHYDROUS MILK FAT, SOY LECITHIN, VANILLA), MACADAMIA NUTS, SEA SALT.\r\nCONTAINS: MACADAMIA NUTS, MILK, SOY.\r\nALLERGEN INFORMATION: MAY CONTAIN OTHER TREE NUTS, PEANUTS, EGG AND\r\nWHEAT.\r\n01/11/2019\r\nDescription: Dry roasted, salted macadamias covered in dark chocolate.\r\nStorage Conditions: Store at ambient temperatures with a humidity less than 50%. \r\nShelf Life: 12 months\r\nBlain's Farm & Fleet\r\nItem No.: 701772\r\nBulk: 421172\r\nSupplier: Devon's\r\n";
// Keys
const string KEY_INGREDIENTS = "INGREDIENTS:";
const string KEY_CONTAINS = "CONTAINS:";
const string KEY_ALLERGEN_INFORMATION = "ALLERGEN INFORMATION:";
const string KEY_DESCRPTION = "Description:";
const string KEY_STORAGE_CONDITION = "Storage Conditions:";
const string KEY_SHELFLIFE = "Shelf Life:";
const string KEY_ITEM_NO = "Item No.:";
const string KEY_BULK = "Bulk:";
const string KEY_SUPPLIER = "Supplier:";
const string KEY_WARNING = "WARNING:";
const string KEY_YEAR_Regex = #"^\d{1,2}/\d{1,2}/\d{4}$";
const string KEY_AFTER_COMPANY_NAME = KEY_ITEM_NO;
// Helpers
var keys = new string[]
{ KEY_INGREDIENTS, KEY_CONTAINS, KEY_ALLERGEN_INFORMATION, KEY_DESCRPTION, KEY_STORAGE_CONDITION,
KEY_SHELFLIFE, KEY_ITEM_NO, KEY_BULK, KEY_SUPPLIER, KEY_WARNING };
var lines = str.Split(new string[] { #"\r\n" }, StringSplitOptions.RemoveEmptyEntries);
void log(string key, string val)
{
Console.WriteLine($"{key} => {val}");
Console.WriteLine();
}
void removeLine(string line)
{
if (line != null) lines = lines.Where(w => w != line).ToArray();
}
// get Multi Line Item with key
string getMultiLine(string key)
{
var line = lines
.Select((linetxt, index) => new { linetxt, index })
.Where(w => w.linetxt.StartsWith(key))
.FirstOrDefault();
if (line == null) return string.Empty;
var result = line.linetxt;
for (int i = line.index + 1; i < lines.Length; i++)
{
if (!keys.Any(a => lines[i].StartsWith(a)))
result += lines[i];
else
break;
}
return result;
}
// get single Line Item before spesic key if the Line is not a key
string getLinebefore(string the_after_key)
{
var the_after_line = lines
.Select((linetxt, index) => new { linetxt, index })
.Where(w => w.linetxt.StartsWith(the_after_key))
.FirstOrDefault();
if (the_after_line == null) return string.Empty;
var the_before_line_text = lines[the_after_line.index - 1];
//not a key
if (!keys.Any(a => the_before_line_text.StartsWith(a)))
return the_before_line_text;
else
return null;
}
// 1st get item without key
var itemName = lines.FirstOrDefault();
removeLine(itemName);
var year = lines.Where(w => Regex.Match(w, KEY_YEAR_Regex).Success).FirstOrDefault();
removeLine(year);
var companyName = getLinebefore(KEY_AFTER_COMPANY_NAME);
removeLine(companyName);
//2nd get item with Keys
var ingredients = getMultiLine(KEY_INGREDIENTS);
var contanins = getMultiLine(KEY_CONTAINS);
var allergenInfromation = getMultiLine(KEY_ALLERGEN_INFORMATION);
var description = getMultiLine(KEY_DESCRPTION);
var storageConditions = getMultiLine(KEY_STORAGE_CONDITION);
var shelfLife = getMultiLine(KEY_SHELFLIFE);
var itemNo = getMultiLine(KEY_ITEM_NO);
var bulk = getMultiLine(KEY_BULK);
var supplier = getMultiLine(KEY_SUPPLIER);
var warning = getMultiLine(KEY_WARNING);
// 3rd log
log("ItemName", itemName);
log("Ingredients", ingredients);
log("contanins", contanins);
log("Allergen Infromation", allergenInfromation);
log("Year", year);
log("Description", description);
log("Storage Conditions", storageConditions);
log("Shelf Life", shelfLife);
log("CompanyName", companyName);
log("Item No", itemNo);
log("Bulk", bulk);
log("Supplier", supplier);
log("warning", warning);
Console.ReadLine();
}
}
}
will output
ItemName => BFFPPB10 Dark Chocolate Macadamia Nuts 11 oz (312g)
Ingredients => INGREDIENTS: DARK CHOCOLATE (SUGAR, CHOCOLATE, COCOA
BUTTER, ANHYDROUS MILK FAT, SOY LECITHIN, VANILLA), MACADAMIA NUTS,
SEA SALT.
contanins => CONTAINS: MACADAMIA NUTS, MILK, SOY.
Allergen Infromation => ALLERGEN INFORMATION: MAY CONTAIN OTHER TREE
NUTS, PEANUTS, EGG ANDWHEAT.
Year => 01/11/2019
Description => Description: Dry roasted, salted macadamias covered in
dark chocolate.
Storage Conditions => Storage Conditions: Store at ambient
temperatures with a humidity less than 50%.
Shelf Life => Shelf Life: 12 months
CompanyName => Blain's Farm & Fleet
Item No => Item No.: 701772
Bulk => Bulk: 421172
Supplier => Supplier: Devon's
warning =>

How to remove pieces of data from string

I have a text file with multiple entries of this format:
Page: 1 of 1
Report Date: January 15 2018
Mr. Gerald M. Abridge ID #: 0000008 1 Route 81 Mr. Gerald Michael Abridge Pittaburgh PA 15668 SSN: XXX-XX-XXXX
Birthdate: 01/00/1998 Sex: M
COURSE Course Title CRD GRD GRDPT COURSE Course Title CRD GRD GRDPT
FALL 2017 (08/28/2017 to 12/14/2017) CS102F FUND. OF IT & COMPUTING 4.00 A 16.00 CS110 C++ PROGRAMMING I 3.00 A- 11.10 EL102 LANGUAGE AND RHETORIC 3.00 B+ 9.90 MA109 CALC WITH APPLICATIONS I 4.00 A 16.00 SP203 INTERMEDIATE SPANISH I 3.00 A 12.00
EHRS QHRS QPTS GPA Term 17.00 17.00 65.00 3.824 Cum 17.00 17.00 65.00 3.824
Current Program(s): Bachelor of Science in Computer Science
End of official record.
So far, I have read the text file into a string, full. I want to be able to remove first two lines of each of the entries. How would I go about doing this?
Here's the code that I used to read it in:
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
string full = sr.ReadToEnd();
}
If all the lines you want to skip begin with the same strings, you can put those prefixes in a list and then, when you're reading the lines, skip the any that being with one of the prefixes:
This will leave you with a list of strings that represent all the file lines that don't begin with one of the specified prefixes:
var filePath = #"f:\public\temp\temp.txt";
var ignorePrefixes = new List<string> {"Page:", "Report Date:"};
var filteredContent = File.ReadAllLines(filePath)
.Where(line => ignorePrefixes.All(prefix => !line.StartsWith(prefix)))
.ToList();
If you want all the content as a single string, you can use String.Join:
var filteredAsString = string.Join(Environment.NewLine, filteredContent);
If Linq isn't your thing, or you don't understand what it's doing, here's the "old school" way of doing the same thing:
List<string> filtered = new List<string>();
foreach (string line in File.ReadLines(filePath))
{
bool okToAdd = true;
foreach (string prefix in ignorePrefixes)
{
if (line.StartsWith(prefix))
{
okToAdd = false;
break;
}
}
if (okToAdd)
{
filtered.Add(line);
}
}
public static IEnumerable<string> ReadReportFile(FileInfo file)
{
var line = String.Empty;
var page = "Page:";
var date = "Report Date:";
using(var reader = File.OpenText(file.FullName))
while((line = reader.ReadLine()) != null)
while(line.IndexOf(page) == -1 AND line.IndexOf(date) == -1)
yield return line;
}
Code is pretty straight forward, while line is not null and doesn't contain page or date, return line. You could condense or even get fancier, building lookups for your prefix etc. but if the code is simple or not needed to be that complex, this should suffice.

Splitting article by sentences using delimiters

I have a small assignment where I have an article in a format that is like this
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2">
<TITLE>STANDARD OIL <SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE> CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc <BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.
Reuter
</BODY></TEXT>
</REUTERS>
and I am writing it to a new xml file with this format
<article id= some id >
<subject>articles subject </subject>
<sentence> sentence #1 </sentence>
.
.
.
<sentence> sentence #n </sentence>
</article>
I have written a code that does all of this and works fine.
The problem is that I am splitting sentences by using the delimiter ., but if the there is a number like 2.00, the code thinks that 2 is a sentence and 00 is a different sentence.
Does anyone have any idea on how to identify sentences better so it will keep the numbers and such in same sentence?
Without having to go over all of the array?
Is there a way I can have the string.Split() method ignore the split if there is a number before and after the delimiter?
My code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Data;
using System.Xml;
namespace project
{
class Program
{
static void Main(string[] args)
{
string[] lines = System.IO.File.ReadAllLines(#"path");
string body = "";
REUTERS article = new REUTERS();
string sentences = "";
for (int i = 0; i<lines.Length;i++){
string line = lines[i];
// finding the first tag of the article
if (line.Contains("<REUTERS"))
{
//extracting the id from the tag
int Id = line.IndexOf("NEWID=\"") + "NEWID=\"".Length;
article.NEWID = line.Substring(Id, line.Length-2 - Id);
}
if (line.Contains("TITLE"))
{
string subject = line;
subject = subject.Replace("<TITLE>", "").Replace("</TITLE>", "");
article.TITLE = subject;
}
if( line.Contains("<BODY"))
{
int startLoc = line.IndexOf("<BODY>") + "<BODY>".Length;
sentences = line.Substring(startLoc, line.Length - startLoc);
while (!line.Contains("</BODY>"))
{
i++;
line = lines[i];
sentences = sentences +" " + line;
}
int endLoc = sentences.IndexOf("</BODY>");
sentences = sentences.Substring(0, endLoc);
char[] delim = {'.'};
string[] sentencesSplit = sentences.Split(delim);
using (System.IO.StreamWriter file =
new System.IO.StreamWriter(#"path",true))
{
file.WriteLine("<articles>");
file.WriteLine("\t <article id = " + article.NEWID + ">");
file.WriteLine("\t \t <subject>" + article.TITLE + "</subject>");
foreach (string sentence in sentencesSplit)
{
file.WriteLine("\t \t <sentence>" + sentence + "</sentence>");
}
file.WriteLine("\t </article>");
file.WriteLine("</articles>");
}
}
}
}
public class REUTERS
{
public string NEWID;
public string TITLE;
public string Body;
}
}
}
ok so i found a solution using the ideas i recieved here
i used the overload method of split like this
.Split(new string[] { ". " }, StringSplitOptions.None);
and it looks much better now
You can also use a regular expression that looks for the sentence terminators with white space:
var pattern = #"(?<=[\.!\?])\s+";
var sentences = Regex.Split(input, pattern);
foreach (var sentence in sentences) {
//do something with the sentence
var node = string.Format("\t \t <sentence>{0}</sentence>", sentence);
file.WriteLine(node);
}
Note that this applies to the English language as there may be other rules for sentences in other languages.
The Following example
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var input = #"Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc <BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.";
var pattern = #"(?<=[\.!\?])\s+";
var sentences = Regex.Split(input, pattern);
foreach (var sentence in sentences)
{
var innerText = sentence.Replace("\n", " ").Replace('\t', ' ');
//do something with the sentence
var node = string.Format("\t \t <sentence>{0}</sentence>", innerText);
Console.WriteLine(node);
}
}
}
Produces this output
<sentence>Standard Oil Co and BP North America Inc said they plan to form a venture to manage the money market borrowing and investment activities of both companies.</sentence>
<sentence>BP North America is a subsidiary of British Petroleum Co Plc <BP>, which also owns a 55 pct interest in Standard Oil.</sentence>
<sentence>The venture will be called BP/Standard Financial Trading and will be operated by Standard Oil under the oversight of a joint management committee.</sentence>
I would make a list of all index points of the '.' characters.
foreach index point, check each side for numbers, if numbers are on both sides, remove the index point from the list.
Then when you are outputting simply use the substring functions with the remaining index points to get each sentence as an individual.
Bad quality code follows (it's late):
List<int> indexesToRemove = new List<int>();
int count=0;
foreach(int indexPoint in IndexPoints)
{
if((sentence.elementAt(indexPoint-1)>='0' && elementAt(indexPoint-1<='9')) && (sentence.elementAt(indexPoint+1)>='0' && elementAt(indexPoint+1<='9')))
indexesToRemove.Add(count);
count++;
}
The next line is so that we do not have to alter the removal number as we traverse the list in the last step.
indexesToRemove = indexesToRemove.OrderByDescending();
Now we simply remove all the locations of the '.'s that have numbers on either side.
foreach(int indexPoint in indexesToRemove)
{
IndexPoints.RemoveAt(indexPoint);
}
Now when you read out the sentences into the new file format you just loop sentences.substring(lastIndexPoint+1, currentIndexPoint)
Spent much time on this - thought you might like to see it as it really doesn't use any awkward code whatsoever - it is producing output 99% similar to yours.
<articles>
<article id="2">
<subject>STANDARD OIL <SRD> TO FORM FINANCIAL UNIT</subject>
<sentence>Standard Oil Co and BP North America</sentence>
<sentence>Inc said they plan to form a venture to manage the money market</sentence>
<sentence>borrowing and investment activities of both companies.</sentence>
<sentence>BP North America is a subsidiary of British Petroleum Co</sentence>
<sentence>Plc <BP>, which also owns a 55.0 pct interest in Standard Oil.</sentence>
<sentence>The venture will be called BP/Standard Financial Trading</sentence>
<sentence>and will be operated by Standard Oil under the oversight of a</sentence>
<sentence>joint management committee.</sentence>
</article>
</articles>
The console app is as follows:
using System.Xml;
using System.IO;
namespace ReutersXML
{
class Program
{
static void Main()
{
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("reuters.xml");
var reuters = xmlDoc.GetElementsByTagName("REUTERS");
var article = reuters[0].Attributes.GetNamedItem("NEWID").Value;
var subject = xmlDoc.GetElementsByTagName("TITLE")[0].InnerText;
var body = xmlDoc.GetElementsByTagName("BODY")[0].InnerText;
string[] sentences = body.Split(new string[] { System.Environment.NewLine },
System.StringSplitOptions.RemoveEmptyEntries);
using (FileStream fileStream = new FileStream("reuters_new.xml", FileMode.Create))
using (StreamWriter sw = new StreamWriter(fileStream))
using (XmlTextWriter xmlWriter = new XmlTextWriter(sw))
{
xmlWriter.Formatting = Formatting.Indented;
xmlWriter.Indentation = 4;
xmlWriter.WriteStartElement("articles");
xmlWriter.WriteStartElement("article");
xmlWriter.WriteAttributeString("id", article);
xmlWriter.WriteElementString("subject", subject);
foreach (var s in sentences)
if (s.Length > 10)
xmlWriter.WriteElementString("sentence", s);
xmlWriter.WriteEndElement();
}
}
}
}
I hope you like it :)

Regex masking of words that contain a digit

Trying to come up with a 'simple' regex to mask bits of text that look like they might contain account numbers.
In plain English:
any word containing a digit (or a train of such words) should be matched
leave the last 4 digits intact
replace all previous part of the matched string with four X's (xxxx)
So far
I'm using the following:
[\-0-9 ]+(?<m1>[\-0-9]{4})
replacing with
xxxx${m1}
But this misses on the last few samples below
sample data:
123456789
a123b456
a1234b5678
a1234 b5678
111 22 3333
this is a a1234 b5678 test string
Actual results
xxxx6789
a123b456
a1234b5678
a1234 b5678
xxxx3333
this is a a1234 b5678 test string
Expected results
xxxx6789
xxxxb456
xxxx5678
xxxx5678
xxxx3333
this is a xxxx5678 test string
Is such an arrangement possible with a regex replace?
I think I"m going to need some greediness and lookahead functionality, but I have zero experience in those areas.
This works for your example:
var result = Regex.Replace(
input,
#"(?<!\b\w*\d\w*)(?<m1>\s?\b\w*\d\w*)+",
m => "xxxx" + m.Value.Substring(Math.Max(0, m.Value.Length - 4)));
If you have a value like 111 2233 33, it will print xxxx3 33. If you want this to be free from spaces, you could turn the lambda into a multi-line statement that removes whitespace from the value.
To explain the regex pattern a bit, it's got a negative lookbehind, so it makes sure that the word behind it does not have a digit in it (with optional word characters around the digit). Then it's got the m1 portion, which looks for words with digits in them. The last four characters of this are grabbed via some C# code after the regex pattern resolves the rest.
I don't think that regex is the best way to solve this problem and that's why I am posting this answer. For so complex situations, building the corresponding regex is too difficult and, what is worse, its clarity and adaptability is much lower than a longer-code approach.
The code below these lines delivers the exact functionality you are after, it is clear enough and can be easily extended.
string input = "this is a a1234 b5678 test string";
string output = "";
string[] temp = input.Trim().Split(' ');
bool previousNum = false;
string tempOutput = "";
foreach (string word in temp)
{
if (word.ToCharArray().Where(x => char.IsDigit(x)).Count() > 0)
{
previousNum = true;
tempOutput = tempOutput + word;
}
else
{
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
output = output + " " + word;
}
}
if (previousNum)
{
if (tempOutput.Length >= 4) tempOutput = "xxxx" + tempOutput.Substring(tempOutput.Length - 4, 4);
output = output + " " + tempOutput;
previousNum = false;
}
Have you tried this:
.*(?<m1>[\d]{4})(?<m2>.*)
with replacement
xxxx${m1}${m2}
This produces
xxxx6789
xxxx5678
xxxx5678
xxxx3333
xxxx5678 test string
You are not going to get 'a123b456' to match ... until 'b' becomes a number. ;-)
Here is my really quick attempt:
(\s|^)([a-z]*\d+[a-z,0-9]+\s)+
This will select all of those test cases. Now as for C# code, you'll need to check each match to see if there is a space at the beginning or end of the match sequence (e.g., the last example will have the space before and after selected)
here is the C# code to do the replace:
var redacted = Regex.Replace(record, #"(\s|^)([a-z]*\d+[a-z,0-9]+\s)+",
match => "xxxx" /*new String("x",match.Value.Length - 4)*/ +
match.Value.Substring(Math.Max(0, match.Value.Length - 4)));

Need multiple regular expression matches using C#

So I have this list of flight data and I need to be able to parse through it using regular expressions (this isn't the entire list).
1 AA2401 F7 A4 Y7 B7 M7 H7 K7 /DFW A LAX 4 0715 0836 E0.M80 9 3:21
2 AA2421 F7 A1 Y7 B7 M7 H7 K7 DFWLAX 4 1106 1215 E0.777 7 3:09
3UA:US6352 B9 M9 H9 K0 /DFW 1 LAX 1200 1448 E0.733 1:48
For example, I might need from the first line 1, AA, 2401, and so on and so on. Now, I'm not asking for someone to come up with a regular expression for me because for the most part I'm getting to where I can pretty much handle that myself. My issue has more to do with being able to store the data some where and access it.
So I'm just trying to initially just "match" the first piece of data I need, which is the line number '1'. My "pattern" for just getting the first number is: ".?(\d{1,2}).*" . The reason it's {1,2} is because obviously once you get past 10 it needs to be able to take 2 numbers. The rest of the line is set up so that it will definitely be a space or a letter.
Here's the code:
var assembly = Assembly.GetExecutingAssembly();
var textStreamReader = new StreamReader(
assembly.GetManifestResourceStream("FlightParser.flightdata.txt"));
List<string> lines = new List<string>();
do
{
lines.Add(textStreamReader.ReadLine());
} while (!textStreamReader.EndOfStream);
Regex sPattern = new Regex(#".?(\d{1,2}).*");//whatever the pattern is
foreach (string line in lines)
{
System.Console.Write("{0,24}", line);
MatchCollection mc = sPattern.Matches(line);
if ( sPattern.IsMatch(line))
{
System.Console.WriteLine(" (match for '{0}' found)", sPattern);
}
else
{
System.Console.WriteLine();
}
System.Console.WriteLine(mc[0].Groups[0].Captures);
System.Console.WriteLine(line);
}//end foreach
System.Console.ReadLine();
With the code I'm writing, I'm basically just trying to get '1' into the match collection and somehow access it and write it to the console (for the sake of testing, that's not the ultimate goal).
Your regex pattern includes an asterisk which matches any number of characters - ie. the whole line. Remove the "*" and it will only match the "1". You may find an online RegEx tester such as this useful.
Assuming your file is not actually formatted as you posted and has each of the fields separated by something, you can match the first two-digit number of the line with this regex (ignoring 0 and leading zeros):
^\s*([1-9]\d?)
Since it is grouped, you can access the matched part through the Groups property of the Match object.
var line = "12 foobar blah 123 etc";
var re = new Regex(#"^\s*([1-9]\d?)");
var match = re.Match(line);
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value); // "12"
}
else
{
Console.WriteLine("No match");
}
The following expression matches the first digit, that you wanted to capture, in the group "First".
^\s*(?<First>\d{1})
I find this regular expression tool highly useful when dealing with regex. Give it a try.
Also set RegexOption to Multiline when you are making the match.

Categories

Resources