Need help formulating regular expression to parse data - c#

Basically,
I have text I want to spit out from a block of text. I have the regular expression down for the most part however, It's either too little[skips a section] or too much[reads part of the next section].It basically needs to read text that I extracted from a bank statement.I already tried reading up on regular expressions more, however I still have no clue as to what to do.
Heres a bit of a sample for you guys to understand what I'm trying to do.
_4XXXXXXXXXXXXXX9_
_SOU THE HOME DEPOT 431 POMPANO BEACH * FL
AUT 020112 DDA PURCHASE_
_2/1_DEBIT POS_3.15_
The underscores are basically parts I want to extract. Basically everything except the DEBIT POS basically.
And the regex I'm using is:
\A
(?<SerialNumber>\b[0-9]{13,16}\b)
(?<Description>.) 'PROBLEM HERE'
(?<PostingDate>
(?:1[0-2]|[1-9])/(?:3[01]|[12][0-9]|[1-9]))
(?<Amount>[,0-9]+\.[0-9]{2})
\Z
I cant set the Description to be from any length of characters because I don't know the maximum length that the text portion will be. I also don't know if it's 2 lines for description or just 1. Thats mainly whats confusing me.

I imagine you want to join every four lines together as one line first:
var file = #"C:\temp.txt";
var lines = System.IO.File.ReadAllLines(file);
var buffer = new List<String>();
for (var i = 0; i < lines.Length; i++ )
{
if (i % 4 == 0) { buffer.Add(""); }
buffer[buffer.Count - 1] += lines[i] + " ";
}
buffer.ForEach(b => Console.WriteLine(b));
Then you can actually parse each entry in buffer as if it's one line. This can be done easily using either regex or just string Substrings. Far easier than trying to do it across lines.
The above code isn't the cleanest, but it works.

Look like another simple answer of don't use Regex. If each of these are lines, it wouldn't be that hard to File.ReadAllLines() and parse each line.
public class Order
{
public string SerialNumber { get; set; }
public string Description { get; set; }
public DateTime PostingDate { get; set; }
public Decimal Amount { get; set; }
public void SetSerialNumberFromRaw(string serialNumber)
{
// Convert to required type, etc.
this.SerialNumber = <someConvertedValue>;
}
public void <OtherNeededValueConverters>
}
List<string> lines = File.ReadAlllines("<filename").ToList();
List<Order> orders = new List<Order>();
Order currentOrder = null;
foreach (string line in lines)
{
if (currentOrder = null)
{
currentOrder = new Order();
orders.Add(currentOrder);
currentOrder.SetSerialNumberFromRaw(line);
}
else
{
if (line.Contains("DEBIT POS", CultureInfo.CurrentCultureIngoreCase))
{
currentOrder.SetPostingDateAndAmount(line);
currentOrder = null;
}
else
{
currentOrder.SetAppendDescription(line);
}
}
}

Related

How to split up and divide long string

I have a string of data that I would like to split up, for example my one string contains multiple characters, their stats and abilities they each have.
Full String:
"Andy,135,Punch-Kick-Bite-Headbutt|Tom,120,Bite-Slap-Dodge-Heal|Nathan,105,Bite-Scratch-Tackle-Kick"
So the above string has the characters seperated by "|" and the abilities that are seperated by "-".
I managed to divide them up by each character so its "Andy,135,Punch-Kick-Bite-Headbutt" in one index of array by doing this:
string myString = "Andy,135,Punch-Kick-Bite-Headbutt|Tom,120,Bite-Slap-Dodge-Heal|Nathan,105,Bite-Scratch-Tackle-Kick";
string[] character = myString.ToString().Split('|');
for (int i = 0; i < character.Length; i++)
{
Debug.Log("Character data: " + character[i].ToString());
}
Now How would I turn something like "Andy,135,Punch-Kick-Bite-Headbutt" and only retrieve the stats into a stat array so its "Andy,135" and pull Abilities into a string array so it is: "Punch-Kick-Bite-Headbutt"
So I would have my statArray as "Andy,135" and abilityArray as "Punch-Kick-Bite-Headbutt"
Well I would strongly recommend defining class to store that data:
public class Character
{
public string Name { get; set; }
public int Stat { get; set; }
public string[] Abilities { get; set; }
}
The I would write following LINQ:
// First split by pipe character to get each character (person)
// in raw format separately
var characters = longString.Split('|')
// Another step is to separate each property of a character,
// so it can be used in next Select method.
// Here we split by comma
.Select(rawCharacter => rawCharacter.Split(','))
// Finally we use splitted raw data and upon this, we create
// concrete object with little help of casting to int and
// assign abilities by splitting abilities list by hyphen -
.Select(rawCharacter => new Character()
{
Name = rawCharacter[0],
Stat = int.Parse(rawCharacter[1]),
Abilities = rawCharacter[2].Split('-'),
})
.ToArray();

How to fill object from contents of a string and populate a List?

I have a string that I have sent through a HTTP Web Request compressing with GZIP with the following data:
[Route("Test")]
public IActionResult Test()
{
var data = "[0].meetingDate=2019-07-12&[0].courseId=12&[0].raceNumber=1&[0].horseCode=000000331213&[1].meetingDate=2019-07-12&[1].courseId=12&[1].raceNumber=1&[1].horseCode=000000356650";
try
{
var req = WebRequest.Create("https://localhost:44374/HorseRacingApi/Prices/GetPriceForEntries");
req.Method = "POST";
req.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip, deflate");
req.Headers.Add(HttpRequestHeader.ContentEncoding, "gzip");
if (!string.IsNullOrEmpty(data))
{
var dataBytes = Encoding.ASCII.GetBytes(data);
using (var requestDS = req.GetRequestStream())
{
using (var zipStream = new GZipStream(requestDS, CompressionMode.Compress))
{
zipStream.Write(dataBytes, 0, dataBytes.Length);
}
requestDS.Flush();
}
}
HttpWebResponse response = (HttpWebResponse)req.GetResponse();
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = new StreamReader(receiveStream, Encoding.UTF8);
Debug.WriteLine("Response stream received.");
Debug.WriteLine(readStream.ReadToEnd());
response.Close();
readStream.Close();
return Ok("Sent!");
}
catch(Exception ex)
{
throw ex;
}
}
I am receiving the http data in this function and decompressing it:
[HttpPost]
[Route("GetPriceForEntries")]
[DisableRequestSizeLimit]
public JsonResult GetPriceForEntries(bool? ShowAll)
{
string contents = null;
using (GZipStream zip = new GZipStream(Request.Body, CompressionMode.Decompress))
{
using (StreamReader unzip = new StreamReader(zip))
{
contents = unzip.ReadToEnd();
}
}
//CONVERT CONTENTS TO LIST HERE?
return Json("GOT");
}
I have a object/model setup:
public class JsonEntryKey
{
public DateTime meetingDate { get; set; }
public int courseId { get; set; }
public int raceNumber { get; set; }
public string horseCode { get; set; }
}
How do I convert this 'string' to the List object above?
The reason I am sending this data by compressing is because sometimes the data will be very big.
Cheers
EDIT: Here is my attempt at creating my owner 'Converter'
//Convert string to table.
string[] unzipString = contents.Split('=','&');
List<Core.Models.JsonEntryKey> entries = new List<Core.Models.JsonEntryKey>();
for (int i = 1; i < entries.Count; i += 8)
{
DateTime meetingDate = Convert.ToDateTime(entries[i]);
int courseId = int.Parse(unzipString[i + 2]);
int raceNumber = int.Parse(unzipString[i + 4]);
string horseCode = unzipString[i + 6];
entries.Add(new Core.Models.JsonEntryKey
{
meetingDate = meetingDate,
courseId = courseId,
raceNumber = raceNumber,
horseCode = horseCode
});
}
Is there a better way?
the basic parsing can be done in 3 steps.
1) Split the entire string by '&'
string [] parts = data.Split('&')
you end up with the sigle parts:
[0].meetingDate=2019-07-12
[0].courseId=12
[0].raceNumber=1
[0].horseCode=000000331213
[1].meetingDate=2019-07-12
[1].courseId=12
[1].raceNumber=1
[1].horseCode=000000356650
2) now you can GroupBy the number in the parenthesis, since it seems to denote the index of the object [0] , [1], .... Split by the '.' and take the first element:
var items = parts.GroupBy(x => x.Split('.').First());
3) now for each group (which is basically a collection of property information about each object) you need to iterate through the properties, find the corresponding property via reflection and set the value. In the end: don't forget to collect your newly created objects into a collection:
List<JsonEntryKey> collection = new List<JsonEntryKey>();
foreach (var item in items)
{
var entry = new JsonEntryKey();
foreach (var property in item)
{
// here the position propInfo[1] has the property name and propInfo[2] has the value
string [] propInfo = property.Split(new string[] {"].", "="}, StringSplitOptions.RemoveEmptyEntries);
// extract here the corresponding property information
PropertyInfo info = typeof(JsonEntryKey).GetProperties().Single(x => x.Name == propInfo[1]);
info.SetValue(entry, Convert.ChangeType(propInfo[2], info.PropertyType));
}
collection.Add(entry);
}
The outcome from your string looks in a LINQPad Dump like this:
An alternative solution that I wanted to share is a Regex based one. The regular expression that I have built for this string will work after appending the & character at the end of the string and based on the regex logic, the required data will be parsed out from the string. This is just an example of how you can use regular expressions for handling string scenarios. Regarding the performance as per the official specs:
The regular expression engine in .NET is a powerful, full-featured tool that processes text based on pattern matches rather than on comparing and matching literal text. In most cases, it performs pattern matching rapidly and efficiently. However, in some cases, the regular expression engine can appear to be very slow. In extreme cases, it can even appear to stop responding as it processes a relatively small input over the course of hours or even days.
The performance of a regular expression is based on the length of the string and the complexity of the regular expression. Regarding your string data, I have prepared a DEMO here.
The code looks like:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var data = "[0].meetingDate=2019-07-12&[0].courseId=12&[0].raceNumber=1&[0].horseCode=000000331213&[1].meetingDate=2019-07-12&[1].courseId=12&[1].raceNumber=1&[1].horseCode=000000356650";
var dataRegex=data+"&";
//Console.WriteLine(dataRegex);
showMatch(dataRegex, #"(?<==)(.*?)(?=&)");
}
private static void showMatch(string text, string expr) {
MatchCollection mc = Regex.Matches(text, expr);
foreach (Match m in mc) {
Console.WriteLine(m);
}
}
}
And the output is:
2019-07-12
12
1
000000331213
2019-07-12
12
1
000000356650
Regular expression used: (?<==)(.*?)(?=&)
Explanation:
Positive Lookbehind (?<==): Matches the character = literally (case sensitive)
1st Capturing Group (.*?): .*? matches any character (except for line terminators). *? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed.
Positive Lookahead (?=&): Matches the character & literally (case sensitive)

Converting Arabic Words to Unicode format in C#

I am designing an API where the API user needs Arabic text to be returned in Unicode format, to do so I tried the following:
public static class StringExtensions
{
public static string ToUnicodeString(this string str)
{
StringBuilder sb = new StringBuilder();
foreach (var c in str)
{
sb.Append("\\u" + ((int)c).ToString("X4"));
}
return sb.ToString();
}
}
The issue with the above code that it returns the unicode of letters regardless of its position in word.
Example: let us assume we have the following word:
"سمير" which consists of:
'س' which is written like 'سـ' because it is the first letter in word.
'م' which is written like 'ـمـ' because it is in the middle of word.
'ي' which is written like 'ـيـ' because it is in the middle of word.
'ر' which is written like 'ـر' because it is last letter of word.
The above code returns unicode of { 'س', 'م' , 'ي' , 'ر'} which is:
\u0633\u0645\u064A\u0631
instead of { 'سـ' , 'ـمـ' , 'ـيـ' , 'ـر'} which is
\uFEB3\uFEE4\uFEF4\uFEAE
Any ideas on how to update code to get correct Unicode?
Helpful link
The string is just a sequence of Unicode code points; it does not know the rules of Arabic. You're getting out exactly the data you put in; if you want different data out, then put different data in!
Try this:
Console.WriteLine("\u0633\u0645\u064A\u0631");
Console.WriteLine("\u0633\u0645\u064A\u0631".ToUnicodeString());
Console.WriteLine("\uFEB3\uFEE4\uFEF4\uFEAE");
Console.WriteLine("\uFEB3\uFEE4\uFEF4\uFEAE".ToUnicodeString());
As expected the output is
سمير
\u0633\u0645\u064A\u0631
ﺳﻤﻴﺮ
\uFEB3\uFEE4\uFEF4\uFEAE
Those two sequences of Unicode code points render the same in the browser, but they're different sequences. If you want to write out the second sequence, then don't pass in the first sequence.
Based on Eric's answer I knew how to solve my problem, I have created a solution on Github.
You will find a simple tool to run on Windows, and if you want to use the code in your projects then just copy paste UnicodesTable.cs and Unshaper.cs.
Basically you need a table of Unicodes for each Arabic letter then you can use something like the following extension method.
public static string GetUnShapedUnicode(this string original)
{
original = Regex.Unescape(original.Trim());
var words = original.Split(' ');
StringBuilder builder = new StringBuilder();
var unicodesTable = UnicodesTable.GetArabicGliphes();
foreach (var word in words)
{
string previous = null;
for (int i = 0; i < word.Length; i++)
{
string shapedUnicode = #"\u" + ((int)word[i]).ToString("X4");
if (!unicodesTable.ContainsKey(shapedUnicode))
{
builder.Append(shapedUnicode);
previous = null;
continue;
}
else
{
if (i == 0 || previous == null)
{
builder.Append(unicodesTable[shapedUnicode][1]);
}
else
{
if (i == word.Length - 1)
{
if (!string.IsNullOrEmpty(previous) && unicodesTable[previous][4] == "2")
{
builder.Append(unicodesTable[shapedUnicode][0]);
}
else
builder.Append(unicodesTable[shapedUnicode][3]);
}
else
{
bool previouChar = unicodesTable[previous][4] == "2";
if (previouChar)
builder.Append(unicodesTable[shapedUnicode][1]);
else
builder.Append(unicodesTable[shapedUnicode][2]);
}
}
}
previous = shapedUnicode;
}
if (words.ToList().IndexOf(word) != words.Length - 1)
builder.Append(#"\u" + ((int)' ').ToString("X4"));
}
return builder.ToString();
}

Populating empty elements of a column in a list

Please refer to this sample data:
# |IDNum |Date |data |SomeDate |TranCode
1|888888| 12/16/10|aaaaa| |a10
2|888888| 12/16/10|bbbbb| 11/16/15|a8
3|888888| 12/16/10|ccccc| |a11
4|888888| 11/16/10|aaaaa| |a6
5|888888| 11/16/10|bbbbb| |a5
6|888888| 11/16/10|ccccc| 10/16/15|a9
7|888888| 11/16/10|aaaaa| |a11
8|888888| 11/15/10|bbbbb| |a3
9|888888| 10/16/10|ccccc| |a6
10|888888| 10/16/10|aaaaa| |a5
11|888888| 10/16/10|bbbbb| 09/16/15|a9
12|888888| 10/16/10|ccccc| |a11
13|888888| 09/16/10|aaaaa| |a6
14|888888| 09/16/10|bbbbb| 08/16/15|a5
15|888888| 09/16/10|ccccc| |a9
16|111111| 03/02/15|aaaaa| |a9
17|111111| 02/27/15|bbbbb| 12/01/15|a6
18|111111| 02/10/15|ccccc| |a1
19|111111| 02/01/15|aaaaa| |a10
20|111111| 02/01/15|bbbbb| 11/01/15|a9
21|111111| 01/05/15|ccccc| |a10
22|111111| 01/05/15|aaaaa| 10/01/15|a9
23|111111| 12/31/14|bbbbb| |a12
24|111111| 12/30/14|ccccc| |a2
25|111111| 12/01/14|aaaaa| |a6
26|111111| 12/01/14|bbbbb| 10/01/15|a10
I have the above data stored as a list delimited by pipes and sorted by Date descending. I would need the "SomeDate" field to populate using the last date available in the row for that particular IDNumber.
So for example:
Row 1 should show a date of 11/16/15.
Row 3:5 should show a date of 10/16/15.
Row 7:10 should show a date of 09/16/15
Row 15 should show no date since there is no preceding date for that IDNum.
Row 16 should show a date of 12/01/15
Any logic recommendations would be much appreciated.
EDIT: To clarify - The data posted above is currently stored in a list. What I need help with is coming up with logic of how to solve my problem.
Here is a full writeup of how to solve this issue. Note that I put the sample data into C:\test\sample.txt for ease of use.
public class FileData
{
public string ID { get; set; }
public string IDNum { get; set; }
public string Date { get; set; }
public string Data { get; set; }
public string SomeDate { get; set; }
public string TranCode { get; set; }
}
public class ReadFile
{
public string SampleFile = #"C:\test\sample.txt";
public ReadFile()
{
StreamReader reader = new StreamReader(SampleFile);
string sampleFile = reader.ReadToEnd();
reader.Close();
string[] lines = sampleFile.Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries);
string previousDate = "";
List<FileData> fileDatas = new List<FileData>();
for (int i = lines.Length - 1; i >= 0; i--)
{
FileData data = new FileData();
string[] columns = lines[i].Split('|');
data.ID = columns[0].Trim();
data.IDNum = columns[1].Trim();
data.Date = columns[2].Trim();
data.Data = columns[3].Trim();
string someDate = columns[4].Trim();
if (someDate.Equals(""))
{
data.SomeDate = previousDate;
}
else
{
previousDate = someDate;
data.SomeDate = someDate;
}
data.TranCode = columns[5].Trim();
fileDatas.Add(data);
}
}
}
Please notice that I created a "FileData" class to use to store the values.
Also notice that I am going through this data backwards, as it's easier to assign the dates this way.
What this does:
This reads all the data from the file into a string. That string is then split by line ends (\r\n).
Once you have a list of lines, we go BACKWARDS through it (int i = lines.length - 1; i < 0; i--).
Going backwards, we simply assign data, except for the "somedate" column. Here we check to see if somedate has a value. If it does, we assign a "previousDate" variable that value, and then assign the value. If it doesn't have a value, we use the value from previousDate. This ensures it will change appropriately.
The one issue with this is actually a potential issue with the data. If the end of the file does not have a date, you will have blank values for the SomeDate column until the first time you encounter a date.
Compiled, tested, and working.

Replace Date-Code Sections of String

I'm trying to parse a date-formatted file name, e.g.
C:\Documents/<yyyy>\<MMM>\Example_CSV_<ddMM>.csv
and return "Todays" filename.
So for the example above, I would return (for 9th August 2013),
C:\Documents\2013\Aug\Example_CSV_0908.csv
I wondered if Regex would work, but I'm just having a mental block as to how to approach it!
I can't just replace the xth to yth sections with the date, as the files I will be processing are stored in different folders all over the system (not my idea). All of the date codes will be contained in <> however, so as far as I'm aware, I couldn't do something like
Return DateTime.Today.ToString(RawFileName);
Plus I imagine it would have unintended consequences if a part of the ordinary filename could be interpreted as a date code!
If someone could give me a pointer in the right direction, that would be great. If you need a little bit more context, here is the class that will contain this method:
public class ImportSetting
{
public string ID { get; private set; }
public List<ImportMapping> Mappings { get; set; }
public string RawFileName { get; set; }
public string GetFileName()
{
string ToFormat = RawFileName; //e.g. C:\Documents/<yyyy>\<MMM>\Example_CSV_<ddMM>.csv
//Do some clever stuff.
return ToFormat; //C:\Documents\2013\Aug\Example_CSV_0908.csv
}
public int GetCSVColumn(string AttributeName) { return Mappings.First(x => x.Attribute == AttributeName).ColumnID; }
public ImportSetting(string Name)
{
ID = Name;
Mappings = new List<ImportMapping>();
}
}
Thankyou very much for your help!
There is no need to replace anything in the text as you can use the Date.ToString() method with a format string like this:
public string GetFileName(DateTime date)
{
string format = #"'C:\\Documents'\\yyyy\\MMM'\\Example_CSV_'ddMM'.csv'";
return date.ToString(format);
}
Call GetFileName with today's date:
Console.WriteLine(GetFileName(DateTime.Now));
Output:
C:\Documents\2013\Aug\Example_CSV_0908.csv
Anything that you don't want to be parsed as a date, put in single quotes ' to have it parsed as a string literal. A full list of the date format strings can be found here: http://msdn.microsoft.com/en-us/library/8kb3ddd4.aspx
var path = new Regex("<([dMy]+)>").Replace(pathFormat, o => DateTime.Now.ToString(o.Groups[1].Value));
Nb: Add all the possible letters/symbols that could occure within the square brackets.
Nb2: This will however not restrict weird DateTime strings. If you want to ensure a uniformed format, you could make a more restrictive Regex like so :
var path = new Regex("<(ddMM)|(MMM)|(yyyy)>").Replace(pathFormat, o => DateTime.Now.ToString(o.Groups[1].Value));
Edit: Gotta love one-liners :)
What you could do (although I can't imagen this is a real scenario but that might be my lacking imagenation is the following regex;
<([fdDmMyYs]+?)>
This will give you any matches within the < and > symbols, as short as possible so in testing for me it returned;
Then strip the first and last symbol, or use some fancier regex functions to do this for you.
Then just use the DateTime.Now.ToString(RegexMatchWithout<> here)
And replace the match with the output.
So a code example (untested, but i'm feeling confident ;-)) would be:
public string GetFileName(string fileName)
{
Regex regex = new Regex(#"<([fdDmMyYs]+?)>");
foreach(Match m in regex.Matches(fileName))
{
fileName = fileName.Replace(m.Value, DateTime.Now.ToString(m.Value.Substring(1, m.Value.Length - 2)));
}
return fileName;
}

Categories

Resources