Processing a text file where the fields are not consistent - c#

A vendor is providing a delimited text file but the file can and likely will be custom for each customer. So if the specification provides 100 fields I may only receive 10 fields.
My concern is the overhead of each loop. In all I am using a while and 2 for loops just for the header and there will at least as many for the detail.
My answer is as follows:
using (StreamReader sr = new StreamReader(flName))
{
//Process first line to get field names
flHeader = sr.ReadLine().Split(charDelimiters);
//Check first field to determine header or detail file
if (flHeader[0].ToUpper() == "ORDERID")
{
header = true;
} else if (flHeader[0].ToUpper() == "ORDERITEMID"){
detail = true;
}
}
//Use TextFieldParser to read and parse files
using (TextFieldParser parser = new TextFieldParser(flName))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(delimiters);
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
//Send read line to header or detail processor
if (header == true)
{
if (flHeader[0] != fields[0])
{
ProcessHeader(fields);
}
}
if (detail == true)
{
if (flHeader[0] != fields[0])
{
ProcessDetail(fields);
}
}
}
//Header Processor snippet
//Declare header class
Data.BLL.OrderExportHeader_BLL OrderHeaderBLL = new Data.BLL.OrderExportHeader_BLL();
foreach (string field in fields)
{
int fldCnt = fields.Count();
//Loop through each field then use the switch to determine which field is to be filled in
for (int flds = 0; flds < fldCnt; flds++ )
{
string strField = field.Trim();
switch (flHeader[flds].ToUpper())
{
case "ORDERID":
OrderHeaderBLL.OrderID = strField;
break;
}
}
}
//header file
OrderID ManufacturerID CustomerID SalesRepID PONumber OrderDate CustomerName CustomerNumber RepNumber Discount Terms ShipVia Notes ShipToCompanyName ShipToContactName ShipToContactPhone ShipToFax ShipToContactEmail ShipToAddress1 ShipToAddress2 ShipToCity ShipToState ShipToZip ShipToCountry ShipDate BillingAddress1 BillingAddress2 BillingCity BillingState BillingZip BillingCountry FreightTerm PriceLevel OrderType OrderStatus IsPlaced ContactName ContactPhone ContactEmail ContactFax Exported ExportDate Source ContainerName ContainerCubes Origin MarketName FOB SubTotal OrderTotal TaxRate TaxTotal ShippingTotal IsDeleted IsContainer OrderGUID CancelDate DoNotShipBefore WrittenByName WrittenForName WrittenForRepNumber CatalogCode CatalogName ShipToCode
491975 18 0 2621 1234 7/17/2014 RepZio 2499174 0 Test 561-351-7416 max#repzio.com 465 Ocean Ridge Way Juno Beach FL 33408 7/18/2014 465 Ocean Ridge Way Juno Beach FL 33408 USA 0 ShopZio True Max Fraser 561-351-7416 max#repzio.com False ShopZio 0.00 ShopZio 1500.0000 1500.0000 0.000 0.0000 0.0000 False False 63960a7b-86b7-47a2-ad11-9763a6b52fd0 7/31/2014 7/18/2014

Your sample data is the key, and your sample is currently obscure, but I think it matches the description that follows.
Per your example of 10 fields out of a possible 100.
In parsing each line, you only need to split in into 10 fields. It looks like you are delimited by whitespace, but you have a problem in that fields can contain embedded whitespace. Perhaps your data is actually tab delimited in which case you are ok.
For simplicity, I am going to assume your 100 fields are name 'fld0', 'fld1', ..., 'fld99'
Now, assuming the received file contains this header
fld10, fld50, fld0, fld20, fld80, fld70, fld0, fld90, fld50, fld60
and a line of data looks like
Alpha Bravo Charlie Delta Echo Foxtrot Golf Hotel India Juliet
e.g.
split[0] = "Alpha", split[1] = "Bravo", etc.
You parse the header and find that the indexes in your master list of 100 fields are 10,50,0 etc.
So you build a lookupFld array with these index value, i.e., lookupFld[0] = 10, lookupFld[1] = 50, etc
Now, as you process each line, split into 10 fields and you have an immediate indexed lookup of the correct corresponding field in your master field list.
Now MasterList[0] = "fld0", MasterList[1] = "fld1", ..., MasterList[99] = "fld99"
for (ii=0; ii<lookupFld.count; ++ii)
{
// MasterField[lookupFld[ii]] is represented by with split[ii]
// when ii = 0
// lookupFld[0] is 10
// so MasterField[10] /* fld10 */ is represented by split[0] /* alpha */
}

Related

I want to split column data into Different column

i have data in a column which i want to split into different column.
data in column is not consistent.
eg:-
974/mt (ICD TKD)
974/mt (+AD 91.27/mt, ICD/TKD)
970-980/mt
970-980/mt
i have tried with substring but not found any solution
OUTPUT SHOULD BE:-
min |max | unit | description
-------------------------
NULL | 974 | /mt | ICD TKD
NULL | 974 | /mt |+AD 91.27/mt, ICD/TKD
970 | 980 | /mt |NULL
You can use Regex to parse the information, and then add columns with the parsed data.
Assumptions (due to lack of clarity in OP)
Min Value is optional
If present, Min Value is succeeded by a "/", followed by Max Value
Description is optional
Since OP haven't mentioned what to assume when Min Value is not available, I have used string type for Min/Max values, but should be ideally replaced by apt DataType.
public Sample Split(string columnValue)
{
var regex = new Regex(#"(?<min>\d+-)?(?<max>\d+)(?<unit>[\/a-zA-Z]+)\s?(\((?<description>(.+))\))?",RegexOptions.Compiled);
var match = regex.Match(columnValue);
if(match.Success)
{
return new Sample
{
Min = match.Groups["min"].Value,
Max = match.Groups["max"].Value,
Unit = match.Groups["unit"].Value,
Description = match.Groups["description"].Value
};
}
return default;
}
public class Sample
{
public string Min{get;set;}
public string Max{get;set;}
public string Unit{get;set;}
public string Description{get;set;}
}
For Example,
var list = new []
{
#"974/mt (ICD TKD)",
#"974/mt (+AD 91.27/mt, ICD/TKD)",
#"970-980/mt",
"970-980/mt"
};
foreach(var item in list)
{
var result = Split(item);
Console.WriteLine($"Min={result.Min},Max={result.Max},Unit={result.Unit},Description={result.Description}");
}
Output
Min=,Max=974,Unit=/mt,Description=ICD TKD
Min=,Max=974,Unit=/mt,Description=+AD 91.27/mt, ICD/TKD
Min=970-,Max=980,Unit=/mt,Description=
Min=970-,Max=980,Unit=/mt,Description=

How to remove pieces of data from string

I have a text file with multiple entries of this format:
Page: 1 of 1
Report Date: January 15 2018
Mr. Gerald M. Abridge ID #: 0000008 1 Route 81 Mr. Gerald Michael Abridge Pittaburgh PA 15668 SSN: XXX-XX-XXXX
Birthdate: 01/00/1998 Sex: M
COURSE Course Title CRD GRD GRDPT COURSE Course Title CRD GRD GRDPT
FALL 2017 (08/28/2017 to 12/14/2017) CS102F FUND. OF IT & COMPUTING 4.00 A 16.00 CS110 C++ PROGRAMMING I 3.00 A- 11.10 EL102 LANGUAGE AND RHETORIC 3.00 B+ 9.90 MA109 CALC WITH APPLICATIONS I 4.00 A 16.00 SP203 INTERMEDIATE SPANISH I 3.00 A 12.00
EHRS QHRS QPTS GPA Term 17.00 17.00 65.00 3.824 Cum 17.00 17.00 65.00 3.824
Current Program(s): Bachelor of Science in Computer Science
End of official record.
So far, I have read the text file into a string, full. I want to be able to remove first two lines of each of the entries. How would I go about doing this?
Here's the code that I used to read it in:
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
string full = sr.ReadToEnd();
}
If all the lines you want to skip begin with the same strings, you can put those prefixes in a list and then, when you're reading the lines, skip the any that being with one of the prefixes:
This will leave you with a list of strings that represent all the file lines that don't begin with one of the specified prefixes:
var filePath = #"f:\public\temp\temp.txt";
var ignorePrefixes = new List<string> {"Page:", "Report Date:"};
var filteredContent = File.ReadAllLines(filePath)
.Where(line => ignorePrefixes.All(prefix => !line.StartsWith(prefix)))
.ToList();
If you want all the content as a single string, you can use String.Join:
var filteredAsString = string.Join(Environment.NewLine, filteredContent);
If Linq isn't your thing, or you don't understand what it's doing, here's the "old school" way of doing the same thing:
List<string> filtered = new List<string>();
foreach (string line in File.ReadLines(filePath))
{
bool okToAdd = true;
foreach (string prefix in ignorePrefixes)
{
if (line.StartsWith(prefix))
{
okToAdd = false;
break;
}
}
if (okToAdd)
{
filtered.Add(line);
}
}
public static IEnumerable<string> ReadReportFile(FileInfo file)
{
var line = String.Empty;
var page = "Page:";
var date = "Report Date:";
using(var reader = File.OpenText(file.FullName))
while((line = reader.ReadLine()) != null)
while(line.IndexOf(page) == -1 AND line.IndexOf(date) == -1)
yield return line;
}
Code is pretty straight forward, while line is not null and doesn't contain page or date, return line. You could condense or even get fancier, building lookups for your prefix etc. but if the code is simple or not needed to be that complex, this should suffice.

Unpivot a table in C#

I'm building a function in C# to unpivot a complex table in a CSV file and insert it into a SQL table. The file looks something like this:
| 1/5/2018 | 1/5/2018 | 1/6/2018 | 1/6/2018...
City: | min: | max: | min: | max:
Boston(KBOS) | 1 | 10 | 5 | 12
My goal is to unpivot it like so:
airport_code | localtime | MinTemp | MaxTemp
KBOS | 1/5/2018 | 1 | 10
KBOS | 1/6/2018 | 5 | 12
My strategy is:
Store the first row of dates and the second row of headers into arrays
Use a CSV parser to read each following line and loop through each field
If the date that corresponds to the current field is same as the previous one, it's belongs in the same row. Put the data into the appropriate field.
Since there are only two temperature fields for each row, this row is complete can now be inserted.
Otherwise, start a new row and put the data into the appropriate field.
However, I'm running into a problem: Once insertRow is populated and inserted, I can't overwrite it or null all the fields and use it again - that throws an error that row has already been inserted. I can't move the declaration of insertRow inside the for loop because I need to preserve the data through multiple iterations to completely fill out the row. So instead I tried to declare it outside the loop but only initialize it inside the loop, something like:
if(insertRow == null)
{
insertRow = MyDataSet.tblForecast.NewtblForecastRow();
}
But that throws a "use of unassigned local variable" error. Any ideas about how I can preserve insertRow on some iterations and dispose of it on others? Or, any suggestions about a better way to do what I'm looking for? The relevant portion of the code is below:
using (TextFieldParser csvParser = new TextFieldParser(FileName))
{
csvParser.SetDelimiters(new string[] { "," });
csvParser.ReadLine(); //Skip top line
string[] dateList = csvParser.ReadFields();//Get dates from second line.
string[] fieldNames = csvParser.ReadFields();//Get headers from third line
//Read through file
while (!csvParser.EndOfData)
{
DataSet1.tblForecastRow insertRow = MyDataSet.tblForecast.NewtblForecastRow();
string[] currRec = csvParser.ReadFields();
//Get airport code
string airportCode = currRec[0].Substring(currRec[0].LastIndexOf("(") + 1, 4);
//Unpivot record
DateTime currDate = DateTime.Parse("1/1/1900");//initialize
DateTime prevDate;
for (int i = 1; i<fieldNames.Length; i++) //skip first col
{
prevDate = currDate;//previous date is the prior current date
DateTime.TryParse(dateList[i], out currDate);//set new current date
int val;
int.TryParse(currRec[i], out val);
switch (fieldNames[i].ToLower())
{
case "min:":
insertRow["MinTemp"] = val;
break;
case "max:":
insertRow["MaxTemp"] = val;
break;
}
if (currDate == prevDate)//if same date, at end of row, insert
{
insertRow["airport_code"] = airportCode;
insertRow["localTime"] = currDate;
insertRow["Forecasted_date"] = DateTime.Today;
MyDataSet.tblForecast.AddtblForecastRow(insertRow);
ForecastTableAdapter.Update(MyDataSet.tblForecast);
}
}
}
}
You create a new row when you finished handling the current one. And you already know where that is:
if (currDate == prevDate)//if same date, at end of row, insert
{
insertRow["airport_code"] = airportCode;
insertRow["localTime"] = currDate;
insertRow["Forecasted_date"] = DateTime.Today;
// we're storing insertRow
MyDataSet.tblForecast.AddtblForecastRow(insertRow);
// now it gets saved (man that is often)
ForecastTableAdapter.Update(MyDataSet.tblForecast);
// OKAY, let's create the new insertRow instance
insertRow = MyDataSet.tblForecast.NewtblForecastRow();
// and now on the next time we end up in this if
// the row we just created will be inserted
}
Your initial Row can be created outside the loop:
// first row creation
DataSet1.tblForecastRow insertRow = MyDataSet.tblForecast.NewtblForecastRow();
//Read through file
while (!csvParser.EndOfData)
{
// line moved out of the while loop
string[] currRec = csvParser.ReadFields();

c# load csv file and sort columns

I have a datagridview set up with elements and some identifying characteristics as column headers.
Col 1 Col 2 Col 3 4 5 6 7 ...
Sample , Symbol, Symbol Color, Na, K, Mg, Mn...
I can load CSV or text tab delimited files currently but the formatting has to match the datagridview. Is there a way to load a CSV of element data with column headers in random order, and then place them in the columns you desire.
Currently, the csv must be formatted in the same order as the datagridview:
Na, K, Mg, Mn....
88, 5, 6, 16...
56, 7, 33, 12...
Is it possible, if the data was in a different order to have it sorted to the format of the existing datagridview:
Mg, Mn, Na, K....
6, 16, 88, 5...
33, 12, 56, 7...
There may be missing columns from the imported file sometimes and thats ok. I have figured out how to hide the columns of empty data.
I would suggest to use some File Library like FileHelper to perform the actions. Its free and opensource with some great stuff like reading csv or any formatted data file, read data asynchronously, define column order for entity.
Edit:
Handling missing value: Create custom handler to handle the missing value or can be defined the data as nullable For more info http://www.filehelpers.net/example/MissingValues/MissingValuesNullable/
Custom order: Use ColumnOrder attribute to define the column order
This is a very simple way to do it.
First create a class like this;
private class MyColumns
{
public string Na { get; set; }
public string K { get; set; }
public string Mg { get; set; }
public string Mn { get; set; }
}
Then you can parse your csv like this.
var allLines = File.ReadAllLines(#"C:\kosalaw\myfile.csv"); //read all lines from the csv file
MyColumns[] AllColumns = new MyColumns[allLines.Count() -1]; //create an array of MyColumns class
var colHeaders = allLines[0].Split(new[]{"\",\""},StringSplitOptions.None).ToList(); // Identify columns headers
for (int index = 1; index < allLines.Length; index++)//loop through the lines. We skip first line as it is the column header
{
var line = allLines[index];
var lineColumns = line.Split(new[] { "\",\"" }, StringSplitOptions.None); //split each line in to columns
AllColumns[index - 1] = new MyColumns //now use column header to identify the exact column.
{
K = lineColumns[colHeaders.IndexOf("K")],
Mg = lineColumns[colHeaders.IndexOf("Mg")],
Mn = lineColumns[colHeaders.IndexOf("Mn")],
Na = lineColumns[colHeaders.IndexOf("Na")]
};
}

Array help Index out of range exception was unhandled

I am trying to populate combo boxes from a text file using comma as a delimiter everything was working fine, but now when I debug I get the "Index out of range exception was unhandled" warning. I guess I need a fresh pair of eyes to see where I went wrong, I commented on the line that gets the error //Fname = fields[1];
private void xViewFacultyMenuItem_Click(object sender, EventArgs e)
{
const string fileStaff = "source\\Staff.txt";
const char DELIM = ',';
string Lname, Fname, Depart, Stat, Sex, Salary, cDept, cStat, cSex;
double Gtotal;
string recordIn;
string[] fields;
cDept = this.xDeptComboBox.SelectedItem.ToString();
cStat = this.xStatusComboBox.SelectedItem.ToString();
cSex = this.xSexComboBox.SelectedItem.ToString();
FileStream inFile = new FileStream(fileStaff, FileMode.Open, FileAccess.Read);
StreamReader reader = new StreamReader(inFile);
recordIn = reader.ReadLine();
while (recordIn != null)
{
fields = recordIn.Split(DELIM);
Lname = fields[0];
Fname = fields[1]; // this is where the error appears
Depart = fields[2];
Stat = fields[3];
Sex = fields[4];
Salary = fields[5];
Fname = fields[1].TrimStart(null);
Depart = fields[2].TrimStart(null);
Stat = fields[3].TrimStart(null);
Sex = fields[4].TrimStart(null);
Salary = fields[5].TrimStart(null);
Gtotal = double.Parse(Salary);
if (Depart == cDept && cStat == Stat && cSex == Sex)
{
this.xEmployeeListBox.Items.Add(recordIn);
}
recordIn = reader.ReadLine();
}
Source file --
Anderson, Kristen, Accounting, Assistant, Female, 43155
Ball, Robin, Accounting, Instructor, Female, 42723
Chin, Roger, Accounting, Full, Male,59281
Coats, William, Accounting, Assistant, Male, 45371
Doepke, Cheryl, Accounting, Full, Female, 52105
Downs, Clifton, Accounting, Associate, Male, 46887
Garafano, Karen, Finance, Associate, Female, 49000
Hill, Trevor, Management, Instructor, Male, 38590
Jackson, Carole, Accounting, Instructor, Female, 38781
Jacobson, Andrew, Management, Full, Male, 56281
Lewis, Karl, Management, Associate, Male, 48387
Mack, Kevin, Management, Assistant, Male, 45000
McKaye, Susan, Management, Instructor, Female, 43979
Nelsen, Beth, Finance, Full, Female, 52339
Nelson, Dale, Accounting, Full, Male, 54578
Palermo, Sheryl, Accounting, Associate, Female, 45617
Rais, Mary, Finance, Instructor, Female, 27000
Scheib, Earl, Management, Instructor, Male, 37389
Smith, Tom, Finance, Full, Male, 57167
Smythe, Janice, Management, Associate, Female, 46887
True, David, Accounting, Full, Male, 53181
Young, Jeff, Management, Assistant, Male, 43513
For the sake of anyone who doesn't want to look at the mammoth code you've posted, here's the relevant bit:
while (recordIn != null)
{
fields = recordIn.Split(DELIM);
Lname = fields[0];
Fname = fields[1]; // this is where the error appears
Given the exception you've seen, that basically means that recordIn doesn't contain the delimiter DELIM (a comma). I suggest you explicitly check for the expected size and throw an exception giving more details if you get an inappropriate line. Or if it's a blank line, as others have suggested (and which does indeed seem likely) you may want to just skip it.
Alternatively, here's a short but complete console application which should help you find the problem:
using System;
using System.IO;
class Test
{
static void Main()
{
string[] lines = File.ReadAllLines("source\\Staff.txt");
for (int i = 0; i < lines.Length; i++)
{
string line = lines[i];
string[] fields = line.Split(',');
if (fields.Length != 6)
{
Console.WriteLine("Invalid line ({0}): '{1}'",
i + 1, line);
}
}
}
}
That could be because of blank line that appear at the top in the text file.
Have you checked for an empty row at the end of your text file?
After this:
fields = recordIn.Split(DELIM);
you need this:
if (fields.length < 6)
{
// the current recordIn is the problem!
}
else
{
Lname = fields[0];
// etc.
}
recordIn = reader.ReadLine(); // make sure to put this after the else block!
You should do this routinely when reading from files, because there are often leading or trailing blank lines.
You've most likely got an extra blank line at the end of your input file, which therefore only has one (empty) field, giving you index out of range at index 1.

Categories

Resources