I'm trying to parse Excel (.xls, .xlsx) files. The structure of files is the same except for the amount of the records.
I need to parse the industry. In this case it is "FinTech". Due to the fact that it is in one cell, I guess I have to use a regex expression such as ^Industry: (.*)$?
It has to find which row/column the list of the people starts and put it into a IEnumerable<Person>. It could use the following regex expressions.
Number always consists of 6 digits. ^[0-9]{6}$
Name consists of at least two words where each one of them starts with a capital letter. ^([a-zA-Z]+\s?\b){2,}$
A test .xlsx file can be found here https://docs.google.com/spreadsheets/d/15SR04cHXgGLWe0cuOOuuB5vUZigebh96/edit?usp=sharing&ouid=112418126731411268789&rtpof=true&sd=true.
List of people
Normal condition
Industry: FinTech
# Number Name
1 226250 Zain Griffiths
2 226256 Michael Houghton
3 226259 Hugo Willis Johnson
4 226264 Anna-Maria Rose
The actual question
First of all, I'm not completely sure if my regex expressions are correct. I was only able to display the rows and the columns but I'm not sure how to actually parse the industry and the list of the people into a IEnumerable<Person>. So how do I do that?
Snippet
// Program.cs
var excel = new ExcelParser();
var sheet1 = excel.Import(#"a.xlsx");
Console.OutputEncoding = Encoding.UTF8;
for (var i = 0; i < sheet1.Rows.Count; i++)
{
for (var j = 0; j < sheet1.Columns.Count; j++)
{
var cell = sheet1.Rows[i][j].ToString()?.Trim();
Console.Write($"Column: {cell} | ");
}
Console.WriteLine();
}
Console.ReadLine();
// ExcelParser.cs
public sealed class ExcelParser
{
public ExcelParser()
{
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
}
public DataTable Import(string filePath)
{
// does file exist?
if (!File.Exists(filePath))
{
throw new FileNotFoundException();
}
// .xls or .xlsx allowed
var extension = new FileInfo(filePath).Extension.ToLowerInvariant();
if (extension is not (".xls" or ".xlsx"))
{
throw new NotSupportedException();
}
// read .xls or .xlsx
using var stream = File.Open(filePath, FileMode.Open, FileAccess.Read);
using var reader = ExcelReaderFactory.CreateReader(stream);
var dataSet = reader.AsDataSet(new ExcelDataSetConfiguration
{
ConfigureDataTable = _ => new ExcelDataTableConfiguration
{
UseHeaderRow = false
}
});
// Sheet1
return dataSet.Tables[0];
}
}
The structure of files is the same except for the amount of the records
As long as the table is structured (or semi-structured), you can state one/two simple assumptions and parse the tables based on these assumptions, and in case the structure is not following the assumptions, you will return false (throw exception, etc..).
Actually, designing regexs to parse the table is kind of assumptions encoding.. I just want to Keep it simple, So, Based on the problem statement, here are my assumptions:
There will be a "industry" (or "industry:", call .ToLower()) string in a separate cell (regex will do nothing more than finding such a string), and industry's name will be in the same cell.[1]
First person's name will be next to the first 6-digits-number cell.[2]
Here is the code
public (string industryName, List<string> peopleNames) ParseSheet(DataTable sheet1)
{
// 1. Get Indices of industry cell and first Name in people names..
var industryCellIndex = (-1, -1, false);
var peopleFirstCellIndex = (-1, -1, false);
for (var i = 0; i < sheet1.Rows.Count; i++)
{
for (var j = 0; j < sheet1.Columns.Count; j++)
{
// .ToLower() added
var cell = sheet1.Rows[i][j].ToString()?.Trim().ToLower();
if (cell.StartsWith("industry"))
{
industryCellIndex = (i, j, true);
break;
}
// the name after the first 6-digits number cell will be the first name in people records
if (cell.Length == 6 && int.TryParse(cell, out _))
{
peopleFirstCellIndex = (i, j + 1, true);
break;
}
}
if (industryCellIndex.Item3 && peopleFirstCellIndex.Item3)
break;
}
if (!industryCellIndex.Item3 || !peopleFirstCellIndex.Item3)
{
// throw new Exception("Excel file is not normalized!");
return (null, null);
}
// 2. retrieve the desired data
var industryName = sheet1.Rows[industryCellIndex.Item1][industryCellIndex.Item2]
.Replace(":", ""); // will do nothing if there were no ":"
industryName = industryName.Substring(industryName.IndexOf("indusrty") + "indusrty".Length);
var peopleNames = new List<string>();
var colIndex = peopleFirstCellIndex.Item2;
for (var rowIndex = peopleFirstCellIndex.Item1;
rowIndex < sheet1.Rows.Count;
rowIndex++)
{
peopleNames.Add(sheet1.Rows[rowIndex][colIndex].ToString()?.Trim());
}
return (industryName, peopleNames);
}
[1] If this assumption needs some editing (like: the indusrty name might be the next cell that has "industry" string), the idea still the same.. you can consider this in parsing.
[2] And, for example, after the "#" cell by 2 columns and 1 row.
My textfile contains contents such as:
1/1/2018;0;29;10
1/2/2018;0;16;1
1/3/2018;0;32;1
1/4/2018;0;34;15
1/5/2018;0;19;2
1/6/2018;0;21;2
Further down in the textfiles are decimals which is why I am trying to use double
1/29/2018;0.32;52;38
1/30/2018;0.06;44;21
I am trying to split up the semicolons and assign each value between the semicolons into a 2D array that contains 31 rows and 4 columns.
private void button1_Click(object sender, EventArgs e)
{
// 2d array
double[,] testarray = new double[31, 4];
string inputFile = File.ReadAllText("testywesty.txt");
char[] spearator = {';',' '};
for (int row = 0; row < 31; row++)
{
for (int column = 0; column < 4; column++)
{
string[] strlist = inputFile.Split(spearator);
testarray [row,column] = double.Parse(strlist[column]);
}
}
}
I believe that I have the right loop needed to insert my values into the 2d array, however, I am getting an error for my input and I believe it is because of the slashes.
Is my code sufficient for holding the text file's contents into my array? And how do I deal with the '/' characters?
You'd be getting an error on the slashes because you're trying to convert them to doubles, which is not possible. One thing you can do is first convert them to a DateTime using the Parse method of that class, and then use the ToOADate method to convert it to a double. Note that if you need to convert it back, you can use the DateTime.FromOADate method as I've done in the output below.
Also, it might be helpful to use File.ReadAllLines to read the file into an array of strings, where each string is a file line (this assumes that each line contains the four parts you need as you've shown in the sample file contents in the question). This way each line represents a row, and then we can split that line to get our columns.
For example:
private static void button1_Click(object sender, EventArgs e)
{
var lines = File.ReadAllLines("testywesty.txt");
var items = new double[lines.Length, 4];
var delims = new[] {';', ' '};
for (var line = 0; line < lines.Length; line++)
{
var parts = lines[line].Split(delims);
var maxParts = Math.Min(parts.Length, items.GetLength(1));
for (var part = 0; part < maxParts; part++)
{
if (part == 0)
{
// Parse first item to a date then use ToOADate to make it a double
items[line, part] = DateTime.Parse(parts[part]).ToOADate();
}
else
{
items[line, part] = double.Parse(parts[part]);
}
}
}
// Show the output
var output = new StringBuilder();
for (var row = 0; row < items.GetLength(0); row++)
{
var result = new List<string>();
for (var col = 0; col < items.GetLength(1); col++)
{
result.Add(col == 0
? DateTime.FromOADate(items[row, col]).ToShortDateString()
: items[row, col].ToString());
}
output.AppendLine(string.Join(", ", result));
}
MessageBox.Show(output.ToString(), "Results");
}
Output
Of course, you can read the data and parse it into an array. But since it is polymorphic the array needs to be of object[,] type. This is how I would approach this:
class Program
{
static void Main(string[] args)
{
object[,] array = ReadFileAsArray("testywesty.txt");
}
static object[,] ReadFileAsArray(string file)
{
// how long is the file?
// read it twice, once to count the rows
// and a second time to read each row in
int rows = 0;
var fs = File.OpenText(file);
while (!fs.EndOfStream)
{
fs.ReadLine();
rows++;
}
fs.Close();
var array = new object[rows, 4];
fs = File.OpenText(file);
int row = 0;
while (!fs.EndOfStream)
{
// read line
var line = fs.ReadLine();
// split line into string parts at every ';'
var parts = line.Split(';');
// if 1st part is date store in 1st column
if (DateTime.TryParse(parts[0], out DateTime date))
{
array[row, 0] = date;
}
// if 2nd part is flaot store in 2nd column
if (float.TryParse(parts[1], out float x))
{
array[row, 1] = x;
}
// if 3rd part is integer store in 3rd column
if (int.TryParse(parts[2], out int a))
{
array[row, 2] = a;
}
// if 4rd part is integer store in 4rd column
if (int.TryParse(parts[3], out int b))
{
array[row, 3] = b;
}
row++;
}
fs.Close();
return array;
}
}
But I feel this is clunky. If the data types represented by the file are predetermined than filling in a collection of a custom type feels more natural in C#, as you let the type handle its own data and parsing. Consider the example below:
class Program
{
static void Main(string[] args)
{
IEnumerable<MyData> list = ReadFileAsEnumerable("testywesty.txt");
Debug.WriteLine(MyData.ToHeading());
foreach (var item in list)
{
Debug.WriteLine(item);
}
// date x a b
// 1/1/2018 0 29 10
// 1/2/2018 0 16 1
// 1/3/2018 0 32 1
// 1/4/2018 0 34 15
// 1/5/2018 0 19 2
// 1/6/2018 0 21 2
// 1/29/2018 0.32 52 38
// 1/30/2018 0.06 44 21
}
public static IEnumerable<MyData> ReadFileAsEnumerable(string file)
{
var fs = File.OpenText(file);
while (!fs.EndOfStream)
{
yield return MyData.Parse(fs.ReadLine());
}
fs.Close();
}
}
/// <summary>
/// Stores a row of my data
/// </summary>
/// <remarks>
/// Mutable structures are evil. Make all properties read-only.
/// </remarks>
public struct MyData
{
public MyData(DateTime date, float number, int a, int b)
{
this.Date = date;
this.Number= number;
this.A=a;
this.B=b;
}
public DateTime Date { get; }
public float Number { get; }
public int A { get; }
public int B { get; }
public static MyData Parse(string line)
{
// split line into string parts at every ';'
var parts = line.Split(';');
// if 1st part is date store in 1st column
if (DateTime.TryParse(parts[0], out DateTime date)) { }
// if 2nd part is flaot store in 2nd column
if (float.TryParse(parts[1], out float number)) { }
// if 3rd part is integer store in 3rd column
if (int.TryParse(parts[2], out int a)) { }
// if 4rd part is integer store in 4rd column
if (int.TryParse(parts[3], out int b)) { }
return new MyData(
date,
number,
a,
b);
}
public static string ToHeading()
{
return $"{"date",-11} {"x",-4} {"a",-4} {"b",-4}";
}
public override string ToString()
{
return $"{Date.ToShortDateString(),-11} {Number,4} {A,4} {B,4}";
}
}
So, I was trying to present a csv document in a console application. However, due to the varying text size in it, the output was not in a presentable format.
To present it, I tried to count the maximum length of text for each column and then append white space to the remaining text in that column so that there's equal length of characters in each column.
I tried to get the character count, but can't seem to figure out how to proceed further.
var file = File.ReadAllLines(#"E:\File.csv");
var lineList = file.Select(x => x.Split(',').ToList()).ToList();
int maxColumn = lineList.Select(x => x.Count).Max(x => x);
List<int> maxElementSize = new List<int>();
for (int i = 0; i < maxColumn; i++)
{
//Some Logic
}
Any help would be highly appreciated.
Here's a sample console application to get maximum character length for each column :
static void Main(string[] args)
{
string CSVPath = #"D:\test.csv";
string outputText = "";
using (var reader = File.OpenText(CSVPath))
{
outputText = reader.ReadToEnd();
}
var colSplitter = ',';
var rowSplitter = new char[] { '\n' };
var rows = (from row in outputText.Split(rowSplitter, StringSplitOptions.RemoveEmptyEntries)
let cols = row.Split(colSplitter)
from col in cols
select new { totalCols = cols.Count(), cols = cols }).ToList();
int[] maxColLengths = new int[rows.Max(o => o.totalCols)];
for (int i = 0; i < rows.Count; i++)
{
for (int j = 0; j < rows[i].cols.Count(); j++)
{
int curLength = rows[i].cols[j].Trim().Length;
if (curLength > maxColLengths[j])
maxColLengths[j] = curLength;
}
}
Console.WriteLine(string.Join(", ", maxColLengths));
}
Hope this helped.
Try with a nested for loop:
var inputLines = File.ReadAllLines(#"E:\File.csv");
Dictionary<int,int> dictIndexLenght = new Dictionary<int,int>();
foreach(var line in inputLines)
{
List<string> columList = line.Split(',').ToList();
for (int i = 0; i < columList.Count; i++)
{
int tempVal = 0;
if(dictIndexLenght.TryGetValue(i,out tempVal))
{
if(tempVal<columList[i].Length)
{
dictIndexLenght[i]=columList[i].Length;
}
}
else
dictIndexLenght[i]=columList[i].Length;
}
}
Can check the result here or with this lines of code:
for(int i=0;i<dictIndexLenght.Count;i++)
{
Console.WriteLine("Column {0} : {1}", i, dictIndexLenght[i]);
}
Here's how I would do it, very similar to un-lucky's answer, only using a List<int> instead of a Dictionary<int, int>. I added dummy data for testing, but you can see the actual call to read the file is left in there, so you can just remove the dummy data and the line that reads it, and it should work ok:
static void Main(string[] args)
{
var fileLines = new List<string>
{
"Lorem, Ipsum, is, simply, dummy, text, of, the, printing, and, typesetting,",
"industry., Lorem, Ipsum, has, been, the, industry's, standard, dummy, text,",
"ever, since, the, 1500s, when, an, ",
"unknown, printer, took, a, galley, of, type, and, scrambled, it, to, make,",
"a, type, specimen, book.,",
"It, has, survived, not, only, five, centuries, but, also, the, leap,",
"into, electronic, typesetting, remaining, essentially, unchanged.,",
"It, was, popularised, in, the, 1960s, with, the, release,",
"of, Letraset, sheets, containing, Lorem, Ipsum, passages, and, more, ",
"recently, with, desktop, publishing,",
"software, like, Aldus, PageMaker, including, versions, of, Lorem, Ipsum."
};
var filePath = #"f:\public\temp\temp.csv";
var fileLinesColumns = File.ReadAllLines(filePath).Select(line => line.Split(','));
var colWidths = new List<int>();
// Remove this line to use file data
fileLinesColumns = fileLines.Select(line => line.Split(','));
// Get the max length of each column and add it to our list
foreach (var fileLineColumns in fileLinesColumns)
{
for (int i = 0; i < fileLineColumns.Length; i++)
{
if (i > colWidths.Count - 1)
{
colWidths.Add(fileLineColumns[i].Length);
}
else if (fileLineColumns[i].Length > colWidths[i])
{
colWidths[i] = fileLineColumns[i].Length;
}
}
}
// Write out our columns, padding each one to match the longest line
foreach (var fileLineColumns in fileLinesColumns)
{
for (int i = 0; i < fileLineColumns.Length; i++)
{
Console.Write(fileLineColumns[i].PadRight(colWidths[i]));
}
Console.WriteLine();
}
Console.Write("\nDone!\nPress any key to exit...");
Console.ReadKey();
}
Output
Initialise your list, then loop over your lines, and within that line, loop over your columns:
for (i = 0; i < lineList.Count; i++)
{
maxElementSize[i] = 0;
}
for (i = 0; i < lineList.Count; i++)
{
for (j = 0; j < maxColumn; j++)
{
if(lineList[i][j].Length > maxElementSize[j])
maxElementSize[j] = lineList[i][j].Length
}
}
I use the following code to make sure the columns in a database are large enough to take the csv input data...
#!/usr/bin/python3
import array as arr
from csv import reader
import argparse
def csv_getFldLens (in_file, has_header=0, delimiter=','):
# open file in read mode
fldMaxLens = arr.array('i')
headers = []
has_header = has_header
with open(in_file, 'r') as read_obj:
# pass the file object to reader() to get the reader object
csv_reader = reader(read_obj, delimiter=delimiter)
# Iterate over each row in the csv using reader object
rcnt = 0
lastIndx = 0
for row in csv_reader:
# row variable is a list that represents a row in csv
# print(row)
if has_header and rcnt == 0:
for fld in row:
headers.append(fld)
rcnt += 1
continue
j = 0
for fld in row:
fldLen = len(fld)
if (lastIndx == 0) or (lastIndx < j):
# print("if --- li, i: ", lastIndx, i, "\n")
fldMaxLens.append(fldLen)
lastIndx = j
else:
# print("else --- li, i: ", lastIndx, i, "\n")
v1 = fldMaxLens[j]
v2 = fldLen
fldMaxLens[j] = max(v1,v2)
j = j + 1
rcnt += 1
j = 0
if has_header:
for f in headers:
print(f,": ", fldMaxLens[j])
j += 1
else:
for i in fldMaxLens:
print("Col[",j+1,"]: ",fldMaxLens[j])
j += 1
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Get column lengths of CVS fields.')
parser.add_argument('--in_file', default='', help='The CSV input file')
parser.add_argument('--has_header', action='store_true', help='The CSV file has headers')
parser.add_argument('--delimiter', default=',', help='Sets the delimiter. Default is comma \',\'.')
args = parser.parse_args()
csv_getFldLens(in_file=args.in_file, has_header=args.has_header, delimiter=args.delimiter)
The code below takes in data from an Excel spread sheet validates the data from a set of pre defined rules and writes out any errors to the console.
This works up to a point. The data returns as expected up to column Z. If any errors are returned passed Z. AB, AC, AD, etc. Then return values start messing up an I get values returned like ], ~, ?. I believe this issue is down to ASCII as I am starting from dec 65 (A). I guess I need to write some kind of Method that can cope with this but do not know where to start. Any help is much appreciated.
namespace WorksheetValidator
{
public class XcelReader
{
private readonly List<List<IRule>> m_Rules;
public XcelReader(List<List<IRule>> rules)
{
m_Rules = rules;
}
public void ValidateWorksheet(string fileName)
{
bool allRulesPassed = true;
WorkbookProvider workbookProvider = new WorkbookProvider();
IWorkbook workbook;
using (FileStream fileStream = File.OpenRead(fileName))
workbook = workbookProvider.GetWorkbook(fileStream, SpreadsheetType.Xlsx);
for (int rowCounter = 1; rowCounter < workbook.Worksheets[1].Rows.Count; rowCounter++)
{
IRow row = workbook.Worksheets[1].Rows[rowCounter];
for (int columnCounter = 0; columnCounter < row.Cells.Count; columnCounter++)
{
List<string> failedRules = ColumnValueIsValid(row.Cells[columnCounter].Value, m_Rules[columnCounter]);
failedRules.ForEach(failedRule =>
{
allRulesPassed = false;
Console.WriteLine("\n[{0}:{1}] Failed: {2}", rowCounter + 1, (char)(columnCounter + 65), failedRule);
});
}
}
if(allRulesPassed)
Console.WriteLine("\n\n\nWOOHOO! worksheet is hunky dory");
}
private List<string> ColumnValueIsValid(string value, List<IRule> rules)
{
List<string> failedRules = new List<string>();
rules.ForEach(rule =>
{
if(!rule.IsValid(value))
failedRules.Add(rule.GetReasonForFailure(value));
});
return failedRules;
}
}
}
Replace this:
(char)(columnCounter + 65)
With a function that converts 0 to "A", 1 to "B"..... 26 to "AA", 27 to "AB", etc.
I have a binary file encoded with little endian and containing ~250.000 values of var1 then another same number of values of var2. I should make a method that reads the file and returns a DataSet with those values in the columns var1 and var2.
I am using the library: miscutil mentioned here in SO multiple times, see here as well for details: will there be an update on MiscUtil for .Net 4?
thanks a lot Jon Skeet for making it available. :)
I have the following code working, I am interested in better ideas on how to minimize the for loops to read from the file and to populate the DataTable. Any suggestion?
private static DataSet parseBinaryFile(string filePath)
{
var result = new DataSet();
var table = result.Tables.Add("Data");
table.Columns.Add("Index", typeof(int));
table.Columns.Add("rain", typeof(float));
table.Columns.Add("gnum", typeof(float));
const int samplesCount = 259200; // 720 * 360
float[] vRain = new float[samplesCount];
float[] vStations = new float[samplesCount];
try
{
if (string.IsNullOrWhiteSpace(filePath) || !File.Exists(filePath))
{
throw new ArgumentException(string.Format("Unable to open the file: '{0}'", filePath));
}
// at this point FilePath is valid and exists...
using (FileStream fs = new FileStream(filePath, FileMode.Open))
{
// We are using the library found here: http://www.yoda.arachsys.com/csharp/miscutil/
var reader = new MiscUtil.IO.EndianBinaryReader(MiscUtil.Conversion.LittleEndianBitConverter.Little, fs);
int i = 0;
while (reader.BaseStream.Position < reader.BaseStream.Length) //while (pos < length)
{
// Read Data
float buffer = reader.ReadSingle();
if (i < samplesCount)
{
vRain[i] = buffer;
}
else
{
vStations[i-samplesCount] = buffer;
}
++i;
}
Console.WriteLine("number of reads was: {0}", (i/2).ToString("N0"));
}
for (int j = 0; j < samplesCount; ++j)
{
table.Rows.Add(new object[] { j + 1, vRain[j], vStations[j] });
}
}
catch (Exception exc)
{
Debug.WriteLine(exc.Message);
}
return result;
}
Option #1
Read the entire file into memory (or Memory Map it) and loop once.
Option #2
Add all the data table rows as you read the var1 section with a placeholder value for var2. Then fix-up the data table as you read the var2 section.