How to read tab delimited lines by skipping alternate lines - c#

I am currently able to parse and extract data from large tab delimited file. I am reading, parsing and extracting line by line and adding the split items in my Data table (Row Limit adding 3 rows at a time). I need to skip even lines i.e. Read first maximum tab delimited line and then skip 2nd one and read the third one directly.
My Tab delimited source file format
001Mean 26.975 1.1403 910.45
001Stdev 26.975 1.1403 910.45
002Mean 26.975 1.1403 910.45
002Stdev 26.975 1.1403 910.45
Need to skip or avoid reading Stdev tab delimited lines.
C# Code:
Getting the Maximum length of items in a tab delimited line of the file by splitting a line
using (var reader = new StreamReader(sourceFileFullName))
{
string line = null;
line = reader.ReadToEnd();
if (!string.IsNullOrEmpty(line))
{
var list_with_max_cols = line.Split('\n').OrderByDescending(y => y.Split('\t').Count()).Take(1);
foreach (var value in list_with_max_cols)
{
var values = value.ToString().Split(new[] { '\t', '\n' }).ToArray();
MAX_NO_OF_COLUMNS = values.Length;
}
}
}
Reading the file line by line until maximum length in a tab delimited line is satisfied as first line to parse and extract
using (var reader = new StreamReader(sourceFileFullName))
{
string new_read_line = null;
//Read and display lines from the file until the end of the file is reached.
while ((new_read_line = reader.ReadLine()) != null)
{
var items = new_read_line.Split(new[] { '\t', '\n' }).ToArray();
if (items.Length != MAX_NO_OF_COLUMNS)
continue;
//when reach first line it is column list need to create datatable based on that.
if (firstLineOfFile)
{
columnData = new_read_line;
firstLineOfFile = false;
continue;
}
if (firstLineOfChunk)
{
firstLineOfChunk = false;
chunkDataTable = CreateEmptyDataTable(columnData);
}
AddRow(chunkDataTable, new_read_line);
chunkRowCount++;
if (chunkRowCount == _chunkRowLimit)
{
firstLineOfChunk = true;
chunkRowCount = 0;
yield return chunkDataTable;
chunkDataTable = null;
}
}
}
Creating Data Table:
private DataTable CreateEmptyDataTable(string firstLine)
{
IList<string> columnList = Split(firstLine);
var dataTable = new DataTable("TableName");
for (int columnIndex = 0; columnIndex < columnList.Count; columnIndex++)
{
string c_string = columnList[columnIndex];
if (Regex.Match(c_string, "\\s").Success)
{
string tmp = Regex.Replace(c_string, "\\s", "");
string finaltmp = Regex.Replace(tmp, #" ?\[.*?\]", ""); // To strip strings inside [] and inclusive [] alone
columnList[columnIndex] = finaltmp;
}
}
dataTable.Columns.AddRange(columnList.Select(v => new DataColumn(v)).ToArray());
dataTable.Columns.Add("ID");
return dataTable;
}
How to skip lines by reading alternatively and split and then add to my datatable !!!
AddRow Function : Managed to achieve my requirement by adding following changes !!!
private void AddRow(DataTable dataTable, string line)
{
if (line.Contains("Stdev"))
{
return;
}
else
{
//Rest of Code
}
}

Considering you have tab separated values in each line, how about reading the odd lines and splitting them into arrays. This is just a sample; you can expand upon this.
Test data (file.txt)
luck is when opportunity meets preparation
this line needs to be skipped
microsoft visual studio
another line to be skipped
let us all code
Code
var oddLines = File.ReadLines(#"C:\projects\file.txt").Where((item, index) => index%2 == 0);
foreach (var line in oddLines)
{
var words = line.Split('\t');
}
Debug screen shots
EDIT
To get lines that don't contain 'Stdev'
var filteredLines = System.IO.File.ReadLines(#"C:\projects\file.txt").Where(item => !item.Contains("Stdev"));

Change
using (var reader = new StreamReader(sourceFileFullName))
{
string new_read_line = null;
//Read and display lines from the file until the end of the file is reached.
while ((new_read_line = reader.ReadLine()) != null)
{
var items = new_read_line.Split(new[] { '\t', '\n' }).ToArray();
if (items.Length != MAX_NO_OF_COLUMNS)
continue;
To
using (var reader = new StreamReader(sourceFileFullName))
{
int cnt = 0;
string new_read_line = null;
//Read and display lines from the file until the end of the file is reached.
while ((new_read_line = reader.ReadLine()) != null)
{
cnt++;
if(cnt % 2 == 0)
continue;
var items = new_read_line.Split(new[] { '\t', '\n' }).ToArray();
if (items.Length != MAX_NO_OF_COLUMNS)
continue;

Related

Fill a MDF Database with CSV data

Well this is how my CSV data looks like:
Artistname;RecordTitle;RecordType;Year;SongTitle
999;Concrete;LP;1981;Mercy Mercy
999;Concrete;LP;1981;Public Enemy No.1
999;Concrete;LP;1981;So Greedy
999;Concrete;LP;1981;Taboo
10cc;Bloody Tourists;LP;1978;Dreadlock Holiday
10cc;Bloody Tourists;LP;1978;Everyhing You've Ever Wanted To Know About!!!
10cc;Bloody Tourists;LP;1978;Shock On The Tube
This is my code where I save this data in the Database:
private void FillDatabase()
{
var firstTime = true;
var lines = File.ReadAllLines("musicDbData.csv");
var list = new List<string>();
foreach (var line in lines)
{
var split = line.Split(";");
if (!firstTime)
{
var artist = new Artist()
{
ArtistName = split[0],
};
db.Artists.Add(artist);
db.SaveChanges();
}
else
{
firstTime = false;
}
}
}
The problem is that every artist should be in the Database only once. Right now there is 4 times Artist 999 and 3 times 10cc and if everything is correct there should only be one row for 999 and one row for 10cc. What do I have to add to my code to get the expected result.
First, a CSV is a comma-separated values file, rather than semicolon.
Besides, the parameter in method String.Split can be type of Char. So you need to modify it like line.Split(';').
And your csv file contains column name line, you need to exclude it when reading the file.
if everything is correct there should only be one row for 999 and one row for 10cc
Do you want to just save the first data of 999 and 10cc to the database? If so, you can first use LINQ to check whether the Artistname already exists in the database.
private void FillDatabase()
{
var lines = File.ReadAllLines("musicDbData.csv");
int count = 0; // line count
foreach (var line in lines)
{
count++;
if (count == 1) // remove first line
continue;
var split = line.Split(';');
string artistname = split[0];
var artistIndb = db.ArtistTables
.Where(c => c.Artistname == artistname)
.SingleOrDefault();
if (artistIndb == null) // check if exists, if not ...
{
var artist = new ArtistTable()
{
Artistname = split[0],
SongTitle = split[4]
};
db.ArtistTables.Add(artist);
db.SaveChanges();
}
}
}
If you want to merge lines with the same Artistname, you can refer to the following code.
if (artistIndb == null)
{
// code omitted
// ...
}
else
{
artistIndb.SongTitle += " ," + split[4]; // Modify the data in SongTitle column
try
{
db.SaveChanges();
}
catch { }
}

How to read and separate segments of a txt file?

I have a txt file, that has headers and then 3 columns of values (i.e)
Description=null
area = 100
1,2,3
1,2,4
2,1,5 ...
... 1,2,1//(these are the values that I need in one list)
Then another segment
Description=null
area = 10
1,2,3
1,2,4
2,1,5 ...
... 1,2,1//(these are the values that I need in one list).
In fact I just need one list per "Table" of values, the values always are in 3 columns but, there are n segments, any idea?
Thanks!
List<double> VMM40xyz = new List<double>();
foreach (var item in VMM40blocklines)
{
if (item.Contains(','))
{
VMM40xyz.AddRange(item.Split(',').Select(double.Parse).ToList());
}
}
I tried this, but it just work with the values in just one big list.
It looks like you want your data to end up in a format like this:
public class SetOfData //Feel free to name these parts better.
{
public string Description = "";
public string Area = "";
public List<double> Data = new List<double>();
}
...stored somewhere in...
List<SetOfData> finalData = new List<SetOfData>();
So, here's how I'd read that in:
public static List<SetOfData> ReadCustomFile(string Filename)
{
if (!File.Exists(Filename))
{
throw new FileNotFoundException($"{Filename} does not exist.");
}
List<SetOfData> returnData = new List<SetOfData>();
SetOfData currentDataSet = null;
using (FileStream fs = new FileStream(Filename, FileMode.Open))
{
using (StreamReader reader = new StreamReader(fs))
{
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
//This will start a new object on every 'Description' line.
if (line.Contains("Description="))
{
//Save off the old data set if there is one.
if (currentDataSet != null)
returnData.Add(currentDataSet);
currentDataSet = new SetOfData();
//Now, to make sure there is something after "Description=" and to set the Description if there is.
//Your example data used "null" here, which this will take literally to be a string containing the letters "null". You can check the contents of parts[1] inside the if block to change this.
string[] parts = line.Split('=');
if (parts.Length > 1)
currentDataSet.Description = parts[1].Trim();
}
else if (line.Contains("area = "))
{
//Just in case your file didn't start with a "Description" line for some reason.
if (currentDataSet == null)
currentDataSet = new SetOfData();
//And then we do some string splitting like we did for Description.
string[] parts = line.Split('=');
if (parts.Length > 1)
currentDataSet.Area = parts[1].Trim();
}
else
{
//Just in case your file didn't start with a "Description" line for some reason.
if (currentDataSet == null)
currentDataSet = new SetOfData();
string[] parts = line.Split(',');
foreach (string part in parts)
{
if (double.TryParse(part, out double number))
{
currentDataSet.Data.Add(number);
}
}
}
}
//Make sure to add the last set.
returnData.Add(currentDataSet);
}
}
return returnData;
}

New line within CSV column causing issue

I have a large csv file which has millions of rows. The sample csv lines are
CODE,COMPANY NAME, DATE, ACTION
A,My Name , LLC,2018-01-28,BUY
B,Your Name , LLC,2018-01-25,SELL
C,
All Name , LLC,2018-01-21,SELL
D,World Name , LLC,2018-01-20,BUY
Row C has new line, but actually this is same record. I want to remove new line character from the csv line within cell\field\column.
I tired \r\n, Envirnment.NewLine and many other things, but could not make it work.
Here is my code..
private DataTable CSToDataTable(string csvfile)
{
Int64 row = 0;
try
{
string CSVFilePathName = csvfile; //#"C:\test.csv";
string[] Lines = File.ReadAllLines(CSVFilePathName.Replace(Environment.NewLine, ""));
string[] Fields;
Fields = Lines[0].Split(new char[] { ',' });
int Cols = Fields.GetLength(0);
DataTable dt = new DataTable();
//1st row must be column names; force lower case to ensure matching later on.
for (int i = 0; i < Cols; i++)
dt.Columns.Add(Fields[i].ToLower(), typeof(string));
DataRow Row;
for (row = 1; row < Lines.GetLength(0); row++)
{
Fields = Lines[row].Split(new char[] { ',' });
Row = dt.NewRow();
//Console.WriteLine(row);
for (int f = 0; f < Cols; f++)
{
Row[f] = Fields[f];
}
dt.Rows.Add(Row);
if (row == 190063)
{
}
}
return dt;
}
catch (Exception ex)
{
throw ex;
}
}
How can I remove new line character and read the row correctly? I don't want to skip the such rows as per the business requirement.
You CSV file is not in valid format. In order to parse and load them successfully, you will have to sanitize them. Couple of issues
COMPANY NAME column contains field separator in it. Fix them by
surrounding quotes.
New line in CSV value - This can be fixed by combining adjacent rows as one.
With Cinchoo ETL, you can sanitize and load your large file as below
string csv = #"CODE,COMPANY NAME, DATE, ACTION
A,My Name , LLC,2018-01-28,BUY
B,Your Name , LLC,2018-01-25,SELL
C,
All Name , LLC,2018-01-21,SELL
D,World Name , LLC,2018-01-20,BUY";
string bufferLine = null;
var reader = ChoCSVReader.LoadText(csv)
.WithFirstLineHeader()
.Setup(s => s.BeforeRecordLoad += (o, e) =>
{
string line = (string)e.Source;
string[] tokens = line.Split(",");
if (tokens.Length == 5)
{
//Fix the second and third value with quotes
e.Source = #"{0},""{1},{2}"",{3}, {4}".FormatString(tokens[0], tokens[1], tokens[2], tokens[3], tokens[4]);
}
else
{
//Fix the breaking lines, assume that some csv lines broken into max 2 lines
if (bufferLine == null)
{
bufferLine = line;
e.Skip = true;
}
else
{
line = bufferLine + line;
tokens = line.Split(",");
e.Source = #"{0},""{1},{2}"",{3}, {4}".FormatString(tokens[0], tokens[1], tokens[2], tokens[3], tokens[4]);
line = null;
}
}
});
foreach (var rec in reader)
Console.WriteLine(rec.Dump());
//Careful to load millions rows into DataTable
//var dt = reader.AsDataTable();
Hope it helps.
You haven't made it clear what are the possible criteria an unwanted new line could appear in the file. So assuming that a 'proper' line in the CSV file does NOT end with a comma, and if one ends with a comma that means that it's not a properly formatted line, you could do something like this:
static void Main(string[] args)
{
string path = #"CSVFile.csv";
List<CSVData> data = new List<CSVData>();
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read))
{
using (StreamReader sr = new StreamReader(fs))
{
sr.ReadLine(); // Header
while (!sr.EndOfStream)
{
var line = sr.ReadLine();
while (line.EndsWith(","))
{
line += sr.ReadLine();
}
var items = line.Split(new string[] { "," }, StringSplitOptions.None);
data.Add(new CSVData() { CODE = items[0], NAME = items[1], COMPANY = items[2], DATE = items[3], ACTION = items[4] });
}
}
}
Console.ReadLine();
}
public class CSVData
{
public string CODE { get; set; }
public string NAME { get; set; }
public string COMPANY { get; set; }
public string DATE { get; set; }
public string ACTION { get; set; }
}
Obviously there's a lot of error handling to be done here (for example, when creating a new CSVData object make sure your items contain all the data you want), but I think this is the start you need.

Import two CSV, add specific columns from one CSV and import changes to new CSV (C#)

i have to import 2 CSV's.
CSV 1 [49]: Including about 50 tab seperated colums.
CSV 2:[2] Inlcudes 3 Columns which should be replaced on the [3] [6] and [11] place of my first csv.
So heres what i do:
1) Importing the csv and split into a array.
string employeedatabase = "MYPATH";
List<String> status = new List<String>();
StreamReader file2 = new System.IO.StreamReader(filename);
string line = file2.ReadLine();
while ((line = file2.ReadLine()) != null)
{
string[] ud = line.Split('\t');
status.Add(ud[0]);
}
String[] ud_status = status.ToArray();
PROBLEM 1: i have about 50 colums to handle, ud_status is just the first, so do i need 50 Lists and 50 String arrays?
2) Importing the second csv and split into a array.
List<String> vorname = new List<String>();
List<String> nachname = new List<String>();
List<String> username = new List<String>();
StreamReader file = new System.IO.StreamReader(employeedatabase);
string line3 = file.ReadLine();
while ((line3 = file.ReadLine()) != null)
{
string[] data = line3.Split(';');
vorname.Add(data[0]);
nachname.Add(data[1]);
username.Add(data[2]);
}
String[] db_vorname = vorname.ToArray();
String[] db_nachname = nachname.ToArray();
String[] db_username = username.ToArray();
PROBLEM 2: After loading these two csv's i dont know how to combine them, and change to columns as mentioned above ..
somethine like this?
mynewArray = ud_status + "/t" + ud_xy[..n] + "/t" + changed_colum + ud_xy[..n];
save "mynewarray" into tablulator seperated csv with encoding "utf-8".
To read the file into a meaningful format, you should set up a class that defines the format of your CSV:
public class CsvRow
{
public string vorname { get; set; }
public string nachname { get; set; }
public string username { get; set; }
public CsvRow (string[] data)
{
vorname = data[0];
nachname = data[1];
username = data[2];
}
}
Then populate a list of this:
List<CsvRow> rows = new List<CsvRow>();
StreamReader file = new System.IO.StreamReader(employeedatabase);
string line3 = file.ReadLine();
while ((line3 = file.ReadLine()) != null)
{
rows.Add(new CsvRow(line3.Split(';'));
}
Similarly format your other CSV and include unused properties for the new fields. Once you have loaded both, you can populate the new properties from this list in a loop, matching the records by whatever common field the CSVs hopefully share. Then finally output the resulting data to a new CSV file.
Your solution is not to use string arrays to do this. That will just drive you crazy. It's better to use the System.Data.DataTable object.
I didn't get a chance to test the LINQ lambda expression at the end of this (or really any of it, I wrote this on a break), but it should get you on the right track.
using (var ds = new System.Data.DataSet("My Data"))
{
ds.Tables.Add("File0");
ds.Tables.Add("File1");
string[] line;
using (var reader = new System.IO.StreamReader("FirstFile"))
{
//first we get columns for table 0
foreach (string s in reader.ReadLine().Split('\t'))
ds.Tables["File0"].Columns.Add(s);
while ((line = reader.ReadLine().Split('\t')) != null)
{
//and now the rest of the data.
var r = ds.Tables["File0"].NewRow();
for (int i = 0; i <= line.Length; i++)
{
r[i] = line[i];
}
ds.Tables["File0"].Rows.Add(r);
}
}
//we could probably do these in a loop or a second method,
//but you may want subtle differences, so for now we just do it the same way
//for file1
using (var reader2 = new System.IO.StreamReader("SecondFile"))
{
foreach (string s in reader2.ReadLine().Split('\t'))
ds.Tables["File1"].Columns.Add(s);
while ((line = reader2.ReadLine().Split('\t')) != null)
{
//and now the rest of the data.
var r = ds.Tables["File1"].NewRow();
for (int i = 0; i <= line.Length; i++)
{
r[i] = line[i];
}
ds.Tables["File1"].Rows.Add(r);
}
}
//you now have these in functioning datatables. Because we named columns,
//you can call them by name specifically, or by index, to replace in the first datatable.
string[] columnsToReplace = new string[] { "firstColumnName", "SecondColumnName", "ThirdColumnName" };
for(int i = 0; i < ds.Tables[0].Rows.Count; i++)
{
//you didn't give a sign of any relation between the two tables
//so this is just by row, and assumes the row count is equivalent.
//This is also not advised.
//if there is a key these sets of data share
//you should join on them instead.
foreach(DataRow dr in ds.Tables[0].Rows[i].ItemArray)
{
dr[3] = ds.Tables[1].Rows[i][columnsToReplace[0]];
dr[6] = ds.Tables[1].Rows[i][columnsToReplace[1]];
dr[11] = ds.Tables[1].Rows[i][columnsToReplace[2]];
}
}
//ds.Tables[0] now has the output you want.
string output = String.Empty;
foreach (var s in ds.Tables[0].Columns)
output = String.Concat(output, s ,"\t");
output = String.Concat(output, Environment.NewLine); // columns ready, now the rows.
foreach (DataRow r in ds.Tables[0].Rows)
output = string.Concat(output, r.ItemArray.SelectMany(t => (t.ToString() + "\t")), Environment.NewLine);
if(System.IO.File.Exists("MYPATH"))
using (System.IO.StreamWriter file = new System.IO.StreamWriter("MYPATH")) //or a variable instead of string literal
{
file.Write(output);
}
}
With Cinchoo ETL - an open source file helper library, you can do the merge of CSV files as below. Assumed the 2 CSV file contains same number of lines.
string CSV1 = #"Id Name City
1 Tom New York
2 Mark FairFax";
string CSV2 = #"Id City
1 Las Vegas
2 Dallas";
dynamic rec1 = null;
dynamic rec2 = null;
StringBuilder csv3 = new StringBuilder();
using (var csvOut = new ChoCSVWriter(new StringWriter(csv3))
.WithFirstLineHeader()
.WithDelimiter("\t")
)
{
using (var csv1 = new ChoCSVReader(new StringReader(CSV1))
.WithFirstLineHeader()
.WithDelimiter("\t")
)
{
using (var csv2 = new ChoCSVReader(new StringReader(CSV2))
.WithFirstLineHeader()
.WithDelimiter("\t")
)
{
while ((rec1 = csv1.Read()) != null && (rec2 = csv2.Read()) != null)
{
rec1.City = rec2.City;
csvOut.Write(rec1);
}
}
}
}
Console.WriteLine(csv3.ToString());
Hope it helps.
Disclaimer: I'm the author of this library.

How to skip headline in csv data when reading from StreamReader?

EDITED:
I have following code:
private void button1_Click_1(object sender, EventArgs e)
{
var date = new List<String>();
var value = new List<Double>();
string dir = #"C:\Main\test.csv";
using (var reader = new System.IO.StreamReader(dir))
{
var lines = File.ReadLines(dir)
.Skip(1);//Ignore the first line
foreach (var line in lines)
{
var fields = line.Split(new Char[] { ';' }, StringSplitOptions.RemoveEmptyEntries);
date.Add(fields[0]);
if (fields.Length > 1)
value.Add(Convert.ToDouble(fields[1]));
}
String[] _date = date.ToArray();
Double[] _value = value.ToArray();
chart1.Series["Test"].Points.DataBindXY(_date,_value);
chart1.Series["Test"].ChartType = SeriesChartType.Spline;
}
}
Now I want to skip the headline of the csv data. That means the first row of the first column and the first row of the second column. How to do that?
The headlines are Strings.When no headlines are in, he will skip the first row but with headlines I get a System.FormatException.
It fails when the first row contains Date in the first column and Value in the second column like that (opened with texteditor):
"Date";"Value"
"20.04.2010";"82.6619508214314"
"21.04.2010";"33.2262968571519"
"22.04.2010";"25.0174973120814"
Why not just start by reading one line, and doing nothing with it?
using (var reader = new System.IO.StreamReader(dir))
{
reader.ReadLine(); // skip first
string line;
while ((line = reader.ReadLine()) != null)
{
}
}
Add one reader.ReadLine() before doing the while loop
using (var reader = new System.IO.StreamReader(dir))
{
if (reader.ReadLine()) //read first line
{
string line;
while ((line = reader.ReadLine()) != null) //read following lines
{
}
}
}

Categories

Resources