I have 2 CSV files with 7 columns each.
CSV file 1 stores current or old data.
CSV file 2 stores the new data to be updated into CSV file 1.
I'd like to programmatically compare each row entry per column of the CSV files, & if a change is detected, generate a SQL script that can be run to auto update this data into CSV file 1.
E.g. if CSV file 1 has a string value called "three" stored under column "number" with ID value 1, & CSV file 2 has a string value called "zwei" stored under the same column with the same ID value, then CSV file 1's value of "three" should be changed to "zwei", but this has to be done via a programmatically generated SQL script.
Please assist...
I would load both files into SQL Temp tables, process line by line and do updates in SQL. Then overwrite CSV file 1 completely.
This is fast and easy.
Related
I have a high scale distributed system which downloads a lot of large .csv files and indexes the data everyday.
Lets say, our file(file.csv) is:
col1 col2 col3
user11 val12 val13
user21 val22 val23
Then we read this file row wise and store the byte offset of where the row of user11 or user12 is located in this file. eg:
Index table -
user11 -> 1120-2130 (bytes offset)
user12 -> 2130-3545 (bytes offset)
When someone says, delete the data for user11, we refer this table, download and open the file, delete the byte offset in the file. Please note, this byte offset is of the entire row.
How can I design the system to process parquet files?
Parquet files operate column wise. To get an entire row of say 10 columns, will i have to make 10 calls? Then, form an entire row, calculate the bytes and then store them in the table?
Then, while deleting, I will have to again form the row and then delete the bytes?
Other option is store the byte offset of each column instead and process column wise but that will blow up the index table.
How can parquet files be efficiently processed in row-wise manner?
Current system is a background job in C#.
Using Cinchoo ETL, an open source library to convert CSV to parquet file easily.
string csv = #"Id,Name
1,Tom
2,Carl
3,Mark";
using (var r = ChoCSVReader.LoadText(csv)
.WithFirstLineHeader()
)
{
using (var w = new ChoParquetWriter("*** PARQUET FILE PATH ***"))
w.Write(r);
}
For more information, pls check https://www.codeproject.com/Articles/5270332/Cinchoo-ETL-Parquet-Reader article.
Sample fiddle: https://dotnetfiddle.net/Ra8yf4
Disclaimer: I'm author of this library.
I create xls/xlsx file from C# using ODBC (with Provider=Microsoft.ACE.OLEDB.12.0). The result table has 4 rows (for example). I open the file with Excel, add 5-th row and save the file. When try to read it from C# over ODBC with SELECT * FROM [table] I get only the original 4 rows without 5th. It seems ODBC stores somewhere in XLS file the number of rows and later reads only them without new data entered from Excel or LibreOffice. Is this known problem and can I solve it? If I create new spreadsheet in Excel, all its rows are read fron C#.
EDIT: I found some useful information. When the XLS file is first created from C#/ODBC, there are 2 tables (sheets). If the table name is TABLE, DataTable sheets = conn.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, null) will contain sheets.Rows[0] == "TABLE" and sheets.Rows[1] == "TABLE$". Excel will show only one sheet "TABLE". After edit the changes (5th row) exist only in "TABLE$" sheet.
Are you adding the 5th row by code if yes, could you please share the code lines which you are using for doing the same. There might be following issue in your code.
Save commit not done properly.
Before reading the file connection refresh not done.
I think I found the problem. It seems that internal spreadsheet names created by Excel have "$" sign at the end. The sheet name generated by ODBC are the exact string given in CREATE TABLE. On the other hand, Excel (and LibreOffice) show only one sheet for both TABLE and TABLE$ sheets. If I edit the table in Excel, after save the changes are only in TABLE$. The other sheet TABLE is unchanged. When I do SELECT * FROM [TABLE] the result is from the original ODBC generated table without Excel changes. Now I enumerate the available sheets inside XLS file and if first sheet name does not end with "$" and sheets are more than 1, I add "$" to first sheet name and open the correct table. I suppose ODBC connection string may include option to work with "$"-ending tables...
I have a table with 6 columns, one of which is a date default value.
I want to import a 5-column CSV file, letting that date default.
I get the error Invalid character value for cast specification.
I also used a format file, but it doesn't help: number of table and csv file column are not matching.
How can I fix this?
Create a View with the columns that match the data file, then BCP into the View. Make sure any other columns in the table will allow Null values and/or have Default values.
If your still having issues:
You may need a format file to tell bcp the file to table (or view) mapping BCP Format Files
I would generate the format file, and edit it to see what BCP thinks the mapping is. BCP may be mapping one of your CSV file columns to the wrong field, or its getting thrown off in some other way.
Just modify the file to map the correct CSV columns to the correct table or view columns and you should be good.
I need to read the data of particular range in excel file and upload them in database.
The required data does not start at A1 cell instead, they start at A15 and A14 is the header row for columns. there are seven columns with headers.
(I tried to read cells via "get_Range" option)
We need to read the data in each cell and do a row by row update in database.
There are thousands of files of same type in a specific folder.
I am trying to achieve this as C# Console app because this is just a one time job.
Here is the answer i found.
step 1 : Loop through each file in the source directory.
Step 2 : Add Excel Interop Reference. and Create Excel Application Class Object, and also for Workbook, and Range(for used range).
Step 3 : Use the Get Range() function and read the rows. (since this is solution is specific for a problem, the start and end ranges of rows and columns are well known)
Step 4 : Each read row can be constructed as a string till the end of the file.
OR
Insert can be done after reading each row.
step 5 : Get Connection String and Create SQLConnection Object to perform insert. Better to use Transaction-Commit.
Done. Thanks to all.
I have a old excel file , i want to modify that.
i need to filter rows from my old excel file and write selected rows to new excel file.
My plan is to read a single row,store it in a list,pass this arraylist to a function.
The function does some checking on the arraylist, if the condition is satisfied, i will write this entire row to my new excel file else i will go back and read next row.
I am not getting anything to read a single row and save it in a arraylist.
I am able to read cell by cell and do conditioning , but this is very slow.
Any better option.
Read the entire input file into a custom-made object at the same time, do your work on it, then write all of that data at the same time into the final Excel file at the same time.