my program (console app) is basically reading a very big csv file and process it. there are columns in the file that I feel like can be grouped together and bestserved in class
for example the first line is title, second line onward are the values. each column has this structure. so I need to group title, location of the column and values. easiest is to create a class
this is what the data look like:
title1, title2, title3, ...
1,1,2, ...
20,30,5000,...
.
.
.
class tt
{
string title;
int column;
List<int> val = new List<int>();
}
but the problem is there are some 1,000 columns , which translate to 1,000 objects. is this a good approach? not sure?
A class with 1000 members would sound.... unusual, to be blunt. Since it is unlikely that the code is going to be referring to any of those by name, I would say it would be self-defeating to create members per-value. But for "should I create a class" - well, you don't have many options - it certainly would make a very bad struct. Actually, I suspect this may be a fair scenario for DataTable - it is not something I usually recommend, but for the data you are describing, it will probably do the job fine. Yes, it has overheads - but it optimises away a number of issues - for example, by storing data internally in typed columns (rather than rows, as you might expect), it avoids having to box all the values during storage (although they still tend to get boxed during access, those boxes are collected during gen-0, so are cheap).
"Thousands" are low numbers in most computing scenarios.
Note: if you actually mean 1000 rows, and the columns are actually sane (say, less than 50, all meaningful), then I would say "sure, create a class that maps the data" - for example something like:
class Customer {
public int Id {get;set;}
public string Name {get;set;}
//...
}
There are several tools that can help you read this from SCV into the objects, typically into a List<Customer> or IEnumerable<Customer>.
but the problem is there are some 1,000 columns , which translate to 1,000 objects. is this a good approach
yes translating to 1000 objects (if you need them) should not bother you.
but, for so many columns creating classes is hell lot of work. I won't do that unless it is absolutely necessary. You can use DataTable.
If your file is a big CSV file I suggest that load them on a DataSet. I dont see any benefit in using a class for each column
The CSV looks like a transposed regular data table, therefore each column is actully a row
Related
I have a program that creates a list of objects from a file, and also creates a list of the same type of object, but with fewer/and some different properties, from the database, like:
List from FILE: Address ID, Address, City, State, Zip, other important properties
list from DB: Address ID, Address, City, State
I have implemented IEquatable on this CustObj so that it only compares against Address, City, and State, in the hopes of doing easy comparisons between the two lists.
The ultimate goal is to get the address ID from the database and update the address IDs for each address in the list of objects from the file. These two lists could have quite a lot of objects (over 1,000,000) so I want it to be fast.
The alternative is to offload this to the database and have the DB return the info we need. If that would be significantly faster/more resource efficient, I will go that route, but I want to see if it can be done quickly and efficiently in code first.
Anyways, I see there's a Zip method. I was wondering if I could use that to say "if there's a match between the two lists, keep the data in list 1 but update the address id property of each object in list 1 to the address Id from list 2".
Is that possible?
The answer is, it really depends. There are a lot of parameters you haven't mentioned.
The only way to be sure is to build one solution (preferably using the zip method, since it has less work involved) and if it works within the parameters of your requirements (time or any other parameter, memory footprint?), you can stop there.
Otherwise you have to off load it to the database. Mind you, you would have to hold the 1 million records from files and 1 million records from DB in memory at the same time if you want to use the zip method.
The problem with pushing everything to the database is, inserting that many records is resource (time, space etc) consuming. Moreover if you want to do that maybe everyday, it is going to be more difficult, resource wise.
Your question didn't say if this was going to be a one time thing or a daily event in a production environment. Even that is going to make a difference in which approach to choose.
To repeat, you would have to try different approaches to see which will work best for you based on your requirements: is this a one time thing? How much resources does the process have? How much time does it have? and possibly many more.
It kindof sounds also like a job for .Aggregate() aka
var aggreg = list1.Aggregate(otherListPrefilled, (acc,elemFrom1) =>
{
// some code to create the joined data, usig elemFrom1 to find
// and modify the correct element in otherListPrefilled
return acc;
});
normally I would use an empty otherListPrefilled, not sure how it performs on 100k data items though.
If its a onetime thing, its probably faster to put your file to a csv, import it in your database as temporary table and join the data in sql.
I think I got an architecture problem.
For the example purpose let's say I have a table named Dict.Country with two columns : Id and Name like below. The reason why I have such a table and not only Enum in code is because with time we want to dynamically add another values.
1 USA
2 POLAND
3 CHINA
etc.
So now is the question, how to correctly read and operate on these values? I can create class DictElement with string fields Id and Column, then read them from database and operate, but we got the problem that we have to operate on strings literals:
if (x.country == "POLAND")
...
which I believe is bad practice, cause one small misspelling can make us much troubles.
Is there any good practice how to work on such dictionaries from database?
I have to create a database structure. I have a question about foreing keys and good practice:
I have a table which must have a field that can be two different string values, either "A" or "B".
It cannot be anything else (therefore, i cannot use a string type field).
What is the best way to design this table:
1) create an int field which is a foreign key to another table with just two records, one for the string "A" and one for the string "B"
2) create an int field then, in my application, create an enumeration such as this
public enum StringAllowedValues
{
A = 1,
B
}
3) ???
In advance, thanks for your time.
Edit: 13 minutes later and I get all this awesome feedback. Thank you all for the ideas and insight.
Many database engines support enumerations as a data type. And there are, indeed, cases where an enumeration is the right design solution.
However...
There are two requirements which may decide that a foreign key to a separate table is better.
The first is: it may be necessary to increase the number of valid options in that column. In most cases, you want to do this without a software deployment; enumerations are "baked in", so in this case, a table into which you can write new data is much more efficient.
The second is: the application needs to reason about the values in this column, in ways that may go beyond "A" or "B". For instance, "A" may be greater/older/more expensive than "B", or there is some other attribute to A that you want to present to the end user, or A is short-hand for something.
In this case, it is much better to explicitly model this as columns in a table, instead of baking this knowledge into your queries.
In 30 years of working with databases, I personally have never found a case where an enumeration was the right decision....
Create a secondary table with the meanings of these integer codes. There's nothing that compels you to JOIN that in, but if you need to that data is there. Within your C# code you can still use an enum to look things up but try to keep that in sync with what's in the database, or vice-versa. One of those should be authoritative.
In practice you'll often find that short strings are easier to work with than rigid enums. In the 1990s when computers were slow and disk space scarce you had to do things like this to get reasonable performance. Now it's not really an issue even on tables with hundreds of millions of rows.
I am working on optimizing some code I have been assigned from a previous employee's code base. Beyond the fact that the code is pretty well "spaghettified" I did run into an issue where I'm not sure how to optimize properly.
The below snippet is not an exact replication, but should detail the question fairly well.
He is taking one DataTable from an Excel spreasheet and placing rows into a consistantly formatted DataTable which later updates the database. This seems logical to me, however, the way he is copying data seems convoluted, and is a royal pain to modify, maintain or add new formats.
Here is what I'm seeing:
private void VendorFormatOne()
{
//dtSumbit is declared with it's column schema elsewhere
for (int i = 0; i < dtFromExcelFile.Rows.Count; i++)
{
dtSubmit.Rows.Add(i);
dtSubmit.Rows[i]["reference_no"] = dtFromExcelFile.Rows[i]["VENDOR REF"];
dtSubmit.Rows[i]["customer_name"] = dtFromExcelFile.Rows[i]["END USER ID"];
//etc etc etc
}
}
To me this is completely overkill for mapping columns to a different schema, but I can't think of a way to do this more gracefully. In the actual solution, there are about 20 of these methods, all using different formats from dtFromExcelFile and the column list is much longer. The column schema of dtSubmit remains the same across the board.
I am looking for a way to avoid having to manually map these columns every time the company needs to load a new file from a vendor. Is there a way to do this more efficiently? I'm sure I'm overlooking something here, but did not find any relevant answers on SO or elsewhere.
This might be overkill, but you could define an XML file that describes which Excel column maps to which database field, then input that along with each new Excel file. You'd want to whip up a class or two for parsing and consuming that file, and perhaps another class for validating the Excel file against the XML file.
Depending on the size of your organization, this may give you the added bonus of being able to offload that tedious mapping to someone less skilled. However, it is quite a bit of setup work and if this happens only sparingly, you might not get a significant return on investment for creating so much infrastructure.
Alternatively, if you're using MS SQL Server, this is basically what SSIS is built for, though in my experience, most programmers find SSIS quite tedious.
I had originally intended this just as a comment but ran out of space. It's in reply to Micah's answer and your first comment therein.
The biggest problem here is the amount of XML mapping would equal that of the manual mapping in code
Consider building a small tool that, given an Excel file with two
columns, produces the XML mapping file. Now you can offload the
mapping work to the vendor, or an intern, or indeed anyone who has a
copy of the requirement doc for a particular vendor project.
Since the file is then loaded at runtime in your import app or
whatever, you can change the mappings without having to redeploy the
app.
Having used exactly this kind of system many, many times in the past,
I can tell you this: you will be very glad you took the time to do
it - especially the first time you get a call right after deployment
along the lines of "oops, we need to add a new column to the data
we've given you, and we realised that we've misspelled the 19th
column by the way."
About the only thing that can perhaps go wrong is data type
conversions, but you can build that into the mapping file (type
from/to) and generalise your import routine to perform the
conversions for you.
Just my 2c.
A while ago I ran into similar problem where I had over 400 columns from 30 odd tables to be mapped to about 60 in the actual table in the database. I had the same dilemma whether to go with a schema or write something custom.
There was so much duplication that I ended up writing a simple helper class with a couple of overridden methods that basically took in a column name from import table and spit out the database column name. Also, for column names, I built a separate class of the format
public static class ColumnName
{
public const string FirstName = "FirstName";
public const string LastName = "LastName";
...
}
Same thing goes for TableNames as well.
This made it much simpler to maintain table names and column names. Also, this handled duplicate columns across different tables really well avoiding duplicate code.
To start I would like to clarify that I'm not extremely well versed in C#. In that, a project I'm doing working in C# using .Net 3.5 has me building a class to read from and export files that contain multiple fixed width formats based on the record type.
There are currently 5 types of records indicated by the first character position in each line of the file that indicate a specific line format. The problem I have is that the types are distinct from each other.
Record type 1 has 5 columns, signifies beginning of the file
Record type 3 has 10 columns, signifies beginning of a batch
Record type 5 has 69 columns, signifies a transaction
Record type 7 has 12 columns, signifies end of the batch, summarizes
(these 3 repeat throughout the file to contain each batch)
Record type 9 has 8 columns, signifies end of the file, summarizes
Is there a good library out there for these kinds of fixed width files? I've seen a few good ones that want to load the entire file in as one spec but that won't do.
Roughly 250 of these files are read at the end of every month and combined filesize on average is about 300 megs. Efficiency is very important to me in this project.
Based on my knowledge of the data I've build a class hierarchy of what I "think" an object should look like...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Extract_Processing
{
class Extract
{
private string mFilePath;
private string mFileName;
private FileHeader mFileHeader;
private FileTrailer mFileTrailer;
private List<Batch> mBatches; // A file can have many batches
public Extract(string filePath)
{ /* Using file path some static method from another class would be called to parse in the file somehow */ }
public string ToString()
{ /* Iterates all objects down the heiarchy to return the file in string format */ }
public void ToFile()
{ /* Calls some method in the file parse static class to export the file back to storage somewhere */ }
}
class FileHeader
{ /* ... contains data types for all fields in this format, ToString etc */ }
class Batch
{
private string mBatchNumber; // Should this be pulled out of the batch header to make LINQ querying simpler for this data set?
private BatchHeader mBatchHeader;
private BatchTrailer mBatchTrailer;
private List<Transaction> mTransactions; // A batch can have multiple transactions
public string ToString()
{ /* Iterates through batches to return what the entire batch would look like in string format */ }
}
class BatchHeader
{ /* ... contains data types for all fields in this format, ToString etc */ }
class Transaction
{ /* ... contains data types for all fields in this format, ToString etc */ }
class BatchTrailer
{ /* ... contains data types for all fields in this format, ToString etc */ }
class FileTrailer
{ /* ... contains data types for all fields in this format, ToString etc */ }
}
Ive left out many constructors and other methods but I think the idea should be pretty solid. I'm looking for ideas and critique to the methods I'm considering as again, not knowledgable about C# and the execution time is the highest priority.
Biggest question besides some critique is, how should I bring in this file? I've brought in many files in other languages such as VBA using FSO methods, Microsoft Access ImportSpec to read in the file (5 times, one for each spec... wow that was inefficient!), created a 'Cursor' object in visual foxpro (which was FAAAAAAAST but again, had to do five times) but am looking for hidden gems in C# if said things exist.
Thanks for reading my novel, let me know if your having issues understanding it. I'm taking the weekend to go over this design to see if I buy it and want to take the effort to implement it this way.
FileHelpers is nice. It has a couple of drawbacks in that it doesn't seem to be under active development anymore, and it makes you use public variables for your fields instead of letting you use properties. But otherwise good.
What are you doing with these files? Are you loading them into SQL Server? If so, and you're looking for FAST and SIMPLE, I'd recommend a design like this:
Make staging tables in your database that correspond to each of the 5 record types. Consider adding a LineNumber column and a FileName column too just so you can trace problems back to the file itself.
Read the file line by line and parse it out into your business objects, or directly into ADO.NET DataTable objects that correspond to your tables.
If you used business objects, apply your data transformations or business rules and then put the data into DataTable objects that correspond to your tables.
Once each DataTable reaches an appropriate BatchSize (say 1000 records), use the SqlBulkCopy object to pump the data into your staging tables. After each SqlBulkCopy operation, clear out the DataTable and continue processing.
If you didn't want to use business objects, do any final data manipulation in SQL Server.
You could probably accomplish the whole thing in under 500 lines of C#.
Biggest question besides some critique is, how should I bring in this file?
I do not know of any good library for file IO, but the reading is pretty straightforward.
Instantiate a StreamReader class using a 64kB buffer to limit disk IO operations (my estimations is 1500 transactions average per file per the end of the month).
Now you can stream over the file:
1) Using the Read at the beggining of each line to determine the type of the record.
2) Using the ReadLine method with the String.Split method to get column values.
3) Create the object using the column values.
or
You could just buffer the data from a Stream manually and IndexOf+SubString for more performance (if done right).
Also if the lines weren't columns but primitive datatypes in binary format, you could use the BinaryReader class for a very easy and performant way to read the objects.
One critique I have is that you are not correctly implementing ToString.
public string ToString()
Should be:
public override string ToString()