I have a two .csv files that are in the format
File1
ID,Name,Token
123,ABC,345
555,XYZ,777
777,YYY,765
666,UUU,
543,MNO,
File2
ID,Name,Token
777,ABC,345
125,XYZ,999
976,RRR,
I have to compare these and exclude those employees who have same ID or/and same token because these employees are considered same.
So, in my above example, in File2
The employee Id 777 has two matching employees in File 1, the one that have 777 as token number and the one that have 777 as Id.
I have done file comparison before also and every time I have used hashtables/dictionaries where the ID/Token were unique for every employee so it was simple as I only had to compare keys. But here as we can see, one employee in File2 matches with 2 Employees in File1.
I assume to tackle this issue we can use a linear search and do a full scan of File1 for every employee of File2. But I think it would be useless as it would increase processing time a lot because in my file there would be hundreds of rows. For example, if File 2 has 400 rows and File 1 has 300, then for each employee in File2, the application would scan 300 times. This doesn't make sense.
What should I do in this case? Should I fix the data at source and eliminate these duplicates entries there only?
Please let me know if the problem statement doesn't make sense. I will try to explain more clearly. Thanks!
UPDATE- What I tried
I read File1 in a dictionary and then read File2 line by line and compared the ID and token. This works well for those employees that have unique id and token but for duplicates in the data it fails.
if((File2_ID).contains(File1_ID)|| (File2_Token).contains(File1_Token))
{
//do something
}
The issue with this approach is that the application checks employee ID 777 in File2 with the Id 777 in File 1, it thinks it has found it match , it processes the data and moves to the next line in File2. It doesn't consider employee with token number 777 in File1.
UPDATE 2 - I had forgotten to mention that tokens can be blank too.
I was wondering if I try some data cleansing thing. Create a functionality that removes these duplicates, and then add clean file in the application for comparison.
You could use something like the following to lookup by both ID and Token:
FileEntry foundById;
idMap.TryGetValue(File1_ID, out var foundById);
FileEntry foundByToken;
idToken.TryGetValue(File1_Token, out var foundByToken);
if (foundById && foundByToken)
{}
else if (foundById)
{}
else if (foundByToken)
{}
Related
I have a customers table with 2 columns for the name (firstname, lastname) and it contains around 100k records.
I have a scenario where I have to import new customers but their names come as a single column. Most names are simple (first and last), but some names are double names (with a space or hyphen), double surnames (with a space or hyphen) or even both.
Does using a ML.NET classification algorithm make sense to split the fullname based on a trained model from the 100k records?
I think it would be unnecessary to use machine learning methods for such a problem. You should try a rule-based method here.
Assuming the data comes in 1 column:
For example: After splitting the Text by space, is the length of the word count equal to 2? If equal, the 1st word is the name and the 2nd word is the surname.
Example 2: Does the text contain hyphen or not? If yes, what should I do? How can I determine my name and surname?
1-) What you need to do here is to create a training, validation and test set for yourself.
2-) Doing a coding with the rules you extracted from the data in the train set. (Here you need to make clever deductions by examining the data)
3-) You need to determine the most ideal rules with validation data.
Finally, you should evaluate your work by getting results on the test set with the rule you find most ideal.
I have a folder filled with about 200 csv files, each containing about 6000 rows of data containing mutual fund data. I have to copy those comma separated data into the database via Entity Framework.
The two major objects are Mutual_Fund_Scheme_Details and Mutual_Fund_NAV_Details.
Mutual_Fund_Scheme_Details - this contains columns like Scheme_Name, Scheme_Code, Id, Last_Updated_On.
Mutual_Fund_NAV_Details - this contains Scheme_Id (foreign key), NAV, NAV_Date.
Each line in the CSV contains all of the above columns so before inserting, I have to -
Split each line.
Extract first the scheme related data and check if the scheme exists and get id. If it does not exist then insert the scheme details and get id.
Using the id obtained from step 2, check if an entry for NAV exists for the same date. If not, then insert it else skip it.
If an entry is inserted in Step 3 then the Last_Updated_On date might need to be updated for the scheme with the NAV date (depending on it is newer than existing value)
All the exists checks are done using ANY linq extension method and all the new entries are inserted into the DbContext but the SaveChanges method is called only at the end of processing of each file. I used to call it after each insert but that just takes even longer than right now.
Now since, this involves at least two exists checks, at the most two inserts and one update, the insertion of each file is taking too long close to 5-7 minutes per file. I am looking for suggestions to improve this. Any help would be useful.
Specifically, I am looking to:
Reduce the time it takes to process each file
Decrease the number of individual exists check (if I can possibly club them in some way)
Decrease individual inserts/updates (if I can possibly club them in some way)
It's going to be hard to optimize it with EF. Here is a suggestion:
Once you process the whole file (~6000) do the exists check with .Where( x => listOfIdsFromFile.Contains(x.Id)). This should work for 6000 ids and it will allow you separate inserts from updates.
I've tried researching this question but most of the answers is for .csv files which does not help me a lot.
I have a couple of large .dat files containing quite a lot of data (each file around 700MB), and I am trying to develop a software in C# where I will be able to search for a specific string and locate the line where it is (duplicates will occur so a listview / listbox might be a good idea).
Every line follows the exact same data format and the starting index/length of each datatype is well documented.
Example:
Line 1: ZATIXIZ20SWEDENSTACKOVERFLOWCHROME
Documented like this:
Username : 0-6 Age : 7-8 Country: 9-14 Website :
15-27 Browser : 28-33
My guess is that the best approach would be to do some kind of BULK INSERT on the data files into a database and then index it for faster searching later on. I am not quite sure how to do this though, nor what the best approach would be. (It also needs to search through all of the files so maybe it could be a good idea to insert them all into the same table?)
So far I have only tried to read one of the files into memory and then do a simple Regex which of course was not a good idea. Unfortunately I am a bit inexperienced with SQL queries which is why I have not tried a lot yet.
Thanks in advance!
'Insert all data of the same type into a table with indexed columns.
If the properties vary between each file, use multiple tables.
If you want to be able to trace the match back to the original file, use a table with columns:
Key - Internal key from a sequence
FileName - So you know where it came from
Line - The line number
Username
Age
Country
Website
Browser
Where FileName, Line is a unique key.
Here is a link to an article on full-text searching on MSSQL as we don't know which RDMS you are using: http://msdn.microsoft.com/en-us/library/ms142571.aspx#queries
From you example, the line 'ZATIXIZ20SWEDENSTACKOVERFLOWCHROME' becomes:
| Key | FileName | Line | Username | Age | CountryKey | Website | BrowserKey
1 'Data1.dat' 1 'ZATIXIZ' 20 46 'STACKOVERFLOW' 4
In this example, you'd need two more tables: Countries and Browsers. These are optional, as you could just include the information directly in the main table.
I must stress though, that it really depends on how you wish to query this data. The above structure gives you the opportunity to search for 'all swedish users between 20 and 25)' by performing the following query:
select * from TABLENAME where Age < 25 and Age >= 20 and CountryKey = 46
In regards to how you import a fixed width file, it depends greatly on your RDMS. If you're using Oracle, you can use SQL*Loader. Remember that it does not necessarily have to be a single-stage process. You can load the data into the tables and then look up the keys internally after the initial import.
For MSSQL, here is another answer from the stack: Bulk insert fixed width fields
You can also preprocess it in .NET. Again, it depends on your scenario. If you are piping these into your system at a rate of one 900MB file every 10 minutes, you're looking at some serious optimization of the bulk load process (and some serious hardware). But if you only need to load this file once a month, the .NET approach is absolutely fine, even though it may take a few hours.
I've researched this across various threads and found viable solutions for one or two parts but I'm having trouble linking the two up. Basically I want to search a directory for file of a certain name , if there is is more than one occurance of this file I then want to choose only the most recent file. e.g if I have three files called DOG I only want the most recent one. I'm using version 1.0 so I don't think Linq is and option. here is the closest match to a solution I've found which I believe will pull the most recent file
var targetDirectory = new DirectoryInfo("G:\SomeFilePath");
string[] files = System.IO.Directory.GetFiles(Dts.Variables["User::DirectoryPath"].Value.ToString());
string lastFileName=string.Empty;
foreach (string Current_filename in files)
{
if (string.Compare(Current_filename,lastFileName)>=1)
{
lastFileName = Current_filename;
}
}
Dts.Variables["User::LastFileName"].Value = lastFileName;
As the files are all in the same directory my Query is if this code pulls the most recent file , how can I specify what most recent file I want , for example , I have 3 files called dog , 2 called cat etc so how can I take just the most recent Dog file and just the most recent Cat file.
based on answers in this thread I am using this solution https://stackoverflow.com/a/1781668/451518 to return the most recent files , however I know want to seperate these files based on names , e.g I want to search the directory for all DOG , or CAT txt files and then perform the order by most recent function as shown in the link
JUST TO CLARIFY the file names aren't exactly the same i mean DOG2 or DOG3 however its the DOG part that is important so based on the DOG I need to pick the most recent one
In my main I am calling
if(filename.StartsWith("DOG"))
{
GetLatestWritenFileFileInDirectory(directoryInfo);
}
However I want to edit the GetLatestWritenFileFileInDirectory so that it only takes the files with the name DOG
Text file contains order and order detail lines.
First line in order header.
After that there is variable number of detail lines.
After detail there is blank line followed by next order etc.
For order line first field is order number.
For detail line first field is 1 always.
128502 02.01.2012 20120 02.01.2012
1 Wine 0 1300
1 Meat 5,8333 5,83
128503 02.01.2012 20123 02.01.2012
1 Wine 20 130
1 Meat 1,33 283,23
1 Cow 2,333 333,23
....
This file need to be readed into list of entities:
class Order {
public string Number; // order number from first field, primary key
public string Date;
... other fields
}
class OrderDetails {
public string Number; // order number from previous line , foreign key to Order
public string ProductName;
... other fields
}
(Instead on Number custom integer id column can also used for relation)
How to read such file in C# ASP.NET MVC2 using FileHelpers library or other way ?
Update
sample
http://www.filehelpers.com/example_multirecords.html
referenced from
Multiple CSV strutures with FileHelpers
shows how to read two tables.
How to create relation between those tables: During reading foreign key to order column should added to details table. How to implement this, getting last order number from previous order line and annd it to detail record ?
Well you could go brute force all in one go.
Read in the entire file
Split on the blank like.
Create an Order out of the first line of each split
Create details out of the rest.
If the files are huge then you could read in up to each blank line and then process as above.
Another solution that might have some advantages.
Run the file through a process that edit's the detail line and adds in the order number from it's realted order line.
Then another that splits out order and detail into two separate files.
Then process orders then details without having to worry about all this dittoing and two structures.
Fourth solution, is give whoever came up with this format a good beating to encourage them to do it properly...
There's no magic wand solution for this, the file format is simply wholly unsuitable for any efficent passing of data to something that might store it sensibly.