Fastest way to fuzzy match two csv files - c#

I have written a very simple program using a nuget package in c# to read in 2 csv files and fuzzy match them and output a new csv file with all the matches. The problem is i need the program to be able to read and compare files up to 700k and comparw it to 100k. I havent been able to find a way to speed up the process. Is there any way i can do this? I will even use another language if need be.
you can ignore all the commented code its just there for when i was using it for testing purposes. sorry im a newer programmer.
the read csv funciton is for reading in the csv. the rest is code inside another function where i pass in the string arrays to pass them through fuzzymatch
static string[] ReadCSV(string path)
{
List<string> name = new List<string>();
List<string> address = new List<string>();
List<string> city = new List<string>();
List<string> state = new List<string>();
List<string> zip = new List<string>();
using (var reader = new StreamReader(path))
{
reader.ReadLine();
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
var values = line.Split(',');
name.Add(values[0] +", "+ values[1]);
//address.Add(values[1]);
//city.Add(values[2]);
//state.Add(values[3]);
//zip.Add(values[4]);
}
}
string[] name1 = name.ToArray();
return name1;
//foreach (var item in name)
//{
// Console.WriteLine(item.ToString());
//}
}
StringBuilder csvcontent = new StringBuilder();
string csvpath = #"C:\Users\bigel\Documents\outputtest.csv";
csvcontent.AppendLine("Name,Address,Match");
//Console.WriteLine("Levenshtein Edit Distance:");
int x = 1;
foreach (var name in string1)
{
for (int i = 0; i < length; i++)
{
int leven = match[i].LevenshteinDistance(name);
//Console.WriteLine(match[i] + "\t{0} against {1}", leven, name);
if (leven <= 7)
{
output[i] = input[i] + ",match";
csvcontent.AppendLine(output[i]);
//Console.WriteLine(match[i] + " " + leven + " against " + name + " is a Match");
//Console.WriteLine(output[i]);
}
else
{
if (i == 500)
{
Console.WriteLine(x);
x++;
}
}
}
}
File.AppendAllText(csvpath, csvcontent.ToString());

Related

writing in a file using C# StreamWriter

I was practicing to write into a file using c#
my code is not working (writing in file is not done)
{
int T, N; //T = testCase , N = number of dice in any Test
int index = 0, straight;
List<string> nDiceFaceValues = new List<string>(); //List for Dice Faces
string line = null; //string to read line from file
string[] lineValues = {}; //array of string to split string line values
string InputFilePath = # "E:\Visual Studio 2017\CodeJam_Dice Straight\A-small-practice.in"; //path of input file
string OuputFilePath = #
"E:\Visual Studio 2017\CodeJam_Dice Straight\A-small-practice.out"; //path of otput file
StreamReader InputFile = new StreamReader(InputFilePath);
StreamWriter Outputfile = new StreamWriter(OuputFilePath);
T = Int32.Parse(InputFile.ReadLine()); //test cases input
Console.WriteLine("Test Cases : {0}", T);
while (index < T) {
N = Int32.Parse(InputFile.ReadLine());
for (int i = 0; i < N; i++) {
line = InputFile.ReadLine();
lineValues = line.Split(' ');
foreach(string j in lineValues)
{
nDiceFaceValues.Add(j);
}
}
straight = ArrangeDiceINStraight(nDiceFaceValues);
Console.WriteLine("case: {0} , {1}", ++index, straight);
Outputfile.WriteLine("case: {0} , {1}", index, straight);
nDiceFaceValues.Clear();
}
}
what is wrong with this code?
how I fix it?
why its not working??
Note: I want to write in file line by line
What's missing is: closing things down - flushing the buffers, etc:
using(var outputfile = new StreamWriter(ouputFilePath)) {
outputfile.WriteLine("case: {0} , {1}", index, straight);
}
However, if you're going to do that for every line, File.AppendText may be more convenient.
In particular, note that new StreamWriter will be overwriting by default, so you'd also need to account for that:
using(var outputfile = new StreamWriter(ouputFilePathm, true)) {
outputfile.WriteLine("case: {0} , {1}", index, straight);
}
the true here is for append.
If you have opened a file for concurrent read/write, you could also try just adding outputfile.Flush();, but... it isn't guaranteed to do anything.

c# load txt file and split it to X files based on number of lines

this is the code that i've written so far...
it doesnt do the job except re-write every line on the same file over and over again...
*RecordCntPerFile = 10K
*FileNumberName = 1 (file number one)
*Full File name should be something like this: 1_asci_split
string FileFullPath = DestinationFolder + "\\" + FileNumberName + FileNamePart + FileExtension;
using (System.IO.StreamReader sr = new System.IO.StreamReader(SourceFolder + "\\" + SourceFileName))
{
for (int i = 0; i <= (RecordCntPerFile - 1); i++)
{
using (StreamWriter sw = new StreamWriter(FileFullPath))
{
{ sw.Write(sr.Read() + "\n"); }
}
}
FileNumberName++;
}
Dts.TaskResult = (int)ScriptResults.Success;
}
If I understood correctly, you want to split a big file in smaller files with maximum of 10k lines. I see 2 problems on your code:
You never change the FullFilePath variable. So you will always rewrite on the same file
You always read and write the whole source file to the target file.
I rewrote your code to fit the behavior I said earlier. You just have to modify the strings.
int maxRecordsPerFile = 10000;
int currentFile = 1;
using (StreamReader sr = new StreamReader("source.txt"))
{
int currentLineCount = 0;
List<string> content = new List<string>();
while (!sr.EndOfStream)
{
content.Add(sr.ReadLine());
if (++currentLineCount == maxRecordsPerFile || sr.EndOfStream)
{
using (StreamWriter sw = new StreamWriter(string.Format("file{0}.txt", currentFile)))
{
foreach (var line in content)
sw.WriteLine(line);
}
content = new List<string>();
currentFile++;
currentLineCount = 0;
}
}
}
Of course you can do better than that, as you don't need to create that string and do that foreach loop. I just made this quick example to give you the idea. To improve the performance is up to you

Need help to develop a C# code to put the header validation

I have a flat file with comma separated values which need to be transfer to a datatable and the values on the first line is header name, will be used as columns name of the datatable. But Before that, I need to check if all required header (Some Mandatory headers) are available in the flat file. Please help me to develop a C# code to put the header validation.
`.
.
.
/getting full file path of Uploaded file and read all text
System.IO.StreamReader file = new System.IO.StreamReader(#path);
string line;
while ((line = file.ReadLine()) != null)
{
string[] linetemp = line.Split(new char[] { ',' });
if(tblcsv.Rows.Count==0)
{
foreach (string ColName in linetemp)
{
tblcsv.Columns.Add(ColName); //Creating columns with available headers names
}
}
tblcsv.Rows.Add();
.
.
.
`//remaining code
For example
If the flat file will contain
datetime,status,Assignee,Reporter,Duration,Col1,Col2,Remarks
1504451523568,Inprogress,ABC,BCD,120,True,B,comments...
1504451523567,Completed,DFG,BCD,120,True,B,comments...
1504451523566,unassigned,VNB,BCD,160,,B,comments...
1504451523565,Inprogress,ERT,FGH,150,True,,comments...
and I need to check that only First line have all mandaory header(like- datetime,Status,Assignee and Duration).
I tired to implement your particular requirement with a sample Csv file from online. Csv file can be found here, I may not have a sophisticated code, but tried to take a simplest way to solve this particular problem.
Below is the short version of code which is of your importance.
String firstLine;
var fileStream = new FileStream( # "C:\Users\user\Desktop\AssetsImportCompleteSample.csv", FileMode.Open,
FileAccess.Read);
using(var streamReader = new StreamReader(fileStream, Encoding.UTF8)) {
firstLine = streamReader.ReadLine();
}
var values = firstLine.Split(',');
for (int i = 0; i < values.Length; i++) {
values[i] = values[i].Trim();
}
if (values.Length == 4)
{
int count=0;
IList<string> newList = new List<string> { "MXASSETInterface", "SRM_SaaS_ES", "EN", "AddChange" };
for (int i = 0; i < values.Length; i++)
{
if (newList.Contains(values[i]))
{
count++;
newList.Remove(values[i]);
}
}
if (count == 4)
{
Console.WriteLine("head is correct");
}
else
{
Console.WriteLine("head is incorrect");
}
}
The complete console application can be found with below code, which can be run direct
class Program
{
static void Main(string[] args)
{
try
{
String firstLine;
var fileStream = new FileStream(#"C:\Users\user\Desktop\AssetsImportCompleteSample.csv", FileMode.Open,
FileAccess.Read);
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8))
{
firstLine = streamReader.ReadLine();
}
if (firstLine != null)
{
var values = firstLine.Split(',');
Console.WriteLine(firstLine);
for (int i = 0; i < values.Length; i++)
{
values[i] = values[i].Trim();
Console.WriteLine(values[i]);
}
if (values.Length == 4)
{
int count=0;
IList<string> newList = new List<string> { "MXASSETInterface", "SRM_SaaS_ES", "EN", "AddChange" };
for (int i = 0; i < values.Length; i++)
{
if (newList.Contains(values[i]))
{
count++;
newList.Remove(values[i]);
}
}
if (count == 4)
{
Console.WriteLine("head is correct");
}
else
{
Console.WriteLine("head is incorrect");
}
}
else
{
Console.WriteLine("header is Invalid");
}
}
else
{
Console.WriteLine("header is Invalid");
}
Console.ReadLine();
}
catch (Exception e)
{
Console.WriteLine("Please check if file is available or path is correct", e.Message);
}
Console.ReadLine();
}
}
I suggest using CsvHelpet library for parsing the CSV file. It allows to define a class that represents a row in your file. Header names are property names by default or they can be mapped usimg fluent API.
var csv = new CsvReader( textReader ); var records = csv.GetRecords();
Get records will fail if some headers are missing.

StreamWriter: Starting and ending on a specific line number

I would like to ask some tips and help on a reading/writing part of my C#.
Situation:
I have to read a CSV file; - OK
If the CSV file name starts with "Load_", I want to write on another CSV the data from line 2 to the last one;
If the CSV file name starts with "RO_", I want to write on 2 different CSVs, 1 with the line 1 to 4 and the other 4 to the last one;
What I have so far is:
public static void ProcessFile(string[] ProcessFile)
{
// Keeps track of your current position within a record
int wCurrLine = 0;
// Number of rows in the file that constitute a record
const int LINES_PER_ROW = 1;
int ctr = 0;
foreach (string filename in ProcessFile)
{
var sbText = new System.Text.StringBuilder(100000);
int stop_line = 0;
int start_line = 0;
// Used for the output name of the file
var dir = Path.GetDirectoryName(filename);
var fileName = Path.GetFileNameWithoutExtension(filename);
var ext = Path.GetExtension(filename);
var folderbefore = Path.GetFullPath(Path.Combine(dir, #"..\"));
var lineCount = File.ReadAllLines(#filename).Length;
string outputname = folderbefore + "output\\" + fileName;
using (StreamReader Reader = new StreamReader(#filename))
{
if (filename.Contains("RO_"))
{
start_line = 1;
stop_line = 5;
}
else
{
start_line = 2;
stop_line = lineCount;
}
ctr = 0;
while (!Reader.EndOfStream && ctr < stop_line)
{
// Add the text
sbText.Append(Reader.ReadLine());
// Increment our current record row counter
wCurrLine++;
// If we have read all of the rows for this record
if (wCurrLine == LINES_PER_ROW)
{
// Add a line to our buffer
sbText.AppendLine();
// And reset our record row count
wCurrLine = 0;
}
ctr++;
} // end of the while
}
int total_lenght = sbText.Length
// When all of the data has been loaded, write it to the text box in one fell swoop
using (StreamWriter Writer = new StreamWriter(dir + "\\" + "output\\" + fileName + "_out" + ext))
{
Writer.Write.(sbText.);
}
} // end of the foreach
} // end of ProcessFile
I was thinking about using the IF/ELSE: "using (StreamWriter Writer = new StreamWriter(dir + "\" + "output\" + fileName + "_out" + ext))" part. However, I am not sure how to pass, to StreamWriter, to only write from/to a specific line number.
Any Help is welcome! If I am missing some information, please, let me know (I am pretty new on stackoverflow).
Thank you.
Code is way too complicated
using System.Collections.ObjectModel;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace ConsoleApplication57
{
class Program
{
static void Main(string[] args)
{
}
public static void ProcessFile(string[] ProcessFile)
{
foreach (string filename in ProcessFile)
{
// Used for the output name of the file
var dir = Path.GetDirectoryName(filename);
var fileName = Path.GetFileNameWithoutExtension(filename);
var ext = Path.GetExtension(filename);
var folderbefore = Path.GetFullPath(Path.Combine(dir, #"..\"));
var lineCount = File.ReadAllLines(#filename).Length;
string outputname = folderbefore + "output\\" + fileName;
using (StreamWriter Writer = new StreamWriter(dir + "\\" + "output\\" + fileName + "_out" + ext))
{
int rowCount = 0;
using (StreamReader Reader = new StreamReader(#filename))
{
rowCount++;
string inputLine = "";
while ((inputLine = Reader.ReadLine()) != null)
{
if (filename.Contains("RO_"))
{
if (rowCount <= 4)
{
Writer.WriteLine(inputLine);
}
if (rowCount == 4) break;
}
else
{
if (rowCount >= 2)
{
Writer.WriteLine(inputLine);
}
}
} // end of the while
Writer.Flush();
}
}
} // end of the foreach
} // end of ProcessFile
}
}
You can use LINQ to Take and Skip lines.
public abstract class CsvProcessor
{
private readonly IEnumerable<string> processFiles;
public CsvProcessor(IEnumerable<string> processFiles)
{
this.processFiles = processFiles;
}
protected virtual IEnumerable<string> GetAllLinesFromFile(string fileName)
{
using(var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
using(var reader = new StreamReader(stream))
{
var line = String.Empty;
while((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
protected virtual void ProcessFiles()
{
var sb1 = new StringBuilder();
var sb2 = new StringBuilder();
foreach(var file in this.processFiles)
{
var fileName = Path.GetFileNameWithoutExtension(file);
var lines = GetAllLinesFromFile(file);
if(fileName.StartsWith("RO_", StringComparison.InvariantCultureIgnoreCase))
{
sb1.AppendLine(lines.Take(4)); //take only the first four lines
sb2.AppendLine(lines.Skip(4).TakeWhile(s => !String.IsNullOrEmpty(s))); //skip the first four lines, take everything else
}
else if(fileName.StartsWith("Load_", StringComparison.InvariantCultureIgnoreCase)
{
sb2.AppendLine(lines.Skip(1).TakeWhile(s => !String.IsNullOrEmpty(s)));
}
}
// now write your StringBuilder objects to file...
}
protected virtual void WriteFile(StringBuilder sb1, StringBuilder sb2)
{
// ... etc..
}
}

using .replace to replace a word in text document (c#)

currently have the following code:
string[] fileLineString = File.ReadAllLines(Server.MapPath("~") + "/App_Data/Users.txt");
for (int i = 0; i < fileLineString.Length; i++)
{
string[] userPasswordPair = fileLineString[i].Split(' ');
if (Session["user"].ToString() == userPasswordPair[0])
{
userPasswordPair[i].Replace(userPasswordPair[1], newPasswordTextBox.Text);
}
}
}
the text file is set out as: 'username' 'password
what i'm trying to do is be able to edit the password and replace it with a new one using my code, but my code seems to do nothing and the text file just stays the same.
string[] fileLineString = File.ReadAllLines(Server.MapPath("~") + "/App_Data/Users.txt");
for (int i = 0; i < fileLineString.Length; i++)
{
string[] userPasswordPair = fileLineString[i].Split(' ');
if (Session["user"].ToString() == userPasswordPair[0])
{
// set the new password in the same list and save the file
fileLineString[i] = Session["user"].ToString() + " " + newPasswordTextBox.Text;
File.WriteAllLines((Server.MapPath("~") + "/App_Data/Users.txt"), fileLineString);
break; // exit from the for loop
}
}
At the moment, you're not storing the file.
Your replace is not assigned to a variable (Replace does not edit or write anything, it just returns the new string object).
Corrected code:
string[] fileLineString = File.ReadAllLines(Server.MapPath("~") + "/App_Data/Users.txt");
for (int i = 0; i < fileLineString.Length; i++)
{
string[] userPasswordPair = fileLineString[i].Split(' ');
if (Session["user"].ToString() == userPasswordPair[0])
{
fileLineString[i] = fileLineString[i].Replace(userPasswordPair[1], newPasswordTextBox.Text);
break;
}
}
File.WriteAllLines((Server.MapPath("~") + "/App_Data/Users.txt", fileLineString);
String _userName = "User";
String _newPassword = "Password";
// Reading All line from file
String _fileContent = System.IO.File.ReadAllLines("filePath").ToString();
// Pattern which user password like to changed
string _regPettern = String.Format(#"{0} ?(?<pwd>\w+)[\s\S]*?", _userName);
Regex _regex2 = new Regex(_regPettern, RegexOptions.IgnoreCase);
String _outPut = Regex.Replace(_fileContent, _regPettern, m => m.Groups[1] + " " + _newPassword);
// Writing to file file
System.IO.File.WriteAllText("filePath", _outPut);

Categories

Resources