Read and process 100 text files in c# in parallel

Read and process 100 text files in c# in parallel - c#

I have project that reads 100 text file with 5000 words in it.
I insert the words into a list. I have a second list that contains english stop words. I compare the two lists and delete the stop words from first list.
It takes 1 hour to run the application. I want to be parallelize it. How can I do that?
Heres my code:
private void button1_Click(object sender, EventArgs e)
{
List<string> listt1 = new List<string>();
string line;
for (int ii = 1; ii <= 49; ii++)
{
string d = ii.ToString();
using (StreamReader reader = new StreamReader(#"D" + d.ToString() + ".txt"))
while ((line = reader.ReadLine()) != null)
{
string[] words = line.Split(' ');
for (int i = 0; i < words.Length; i++)
{
listt1.Add(words[i].ToString());
}
}
listt1 = listt1.ConvertAll(d1 => d1.ToLower());
StreamReader reader2 = new StreamReader("stopword.txt");
List<string> listt2 = new List<string>();
string line2;
while ((line2 = reader2.ReadLine()) != null)
{
string[] words2 = line2.Split('\n');
for (int i = 0; i < words2.Length; i++)
{
listt2.Add(words2[i]);
}
listt2 = listt2.ConvertAll(d1 => d1.ToLower());
}
for (int i = 0; i < listt1.Count(); i++)
{
for (int j = 0; j < listt2.Count(); j++)
{
listt1.RemoveAll(d1 => d1.Equals(listt2[j]));
}
}
listt1=listt1.Distinct().ToList();
textBox1.Text = listt1.Count().ToString();
}
}
}
}

I fixed many things up with your code. I don't think you need multi-threading:
private void RemoveStopWords()
{
HashSet<string> stopWords = new HashSet<string>();
using (var stopWordReader = new StreamReader("stopword.txt"))
{
string line2;
while ((line2 = stopWordReader.ReadLine()) != null)
{
string[] words2 = line2.Split('\n');
for (int i = 0; i < words2.Length; i++)
{
stopWords.Add(words2[i].ToLower());
}
}
}
var fileWords = new HashSet<string>();
for (int fileNumber = 1; fileNumber <= 49; fileNumber++)
{
using (var reader = new StreamReader("D" + fileNumber.ToString() + ".txt"))
{
string line;
while ((line = reader.ReadLine()) != null)
{
foreach(var word in line.Split(' '))
{
fileWords.Add(word.ToLower());
}
}
}
}
fileWords.ExceptWith(stopWords);
textBox1.Text = fileWords.Count().ToString();
}
You are reading through the list of stopwords many times as well as continually adding to the list and re-attempting to remove the same stopwords over and again due to the way your code is structured. Your needs are also better matched to a HashSet than to a List, as it has set based operations and uniqueness already handled.
If you still wanted to make this parallel, you could do it by reading the stopword list once and passing it to an async method that will read the input file, remove the stopwords and return the resulting list, then you would need to merge the resulting lists after the asynchronous calls came back, but you had better test before deciding you need that, because that is quite a bit more work and complexity than this code already has.

If I understand you correctly, you want to:
Read all words from a file into a List
Remove all "stop words" from the List
Repeat for 99 more files, saving only the unique words
If this is correct, the code is pretty simple:
// The list of words to delete ("stop words")
var stopWords = new List<string> { "remove", "these", "words" };
// The list of files to check - you can get this list in other ways
var filesToCheck = new List<string>
{
#"f:\public\temp\temp1.txt",
#"f:\public\temp\temp2.txt",
#"f:\public\temp\temp3.txt"
};
// This list will contain all the unique words from all
// the files, except the ones in the "stopWords" list
var uniqueFilteredWords = new List<string>();
// Loop through all our files
foreach (var fileToCheck in filesToCheck)
{
// Read all the file text into a varaible
var fileText = File.ReadAllText(fileToCheck);
// Split the text into distinct words (splitting on null
// splits on all whitespace) and ignore empty lines
var fileWords = fileText.Split(null)
.Where(line => !string.IsNullOrWhiteSpace(line))
.Distinct();
// Add all the words from the file, except the ones in
// your "stop list" and those that are already in the list
uniqueFilteredWords.AddRange(fileWords.Except(stopWords)
.Where(word => !uniqueFilteredWords.Contains(word)));
}
This can be condensed into a single line with no explicit loop:
// This list will contain all the unique words from all
// the files, except the ones in the "stopWords" list
var uniqueFilteredWords = filesToCheck.SelectMany(fileToCheck =>
File.ReadAllText(fileToCheck)
.Split(null)
.Where(word => !string.IsNullOrWhiteSpace(word) &&
!stopWords.Any(stopWord => stopWord.Equals(word,
StringComparison.OrdinalIgnoreCase)))
.Distinct());
This code processed over 100 files with more than 12000 words each in less than a second (WAY less than a second... 0.0001782 seconds)

One issue I see here that can help improve performance is listt1.ConvertAll() will run in O(n) on the list. You are already looping to add the items to the list, why not convert them to lower case there. Also why not store the words in a hash set, so you can do look up and insertion in O(1). You could store the list of stop words in a hash set and when you are reading your text input see if the word is a stop word and if its not add it to the hash set to output the user.

Related

How do I take each element off of a list and push it onto a stack?

string filePath = #"C:\Users\Me\Desktop\Palindromes\palindromes.txt";
List<string> lines = File.ReadAllLines(filePath).ToList();
var meStack = new Stack<string>();
for (int i = 0; i < lines.Count; i++)
{
string pali;
pali = lines.RemoveAt(i);
meStack.Push(pali[i]);
}
Basically I need to Remove each element (in the txt there are 40 lines) from the list and then Push each one onto a stack.

Why even make a list List<String>? ReadAllLines responds with a String[]. And Stack takes an array as constructor parameter... So, would code below do the job for you?
string filePath = #"C:\Users\Me\Desktop\Palindromes\palindromes.txt";
var meStack = new Stack<string>(File.ReadAllLines(filePath));

Do not RemoveAt but Clear (if necessary) the lines list at the very end:
for (int i = 0; i < lines.Count; ++i)
meStack.Push(lines[i]);
lines.Clear();
Or even (we can get rid of list at all):
string filePath = #"C:\Users\Me\Desktop\Palindromes\palindromes.txt";
var meStack = new Stack<string>();
foreach (var item in File.ReadLines(filePath))
meStack.Push(item);

You can simplify it to
lines.ForEach(meStack.Push);
lines.Clear();

Your code with some comments:
string filePath = #"C:\Users\Me\Desktop\Palindromes\palindromes.txt";
List<string> lines = File.ReadAllLines(filePath).ToList();
var meStack = new Stack<string>();
for (int i = 0; i < lines.Count; i++)
{
string pali;
pali = lines.RemoveAt(i); // < this will return AND REMOVE the line from the list.
// now, what was line i+1 is now line i, next iteration
// will return and remove (the new) line i+1, though,
// skipping one line.
meStack.Push(pali[i]); // here you push one char (the ith) of the string (the line you
// just removed) to the stack which _may_ cause an
// IndexOutOfBounds! (if "i" >= pali.Length )
}
Now since I do not want to reiterate the other (great) answers, here is one where you can actually use RemoveAt:
while( lines.Count > 0 ) // RemoveAt will decrease Count with each iteration
{
meStack.Push(lines.RemoveAt(0)); // Push the whole line that is returned.
// Mind there is hardcoded "0" -> we always remove and push the first
// item of the list.
}
Which is not the best solution, just another alternative.

Read Text file and access elements of Text file separately in c#

I am reading a text file consisting of two columns (contains x and y coordinates):
static void Main(string[] args)
{
string line;
using (StreamReader file = new StreamReader(#"C:\Point(x,y).txt"))
{
while ((line = file.ReadLine()) != null)
{
char[] delimiters = new char[] { '\t' };
string[] parts = line.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < parts.Length; i++)
{
Console.WriteLine(parts[i]);
}
}
}
// Suspend the screen.
Console.ReadLine();
I want to read the data in text file separately in two different variables
How do I access each value?

This is a case where Linq can come to the rescue. You can read all the lines and parse directly with Linq. I am assuming you have a tab delimited text file based on the line.Split you are doing. You can create a new anonymous type to capture row/line in your text file. You then have access to the fields as seen when writing out to the console, by looping through each item/anonymous type in the list. Make sure you have the using System.Linq namespace included which is usually there by default.
var records = File
.ReadAllLines(#"C:\Users\saira\Documents\Visual Studio 2017\Projects\Point(x,y).txt")
.Select(record => record.Split('\t'))
.Select(record => new
{
x = record[0],
y = record[1]
}).ToList();
foreach (var record in records)
{
Console.WriteLine($"The x coordinate is:{record.x}, and the y coordinate is {record.y}");
}

trouble reading and writing to a file c#

I am currently trying to take a file of words that are not in alphabetical, re-order the words so that they are in alphabetical order (I am trying to use a non-built in sort method), and then write the newly ordered list into a new txt file(one that must be created). For example, lets say there is only five words in the txt file that are as follows "dog bat apple rabbit cat". I would want the program to resort these in alphabetical order, and then create a txt file that saves that order. As of right now, the program will iterate through the txt file, but will not save the re-ordered list into the new txt file. What is saved into the new file is this... "System.Collections.Generic.List`1[System.String]"
Truth be told, I am not very savvy with c# yet, so i apologize if my structuring or coding is not very well. The original file that is un-ordered is called "jumbled english FILTERED.ALL.txt", and the file I am trying to write to is called "english FILTERED.ALL.txt".
static void Main(string[] args)
{
// declaring integer for minimum.
int min = 0;
// declare the list for the original file
List<string> LinuxWords = new List<string>();
List<string> lexicalOrder = new List<string>();
// read the text from the file
string[] lines = System.IO.File.ReadAllLines("jumbled english FILTERED.ALL.txt");
string line = string.Empty;
// seperate each word into a string
//foreach (string line in lines)
//{
//add each word into the list.
//LinuxWords.Add(line);
//}
for (int i = 0; i < lines.Length - 1; i++)
{
for (int j = i + 1; j < lines.Length; j++)
{
if (lines[i].Length < lines[j].Length)
{
min = lines[i].Length;
}
else
{
min = lines[j].Length;
}
for (int k = 0; k < min; k++)
{
if (lines[i][k] > lines[j][k])
{
line = lines[i].ToString();
lines[i] = lines[j];
lines[j] = line;
break;
}
else if (lines[i][k] == lines[j][k])
{
continue;
}
else
{
break;
}
}
}
}
for (int i = 0; i < lines.Length; i++)
{
Console.WriteLine("The program is formatting the correct order");
lexicalOrder.Add(lines[i]);
}
//lexicalOrder.ForEach(Console.WriteLine);
//}
//LinuxWords.ForEach(Console.WriteLine);
File.WriteAllText(AppDomain.CurrentDomain.BaseDirectory + "english FILTERED.ALL.txt",
lexicalOrder.ToString());
// write the ordered list back into another .txt file named "english FILTERED.ALL.txt"
// System.IO.File.WriteAllLines("english FILTERED.ALL.txt", lexicalOrder);
Console.WriteLine("Finished");
}

Assuming you mean that you don't get the list saved (if that's not the problem - please be more specific) - you need to change
lexicalOrder.ToString()
to something like
lexicalOrder.Aggregate((s1, s2) => s1 + " " + s2)

StreamWriter C# formatting output

Problem Statement
In order to run gene annotation software, I need to prepare two types of files, vcard files and coverage tables, and there has to be one-to-one match of vcard to coverage table. Since Im running 2k samples, its hard to identify which file is not one-to-one match. I know that both files have unique identifier numbers, hence, if both folders have files that have same unique numbers, i treat that as "same" file
I made a program that compares two folders and reports unique entries in each folder. To do so, I made two list that contains unique file names to each directory.
I want to format the report file (tab delimited .txt file) such that it looks something like below:
Unique in fdr1 Unique in fdr2
file x file a
file y file b
file z file c
I find this difficult to do because I have to iterate twice (since I have two lists), but there is no way of going back to the previous line in StreamWriter as far as I know. Basically, once I iterate through the first list and fill the first column, how can I fill the second column with the second list?
Can someone help me out with this?
Thanks
If design of the code has to change (i.e. one list instead of two), please let me know
As requested by some user, this is how I was going to do (not working version)
// Write report
using (StreamWriter sw = new StreamWriter(dest_txt.Text + #"\" + "Report.txt"))
{
// Write headers
sw.WriteLine("Unique Entries in Folder1" + "\t" + "Unique Entries in Folder2");
// Write unique entries in fdr1
foreach(string file in fdr1FileList)
{
sw.WriteLine(file + "\t");
}
// Write unique entries in fdr2
foreach (string file in fdr2FileList)
{
sw.WriteLine(file + "\t");
}
sw.Dispose();
}
As requested for my approach for finding unique entries, here's my code snippet
Dictionary<int, bool> fdr1Dict = new Dictionary<int, bool>();
Dictionary<int, bool> fdr2Dict = new Dictionary<int, bool>();
List<string> fdr1FileList = new List<string>();
List<string> fdr2FileList = new List<string>();
string fdr1Path = folder1_txt.Text;
string fdr2Path = folder2_txt.Text;
// File names in the specified directory; path not included
string[] fdr1FileNames = Directory.GetFiles(fdr1Path).Select(Path.GetFileName).ToArray();
string[] fdr2FileNames = Directory.GetFiles(fdr2Path).Select(Path.GetFileName).ToArray();
// Iterate through the first directory, and add GL number to dictionary
for(int i = 0; i < fdr1FileNames.Length; i++)
{
// Grabs only the number from the file name
string number = Regex.Match(fdr1FileNames[i], #"\d+").ToString();
int glNumber;
// Make sure it is a number
if(Int32.TryParse(number, out glNumber))
{
fdr1Dict[glNumber] = true;
}
// If number not present, raise exception
else
{
throw new Exception(String.Format("GL Number not found in: {0}", fdr1FileNames[i]));
}
}
// Iterate through the second directory, and add GL number to dictionary
for (int i = 0; i < fdr2FileNames.Length; i++)
{
// Grabs only the number from the file name
string number = Regex.Match(fdr2FileNames[i], #"\d+").ToString();
int glNumber;
// Make sure it is a number
if (Int32.TryParse(number, out glNumber))
{
fdr2Dict[glNumber] = true;
}
// If number not present, raise exception
else
{
throw new Exception(String.Format("GL Number not found in: {0}", fdr2FileNames[i]));
}
}
// Iterate through the first directory, and find files that are unique to it
for (int i = 0; i < fdr1FileNames.Length; i++)
{
int glNumber = Int32.Parse(Regex.Match(fdr1FileNames[i], #"\d+").Value);
// If same file is not present in the second folder add to the list
if(!fdr2Dict[glNumber])
{
fdr1FileList.Add(fdr1FileNames[i]);
}
}
// Iterate through the second directory, and find files that are unique to it
for (int i = 0; i < fdr2FileNames.Length; i++)
{
int glNumber = Int32.Parse(Regex.Match(fdr2FileNames[i], #"\d+").Value);
// If same file is not present in the first folder add to the list
if (!fdr1Dict[glNumber])
{
fdr2FileList.Add(fdr2FileNames[i]);
}

I am a quite confident that this will work as I've tested it:
static void Main(string[] args)
{
var firstDir = #"Path1";
var secondDir = #"Path2";
var firstDirFiles = System.IO.Directory.GetFiles(firstDir);
var secondDirFiles = System.IO.Directory.GetFiles(secondDir);
print2Dirs(firstDirFiles, secondDirFiles);
}
private static void print2Dirs(string[] firstDirFile, string[] secondDirFiles)
{
var maxIndex = Math.Max(firstDirFile.Length, secondDirFiles.Length);
using (StreamWriter streamWriter = new StreamWriter("result.txt"))
{
streamWriter.WriteLine(string.Format("{0,-150}{1,-150}", "Unique in fdr1", "Unique in fdr2"));
for (int i = 0; i < maxIndex; i++)
{
streamWriter.WriteLine(string.Format("{0,-150}{1,-150}",
firstDirFile.Length > i ? firstDirFile[i] : string.Empty,
secondDirFiles.Length > i ? secondDirFiles[i] : string.Empty));
}
}
}
It's a quite simple code but if you need help understanding it just let me know :)

I would construct each line at a time. Something like this:
int row = 0;
string[] fdr1FileList = new string[0];
string[] fdr2FileList = new string[0];
while (row < fdr1FileList.Length || row < fdr2FileList.Length)
{
string rowText = "";
rowText += (row >= fdr1FileList.Length ? "\t" : fdr1FileList[row] + "\t");
rowText += (row >= fdr2FileList.Length ? "\t" : fdr2FileList[row]);
row++;
}

Try something like this:
static void Main(string[] args)
{
Dictionary<int, string> fdr1Dict = FilesToDictionary(Directory.GetFiles("path1"));
Dictionary<int, string> fdr2Dict = FilesToDictionary(Directory.GetFiles("path2"));
var unique_f1 = fdr1Dict.Where(f1 => !fdr2Dict.ContainsKey(f1.Key)).ToArray();
var unique_f2 = fdr2Dict.Where(f2 => !fdr1Dict.ContainsKey(f2.Key)).ToArray();
int f1_size = unique_f1.Length;
int f2_size = unique_f2.Length;
int list_length = 0;
if (f1_size > f2_size)
{
list_length = f1_size;
Array.Resize(ref unique_f2, list_length);
}
else
{
list_length = f2_size;
Array.Resize(ref unique_f1, list_length);
}
using (StreamWriter writer = new StreamWriter("output.txt"))
{
writer.WriteLine(string.Format("{0,-30}{1,-30}", "Unique in fdr1", "Unique in fdr2"));
for (int i = 0; i < list_length; i++)
{
writer.WriteLine(string.Format("{0,-30}{1,-30}", unique_f1[i].Value, unique_f2[i].Value));
}
}
}
static Dictionary<int, string> FilesToDictionary(string[] filenames)
{
Dictionary<int, string> dict = new Dictionary<int, string>();
for (int i = 0; i < filenames.Length; i++)
{
int glNumber;
string filename = Path.GetFileName(filenames[i]);
string number = Regex.Match(filename, #"\d+").ToString();
if (int.TryParse(number, out glNumber))
dict.Add(glNumber, filename);
}
return dict;
}

Count the number of words within an Array or List

I need to count the number of words within an array or a list. The reason I say array or list is because I am not sure which would be the best to use in this situation. The data is static and in a .txt file (It's actually a book). I was able to create an array and break down words from the array but for the life of me I can not count! I have tried many different ways to do this and I'm thinking since it is a string it is unable to count. I have even teetered on the edge of just printing the whole book to a listbox and counting from the listbox but, that's ridiculous.
public partial class mainForm : Form
{
//------------------------
//GLOBAL VARIABLES:
//------------------------
List<string> countWords;
string[] fileWords;
string[] fileLines;
char[] delim = new char[] { ' ', ',','.','?','!' };
string path;
public mainForm()
{
InitializeComponent();
}
private void BookTitle() // TiTleAndAuthor Method will pull the Book Title and display it.
{
for (int i = 0; i < 1; i++)
{
bookTitleLabel.Text = fileLines[i];
}
}
private void BookAuthor() // TiTleAndAuthor Method will pull the Book Author and display it.
{
for (int i = 1; i < 2; i++)
{
bookAuthorLabel.Text = fileLines[i];
}
}
private void FirstLines() // FirstTenWords Method pulls the first ten words of any text file and prints the to a ListBox
{
for (int i = 0; i <= 499; i++)
{
wordsListBox.Items.Add(fileWords[i]);
}
}
private void WordCount() // Count all the words in the file.
{
}
private void openFileButton_Click(object sender, EventArgs e)
{
OpenFileDialog inputFile = new OpenFileDialog();
if (inputFile.ShowDialog() == DialogResult.OK) // check the file the user selected
{
path = inputFile.FileName; // save that path of the file to a string variable for later use
StreamReader fileRead = new StreamReader(path); // read a file at the path outlined in the path variable
fileWords = fileRead.ReadToEnd().Split(delim); // Breakdown the text into lines of text to call them at a later date
fileLines = File.ReadAllLines(path);
countWords = File.ReadLines(path).ToList();
wordsListBox.Items.Clear();
BookTitle();
BookAuthor();
FirstLines();
WordCount();
}
else
{
MessageBox.Show("Not a valid file, please select a text file");
}
}
}

Maybe this is useful:
static void Main(string[] args)
{
string[] lines = File_ReadAllLines();
List<string> words = new List<string>();
foreach(var line in lines)
{
words.AddRange(line.Split(' '));
}
Console.WriteLine(words.Count);
}
private static string[] File_ReadAllLines()
{
return new[] {
"The one book",
"written by gnarf",
"once upon a time ther werent any grammer",
"iso 1-12122-445",
"(c) 2012 under the hills"
};
}

Before I get to the answer, a quick observation on some of the loops:
for (int i = 1; i < 2; i++)
{
bookAuthorLabel.Text = fileLines[i];
}
This'll only run once, so it's pointless to have it in a loop (unless you intended this to actually loop through the whole list, in which case it's a bug). If this is the expected behavior, you might as well just do
bookAuthorLabel.Text = fileLines[1];
You have something similar here:
for (int i = 0; i < 1; i++)
{
bookTitleLabel.Text = fileLines[i];
}
Again, this is pointless.
Now for the answer itself. I'm not sure if you're trying to get total word count or count of individual words, so here's a code sample for doing both:
private static void CountWords()
{
const string fileName = "CountWords.txt";
// Create a dummy file
using (var sw = new StreamWriter(fileName))
{
sw.WriteLine("This is a short sentence");
sw.WriteLine("This is a long sentence");
}
string text = File.ReadAllText(fileName);
string[] result = text.Split(new[] { " ", Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
// Total word count
Console.WriteLine("Total count: " + result.Count().ToString());
// Now to illustrate getting the count of individual words
var dictionary = new Dictionary<string, int>();
foreach (string word in result)
{
if (dictionary.ContainsKey(word))
{
dictionary[word]++;
}
else
{
dictionary[word] = 1;
}
}
foreach (string key in dictionary.Keys)
{
Console.WriteLine(key + ": " + dictionary[key].ToString());
}
}
This should be easy to adapt to your particular needs in this case.

Read text file line by line. split by empty character and remove unnecessary spaces. sum this count to total.
var totalWords = 0;
using (StreamReader sr = new StreamReader("abc.txt"))
{
while (!sr.EndOfStream)
{
int count = sr
.ReadLine()
.Split(new char[] {' '}, StringSplitOptions.RemoveEmptyEntries).Count();
totalWords += count;
}
You can also use the below code:
totalWords = fileRead.ReadToEnd().Split(delim, StringSplitOptions.RemoveEmptyEntries).Length;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Read and process 100 text files in c# in parallel - c#

Related

How do I take each element off of a list and push it onto a stack?

Read Text file and access elements of Text file separately in c#

trouble reading and writing to a file c#

StreamWriter C# formatting output

Count the number of words within an Array or List

Categories

Resources