Get first value in CSV column without duplicates

Get first value in CSV column without duplicates - c#

I am getting a list of items from a csv file via a Web Api using this code:
private List<Item> items = new List<Item>();
public ItemRepository()
{
string filename = HttpRuntime.AppDomainAppPath + "App_Data\\items.csv";
var lines = File.ReadAllLines(filename).Skip(1).ToList();
for (int i = 0; i < lines.Count; i++)
{
var line = lines[i];
var columns = line.Split('$');
//get rid of newline characters in the middle of data lines
while (columns.Length < 9)
{
i += 1;
line = line.Replace("\n", " ") + lines[i];
columns = line.Split('$');
}
//Remove Starting and Trailing open quotes from fields
columns = columns.Select(c => { if (string.IsNullOrEmpty(c) == false) { return c.Substring(1, c.Length - 2); } return string.Empty; }).ToArray();
var temp = columns[5].Split('|', '>');
items.Add(new Item()
{
Id = int.Parse(columns[0]),
Name = temp[0],
Description = columns[2],
Photo = columns[7]
});
}
}
The Name attribute of the item list must come from column whose structure is as follows:
Groups>Subgroup>item
Therefore I use var temp = columns[5].Split('|', '>'); in my code to get the first element of the column before the ">", which in the above case is Groups. And this works fine.
However, I a getting many duplicates in the result. This is because other items in the column may be:
(These are some of the entries in my csv column 9)
Groups>Subgroup2>item2, Groups>Subgroup3>item4, Groups>Subgroup4>item9
All start with Groups, but I only want to get Groups once.
As it is I get a long list of Groups. How do I stop the duplicates?
I want that if an Item in the list is returned with the Name "Groups", that no other item with that name would be returned. How do I make this check and implement it?

If you are successfully getting the list of groups, take that list of groups and use LINQ:
var undupedList = dupedList
.Distinct();
Update: The reason distinct did not work is because your code is requesting not just Name, but also, Description, etc...If you only ask for Name, Distinct() will work.
Update 2: Try this:
//Check whether already exists
if((var match = items.Where(q=>q.Name == temp[0])).Count==0)
{
items.add(...);
}

How about using a List to store Item.Name?
Then check List.Contains() before calling items.Add()
Simple, only 3 lines of code, and it works.
IList<string> listNames = new List();
//
for (int i = 0; i < lines.Count; i++)
{
//
var temp = columns[5].Split('|', '>');
if (!listNames.Contains(temp[0]))
{
listNames.Add(temp[0]);
items.Add(new Item()
{
//
});
}
}

Related

How to read all lines from notepad comma separated and updated specific column conditionally in C#

how can we read all lines from notepad comma separated and updated specific column conditionally, its successfully iterating no error is coming but not to update the value in C#
string[] existingLines = File.ReadAllLines(filepath);
foreach (var row in existingLines)
{
row.Split(Text_Separator)[0] = "Test Data";
}
var newdata = existingLines;

You'll need to use a for loop to modify an item while iterating over it. I believe you'll get an error if using foreach. This is because you are exposing an enumerator which is read-only.
This will iterate through each row, modify a column and replace the current row.
string[] existingLines = File.ReadAllLines(filepath);
foreach (var i = 0; i < existingLines.Length; i++)
{
// retrieve row by index
var row = existingLines[i];
// split into array of columns
var columns = row.Split(Text_Separator);
// update column
columns[0] = "Test Data";
// create row from array of columns
var updatedRow = string.Join(Text_Separator, columns);
// update row in array of rows
existingLines[i] = updatedRow;
}
var newdata = existingLines;

String.Split method will create a new instance of string array which is not referred to the existingLines variable. Hence, updating the value returned from Split does not reflect on existingLines
String[] stringArray = splitStrings;
if( arrIndex!= maxItems) {
stringArray = new String[arrIndex];
for( int j = 0; j < arrIndex; j++) {
stringArray[j] = splitStrings[j];
}
}
return stringArray;
https://github.com/Microsoft/referencesource/blob/master/mscorlib/system/string.cs

Read and process 100 text files in c# in parallel

I have project that reads 100 text file with 5000 words in it.
I insert the words into a list. I have a second list that contains english stop words. I compare the two lists and delete the stop words from first list.
It takes 1 hour to run the application. I want to be parallelize it. How can I do that?
Heres my code:
private void button1_Click(object sender, EventArgs e)
{
List<string> listt1 = new List<string>();
string line;
for (int ii = 1; ii <= 49; ii++)
{
string d = ii.ToString();
using (StreamReader reader = new StreamReader(#"D" + d.ToString() + ".txt"))
while ((line = reader.ReadLine()) != null)
{
string[] words = line.Split(' ');
for (int i = 0; i < words.Length; i++)
{
listt1.Add(words[i].ToString());
}
}
listt1 = listt1.ConvertAll(d1 => d1.ToLower());
StreamReader reader2 = new StreamReader("stopword.txt");
List<string> listt2 = new List<string>();
string line2;
while ((line2 = reader2.ReadLine()) != null)
{
string[] words2 = line2.Split('\n');
for (int i = 0; i < words2.Length; i++)
{
listt2.Add(words2[i]);
}
listt2 = listt2.ConvertAll(d1 => d1.ToLower());
}
for (int i = 0; i < listt1.Count(); i++)
{
for (int j = 0; j < listt2.Count(); j++)
{
listt1.RemoveAll(d1 => d1.Equals(listt2[j]));
}
}
listt1=listt1.Distinct().ToList();
textBox1.Text = listt1.Count().ToString();
}
}
}
}

I fixed many things up with your code. I don't think you need multi-threading:
private void RemoveStopWords()
{
HashSet<string> stopWords = new HashSet<string>();
using (var stopWordReader = new StreamReader("stopword.txt"))
{
string line2;
while ((line2 = stopWordReader.ReadLine()) != null)
{
string[] words2 = line2.Split('\n');
for (int i = 0; i < words2.Length; i++)
{
stopWords.Add(words2[i].ToLower());
}
}
}
var fileWords = new HashSet<string>();
for (int fileNumber = 1; fileNumber <= 49; fileNumber++)
{
using (var reader = new StreamReader("D" + fileNumber.ToString() + ".txt"))
{
string line;
while ((line = reader.ReadLine()) != null)
{
foreach(var word in line.Split(' '))
{
fileWords.Add(word.ToLower());
}
}
}
}
fileWords.ExceptWith(stopWords);
textBox1.Text = fileWords.Count().ToString();
}
You are reading through the list of stopwords many times as well as continually adding to the list and re-attempting to remove the same stopwords over and again due to the way your code is structured. Your needs are also better matched to a HashSet than to a List, as it has set based operations and uniqueness already handled.
If you still wanted to make this parallel, you could do it by reading the stopword list once and passing it to an async method that will read the input file, remove the stopwords and return the resulting list, then you would need to merge the resulting lists after the asynchronous calls came back, but you had better test before deciding you need that, because that is quite a bit more work and complexity than this code already has.

If I understand you correctly, you want to:
Read all words from a file into a List
Remove all "stop words" from the List
Repeat for 99 more files, saving only the unique words
If this is correct, the code is pretty simple:
// The list of words to delete ("stop words")
var stopWords = new List<string> { "remove", "these", "words" };
// The list of files to check - you can get this list in other ways
var filesToCheck = new List<string>
{
#"f:\public\temp\temp1.txt",
#"f:\public\temp\temp2.txt",
#"f:\public\temp\temp3.txt"
};
// This list will contain all the unique words from all
// the files, except the ones in the "stopWords" list
var uniqueFilteredWords = new List<string>();
// Loop through all our files
foreach (var fileToCheck in filesToCheck)
{
// Read all the file text into a varaible
var fileText = File.ReadAllText(fileToCheck);
// Split the text into distinct words (splitting on null
// splits on all whitespace) and ignore empty lines
var fileWords = fileText.Split(null)
.Where(line => !string.IsNullOrWhiteSpace(line))
.Distinct();
// Add all the words from the file, except the ones in
// your "stop list" and those that are already in the list
uniqueFilteredWords.AddRange(fileWords.Except(stopWords)
.Where(word => !uniqueFilteredWords.Contains(word)));
}
This can be condensed into a single line with no explicit loop:
// This list will contain all the unique words from all
// the files, except the ones in the "stopWords" list
var uniqueFilteredWords = filesToCheck.SelectMany(fileToCheck =>
File.ReadAllText(fileToCheck)
.Split(null)
.Where(word => !string.IsNullOrWhiteSpace(word) &&
!stopWords.Any(stopWord => stopWord.Equals(word,
StringComparison.OrdinalIgnoreCase)))
.Distinct());
This code processed over 100 files with more than 12000 words each in less than a second (WAY less than a second... 0.0001782 seconds)

One issue I see here that can help improve performance is listt1.ConvertAll() will run in O(n) on the list. You are already looping to add the items to the list, why not convert them to lower case there. Also why not store the words in a hash set, so you can do look up and insertion in O(1). You could store the list of stop words in a hash set and when you are reading your text input see if the word is a stop word and if its not add it to the hash set to output the user.

How to Merge items within a List<> collection C#

I have a implememtation where i need to loop through a collection of documents and based on certain condition merge the documents .
The merge condition is very simple, if present document's doctype is same as later document's doctype, then copy all the pages from the later doctype and append it to the pages of present document's and remove the later document from the collection.
Note : Both response.documents and response.documents[].pages are List<> collections.
I was trying this but was getting following exception Once I remove the document.
collection was modified enumeration may not execute
Here is the code:
int docindex = 0;
foreach( var document in response.documents)
{
string presentDoctype = string.Empty;
string laterDoctype = string.Empty;
presentDoctype = response.documents[docindex].doctype;
laterDoctype = response.documents[docindex + 1].doctype;
if (laterDoctype == presentDoctype)
{
response.documents[docindex].pages.AddRange(response.documents[docindex + 1].pages);
response.documents.RemoveAt(docindex + 1);
}
docindex = docindex + 1;
}
Ex:
reponse.documents[0].doctype = "BankStatement" //page count = 1
reponse.documents[1].doctype = "BankStatement" //page count = 2
reponse.documents[2].doctype = "BankStatement" //page count = 2
reponse.documents[3].doctype = "BankStatement" //page count = 1
reponse.documents[4].doctype = "BankStatement" //page count = 4
Expected result:
response.documents[0].doctype = "BankStatement" //page count = 10
Please suggest.Appreciate your help.

I would recommend you to look at LINQ GroupBy and Distinct to process your response.documents
Example (as I cannot use your class, I give example using my own defined class):
Suppose you have DummyClass
public class DummyClass {
public int DummyInt;
public string DummyString;
public double DummyDouble;
public DummyClass() {
}
public DummyClass(int dummyInt, string dummyString, double dummyDouble) {
DummyInt = dummyInt;
DummyString = dummyString;
DummyDouble = dummyDouble;
}
}
Then doing GroupBy as shown,
DummyClass dc1 = new DummyClass(1, "This dummy", 2.0);
DummyClass dc2 = new DummyClass(2, "That dummy", 2.0);
DummyClass dc3 = new DummyClass(1, "These dummies", 2.0);
DummyClass dc4 = new DummyClass(2, "Those dummies", 2.0);
DummyClass dc5 = new DummyClass(3, "The dummies", 2.0);
List<DummyClass> dummyList = new List<DummyClass>() { dc1, dc2, dc3, dc4, dc5 };
var groupedDummy = dummyList.GroupBy(x => x.DummyInt).ToList();
Will create three groups, marked by DummyInt
Then to process the group you could do
for (int i = 0; i < groupedDummy.Count; ++i){
foreach (DummyClass dummy in groupedDummy[i]) { //this will process the (i-1)-th group
//do something on this group
//groupedDummy[0] will consists of "this" and "these", [1] "that" and "those", while [2] "the"
//Try it out!
}
}
In your case, you should create group based on doctype.
Once you create groups based on your doctype, everything else would be pretty "natural" for you to continue.
Another LINQ method which you might be interested in would be Distinct. But I think for this case, GroupBy would be the primary method you would like to use.

Use only "for loop" instead of "foreach".
foreach will hold the collection and cannot be modified while looping thru it.

Here is an example using groupBy, hope this help.
//mock a collection
ICollection<string> collection1 = new List<string>();
for (int i = 0; i < 10; i++)
{
collection1.Add("BankStatement");
}
for (int i = 0; i < 5; i++)
{
collection1.Add("BankStatement2");
}
for (int i = 0; i < 4; i++)
{
collection1.Add("BankStatement3");
}
//merge and get count
var result = collection1.GroupBy(c => c).Select(c => new { name = c.First(), count = c.Count().ToString() }).ToList();
foreach (var item in result)
{
Console.WriteLine(item.name + ": " + item.count);
}

Just use AddRange()
response.documents[0].pages.AddRange(response.documents[1].pages);
it will merge all pages of document[1] with the document[0] into document[0]

How do I make the foreach instruction iterate in 2 places?

how do I make the foreach instruction iterate both in the "files" variable and in the "names" array?
var files = Directory.GetFiles(#".\GalleryImages");
string[] names = new string[8] { "Matt", "Joanne", "Robert","Andrei","Mihai","Radu","Ionica","Vasile"};
I've tried 2 options.. the first one gives me lots of errors and the second one displays 8 images of each kind
foreach(var file in files,var i in names)
{
//Do stuff
}
and
foreach(var file in files)
{
foreach (var i in names)
{
//Do stuff
}
}

You can try using the Zip Extension method of LINQ:
int[] numbers = { 1, 2, 3, 4 };
string[] words = { "one", "two", "three" };
var numbersAndWords = numbers.Zip(words, (first, second) => first + " " + second);
foreach (var item in numbersAndWords)
Console.WriteLine(item);
Would look something like this:
var files = Directory.GetFiles(#".\GalleryImages");
string[] names = new string[] { "Matt", "Joanne", "Robert", "Andrei", "Mihai","Radu","Ionica","Vasile"};
var zipped = files.Zip(names, (f, n) => new { File = f, Name = n });
foreach(var fn in zipped)
Console.WriteLine(fn.File + " " + fn.Name);
But I haven't tested this one.

It's not clear what you're asking. But, you can't iterate two iterators with foreach; but you can increment another variable in the foreach body:
int i = 0;
foreach(var file in files)
{
var name = names[i++];
// TODO: do something with name and file
}
This, of course, assumes that files and names are of the same length.

You can't. Use a for loop instead.
for(int i = 0; i < files.Length; i++)
{
var file = files[i];
var name = names[i];
}
If the both array have the same length this should work.

You have two options here; the first works if you are iterating over something that has an indexer, like an array or List, in which case use a simple for loop and access things by index:
for (int i = 0; i < files.Length && i < names.Length; i++)
{
var file = files[i];
var name = names[i];
// Do stuff with names.
}
If you have a collection that doesn't have an indexer, e.g. you just have an IEnumerable and you don't know what it is, you can use the IEnumerable interface directly. Behind the scenes, that's all foreach is doing, it just hides the slightly messier syntax. That would look like:
var filesEnum = files.GetEnumerator();
var namesEnum = names.GetEnumerator();
while (filesEnum.MoveNext() && namesEnum.MoveNext())
{
var file = filesEnum.Current;
var name = namesEnum.Current;
// Do stuff with files and names.
}
Both of these assume that both collections have the same number of items. The for loop will only iterate as many times as the smaller one, and the smaller enumerator will return false from MoveNext when it runs out of items. If one collection is bigger than the other, the 'extra' items won't get processed, and you'll need to figure out what to do with them.

I guess the files array and the names array have the same indices.
When this is the case AND you always want the same index at one time you do this:
for (int key = 0; key < files.Length; ++key)
{
// access names[key] and files[key] here
}

You can try something like this:
var pairs = files.Zip(names, (f,n) => new {File=f, Name=n});
foreach (var item in pairs)
{
Console.Write(item.File);
Console.Write(item.Name);
}

List sorting by multiple parameters

I have a .csv with the following headers and an example line from the file.
AgentID,Profile,Avatar,In_Time,Out_Time,In_Location,Out_Location,Target_Speed(m/s),Distance_Traveled(m),Congested_Duration(s),Total_Duration(s),LOS_A_Duration(s),LOS_B_Duration(s),LOS_C_Duration(s),LOS_D_Duration(s),LOS_E_Duration(s),LOS_F_Duration(s)
2177,DefaultProfile,DarkGreen_LowPoly,08:00:00,08:00:53,East12SubwayportalActor,EWConcourseportalActor,1.39653,60.2243,5.4,52.8,26.4,23,3.4,0,0,0
I need to sort this .csv by the 4th column (In_time) by increasing time ( 08:00:00, 08:00:01) and the 6th (In_Location) by alphabetical direction (e.g. East, North, etc).
So far my code looks like this:
List<string> list = new List<string>();
using (StreamReader reader = new StreamReader("JourneyTimes.csv"))
{
string line;
while ((line = reader.ReadLine()) != null)
{
line.Split(',');
list.Add(line);
}
I read in the .csv and split it using a comma (there are no other commas so this is not a concern). I then add each line to a list. My issue is how do I sort the list on two parameters and by the headers of the .csv.
I have been looking all day at this, I am relatively new to programming, this is my first program so I apologize for my lack of knowledge.

You can use LINQ OrderBy/ThenBy:
e.g.
listOfObjects.OrderBy (c => c.LastName).ThenBy (c => c.FirstName)
But first off, you should map your CSV line to some object.
To map CSV line to object you can predefine some type or create it dynamically
from line in File.ReadLines(fileName).Skip(1) //header
let columns = line.Split(',') //really basic CSV parsing, consider removing empty entries and supporting quotes
select new
{
AgentID = columns[0],
Profile = int.Parse(columns[1]),
Avatar = float.Parse(columns[2])
//other properties
}
And be aware that like many other LINQ methods, these two use deferred execution

You are dealing with two distinct problems.
First, ordering two columns in C# can be achieved with OrderBy, ThenBy
public class SpreadsheetExample
{
public DateTime InTime { get; set; }
public string InLocation { get; set; }
public SpreadsheetExample(DateTime inTime, string inLocation)
{
InTime = inTime;
InLocation = inLocation;
}
public static List<SpreadsheetExample> LoadMockData()
{
int maxMock = 10;
Random random = new Random();
var result = new List<SpreadsheetExample>();
for (int mockCount = 0; mockCount < maxMock; mockCount++)
{
var genNumber = random.Next(1, maxMock);
var genDate = DateTime.Now.AddDays(genNumber);
result.Add(new SpreadsheetExample(genDate, "Location" + mockCount));
}
return result;
}
}
internal class Class1
{
private static void Main()
{
var mockData = SpreadsheetExample.LoadMockData();
var orderedResult = mockData.OrderBy(m => m.InTime).ThenBy(m => m.InLocation);//Order, ThenBy can be used to perform ordering of two columns
foreach (var item in orderedResult)
{
Console.WriteLine("{0} : {1}", item.InTime, item.InLocation);
}
}
}
Now you can tackle the second issue of moving data into a class from Excel. VSTO is what you are looking for. There are lots of examples online. Follow the example I posted above. Replace your custom class in place of SpreadSheetExample.

You may use a DataTable:
var lines = File.ReadAllLines("test.csv");
DataTable dt = new DataTable();
var columNames = lines[0].Split(new char[] { ',' });
for (int i = 0; i < columNames.Length; i++)
{
dt.Columns.Add(columNames[i]);
}
for (int i = 1; i < lines.Length; i++)
{
dt.Rows.Add(lines[i].Split(new char[] { ',' }));
}
var rows = dt.Rows.Cast<DataRow>();
var result = rows.OrderBy(i => i["In_time"])
.ThenBy(i => i["In_Location"]);
// sum
var sum = rows.Sum(i => Int32.Parse(i["AgentID"].ToString()));

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get first value in CSV column without duplicates - c#

Related

How to read all lines from notepad comma separated and updated specific column conditionally in C#

Read and process 100 text files in c# in parallel

How to Merge items within a List<> collection C#

How do I make the foreach instruction iterate in 2 places?

List sorting by multiple parameters

Categories

Resources