I have very big CSV with 244 columns and 4000 rows.
There are a lot of \n\r, so when I try to split it with this (to find the end of a line) I get around 9000 rows instead of my wished 4000.
So how to determine which \n\r is within text or maybe at the end of a cell - and which is a definitive end of a line?
When CSV file has data in column which is either \n,\r or , around these values usually put quotes. To correctly prase CSV I would recommend already existing parsers. See this answer as example.
If you truly want to be on your own you have to write simple state machine which will read data by individual columns. When reading column you have to take care about escaping rules. Only that way you could distinguish between line endings in data and line endings which separate rows
try using Environment.NewLine for splitting instead of \n\r
string path = yourfilepath;
string csv = System.IO.File.ReadAllText(path);
List<string> rows = csv.Split(new string[] {Environment.NewLine }, System.StringSplitOptions.RemoveEmptyEntries).ToList();
Related
I have this little project in C# where I am manipulating with files. Now my task is that I have to delete specific rows from files.
For example my file looks like this:
1-this is the first line
2-this is the second line
3-this is the third line
4-this is the fourth line
Now how can I keep only the first two rows and delete only the last two rows?
Note- this is how I read the file from my local machine:
string[] lines = File.ReadAllLines(#"C:\Users\admin\Desktop\COMMANDS.dat");
I have tried something like this but I think it's not so "efficient"
string text = File.ReadAllText(#"C:\Users\admin\Desktop\COMMANDS.dat");
text = text.Replace(lines[2], "");
text = text.Replace(lines[3], "");
File.WriteAllText(#"C:\Users\admin\Desktop\COMMANDS.dat", text);
So this actually does the job, it replaces the lines by string with an empty character but when I take a look at the file, I don't want to have 4 lines there, even though 2 of them are real strings and the other two are just empty lines... Can I manage to do this in another way?
Try replacing the newline character with an empty string:
string text = File.ReadAllText(#"C:\Users\admin\Desktop\COMMANDS.dat");
text = text.Replace(lines[2], "").Remove(Environment.NewLine, "");
text = text.Replace(lines[3], "").Remove(Environment.NewLine , "");
File.WriteAllText(#"C:\Users\admin\Desktop\COMMANDS.dat", text);
If my answer is useful, please mark it as accepted, and upvote it.
async Task Example()
{
var inputLines = await File.ReadAllLinesAsync("path/to/file.txt");
var outputLines = inputLines.Where((l, i) => i < 2);
await File.WriteAllLinesAsync("target/file.txt", outputLines);
}
What it does
Read data but not as one string but as a collection of lines
Create a new collection containing only the lines you want in your output
Write the filtered lines
Notes:
This example is not optimized for memory usage (because we read all lines and for larger files, e.g. multiple GB, this will fail). See existing answers for memory optimized version) - but: It's totally fine to do it this way if you know you have just a few k lines. (and it's faster)
Try not to "modify" strings. This will always create a copy and needs a lot of memory.
In this "Linq style" (functional) approach, we should treat data as immutable. That means: we have one variable that represents the input file and one variable that represents the result. We use declarative Linq to describe how the output should look like. "output is input where the filter index < 2 matches" instead of "if xy remove line" in an imperative style.
I have the following problem:
I am trying to split the rows of a CSV file but the thing is that sometimes I read the following line:
string input = "a,b,c,d,\"V=12.503,I=0.194\",e,f"
I use the following code
string[] SplittedLine= input.split(',');
The result is that i get an extra column because the data \"V=12.503,I=0.194\" has a comma inside, but when I open the CSV file with excel i noticed that Excel doesn't add an extra column because it doesn't split that data into two different data. How can I properly split this CSV file considering this situation?
You are encountering commas in the "cells" of your CSV, which by convention (but not by any standard) are escaped by wrapping the cell data with double quotes. You also need to be aware that the quote-escaped string can contain quote literals.
Let's say you had a name column and someone's name was
Jonathan "Jake" Smith, Jr.
That would be encoded as
"Jonathan ""Jake"" Smith, Jr."
You can certainly improve your code to handle those cases. However, that problem has been solved before. If you don't want to reinvent the wheel, there are a number of solid open source libraries that handle the headache of parsing CSV files. The one I use is
http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
I have a CSV whose author, annoyingly enough, has decided to 'introduce' the file before the contents themselves. So in all, I have a CSV that looks like:
This file was created by XXXXYY and represents the crossover between YY and QQQ.
Additional information can be found through the website GG, blah blah blah...
Jacob, Hybrid
Dan, Pure
Lianne, Hybrid
Jack, Hatchback
So the problem here is that I want to get rid of the first few lines before the 'real content' of the CSV file begins. I'm looking for robustness here, so using Streamreader and removing all content before the 4th line for example, is not ideal (plus the length of the text can vary).
Is there a way in which one can read only what matters and write a new CSV into a directory path?
Regards,
genesis
(edit - I'm looking for C sharp code)
The solution depends on the files you have to parse. You need to look for a reliable pattern that distinguishes data from comment.
In your example, there are some possibilities that might be the same in other files:
there are 4 lines of text. But you say this isn't consistent across files
The text lives may not contain the same number of commas as the data table. But that is unlikely to be reliable for all files.
there is a blank/whitespace only line between the text and the data.
the data appears to be in the form word-comma-word. If this is true it should be easy to identify non data lines (any line which doesn't contain exactly one comma, or has multiple words etc)
You may be able to use a combination of these heuristics to more reliably detect the data.
You could scan by line (looking for the \r\n) and ignore lines that don't have a comma count that matches you csv.
You should be able to read the file into a string pretty easily unless it is really massive.
e.g.
var csv = "some test\r\nsome more text\r\na,b,c\r\nd,e,f\r\n";
var lines = csv.Split('\r\n');
var csvLines = line.Where(l => l.Count(',') == 2);
// now csvLines contains only the lines you are after
List<string> info = new List<string>();
int counter = 0;
// Open the file to read from.
info = System.IO.File.ReadAllLines(path).ToList();
// Find the lines up until (& including) the empty one
foreach (string s in info)
{
counter++;
if(string.IsNullOrEmpty(s))
break; //exit from the loop
}
// Remove the lines including the blank one.
info.RemoveRange(0,counter);
Something like this should work, you should probably put some tests in to make sure counter is not > length and other tests to handle errors.
You could adapt this code so that it just finds the empty line number using linq or something, but I don't like the overhead of linq (Yeah ironic considering I'm using c#).
Regards,
Slipoch
I am trying to read CSV file, and i am success to some extent.
i tried
string[] values = strCurLine.Split(',');
and i am getting output as
array[0]="180"
array[1]="LMN"
array[2]="8"
array[3]="5/17/2012 15:00"
array[4]=""
array[5]="name"
array[6]="2nd row"
array[7]="step 2 from 4 to 2"
array[8]="7/9/2012 8:47:00 AM"
But when i am trying to read a content which contains , (comma) in it, its giving additional array item, and because of this i am loosing data. how can i replace a comma(,) in string.
My expected output is
array[0]="180"
array[1]="LMN"
array[2]="8"
array[3]="5/17/2012 15:00"
array[4]=""
array[5]="name"
array[6]="2nd row, step 2 from 4 to 2"
array[7]="7/9/2012 8:47:00 AM"
Please suggest
You've hit on what is known as a "Delimiter collision"
You're going to need to change either the delimiter itself, or places where you want to put in a comma.
To replace parts of a string you can use String.Replace() - but that won't really help you in this case because it'll break your csv file.
So what I suggest is the following:
Use a different csv delimiter in the file (even though its called csv, you can delimit them with tabs or semicolons) - something you're not expecting to be in the data. If you have complete control of the csv, you could put your own delimiter (maybe ,, or something like that)
or
Put some sort of 'exception' to commas. For instance use \, or ,, to define a comma. Then you can string.replace the entire csv file to a temporary character - and swap it back into a comma after you're done.
Check here, i got answer to my question
Dealing with commas in a CSV file answered by
harpo
thanks, happy coding.
From this thread, I got the basic info on how to parse CSV to create XML. Unfortunately, the text fields (all enclosed in quotes) sometimes contain commas, so line.split(',') gives me too many columns. I can't figure out how to parse the CSV so line.split(',') distinguishes between commas within a text field, and commas separating fields. Any thoughts on how to do that?
Thanks!
Go grab this code: http://geekswithblogs.net/mwatson/archive/2004/09/04/10658.aspx
Then replace line.Split(",") with SplitCSV(line), like:
var lines = File.ReadAllLines(#"C:\text.csv");
var xml = new XElement("TopElement",
lines.Select(line => new XElement("Item",
SplitCSV(line)
.Select((column, index) => new XElement("Column" + index, column)))));
xml.Save(#"C:\xmlout.xml");
Note that the code at the link above is rather old, and probably could be cleaned up a bit using Linq, but it should do the trick.
Try FileHelpers.
The FileHelpers are a free and easy to use .NET library to import/export data from fixed length or delimited records in files, strings or streams.
What about using the pipe character "|"? This often happens with CSV files and a better approach is to seperate on pipes.
If your CSV files are too complex to make writing your own parser practical, use another parser. The Office ACE OLEDB provider may already be available on your system, but may be overkill for your purposes. I haven't used any of the lightweight alternatives, so I can't speak to their suitability.
Here's a little trick if you don't want to use Regex. Instead of spliting with comma you can split with comma and quotes together ","
Assuming there's no space before and after comma:
line.Split("\",\"")
You will need to remove the quote before the first field and after the last field however.
While I'm almost always against regular expression, here's a solution using it.
Assume you have data as such:
"first name","last name","phone number"
"john,jane","doe","555-5555"
Then, the following code:
string csv = GetCSV(); // will load your CSV, or the above data
foreach (string line in csv.Split('\n'))
{
Console.WriteLine("--- Begin record ---");
foreach (Match m in Regex.Matches(line, "\".+?\""))
Console.WriteLine(m.Value);
}
will output this:
--- Begin record ---
"first name"
"last name"
"phone number"
--- Begin record ---
"john,jane"
"doe"
"555-5555"
But I would not recommend the Regex approach if you have like a 2 GB csv file.
So you can use that as your baseline for making up your XML records.