Read and extract from file - c#

I have a huge file with ~3 mill rows. Every line contains record like this:
1|2|3|4|5|6|7|8|9
Exactly 8 separators like '|' on every line. I am looking for a way to read this file then extract last '9' number only from every line and store it into another file.
edit:
Ok here is what i done already.
using (StreamReader sr = new StreamReader(filepath))
using (StreamWriter sw = new StreamWriter(filepath1))
{
string line = null;
while ((line = sr.ReadLine()) != null)
sw.WriteLine(line.Split('|')[8]);
}
File.WriteAllLines("filepath", File.ReadAllLines(filepath).Where(l => !string.IsNullOrWhiteSpace(l)));
Read file, extract last digits then write in new file and clear blank lines. Last digit is 10-15 symbols and I want to extract first 6. I continue to read and try some and when I'm done or have some question I'll edit again.
Thanks
Edit 2:
Ok, here I take first 8 digits from the number:
sw.WriteLine(line.Substring(0, Math.Min(line.Length, 8)));
Edit 3:
I have no idea how can I match now every numbers that left in file. I want to match them and to see witch number how many times is in the file.
Any help?

I am looking for a way to read this file then extract last [..] number only from every line and store it into another file.
What part exactly are you having trouble with? In psuedo code, this is what you want:
fileReader = OpenFile("input")
fileWriter = OpenFile("output")
while !fileReader.EndOfFile
line = fileReader.ReadLine
records[] = line.Split('|')
value = records[8]
fileWriter.WriteLine(value)
do
So start implementing it and feel free to ask a question on any specific line you're having trouble with. Each line of code I posted contains enough pointers to figure out the C# code or the terms to do a web search for it.

You don't say where you are stuck. Break the problem down:
Write and run minimal C# program
Read lines from file
Break up one line
write result line to a file
Are you stuck on any one of those? Then ask a specific question about that. This decomposition technique is key to many programming tasks, and indeed complex tasks in general.
You might find the string split capability useful.

Because it's a huge file you must read it line by line!
public IEnumerable ReadFileIterator(String filePath)
{
using (StreamReader streamReader = new StreamReader(filePath, Encoding.Default))
{
String line;
while ((line = streamReader.ReadLine()) != null)
{
yield return line;
}
yield break;
}
}
public void WriteToFile(String inputFilePath, String outputFilePath)
{
using (StreamWriter streamWriter = new StreamWriter(outputFilePath, true, Encoding.Default))
{
foreach (String line in ReadFileIterator(inputFilePath))
{
String[] subStrings = line.Split('|');
streamWriter.WriteLine(subStrings[8]);
}
streamWriter.Flush();
streamWriter.Close();
}
}

using (StreamReader sr = new StreamReader("input"))
using (StreamWriter sw = new StreamWriter("output"))
{
string line = null;
while ((line=sr.ReadLine())!=null)
sw.WriteLine(line.Split('|')[8]);
}

Some pointer to start from: StreamReader.Readline() and String.Split(). There are examples on both pages.

With LINQ you could do a thing like the following to filter the numbers:
var numbers = from l in File.ReadLines(fileName)
let p = l.Split('|')
select p[8];
and then write them into a new file like that:
File.WriteAllText(newFileName, String.Join("\r\n", numbers));

Use String.Split() to get the line inside an array and get the last element and store it into another file. Repeat the process for each line.

Try this...
// Read the file and display it line by line.
System.IO.StreamReader file =
new System.IO.StreamReader("c:\\test.txt");
while((line = file.ReadLine()) != null)
{
string[] words = s.Split('|');
string value = words [8]
Console.WriteLine (value);
}
file.Close();

Related

Trying to get certain line endings when using streamreader

I'm trying to get certain line endings when using streamreader in a C# app.
Code:
public static IEnumerable<string> ReadAllLines(string path)
{
if (!File.Exists(path)) return null;
List<string> lines = new List<string>();
using (var reader = new StreamReader(path))
{
while (!reader.EndOfStream)
{
lines.Add(reader.ReadLine(#"(\r\n|\n)"));
}
}
return lines.ToArray();
}
you can see where I have reader.ReadLine(#"(\r\n|\n)"); If I write reader.ReadLine(); i have no issues but when I try to add line endings to it like I found online it tells me there is no overload to ReadLine.
Question: Can someone assist me with figuring out how to add certain line endings so I can successfully scan my CSV files?
Update:
So I found a way to add the line endings i was looking for and attempted it three different ways. But I'm still getting \r only one some lines. It doesn't make a lot of sense. Can anyone see any issues with the below lines of code?
var reader = new StreamReader(path, Encoding.Default);
//string text = reader.ReadToEnd();
////// attampt 1 - this gives the best result but is still splitting an a \r in one of the fields
//// List<string> lines = new List<string>(text.Split(new[] {"\r","\n"}, StringSplitOptions.None));
////// attempt 2 This worked almost identical to the option above but seemed faster.
//var lines = Regex.Split(text, "\r\n");
//// attempt 3 - this split both \r and \n separately
// List<string> lines = new List<string>(text.Split("\r\n".ToCharArray()));
any other suggestions on how to do this would be great!
Based on your comment to your question:
so just to explain what is going on i have a CSV file. when you put it in excel i have some lines that go to ZZ and other lines that go to AZ (not as long). the white space at the end of AZ all the way to ZZ gets added to the next line and screws everything. i assumed it was because the line endings were not correct but they are as you state above
Try a String.TrimEnd() method call before adding the string to your list.
public static IEnumerable<string> ReadAllLines(string path)
{
if (!File.Exists(path)) return null;
var lines = new List<string>();
using (var reader = new StreamReader(path))
{
while (!reader.EndOfStream)
{
// add the TrimEnd call here
lines.Add(reader.ReadLine().TrimEnd());
}
}
return lines.ToArray();
}

Read second line and save it from txt C#

What I have to do is read only the second line in a .txt file and save it as a string, to use later in the code.
The file name is "SourceSetting". In line 1 and 2 I have some words
For line 1, I have this code:
string Location;
StreamReader reader = new StreamReader("SourceSettings.txt");
{
Location = reader.ReadLine();
}
ofd.InitialDirectory = Location;
And that works out great but how do I make it so that it only reads the second line so I can save it as for example:
string Text
You can skip the first line by doing nothing with it, so call ReadLine twice:
string secondLine:
using(var reader = new StreamReader("SourceSettings.txt"))
{
reader.ReadLine(); // skip
secondLine = reader.ReadLine();
}
Another way is the File class that has handy methods like ReadLines:
string secondLine = File.ReadLines("SourceSettings.txt").ElementAtOrDefault(1);
Since ReadLines also uses a stream the whole file must not be loaded into memory first to process it. Enumerable.ElementAtOrDefault will only take the second line and don't process more lines. If there are less than two lines the result is null.
Update I'd advice to go with Tim Schmelter solution.
When you call ReadLine - it moves the carret to next line. So on second call you'll read 2nd line.
string Location;
using(var reader = new StreamReader("SourceSettings.txt"))
{
Location = reader.ReadLine(); // this call will move caret to the begining of 2nd line.
Text = reader.ReadLine(); //this call will read 2nd line from the file
}
ofd.InitialDirectory = Location;
Don't forget about using.
Or an example how to do this vi ReadLines of File class if you need just one line from file. But solution with ElementAtOrDefault is the best one as Tim Schmelter points.
var Text = File.ReadLines(#"C:\Projects\info.txt").Skip(1).First()
The ReadLines and ReadAllLines methods differ as follows: When you use
ReadLines, you can start enumerating the collection of strings before
the whole collection is returned; when you use ReadAllLines, you must
wait for the whole array of strings be returned before you can access
the array. Therefore, when you are working with very large files,
ReadLines can be more efficient.
So it doesn't read all lines into memory in comparison with ReadAllLines.
The line could be read using Linq as follows.
var SecondLine = File.ReadAllLines("SourceSettings.txt").Skip(1).FirstOrDefault();
private string GetLine(string filePath, int line)
{
using (var sr = new StreamReader(filePath))
{
for (int i = 1; i < line; i++)
sr.ReadLine();
return sr.ReadLine();
}
}
Hope this will help :)
If you know that your second line is unique, because it contains a specific keyword that does not appear anywhere else in your file, you also could use linq, the benefit is that the "second" line could be any line in future.
var myLine = File.ReadLines("SourceSettings.txt")
.Where(line => line.Contains("The Keyword"))
.ToList();

Alternative to File.AppendAllText for newline

I am trying to read characters from a file and then append them in another file after removing the comments (which are followed by semicolon).
sample data from parent file:
Name- Harly Brown ;Name is Harley Brown
Age- 20 ;Age is 20 years
Desired result:
Name- Harley Brown
Age- 20
I am trying the following code-
StreamReader infile = new StreamReader(floc + "G" + line + ".NC0");
while (infile.Peek() != -1)
{
letter = Convert.ToChar(infile.Read());
if (letter == ';')
{
infile.ReadLine();
}
else
{
System.IO.File.AppendAllText(path, Convert.ToString(letter));
}
}
But the output i am getting is-
Name- Harley Brown Age-20
Its because AppendAllText is not working for the newline. Is there any alternative?
Sure, why not use File.AppendAllLines. See documentation here.
Appends lines to a file, and then closes the file. If the specified file does not exist, this method creates a file, writes the specified lines to the file, and then closes the file.
It takes in any IEnumerable<string> and adds every line to the specified file. So it always adds the line on a new line.
Small example:
const string originalFile = #"D:\Temp\file.txt";
const string newFile = #"D:\Temp\newFile.txt";
// Retrieve all lines from the file.
string[] linesFromFile = File.ReadAllLines(originalFile);
List<string> linesToAppend = new List<string>();
foreach (string line in linesFromFile)
{
// 1. Split the line at the semicolon.
// 2. Take the first index, because the first part is your required result.
// 3. Trim the trailing and leading spaces.
string appendAbleLine = line.Split(';').FirstOrDefault().Trim();
// Add the line to the list of lines to append.
linesToAppend.Add(appendAbleLine);
}
// Append all lines to the file.
File.AppendAllLines(newFile, linesToAppend);
Output:
Name- Harley Brown
Age- 20
You could even change the foreach-loop into a LINQ-expression, if you prefer LINQ:
List<string> linesToAppend = linesFromFile.Select(line => line.Split(';').FirstOrDefault().Trim()).ToList();
Why use char by char comparison when .NET Framework is full of useful string manipulation functions?
Also, don't use a file write function multiple times when you can use it only one time, it's time and resources consuming!
StreamReader stream = new StreamReader("file1.txt");
string str = "";
while ((string line = infile.ReadLine()) != null) { // Get every line of the file.
line = line.Split(';')[0].Trim(); // Remove comment (right part of ;) and useless white characters.
str += line + "\n"; // Add it to our final file contents.
}
File.WriteAllText("file2.txt", str); // Write it to the new file.
You could do this with LINQ, System.File.ReadLines(string), and System.File.WriteAllLines(string, IEnumerable<string>). You could also use System.File.AppendAllLines(string, IEnumerable<string>) in a find-and-replace fashion if that was, in fact, the functionality you were going for. The difference, as the names suggest, is whether it writes everything out as a new file or if it just appends to an existing one.
System.IO.File.WriteAllLines(newPath, System.IO.File.ReadLines(oldPath).Select(c =>
{
int semicolon = c.IndexOf(';');
if (semicolon > -1)
return c.Remove(semicolon);
else
return c;
}));
In case you aren't super familiar with LINQ syntax, the idea here is to loop through each line in the file, and if it contains a semicolon (that is, IndexOf returns something that is over -1) we cut that off, and otherwise, we just return the string. Then we write all of those to the file. The StreamReader equivalent to this would be:
using (StreamReader reader = new StreamReader(oldPath))
using (StreamWriter writer = new StreamWriter(newPath))
{
string line;
while ((line = reader.ReadLine()) != null)
{
int semicolon = line.IndexOf(';');
if (semicolon > -1)
line = c.Remove(semicolon);
writer.WriteLine(line);
}
}
Although, of course, this would feed an extra empty line at the end and the LINQ version wouldn't (as far as I know, it occurs to me that I'm not one hundred percent sure on that, but if someone reading this does know I would appreciate a comment).
Another important thing to note, just looking at your original file, you might want to add in some Trim calls, since it looks like you can have spaces before your semicolons, and I don't imagine you want those copied through.

Most efficient way of removing lines that contain more than one string from a file?

I want to find the most efficient way of removing string 1 and string 2 when reading a file (host file) and remove the entire lines that contains string 1 or string 2.
Currently I have, and is obviously sluggish. What better methods are there?
using(StreamReader sr = File.OpenText(path)){
while ((stringToRemove = sr.ReadLine()) != null)
{
if (!stringToRemove.Contains("string1"))
{
if (!stringToRemove.Contains("string2"))
{
emptyreplace += stringToRemove + Environment.NewLine;
}
}
}
sr.Close();
File.WriteAllText(path, emptyreplace);
hostFileConfigured = false;
UInt32 result = DnsFlushResolverCache();
MessageBox.Show(removeSuccess, windowOffline);
}
The primary problem that you have is that you are constantly using large regular strings and appending data onto the end. This is re-creating the strings each time and consumes a lot of time and particularly memory. By using string.Join it will avoid the (very large number of) intermediate string values being created.
You can also shorten the code to get the lines of text by using File.ReadLines instead of using the stream directly. It's not really any better or worse, just prettier.
var lines = File.ReadLines(path)
.Where(line => !line.Contains("string1") && !line.Contains("string2"));
File.WriteAllText(path, string.Join(Environment.NewLine, lines));
Another option would be to stream the writing of the output as well. Since there is no good library method for writing out a IEnumerable<string> without eagerly evaluating the input, we'll have to write our own (which is simple enough):
public static void WriteLines(string path, IEnumerable<string> lines)
{
using (var stream = File.CreateText(path))
{
foreach (var line in lines)
stream.WriteLine(line);
}
}
Also note that if we're streaming our output then we'll need a temporary file, since we don't want to be reading and writing to the same file at the same time.
//same code as before
var lines = File.ReadLines(path)
.Where(line => !line.Contains("string1") && !line.Contains("string2"));
//get a temp file path that won't conflict with any other files
string tempPath = Path.GetTempFileName();
//use the method from above to write the lines to the temp file
WriteLines(tempPath, lines);
//rename the temp file to the real file we want to replace,
//both deleting the temp file and the old file at the same time
File.Move(tempPath, path);
The primary advantage of this option, as opposed to the first, is that it will consume far less memory. In fact, it only ever needs to hold line of the file in memory at a time, rather than the whole file. It does take up a bit of extra space on disk (temporarily) though.
The first thing that shines to me, is wrong (not efficient) use of string type variable inside a while loop (emptyreplace), use StrinBuilder type and it will be much memory efficient.
For example:
StringBuilder emptyreplace = new StringBuilder();
using(StreamReader sr = File.OpenText(path)){
while ((stringToRemove = sr.ReadLine()) != null)
{
if (!stringToRemove.Contains("string1"))
{
if (!stringToRemove.Contains("string2"))
{
//USE StringBuilder.Append, and NOT string concatenation
emptyreplace.AppendLine(stringToRemove + Environment.NewLine);
}
}
}
...
}
The rest seems good enough.
There are a number of ways to improve this:
Compile the array of words you're searching for into a regex (eg, word1|word2; beware of special characters) so that you'll only need to loop over the string once. (this would also allow you to use \b to only match words)
Write each line through a StreamWriter to a new file so that you don't need to store the whole thing in memory while building it. (after you finish, delete the original file & rename the new one)
Is your host file really that big that you need to bother with reading it line by line? Why not simply do this?
var lines = File.ReadAllLines(path);
var lines = lines.Where(x => !badWords.Any(y => x.Contains(y))).ToArray();
File.WriteAllLines(path, lines);
Two suggestions:
Create an array of strings to detect (I'll call them stopWords) and use Linq's Any extension method.
Rather than building the file up and writing it all at once, write each line to an output file one at a time while your reading the source file, and replace the source file once your done.
The resulting code:
string[] stopWords = new string[]
{
"string1",
"string2"
}
using(StreamReader sr = File.OpenText(srcPath))
using(StreamWriter sw = new StreamWriter(outPath))
{
while ((stringToRemove = sr.ReadLine()) != null)
{
if (!stopWords.Any(s => stringToRemove.Contains(s))
{
sw.WriteLine(stringToRemove);
}
}
}
File.Move(outPath, srcPath);
Update: I just realized that you are actually talking about the "hosts file". Assuming you mean %windir%\system32\drivers\etc\hosts, it is very unlikely that this file has a truly significant size (like more than a couple of KBs). So personally, I would go with the most readable approach. Like, for example, the one by #servy.
In the end you will have to read every line and write every line, that does not match your criteria. So, you will always have the basic IO overhead that you cannot avoid. Depending on the actual (average) size of your files that might overshadow every other optimization technique you use in your code to actually filter the lines.
Having that said, you can however be a little less wasteful on the memory side of things, by not collecting all output lines in a buffer, but directly writing them to the output file as you have read them (again, this might be pointless if you files are not very big).
using (var reader = new StreamReader(inputfile))
{
using (var writer = new StreamWriter(outputfile))
{
string line;
while ((line = reader.ReadLine()) != null)
{
if (line.IndexOf("string1") == -1 && line.IndexOf("string2") == -1)
{
writer.WriteLine(line);
}
}
}
}
File.Move(outputFile, inputFile);

is there any way to ignore reading in certain lines in a text file?

I'm trying to read in a text file in a c# application, but I don't want to read the first two lines, or the last line. There's 8 lines in the file, so effectivly I just want to read in lines, 3, 4, 5, 6 and 7.
Is there any way to do this?
example file
_USE [Shelley's Other Database]
CREATE TABLE db.exmpcustomers(
fName varchar(100) NULL,
lName varchar(100) NULL,
dateOfBirth date NULL,
houseNumber int NULL,
streetName varchar(100) NULL
) ON [PRIMARY]_
EDIT
Okay, so, I've implemented Callum Rogers answer into my code and for some reason it works with my edited text file (I created a text file with the lines I didn't want to use omitted) and it does exactly what it should, but whenever I try it with the original text file (above) it throws an exception. I display this information in a DataGrid and I think that's where the exception is being thrown.
Any ideas?
The Answer by Rogers is good, I am just providing another way of doing this.
Try this,
List<string> list = new List<string>();
using (StreamReader reader = new StreamReader(FilePath))
{
string text = "";
while ((text = reader.ReadLine()) != null)
{
list.Add(text);
}
list.RemoveAt(0);
list.RemoveAt(0);
}
Hope this helps
Why do you want to ignore exactly the first two and the last line?
Depending on what your file looks like you might want to analyze the line, e.g. look at the first character whether it is a comment sign, or ignore everything until you find the first empty line, etc.
Sometimes, hardcoding "magic" numbers isn't such a good idea. What if the file format needs to be changed to contain 3 header lines?
As the other answers demonstrate: Nothing keeps you from doing what you ever want with a line you have read, so of course, you can ignore it, too.
Edit, now that you've provided an example of your file: For your case I'd definitely not use the hardcoded numbers approach. What if some day the SQL statement should contain another field, or if it appears on one instead of 8 lines?
My suggestion: Read in the whole string at once, then analyze it. Safest way would be to use a grammar, but if you presume the SQL statement is never going to be more complicated, you can use a regular expression (still much better than using line numbers etc.):
string content = File.ReadAllText(filename);
Regex r = new Regex(#"CREATE TABLE [^\(]+\((.*)\) ON");
string whatYouWant = r.Match(content).Groups[0].Value;
Why not just use File.ReadAllLines() and then remove the first 2 lines and the last line? With such a small file speed differences will not be noticeable.
string[] allLines = File.ReadAllLines("file.ext");
string[] linesWanted = new string[allLines.Length-3];
Array.Copy(allLines, 2, linesWanted, 0, allLines.Length-3);
If you have a TextReader object wrapping the filestream you could just call ReadLine() two times.
StreamReader inherits from TextReader, which is abstract.
Non-fool proof example:
using (var fs = new FileStream("blah", FileMode.Open))
using (var reader = new StreamReader(fs))
{
reader.ReadLine();
reader.ReadLine();
// Do stuff.
}
string filepath = #"C:\whatever.txt";
using (StreamReader rdr = new StreamReader(filepath))
{
rdr.ReadLine(); // ignore 1st line
rdr.ReadLine(); // ignore 2nd line
string fileContents = "";
while (true)
{
string line = rdr.ReadLine();
if (rdr.EndOfStream)
break; // finish without processing last line
fileContents += line + #"\r\n";
}
Console.WriteLine(fileContents);
}
How about a general solution?
To me, the first step is to enumerate over the lines of a file (already provided by ReadAllLines, but that has a performance cost due to populating an entire string[] array; there's also ReadLines, but that's only available as of .NET 4.0).
Implementing this is pretty trivial:
public static IEnumerable<string> EnumerateLines(this FileInfo file)
{
using (var reader = file.OpenText())
{
while (!reader.EndOfStream)
{
yield return reader.ReadLine();
}
}
}
The next step is to simply skip the first two lines of this enumerable sequence. This is straightforward using the Skip extension method.
The last step is to ignore the last line of the enumerable sequence. Here's one way you could implement this:
public static IEnumerable<T> IgnoreLast<T>(this IEnumerable<T> source, int ignoreCount)
{
if (ignoreCount < 0)
{
throw new ArgumentOutOfRangeException("ignoreCount");
}
var buffer = new Queue<T>();
foreach (T value in source)
{
if (buffer.Count < ignoreCount)
{
buffer.Enqueue(value);
continue;
}
T buffered = buffer.Dequeue();
buffer.Enqueue(value);
yield return buffered;
}
}
OK, then. Putting it all together, we have:
var file = new FileInfo(#"path\to\file.txt");
var lines = file.EnumerateLines().Skip(2).IgnoreLast(1);
Test input (contents of file):
This is line number 1.
This is line number 2.
This is line number 3.
This is line number 4.
This is line number 5.
This is line number 6.
This is line number 7.
This is line number 8.
This is line number 9.
This is line number 10.
Output (of Skip(2).IgnoreLast(1)):
This is line number 3.
This is line number 4.
This is line number 5.
This is line number 6.
This is line number 7.
This is line number 8.
This is line number 9.
You can do this:
var valid = new int[] { 3, 4, 5, 6, 7 };
var lines = File.ReadAllLines("file.txt").
Where((line, index) => valid.Contains(index + 1));
Or the opposite:
var invalid = new int[] { 1, 2, 8 };
var lines = File.ReadAllLines("file.txt").
Where((line, index) => !invalid.Contains(index + 1));
If you're looking for a general way to remove the last and the first 2, you can use this:
var allLines = File.ReadAllLines("file.txt");
var lines = allLines
.Take(allLines.Length - 1)
.Skip(2);
But from your example it seems that you're better off looking for the string pattern that you want to read from the file. Try using regexes.

Categories

Resources