Text file line by line into string array - c#

I need help, trying to take a large text document ~1000 lines and put it into a string array, line by line.
Example:
string[] s = {firstLineHere, Secondline, etc};
I also want a way to find the first word, only the first word of the line, and once first word it found, copy the entire line. Find only the first word or each line!

You can accomplish this with File.ReadAllLines combined with a little Linq (to accomplish the addition to the question stated in the comments of Praveen's answer.
string[] identifiers = { /*Your identifiers for needed lines*/ };
string[] allLines = File.ReadAllLines("C:\test.txt");
string[] neededLines = allLines.Where(c => identifiers.Contains(c.SubString(0, c.IndexOf(' ') - 1))).ToArray();
Or make it more of a one liner:
string[] lines = File.ReadAllLines("your path").Where(c => identifiers.Contains(c.SubString(0, c.IndexOf(' ') - 1))).ToArray();
This will give you array of all the lines in your document that start with the keywords you define within your identifiers string array.

There is an inbuilt method to achieve your requirement.
string[] lines = System.IO.File.ReadAllLines(#"C:\sample.txt");
If you want to read the file line by line
List<string> lines = new List<string>();
using (StreamReader reader = new StreamReader(#"C:\sample.txt"))
{
while (reader.Peek() >= 0)
{
string line = reader.ReadLine();
//Add your conditional logic to add the line to an array
if (line.Contains(searchTerm)) {
lines.Add(line);
}
}
}

Another option you could use would be to read each individual line, while splitting the line into segments and comparing only the first element against
the provided search term. I have provided a complete working demonstration below:
Solution:
class Program
{
static void Main(string[] args)
{
// Get all lines that start with a given word from a file
var result = GetLinesWithWord("The", "temp.txt");
// Display the results.
foreach (var line in result)
{
Console.WriteLine(line + "\r");
}
Console.ReadLine();
}
public static List<string> GetLinesWithWord(string word, string filename)
{
List<string> result = new List<string>(); // A list of strings where the first word of each is the provided search term.
// Create a stream reader object to read a text file.
using (StreamReader reader = new StreamReader(filename))
{
string line = string.Empty; // Contains a single line returned by the stream reader object.
// While there are lines in the file, read a line into the line variable.
while ((line = reader.ReadLine()) != null)
{
// If the line is white space, then there are no words to compare against, so move to next line.
if (line != string.Empty)
{
// Split the line into parts by a white space delimiter.
var parts = line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
// Get only the first word element of the line, trim off any additional white space
// and convert the it to lowercase. Compare the word element to the search term provided.
// If they are the same, add the line to the results list.
if (parts.Length > 0)
{
if (parts[0].ToLower().Trim() == word.ToLower().Trim())
{
result.Add(line);
}
}
}
}
}
return result;
}
}
Where the sample text file may contain:
How shall I know thee in the sphere which keeps
The disembodied spirits of the dead,
When all of thee that time could wither sleeps
And perishes among the dust we tread?
For I shall feel the sting of ceaseless pain
If there I meet thy gentle presence not;
Nor hear the voice I love, nor read again
In thy serenest eyes the tender thought.
Will not thy own meek heart demand me there?
That heart whose fondest throbs to me were given?
My name on earth was ever in thy prayer,
Shall it be banished from thy tongue in heaven?
In meadows fanned by heaven's life-breathing wind,
In the resplendence of that glorious sphere,
And larger movements of the unfettered mind,
Wilt thou forget the love that joined us here?
The love that lived through all the stormy past,
And meekly with my harsher nature bore,
And deeper grew, and tenderer to the last,
Shall it expire with life, and be no more?
A happier lot than mine, and larger light,
Await thee there; for thou hast bowed thy will
In cheerful homage to the rule of right,
And lovest all, and renderest good for ill.
For me, the sordid cares in which I dwell,
Shrink and consume my heart, as heat the scroll;
And wrath has left its scar--that fire of hell
Has left its frightful scar upon my soul.
Yet though thou wear'st the glory of the sky,
Wilt thou not keep the same beloved name,
The same fair thoughtful brow, and gentle eye,
Lovelier in heaven's sweet climate, yet the same?
Shalt thou not teach me, in that calmer home,
The wisdom that I learned so ill in this--
The wisdom which is love--till I become
Thy fit companion in that land of bliss?
And you wanted to retrieve every line where the first word of the line is the word 'the' by calling the method like so:
var result = GetLinesWithWord("The", "temp.txt");
Your result should then be the following:
The disembodied spirits of the dead,
The love that lived through all the stormy past,
The same fair thoughtful brow, and gentle eye,
The wisdom that I learned so ill in this--
The wisdom which is love--till I become
Hopefully this answers your question adequately enough.

Related

How to read a portion of a line in a text file?

So, I have a text file with thousands of lines formatted similarly to this:
123456:0.8525000:1590882780:91011
These files are almost always a different length, and I only need to read the first two parts of the line, being 123456:0.8525000.
I know that I can split each line using C#, but I'm unsure how to only read the first 2 parts. Anyone have any idea on how to do this? Sorry if my question doesn't make sense, I can restate it if needed.
The Split function returns a string[], an array of strings.
Just take the 2 first elements of the result of Split (with : as the separator).
var read = "123456:0.8525000:1590882780:91011";
var values = read.Split(":");
Console.WriteLine(values[0]); // 123456
Console.WriteLine(values[1]); // 0.8525000
.NET Fiddle
Don't forget that elements of values are string and not yet int or double values. See How to convert string to integer in C# for how to convert from string to number type.
There are TONS of ways to doing this but I am going to suggest some options that involving read the full line as its much easier to work with / understand and that your lines are of varying length. I did add a suggestion on using StreamReader on a file at the end in addendum but you may need to figure out serious work arounds on skipping lines you don't want, restarting a char iterating loop on new lines etc.
I first demonstrate the latest and greatest IAsyncEnumerable found in NetCore 3.x followed by a similar string-based approach. By sharing an Int example that is a slightly advanced and that will also be asynchronous, I hope to also help others and demonstrate a fairly modern approach in 2020. Streaming out only the data you need will be a huge benefit in keeping it fast and a low memory footprint.
public static async IAsyncEnumerable<int> StreamFileOutAsIntsAsync(string filePathName)
{
if (string.IsNullOrWhiteSpace(filePathName)) throw new ArgumentNullException(nameof(filePathName));
if (!File.Exists(filePathName)) throw new ArgumentException($"{filePathName} is not a valid file path.");
using var streamReader = File.OpenText(filePathName);
string currentLine;
while ((currentLine = await streamReader.ReadLineAsync().ConfigureAwait(false)) != null)
{
if (int.TryParse(currentLine.AsSpan(), out var output))
{
yield return output;
}
}
}
This streams every int out of a file, checking that file exists and that the filename path is not null or blank etc.
Streaming maybe too much for a beginner so I don't know your level.
You may want to start with just turning the file into a list of strings.
Modifying my previous example above to something less complex but split your strings for you. I recommend learning about streaming so you don't have every piece of string in memory while you work on it... or maybe you want them all. I am not here to judge.
Once you get your string line out from a file you can do whatever else needs to be done.
public static async Task<List<string>> GetStringsFromFileAsync(string filePathName)
{
if (string.IsNullOrWhiteSpace(filePathName)) throw new ArgumentNullException(nameof(filePathName));
if (!File.Exists(filePathName)) throw new ArgumentException($"{filePathName} is not a valid file path.");
using var streamReader = File.OpenText(filePathName);
string currentLine;
var strings = new List<string>();
while ((currentLine = await streamReader.ReadLineAsync().ConfigureAwait(false)) != null)
{
var lineAsArray = currentLine.Split(new string[] { ":" }, StringSplitOptions.RemoveEmptyEntries);
// Simple Data Validation
if (lineAsArray.Length == 4)
{
strings.Add($"{lineAsArray[0]}:{lineAsArray[1]}");
strings.Add($"{lineAsArray[2]}:{lineAsArray[3]}");
}
}
return strings;
}
The meat of the code is really simple, open the file for reading!
using var streamReader = File.OpenText(filePathName);
and then loop through that file...
while ((currentLine = await streamReader.ReadLineAsync()) != null)
{
var lineAsArray = currentLine.Split(new string[] { ":" }, StringSplitOptions.RemoveEmptyEntries);
// Simple Data Validation
if (lineAsArray.Length == 4)
{
// Do whatever you need to do with the first bits of information.
// In this case, we add them all to a list for return.
strings.Add($"{lineAsArray[0]}:{lineAsArray[1]}");
strings.Add($"{lineAsArray[2]}:{lineAsArray[3]}");
}
}
What this demonstrates is that, for every line that I read out that is not null, break into four parts (based on the ":") character removing all empty entries.
We then use a C# feature called String Interpolation ($"") to put the first two back together with ":" as a string. Then the second two. Or whatever you need to do with reading each part of the line.
That's really all there is to it! Hope it helps.
Addendum: If you really need to read parts of file, please use a StreamReader.Read and Peek()
using (var sr = new StreamReader(path))
{
while (sr.Peek() >= 0)
{
Console.Write((char)sr.Read());
}
}
Reading each character
Some bare bones code:
string fileName = #"c:\some folder\path\file.txt";
using (StreamReader sr = new StreamReader(fileName))
{
while (!sr.EndOfStream)
{
String[] values = sr.ReadLine().Split(":".ToCharArray());
if (values.Length >= 2)
{
// ... do something with values[0] and values[1] ...
Console.WriteLine(values[0] + ", " + values[1]);
}
}
}

Splitting text into sentences

I run into a problem while trying to parse my text into sentences.
Everything works fine is text is formatted this way: (random text)
Much did had call new drew that kept. Limits expect wonder law she.
Now has you views woman noisy match money rooms.
Program parses text into 3 sentences.
But as soon as there is a line break in the middle of a sentence my program splits text incorrectly.
Much did had call new drew that kept. Limits (new line her) expect wonder law she.
Now has you views woman noisy match money rooms.
Program parses text as 4 sentences.
My code:
public static void ReadData()
{
char[] sentenceSeparators = {'.', '!', '?'};
using (StreamReader reader = new StreamReader(dataFile))
{
string line = null;
while (null != (line = reader.ReadLine()))
{
var split = line.Split(sentenceSeparators, StringSplitOptions.RemoveEmptyEntries);
foreach (var i in split)
{
Console.WriteLine(i);
}
}
}
}
Input #1:
Much did had call new drew that kept. Limits expect wonder law she.
Now has you views woman noisy match money rooms.
Output #1:
Much did had call new drew that kept
Limits expect wonder law she
Now has you views woman noisy match money rooms
Input #2:
Much did had call new drew that kept. Limits expect
wonder law she.
Now has you views woman noisy match money rooms.
Output #2:
Much did had call new drew that kept
Limits expect
wonder law she
Now has you views woman noisy match money rooms
Its because you are using ReadLine. Use ReadToEnd instead.
public static void ReadData()
{
char[] sentenceSeparators = {'.', '!', '?'};
using (StreamReader reader = new StreamReader(dataFile))
{
string line = reader.ReadToEnd();
var split = line.Split(sentenceSeparators, StringSplitOptions.RemoveEmptyEntries);
foreach (var i in split)
{
Console.WriteLine(i);
}
}
}
As already mentionen don't read it line by line if you want \n not to influence your splitting. Here is a version which does the job in 1 line:
string [] split = File.ReadAllText(dataFile).Split(sentenceSeparators, StringSplitOptions.RemoveEmptyEntries);
Also: the display in the console is illusionary. Since it will display the "bad" sentence on 2 lines, but in the split array it will be on a single position!
Console.WriteLine(split.Length); // will display 3

reading a CSV issue

I am trying to read a csv
following is the sample.
"0734306547 ","9780734306548 ","Jane Eyre Pink PP ","Bronte Charlotte ","FRONT LIST",20/03/2013 0:00:00,0,"PAPERBACK","Y","Pen"
Here is the code i am using read CSV
public void readCSV()
{
StreamReader reader = new StreamReader(File.OpenRead(#"C:\abc\21-08-2013\PNZdatafeed.csv"),Encoding.ASCII);
List<string> ISBN = new List<String>();
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (!String.IsNullOrWhiteSpace(line))
{
string[] values = line.Split(',');
if (values[9] == "Pen")
{
ISBN.Add(values[1]);
}
}
}
MessageBox.Show(ISBN.Count().ToString());
}
I am not able to compare it values if (values[9] == "Pen") because when i debug the code it says values[9] value is \"Pen\""
How do i get rid of the special characters.?
The problem here is that you're splitting the line every time you find , and leaving the data like that. For example, if this is the line you're reading in:
"A","B","C"
and you split it at commas, you'll get "A", "B", and "C" as your data. According to your description, you don't want quotes around the data.
To throw away quotes around a string:
Check if the leftmost character is ".
If so, check if the rightmost character is ".
If so, remove the leftmost and rightmost characters.
In pseudocode:
if (data.left(1) == "\"" && data.right(1) == "\"") {
data = data.trimleft(1).trimright(1)
}
At this point you might have a few questions (I'm not sure how much experience you have). If any of these apply to you, feel free to ask them, and I'll explain further.
What does "\"" mean?
How do I extract the leftmost/rightmost character of a string?
How do I extract the middle of a string?

How to read barcode from text file by specified place in C#?

0000016011071693266104*014482*3 15301 45 VETRO NOVA BLUVETRO NOVA BLUE FLAT STRETCH 115428815150010050 05420 000033 0003
0000072011076993266101*014687*4 15300 45 VETRO NOVA BLUVETRO NOVA BLUE FLAT STRETCH 115428815160010030 05430 000032 0007
I have a text file which includes many barcode codes line by line, and as you see in above string format are company codes and others show other things.
So how can I get read this text line by line and character by character in C#?
For reading it line by line you can use a StreamReader - see for example on MSDN http://msdn.microsoft.com/en-us/library/db5x7c0d.aspx
Another option is:
string[] AllLines = File.ReadAllLines (#"C:\MyFile.txt");
This give you all lines in a string array and you can work with them - this uses more memory but is faster... see for example http://msdn.microsoft.com/en-us/library/s2tte0y1.aspx
When have a line in a string you can split that line for example:
string[] MyFields = AllLines[1].Split(null); // since your fields seem to be separated by whitespace
The result is that you have the parts of the line in an array and can access for example the second field in the line with MyFields[1] - see http://msdn.microsoft.com/en-us/library/b873y76a.aspx
EDIT - as per comment another option:
IF you exactly know the positions and lengths of your fields you can do this:
string MyIdentity = AllLines[1].SubString(1, 5);
For MSDN reference see http://msdn.microsoft.com/en-us/library/aka44szs.aspx
You use Microsoft libraries dedicated to files and streams to open a file, and Readline().
Then you use Microsoft libraries dedicated to parsing to parse those lines.
You create, with Microsoft libraries, a regular expression to detect bar codes (not borcod...)
Then you throw away anything that doesn't match your regular expression.
Then you compile and debug (you can use Mono). And voilà, you have a C# program that solves your problem.
Note: you definitely don't need to go "character by character". Microsoft libraries and parsing will be much easier for your simple need.
If all you are after is reading it line-by-line, and character-by-character, then this is a possible solution:
var lines = File.ReadLines(#"pathtotextfile.txt");
foreach (var line in lines)
{
foreach (var character in line)
{
char individualCharacter = character;
}
}
If you need to know which line and character you are on; you can use a for loop instead:
var lines = File.ReadAllLines(#"pathtotextfile.txt");
for (var i = 0; i < lines.Length; i++)
{
var line = lines[i];
for(var j = 0; j < line.Length; j++)
{
var character = line[j];
}
}
Or use SelectMany in LINQ:
var lines = File.ReadLines(#"pathtotextfile.txt");
foreach (char individualCharacter in lines.SelectMany(line => line))
{
}
Now, as far as my opinion goes, doing it "line by line" and "character by character" seems like a difficult choice to me. If you can tell us what exactly each bit of information is in the barcode, we could help you extract it that way.

Need to pick up line terminators with StreamReader.ReadLine()

I wrote a C# program to read an Excel .xls/.xlsx file and output to CSV and Unicode text. I wrote a separate program to remove blank records. This is accomplished by reading each line with StreamReader.ReadLine(), and then going character by character through the string and not writing the line to output if it contains all commas (for the CSV) or all tabs (for the Unicode text).
The problem occurs when the Excel file contains embedded newlines (\x0A) inside the cells. I changed my XLS to CSV converter to find these new lines (since it goes cell by cell) and write them as \x0A, and normal lines just use StreamWriter.WriteLine().
The problem occurs in the separate program to remove blank records. When I read in with StreamReader.ReadLine(), by definition it only returns the string with the line, not the terminator. Since the embedded newlines show up as two separate lines, I can't tell which is a full record and which is an embedded newline for when I write them to the final file.
I'm not even sure I can read in the \x0A because everything on the input registers as '\n'. I could go character by character, but this destroys my logic to remove blank lines.
I would recommend that you change your architecture to work more like a parser in a compiler.
You want to create a lexer that returns a sequence of tokens, and then a parser that reads the sequence of tokens and does stuff with them.
In your case the tokens would be:
Column data
Comma
End of Line
You would treat '\n' ('\x0a') by its self as an embedded new line, and therefore include it as part of a column data token. A '\r\n' would constitute an End of Line token.
This has the advantages of:
Doing only 1 pass over the data
Only storing a max of 1 lines worth of data
Reusing as much memory as possible (for the string builder and the list)
It's easy to change should your requirements change
Here's a sample of what the Lexer would look like:
Disclaimer: I haven't even compiled, let alone tested, this code, so you'll need to clean it up and make sure it works.
enum TokenType
{
ColumnData,
Comma,
LineTerminator
}
class Token
{
public TokenType Type { get; private set;}
public string Data { get; private set;}
public Token(TokenType type)
{
Type = type;
}
public Token(TokenType type, string data)
{
Type = type;
Data = data;
}
}
private IEnumerable<Token> GetTokens(TextReader s)
{
var builder = new StringBuilder();
while (s.Peek() >= 0)
{
var c = (char)s.Read();
switch (c)
{
case ',':
{
if (builder.Length > 0)
{
yield return new Token(TokenType.ColumnData, ExtractText(builder));
}
yield return new Token(TokenType.Comma);
break;
}
case '\r':
{
var next = s.Peek();
if (next == '\n')
{
s.Read();
}
if (builder.Length > 0)
{
yield return new Token(TokenType.ColumnData, ExtractText(builder));
}
yield return new Token(TokenType.LineTerminator);
break;
}
default:
builder.Append(c);
break;
}
}
s.Read();
if (builder.Length > 0)
{
yield return new Token(TokenType.ColumnData, ExtractText(builder));
}
}
private string ExtractText(StringBuilder b)
{
var ret = b.ToString();
b.Remove(0, b.Length);
return ret;
}
Your "parser" code would then look like this:
public void ConvertXLS(TextReader s)
{
var columnData = new List<string>();
bool lastWasColumnData = false;
bool seenAnyData = false;
foreach (var token in GetTokens(s))
{
switch (token.Type)
{
case TokenType.ColumnData:
{
seenAnyData = true;
if (lastWasColumnData)
{
//TODO: do some error reporting
}
else
{
lastWasColumnData = true;
columnData.Add(token.Data);
}
break;
}
case TokenType.Comma:
{
if (!lastWasColumnData)
{
columnData.Add(null);
}
lastWasColumnData = false;
break;
}
case TokenType.LineTerminator:
{
if (seenAnyData)
{
OutputLine(lastWasColumnData);
}
seenAnyData = false;
lastWasColumnData = false;
columnData.Clear();
}
}
}
if (seenAnyData)
{
OutputLine(columnData);
}
}
You can't change StreamReader to return the line terminators, and you can't change what it uses for line termination.
I'm not entirely clear about the problem in terms of what escaping you're doing, particularly in terms of "and write them as \x0A". A sample of the file would probably help.
It sounds like you may need to work character by character, or possibly load the whole file first and do a global replace, e.g.
x.Replace("\r\n", "\u0000") // Or some other unused character
.Replace("\n", "\\x0A") // Or whatever escaping you need
.Replace("\u0000", "\r\n") // Replace the real line breaks
I'm sure you could do that with a regex and it would probably be more efficient, but I find the long way easier to understand :) It's a bit of a hack having to do a global replace though - hopefully with more information we'll come up with a better solution.
Essentially, a hard-return in Excel (shift+enter or alt+enter, I can't remember) puts a newline that is equivalent to \x0A in the default encoding I use to write my CSV. When I write to CSV, I use StreamWriter.WriteLine(), which outputs the line plus a newline (which I believe is \r\n).
The CSV is fine and comes out exactly how Excel would save it, the problem is when I read it into the blank record remover, I'm using ReadLine() which will treat a record with an embedded newline as a CRLF.
Here's an example of the file after I convert to CSV...
Reference,Name of Individual or Entity,Type,Name Type,Date of Birth,Place of Birth,Citizenship,Address,Additional Information,Listing Information,Control Date,Committees
1050,"Aziz Salih al-Numan
",Individual,Primary Name,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)
1050a,???? ???? ???????,Individual,Original script,1941 or 1945,An Nasiriyah,Iraqi,,Ba’th Party Regional Command Chairman; Former Governor of Karbala and An Najaf Former Minister of Agriculture and Agrarian Reform (1986-1987),Resolution 1483 (2003),6/27/2003,1518 (Iraq)
As you can see, the first record has an embedded new-line after al-Numan. When I use ReadLine(), I get '1050,"Aziz Salih al-Numan' and when I write that out, WriteLine() ends that line with a CRLF. I lose the original line terminator. When I use ReadLine() again, I get the line starting with '1050a'.
I could read the entire file in and replace them, but then I'd have to replace them back afterwards. Basically what I want to do is get the line terminator to determine if its \x0a or a CRLF, and then if its \x0A, I'll use Write() and insert that terminator.
I know I'm a little late to the game here, but I was having the same problem and my solution was a lot simpler than most given.
If you are able to determine the column count which should be easy to do since the first line is usually the column titles, you can check your column count against the expected column count. If the column count doesn't equal the expected column count, you simply concatenate the current line with the previous unmatched lines. For example:
string sep = "\",\"";
int columnCount = 0;
while ((currentLine = sr.ReadLine()) != null)
{
if (lineCount == 0)
{
lineData = inLine.Split(new string[] { sep }, StringSplitOptions.None);
columnCount = lineData.length;
++lineCount;
continue;
}
string thisLine = lastLine + currentLine;
lineData = thisLine.Split(new string[] { sep }, StringSplitOptions.None);
if (lineData.Length < columnCount)
{
lastLine += currentLine;
continue;
}
else
{
lastLine = null;
}
......
Thank you so much with your code and some others I came up with the following solution! I have added a link at the bottom to some code I wrote that used some of the logic from this page. I figured I'd give honor where honor was due! Thanks!
Below is a explanation about what I needed:
Try This, I wrote this because I have some very large '|' delimited files that have \r\n inside of some of the columns and I needed to use \r\n as the end of the line delimiter. I was trying to import some files using SSIS packages but because of some corrupted data in the files I was unable to. The File was over 5 GB so it was too large to open and manually fix. I found the answer through looking through lots of Forums to understand how streams work and ended up coming up with a solution that reads each character in a file and spits out the line based on the definitions I added into it. this is for use in a Command Line Application, complete with help :). I hope this helps some other people out, I haven't found a solution quite like it anywhere else, although the ideas were inspired by this forum and others.
https://stackoverflow.com/a/12640862/1582188

Categories

Resources