C# Programming How to grep columns/lines from Text File? - c#

I have a C# console program which main functions should let a user grep lines / columns from a log text file.
An Example within the text file the user wishes to grep a group of all the related lines starting from a particular date etc. "Tue Aug 03 2004 22:58:34" to "Wed Aug 04 2004 00:56:48". Therefore after processing, the program would then output all the data found within the log text files between the 2 dates.
Could someone please advise on some codes that I could use to grep or create a filter to retrieve the neccessary text/data from the file? Thanks!
C# Program Files:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.IO;
namespace Testing
{
class Analysis
{
static void Main()
{
// Read the file lines into a string array.
string[] lines = System.IO.File.ReadAllLines(#"C:\Test\ntfs.txt");
System.Console.WriteLine("Analyzing ntfs.txt:");
foreach (string line in lines)
{
Console.WriteLine("\t" + line);
// ***Trying to filter/grep out dates, file size, etc****
if (lines = "Sun Nov 19 2000")
{
Console.WriteLine("Print entire line");
}
}
// Keep the console window open in debug mode.
Console.WriteLine("Press any key to exit.");
System.Console.ReadKey();
}
}
}
Log Text File Example:
Wed Jul 21 2004 16:58:48 499712 m... r/rrwxrwxrwx 0 0 8360-128-3
C:/Program Files/AccessData/Common Files/AccessData LicenseManager/LicenseManager.exe
Tue Aug 03 2004 22:58:34 23040 m... r/rrwxrwxrwx 0 0 8522-128-3
C:/System Volume Information/_restore{88D7369F-4F7E-44D4-8CD1-
F7FF1F6AC067}/RP4/A0002101.sys
23040 m... r/rrwxrwxrwx 0 0 9132-128-3
C:/WINDOWS/system32/ReinstallBackups/0003/DriverFiles/i386/mouclass.sys
23040 m... r/rrwxrwxrwx 0 0 9135-128-4 C:/System Volume
Information/_restore{88D7369F-4F7E-44D4-8CD1-F7FF1F6AC067}/RP4/A0003123.sys
23040 m... r/rrwxrwxrwx 0 0 9136-128-3
C:/WINDOWS/system32/drivers/mouclass.sys
Tue Aug 03 2004 23:01:16 196864 m... r/rrwxrwxrwx 0 0 4706-128-3
C:/WINDOWS/system32/drivers/rdpdr.sys
Tue Aug 03 2004 23:08:18 24960 m... r/rrwxrwxrwx 0 0 8690-128-3
C:/WINDOWS/system32/drivers/hidparse.sys

You could do this using Regex to select matching lines in a richer way than string.Contains allows.
Not sure why you are reinventing findstr.exe though.
For large files you might find File.ReadLines (.Net 4 only) performs better - this reads the same lines but allows you to process them in a foreach and other IEnumerable scenarios without loading the entire file into RAM at once.

Well, as a quick fix for the specific example:
if (line.StartsWith("Sun Nov 19 2000"))
{
Console.WriteLine(line);
}
You could use Contains to find a substring within the line.
Note that loading the whole file in an array won't scale well for very large logs. We can look into fixing that if it's an issue for you - but let's take things slowly :)

Here's a grep style method I use in testing:
public static List<string> FileGrep(string filePath, string searchText)
{
var matches = new List<string>();
using (var f = File.OpenRead(filePath))
{
var s = new StreamReader(f);
while (!s.EndOfStream)
{
var line = s.ReadLine();
if (line != null && line.Contains(searchText)) matches.Add(line);
}
f.Close();
}
return matches;
}

Related

How to get counted words in files in BODY field?

The following code counting words in directory from all ".sgm" files.
But I need to get counted words in all ".sgm" files between BODY tags for example.
How can I do that?
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.Serialization;
namespace Project2
{
class Program
{
static void Main(string[] args)
{
string[] parcesPlaces = new string[] { "west-germany", "usa", "france", "uk", "canada", "japan" };
DirectoryInfo filePaths = new DirectoryInfo(#"D:\project_IAD");
FileInfo[] Files = filePaths.GetFiles("*.sgm");
List<TotalBody> allNeedBody = new List<TotalBody>();
foreach (FileInfo file in Files)
{
string fileContent = File.ReadAllText(file.FullName);
string fileContentCleared = ReplaceHexadecimalSymbols(fileContent);
string myRootedXml = "<root>" + fileContentCleared + "</root>";
root result = (root)XmlDeserializeFromString(myRootedXml, typeof(root));
Console.WriteLine(" Ilość potrzebnych słów: {0}", result.REUTERS.ToList().Count);
foreach (rootREUTERS rootREUTERS in result.REUTERS)
{
if (rootREUTERS.PLACES.Length != 1)
{
continue;
}
else if (!parcesPlaces.Contains(rootREUTERS.PLACES[0]))
{
continue;
}
else
{
if (rootREUTERS.TEXT.BODY != null)
{
allNeedBody.Add(new TotalBody(rootREUTERS.PLACES[0], rootREUTERS.TEXT.BODY));
}
else
{
continue;
}
}
}
}
Console.WriteLine("Total count words: ");
Console.WriteLine(allNeedBody.Count);
Console.ReadKey();
}
private static object XmlDeserializeFromString(string v, Type type)
{
object result = null;
using (TextReader reader = new StringReader(v))
{
result = new XmlSerializer(type).Deserialize(reader);
}
return result;
}
private static string ReplaceHexadecimalSymbols(string txt)
{
string r = "[\x00-\x08\x0B\x0C\x0E-\x1F\x26]";
return Regex.Replace(txt, r, "", RegexOptions.Compiled);
}
}
}
Example of text in file "reut2-000.sgm":
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES> <UNKNOWN> C T
f0704reute u f BC-BAHIA-COCOA-REVIEW 02-26
0105</UNKNOWN> <TEXT> <TITLE>BAHIA COCOA REVIEW</TITLE> <DATELINE>
SALVADOR, Feb 26 - </DATELINE><BODY>**Showers continued throughout the
week in the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao, although
normal humidity levels have not been restored, Comissaria Smith said
in its weekly review.
The dry period means the temporao will be late this year.
Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against
5.81 at the same stage last year. Again it seems that cocoa delivered earlier on consignment was included in the arrivals figures.
Comissaria Smith said there is still some doubt as to how much old crop cocoa is still available as harvesting has practically come to an
end. With total Bahia crop estimates around 6.4 mln bags and sales
standing at almost 6.2 mln there are a few hundred thousand bags still
in the hands of farmers, middlemen, exporters and processors.
There are doubts as to how much of this cocoa would be fit for export as shippers are now experiencing dificulties in obtaining
+Bahia superior+ certificates.
In view of the lower quality over recent weeks farmers have sold a good part of their cocoa held on consignment.
Comissaria Smith said spot bean prices rose to 340 to 350 cruzados per arroba of 15 kilos.
Bean shippers were reluctant to offer nearby shipment and only limited sales were booked for March shipment at 1,750 to 1,780 dlrs
per tonne to ports to be named.
New crop sales were also light and all to open ports with June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs under
New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs per tonne FOB.
Routine sales of butter were made. March/April sold at 4,340, 4,345 and 4,350 dlrs.
April/May butter went at 2.27 times New York May, June/July at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at
2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and
2.27 times New York Dec, Comissaria Smith said.
Destinations were the U.S., Covertible currency areas, Uruguay and open ports.
Cake sales were registered at 785 to 995 dlrs for March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times New York Dec for
Oct/Dec.
Buyers were the U.S., Argentina, Uruguay and convertible currency areas.
Liquor sales were limited with March/April selling at 2,325 and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New York July,
Aug/Sept at 2,400 dlrs and at 1.25 times New York Sept and Oct/Dec at
1.25 times New York Dec, Comissaria Smith said.
Total Bahia sales are currently estimated at 6.13 mln bags against the 1986/87 crop and 1.06 mln bags against the 1987/88 crop.
Final figures for the period to February 28 are expected to be published by the Brazilian Cocoa Trade Commission after carnival which
ends midday on February 27.** Reuter </BODY></TEXT> </REUTERS>
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="5545" NEWID="2"> <DATE>26-FEB-1987 15:02:20.00</DATE>
<TOPICS></TOPICS> <PLACES><D>usa</D></PLACES> <PEOPLE></PEOPLE>
<ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES>
<UNKNOWN> F Y f0708reute d f
BC-STANDARD-OIL-<SRD>-TO 02-26 0082</UNKNOWN>
Need to count words only in the BODY fields (On example marked in bold), without different characters, etc.
File example for testing propose.
What I see in your question is you trying to create xml formatted content, and trying to deserialize it just to count the content, that would be fine if you need to collect data, but if the intention is only to count words tagged in between body of documents it is much faster to just parse it and count it on the fly.
My strategy is to take substring of content that starts with <body> and take the substring that ends with </body> and count it by splitting it.
Here is the solution:
DirectoryInfo filePaths = new DirectoryInfo(#"D:\Stackoverflow\SgmCount\docs");
FileInfo[] Files = filePaths.GetFiles("*.sgm");
int wordCount = 0;
foreach (FileInfo file in Files)
{
string content = File.ReadAllText(file.FullName);
content = content.Substring(content.IndexOf("<BODY>", StringComparison.Ordinal) + 5);
content = content.Substring(0, content.IndexOf("</BODY>", StringComparison.Ordinal) - 1);
char[] delimiters = { ' ', '\r', '\n' };
wordCount = content.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Length;
}
Console.WriteLine($"Total count words: {wordCount}" words);
This gives an output:
Total count words: 488 words

Split a string into lines?

Here is code;
foreach (var file in d.GetFiles("*.xml"))
{
string test = getValuesOneFile(file.ToString());
result.Add(test);
Console.WriteLine(test);
Console.ReadLine();
}
File.WriteAllLines(filepath + #"\MapData.txt", result);
Here is what it looks like in the console;
[30000]
total=5
sp 0 -144 152 999999999
sp 0 -207 123 999999999
sp 0 -173 125 999999999
in00 1 -184 213 999999999
out00 2 1046 94 40000
Here is how it looks like in the text file (when written at end of loop).
[30000]total=5sp 0 -144 152 999999999sp 0 -207 123 999999999sp 0 -173 125 999999999in00 1 -184 213 999999999out00 2 1046 94 40000
I need it to write the lines in the same style as the console output.
WriteAllLines is going to separate each of the values with the environments new line string, however, throughout the history of computers a number of possible different characters have been used to represent new lines. You are looking at the text file using some program that is expecting a different type of new line separator. You should either be using a different program to look at the value of that file; one that either properly handles this type of separator (or can handle any type of separator), you should be configuring your program to expect the given type of separator, or you'll need to replace WriteAllLines with a manual method of writing the strings that uses another new line separator.
Rather than WriteAllLines You'll probably want to just write the text manually:
string textToWrite = "";
foreach (var res in result)
{
textToWrite += res.Replace("\r","").Replace("\n",""); //Ensure there are no line feeds or carriage returns
textToWrite += "\r\n"; //Add the Carriage Return
}
File.WriteAllText(filepath + #"\MapData.txt", textToWrite)
The problem is definitely how you are looking for newlines in your output. Environment.NewLine will get inserted after each string written by WriteAllLines.
I would recommend opening the output file in NotePad++ and turn on View-> ShowSymbol-> Show End of Line to see what end of line characters are in the file. On my machine for instance it is [CR][LF] (Carriage Return / Line Feed) at the end of each line which is standard for windows.

Reading from a textfile and writing to another textfile

I have Data in Notepad that looks like this
and I'm writing it to an output file
01 some Data
02 some Data
02 some data
03 some data(End of client 1)
01 some data
02 some data
02 some data
02 some data
03 some data(End of client 2)
I want to count how many times the value 02 appears and display it after the end of each client.
I'm using this piece of code to count
int count = File.ReadLines(#"C:\Exercises\gamenam.dat").Count(
line => line.StartsWith("02")
);
I want to know how do you display it after the end of each client ie after 03?
This might be easier to do with a loop and regex:
int count = 0;
foreach (string line in File.ReadLines(#"C:\Exercises\gamenam.dat"))
{
if (line.StartsWith("02"))
count++;
Match clientMatch = Regex.Match(line, #"(?<=\(End of client )\d+(?=\))");
if (clientMatch.Success)
{
// Replace line below with write to output file
Console.WriteLine("Client {0} has {1} occurrences of \"02\".",
clientMatch.Value, count);
count = 0;
}
}

rename multiple file in a sequence c# or c++

How can I rename multiple files like this:
file.txt , anotherfile.txt , log.txt
into something like this :
file1.txt , file2.txt , file3.txt
How can I do this in c# or in c++ ?
Use File.Move Method as:
IEnumerable<FileInfo> files = GetFilesToBeRenamed();
int i = 1;
foreach(FileInfo f in files)
{
File.Move(f.FullName, string.Format("file{0}.txt", i));
i++;
}
And if f is a fullpath, then you can do this instead:
File.Move(f.FullName,
Path.Combine(f.Directory.ToString(), string.Format("file{0}.txt", i));
This would work in you're using an sh-based shell:
#!/bin/sh
FEXT="txt" # This is the file extension you're searching for
FPRE="file" # This is the base of the new files names file1.txt, file2.txt, etc.
FNUM=1; # This is the initial starting number
find . -name "*.${FEXT}" | while read OFN ; do
# Determine new file name
NFN="${FPRE}${FNUM}.${FEXT}"
# Increment FNUM
FNUM=$(($FNUM + 1))
# Rename File
mv "${OFN}" "${NFN}"
done
The script in action:
[james#fractal renfiles]$ touch abc.txt
[james#fractal renfiles]$ touch test.txt
[james#fractal renfiles]$ touch "filename with spaces.txt"
[james#fractal renfiles]$ ll
total 4
-rw-rw-r-- 1 james james 0 Sep 3 17:45 abc.txt
-rw-rw-r-- 1 james james 0 Sep 3 17:45 filename with spaces.txt
-rwxrwxr-x 1 james james 422 Sep 3 17:41 renfiles.sh
-rw-rw-r-- 1 james james 0 Sep 3 17:45 test.txt
[james#fractal renfiles]$ ./renfiles.sh
[james#fractal renfiles]$ ll
total 4
-rw-rw-r-- 1 james james 0 Sep 3 17:45 file1.txt
-rw-rw-r-- 1 james james 0 Sep 3 17:45 file2.txt
-rw-rw-r-- 1 james james 0 Sep 3 17:45 file3.txt
-rwxrwxr-x 1 james james 422 Sep 3 17:41 renfiles.sh
In c++, you will eventually use
std::rename(frompath, topath);
to perform the action. TR2 proposal N1975 covers this function. However, until then, use boost::rename for the immediate future, and tr2::rename for the period after approval before final placement.
Loop through and use whatever names you want. Don't quite know if you're trying to add numbers, because the current question says 1, 2, 2.
In c# you can use File.Move(source, dest)
Of course you can do it programatically:
string[] files = new string[] {"file.txt" , "anotherfile.txt" , "log.txt"}:
int index = 0;
string Dest;
foreach (string Source in files)
{
if (!Files.Exists(Source )) continue;
do {
index++;
Dest= "file"+i+".txt";
} while (File.Exists(NewName);
File.Move(Source , Dest);
}

C# How to skip String line numbers in Array after processing from Text File?

The program that was created allows users to simply parse a log text file. The program simply utilizes grouping of the various parts of the text files into the variable "sections" array.
However is there a way to skip the number of lines of the "sections" array? I have tried using the "split" method but it does not work as it skips a number of "sections" instead of the number of lines in each "sections".
The lines in each sections should be removed are:
Restore Point Info
Description : Installed VMware Tools
Type : Application Install
Creation Time : Mon Nov 29 16:53:12 2010
Therefore may someone please advise on the codes? Thanks!
The codes:
namespace Testing {
class Program {
static void Main(string[] args) {
TextReader tr = new StreamReader(#"C:\Test\new.txt");
String SplitBy = "----------------------------------------";
// Skip 5 lines of the original text file
for(var i = 0; i < 5; i++) {
tr.ReadLine();
}
// Read the reststring
String fullLog = tr.ReadToEnd();
String[] sections = fullLog.Split(new string[] { SplitBy }, StringSplitOptions.None);
//String[] lines = sections.Skip(5).ToArray();
int t = 0;
// Tried using foreach (String r in sections.skip(4)) but skips sections instead of the Text lines found within each sections
foreach (String r in sections) {
Console.WriteLine("The times are : " + t);
Console.WriteLine(r);
Console.WriteLine(sections[6]);
Console.WriteLine("============================================================");
t++;
}
}
}
}
An Example of the Text log file:
Restore Point Info
Description : System Checkpoint
Type : System Checkpoint
Creation Time : Mon Nov 29 16:51:52 2010
J:\syscrawl\Restore\RP1\snapshot\_REGISTRY_MACHINE_SYSTEM
ControlSet001\Enum\USBStor not found.
----------------------------------------
Restore Point Info
Description : Installed Hex Workshop v5
Type : Application Install
Creation Time : Fri Dec 3 04:35:57 2010
J:\syscrawl\Restore\RP10\snapshot\_REGISTRY_MACHINE_SYSTEM
USBStor
ControlSet001\Enum\USBStor
CdRom&Ven_SanDisk&Prod_Ultra_Backup&Rev_8.32 [Wed Dec 1 07:39:09 2010]
S/N: 2584820A2890B317&1 [Wed Dec 1 07:39:22 2010]
FriendlyName : SanDisk Ultra Backup USB Device
CdRom&Ven_WD&Prod_Virtual_CD_070A&Rev_1032 [Wed Dec 1 07:31:33 2010]
S/N: 575836314331304639303339&1 [Fri Dec 3 03:03:36 2010]
FriendlyName : WD Virtual CD 070A USB Device
Disk&Ven_SanDisk&Prod_Ultra_Backup&Rev_8.32 [Wed Dec 1 07:39:09 2010]
S/N: 2584820A2890B317&0 [Wed Dec 1 07:39:19 2010]
FriendlyName : SanDisk Ultra Backup USB Device
ParentIdPrefix: 8&2f23e350&0
Disk&Ven_WD&Prod_My_Passport_070A&Rev_1032 [Wed Dec 1 07:31:33 2010]
S/N: 575836314331304639303339&0 [Fri Dec 3 03:03:36 2010]
FriendlyName : WD My Passport 070A USB Device
Other&Ven_WD&Prod_SES_Device&Rev_1032 [Wed Dec 1 07:31:33 2010]
S/N: 575836314331304639303339&2 [Fri Dec 3 04:08:49 2010]
----------------------------------------
Restore Point Info
Description : Installed VMware Tools
Type : Application Install
Creation Time : Mon Nov 29 16:53:12 2010
J:\syscrawl\Restore\RP2\snapshot\_REGISTRY_MACHINE_SYSTEM
ControlSet001\Enum\USBStor not found.
There are multiple solutions available depending on how maintainable you want to code.
Hard-coding the text to remove so that you find and replace it with empty string
Read lines one by one and you have a list of all lines to ignore and you check against them
Use a regular expression to extract what you need [PREFERRED]
Reality is the log file you are trying to parse does not seem to be generated by your software, i.e. you do not own the format (VMWare does). So I believe this format could be changed by any update so hard-coding the format text you need or you do not need could make your software very brittle.
I would recommend using Regex, perhaps you would spend a while writing the expression but it is clean and useful.
Since the number of lines you want to keep can change, one solution would be to use a token/character at the start of every line you want to remove, that you are sure won't appear on the other log lines. For example:
$Restore Point Info
$Description : Installed VMware Tools
$Type : Application Install
$Creation Time : Mon Nov 29 16:53:12 2010
Now you can do:
if(line[0]=="$")
continue;
EDIT: Since you can only read the file
You could try a dirty way to do it, I think:
bool ShouldSkip(string line)
{
return (line.StartsWith("Restore Point Info") || line.StartsWith("Description") || line.StartsWith("Type") || line.StartsWith("Creation Time"))
}
usage:
//in your main method
foreach(var line in lines)
{
if(ShouldSkip(line))
continue;
}
I don't know if this is what you're looking for.

Categories

Resources