How to get counted words in files in BODY field? - c#

The following code counting words in directory from all ".sgm" files.
But I need to get counted words in all ".sgm" files between BODY tags for example.
How can I do that?
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.Serialization;
namespace Project2
{
class Program
{
static void Main(string[] args)
{
string[] parcesPlaces = new string[] { "west-germany", "usa", "france", "uk", "canada", "japan" };
DirectoryInfo filePaths = new DirectoryInfo(#"D:\project_IAD");
FileInfo[] Files = filePaths.GetFiles("*.sgm");
List<TotalBody> allNeedBody = new List<TotalBody>();
foreach (FileInfo file in Files)
{
string fileContent = File.ReadAllText(file.FullName);
string fileContentCleared = ReplaceHexadecimalSymbols(fileContent);
string myRootedXml = "<root>" + fileContentCleared + "</root>";
root result = (root)XmlDeserializeFromString(myRootedXml, typeof(root));
Console.WriteLine(" Ilość potrzebnych słów: {0}", result.REUTERS.ToList().Count);
foreach (rootREUTERS rootREUTERS in result.REUTERS)
{
if (rootREUTERS.PLACES.Length != 1)
{
continue;
}
else if (!parcesPlaces.Contains(rootREUTERS.PLACES[0]))
{
continue;
}
else
{
if (rootREUTERS.TEXT.BODY != null)
{
allNeedBody.Add(new TotalBody(rootREUTERS.PLACES[0], rootREUTERS.TEXT.BODY));
}
else
{
continue;
}
}
}
}
Console.WriteLine("Total count words: ");
Console.WriteLine(allNeedBody.Count);
Console.ReadKey();
}
private static object XmlDeserializeFromString(string v, Type type)
{
object result = null;
using (TextReader reader = new StringReader(v))
{
result = new XmlSerializer(type).Deserialize(reader);
}
return result;
}
private static string ReplaceHexadecimalSymbols(string txt)
{
string r = "[\x00-\x08\x0B\x0C\x0E-\x1F\x26]";
return Regex.Replace(txt, r, "", RegexOptions.Compiled);
}
}
}
Example of text in file "reut2-000.sgm":
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES> <UNKNOWN> C T
f0704reute u f BC-BAHIA-COCOA-REVIEW 02-26
0105</UNKNOWN> <TEXT> <TITLE>BAHIA COCOA REVIEW</TITLE> <DATELINE>
SALVADOR, Feb 26 - </DATELINE><BODY>**Showers continued throughout the
week in the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao, although
normal humidity levels have not been restored, Comissaria Smith said
in its weekly review.
The dry period means the temporao will be late this year.
Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against
5.81 at the same stage last year. Again it seems that cocoa delivered earlier on consignment was included in the arrivals figures.
Comissaria Smith said there is still some doubt as to how much old crop cocoa is still available as harvesting has practically come to an
end. With total Bahia crop estimates around 6.4 mln bags and sales
standing at almost 6.2 mln there are a few hundred thousand bags still
in the hands of farmers, middlemen, exporters and processors.
There are doubts as to how much of this cocoa would be fit for export as shippers are now experiencing dificulties in obtaining
+Bahia superior+ certificates.
In view of the lower quality over recent weeks farmers have sold a good part of their cocoa held on consignment.
Comissaria Smith said spot bean prices rose to 340 to 350 cruzados per arroba of 15 kilos.
Bean shippers were reluctant to offer nearby shipment and only limited sales were booked for March shipment at 1,750 to 1,780 dlrs
per tonne to ports to be named.
New crop sales were also light and all to open ports with June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs under
New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs per tonne FOB.
Routine sales of butter were made. March/April sold at 4,340, 4,345 and 4,350 dlrs.
April/May butter went at 2.27 times New York May, June/July at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at
2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and
2.27 times New York Dec, Comissaria Smith said.
Destinations were the U.S., Covertible currency areas, Uruguay and open ports.
Cake sales were registered at 785 to 995 dlrs for March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times New York Dec for
Oct/Dec.
Buyers were the U.S., Argentina, Uruguay and convertible currency areas.
Liquor sales were limited with March/April selling at 2,325 and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New York July,
Aug/Sept at 2,400 dlrs and at 1.25 times New York Sept and Oct/Dec at
1.25 times New York Dec, Comissaria Smith said.
Total Bahia sales are currently estimated at 6.13 mln bags against the 1986/87 crop and 1.06 mln bags against the 1987/88 crop.
Final figures for the period to February 28 are expected to be published by the Brazilian Cocoa Trade Commission after carnival which
ends midday on February 27.** Reuter </BODY></TEXT> </REUTERS>
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="5545" NEWID="2"> <DATE>26-FEB-1987 15:02:20.00</DATE>
<TOPICS></TOPICS> <PLACES><D>usa</D></PLACES> <PEOPLE></PEOPLE>
<ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES>
<UNKNOWN> F Y f0708reute d f
BC-STANDARD-OIL-<SRD>-TO 02-26 0082</UNKNOWN>
Need to count words only in the BODY fields (On example marked in bold), without different characters, etc.
File example for testing propose.

What I see in your question is you trying to create xml formatted content, and trying to deserialize it just to count the content, that would be fine if you need to collect data, but if the intention is only to count words tagged in between body of documents it is much faster to just parse it and count it on the fly.
My strategy is to take substring of content that starts with <body> and take the substring that ends with </body> and count it by splitting it.
Here is the solution:
DirectoryInfo filePaths = new DirectoryInfo(#"D:\Stackoverflow\SgmCount\docs");
FileInfo[] Files = filePaths.GetFiles("*.sgm");
int wordCount = 0;
foreach (FileInfo file in Files)
{
string content = File.ReadAllText(file.FullName);
content = content.Substring(content.IndexOf("<BODY>", StringComparison.Ordinal) + 5);
content = content.Substring(0, content.IndexOf("</BODY>", StringComparison.Ordinal) - 1);
char[] delimiters = { ' ', '\r', '\n' };
wordCount = content.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Length;
}
Console.WriteLine($"Total count words: {wordCount}" words);
This gives an output:
Total count words: 488 words

Related

Parsing JSON response using text summarization API, encoding error in response

I use service at https://www.meaningcloud.com/products/automatic-summarization
for text summarization. I am using .NET Core 5
For example, I want shorten this news: https://e.vnexpress.net/news/business/economy/vn-index-rises-for-third-straight-session-4141865.html
string input = "..." // long content of news post.
var client = new RestClient("https://api.meaningcloud.com/summarization-1.0");
client.Timeout = -1;
var request = new RestRequest(Method.POST);
request.AddParameter("key", "25870359b682ec3c93f9becd850eb459"); // fake token because this content is public.
request.AddParameter("sentences", 4);
request.AddParameter("txt", JsonEncodedText.Encode(content));
IRestResponse response = client.Execute(request);
System.Threading.Thread.Sleep(3000);
var res = JObject.Parse(response.Content);
// Need convert \r\n , \r\n\r\n to space.
string short_content = res["summary"].ToString();
// SysUtil.StringEncodingConvert(short_content, "ISO-8859-1", "UTF-8");
string result = raw_string.Replace(" [...] ", " ");
Input
The benchmark VN-Index saw steady growth throughout the day, gradually gaining a total of 10.23 points by the end of the session. The Ho Chi Minh Stock Exchange (HoSE), on which the index is based, saw 300 stocks gain and 78 lose. Total trading volume improved 48 percent over the previous session, reaching VND6.2 trillion ($269 million). The VN30-Index, a basket of HoSE’s 30 largest capped stocks, rose 1.63 percent, with 27 gaining and 2 losing. Its top gainers were SAB of Vietnam’s largest brewer Sabeco, up 4.8 percent, followed by VJC of budget airline Vietjet, up 2.8 percent, and MWG of electronics retailer Mobile World, up 2.2 percent. Of Vietnam’s biggest state-owned lenders by assets, BID of BIDV climbed 0.85 percent, VCB of Vietcombank 0.8 percent, and CTG of VietinBank 0.6 percent. HDB of HDBank and TCB of Techcombank led gains of private banks at 0.85 percent and 0.6 percent respectively. Other gainers included PNJ of Phu Nhuan Jewelry with 1.4 percent, HPG of steel producer Hoa Phat, 1.1 percent, and MSN of conglomerate Masan, 1 percent. The only two VN30 tickers that ended in the red were VIC of conglomerate Vingroup, down 1 percent, and PLX of fuel distributor Petrolimex, down 0.05 percent. The HNX-Index for stocks on the Hanoi Stock Exchange, home to mid and small caps, rose 1.35 percent, and the UPCoM-Index for stocks on the Unlisted Public Companies Market added 0.3 percent. Foreign investors turned net buyers to the tune of VND15.7 billion ($681,600), with buying pressure focused mainly on HPG and VHM of real estate giant Vinhomes.
output after text summarization (4 sentences)
The benchmark VN-Index saw steady growth throughout the day, gradually gaining a total of 10.23 points by the end of the session. The VN30-Index, a basket of HoSE\u2019s 30 largest capped stocks, rose 63 percent, with 27 gaining and 2 losing. Of Vietnam\u2019s biggest state-owned lenders by assets, BID of BIDV climbed 0.85 percent, VCB of Vietcombank 0.8 percent, and CTG of VietinBank 0.6 percent. The HNX-Index for stocks on the Hanoi Stock Exchange, home to mid and small caps, rose 1.35 percent, and the UPCoM-Index for stocks on the Unlisted Public Companies Market added 0.3 percent.
I also try use util
using System;
namespace myproj.Controllers
{
public class SysUtil
{
public static String StringEncodingConvert(String strText, String strSrcEncoding, String strDestEncoding)
{
System.Text.Encoding srcEnc = System.Text.Encoding.GetEncoding(strSrcEncoding);
System.Text.Encoding destEnc = System.Text.Encoding.GetEncoding(strDestEncoding);
byte[] bData = srcEnc.GetBytes(strText);
byte[] bResult = System.Text.Encoding.Convert(srcEnc, destEnc, bData);
return destEnc.GetString(bResult);
}
}
}
but not success.
even I replace, still not success
tring result2 = result.Replace("\u2019s", "'s");
I catch some problem
\u2019s --> I need 's, how to archive this?
\u2019 is the unicode char for smart quote. Just replace that:
result2 = result.Replace('\u2019', '\'')

C# Regex to match single number among multiple numbers in a string

What regex for C# can I use that matches the a "string + some number + string + some number +string"
Sample Inputs:
Book a hotel room for 10 people -- o/p: 10
Book a hotel room for 15 people at 10AM -- o/p: 15
Book a hotel room for 5 employees for 12 dec at 10 am -- o/p: 5
Book a hotel room in Singapore for 10 people at today -- o/p: 10
Book a hotel room for 12 dec for 10 members -- o/p: 10
So have to fetch how many members/people/employees for booking hotel.
Hope this makes sense
A regular expression that I could plug into C# would be fantastic
I tried below pattern but not matching.
[A-Za-z]*\d+\s?(people)|(memebers)|(peoples)|(member)*$
If your number always precedes the keyword, you might not need a regex.
Try the below code.
var parts = line.Split(' ', StringSplitOptions.RemoveEmptyEntries);
var index = Array.Find(parts, p => p == "member" || p == "members" || p == "people");
int count = -1;
var found = index > 0 && int.TryParse(parts[index-1], out count);
If found is true, it indicates count has a valid value which you can use later on.
Try following :
string[] inputs = {
"Book a hotel room for 10 people -- o/p: 10",
"Book a hotel room for 15 people at 10AM -- o/p: 15",
"Book a hotel room for 5 employees for 12 dec at 10 am -- o/p: 5",
"Book a hotel room in Singapore for 10 people at today -- o/p: 10",
"Book a hotel room for 12 dec for 10 members -- o/p: 10"
};
string pattern = #"for\s+(?'count'\d+)\s+(?'type'[^\s]+)";
foreach(string input in inputs)
{
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches.Cast<Match>().AsEnumerable())
{
Console.WriteLine("Count : '{0}', Type : '{1}'", match.Groups["count"].Value, match.Groups["type"].Value);
}
}
Console.ReadLine();
if you want just the number, not capturing much else, maybe you are looking for something like this
(?<=for)(?: +)(?<number>\d+)(?= +(?:people|employee|member)s?)
Using the asterix * after the group (member)* will repeat the group 0 or more times so you could omit that.
Using the $ after member(member)$ will only match it at the end of the string.
You could use an alternation to match either people, member with an optional s or employee with an optional s
If you want to capture the digits as well for further processing you could also use a capturing group for that part.
\b[A-Za-z]*(\d+)\s?(people|members?|employees?)\b
Regex demo | C# demo
For example
string pattern = #"\b[A-Za-z]*(\d+)\s?(people|members?|employees?)\b";
string input = #"Book a hotel room for 10 people -- o/p: 10
Book a hotel room for 15 people at 10AM -- o/p: 15
Book a hotel room for 5 employees for 12 dec at 10 am -- o/p: 5
Book a hotel room in Singapore for 10 people at today -- o/p: 10
Book a hotel room for 12 dec for 10 member -- o/p: 10 ";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine("Match: {0}\nGroup 1: {1}\nGroup: {2}", m.Value, m.Groups[1].Value, m.Groups[2].Value);
}
If all the matches are preceded by for you might also use
\bfor (\d+)\s?(people|members?|employees?)\b

How to remove pieces of data from string

I have a text file with multiple entries of this format:
Page: 1 of 1
Report Date: January 15 2018
Mr. Gerald M. Abridge ID #: 0000008 1 Route 81 Mr. Gerald Michael Abridge Pittaburgh PA 15668 SSN: XXX-XX-XXXX
Birthdate: 01/00/1998 Sex: M
COURSE Course Title CRD GRD GRDPT COURSE Course Title CRD GRD GRDPT
FALL 2017 (08/28/2017 to 12/14/2017) CS102F FUND. OF IT & COMPUTING 4.00 A 16.00 CS110 C++ PROGRAMMING I 3.00 A- 11.10 EL102 LANGUAGE AND RHETORIC 3.00 B+ 9.90 MA109 CALC WITH APPLICATIONS I 4.00 A 16.00 SP203 INTERMEDIATE SPANISH I 3.00 A 12.00
EHRS QHRS QPTS GPA Term 17.00 17.00 65.00 3.824 Cum 17.00 17.00 65.00 3.824
Current Program(s): Bachelor of Science in Computer Science
End of official record.
So far, I have read the text file into a string, full. I want to be able to remove first two lines of each of the entries. How would I go about doing this?
Here's the code that I used to read it in:
using (StreamReader sr = new StreamReader(fileName, Encoding.Default))
{
string full = sr.ReadToEnd();
}
If all the lines you want to skip begin with the same strings, you can put those prefixes in a list and then, when you're reading the lines, skip the any that being with one of the prefixes:
This will leave you with a list of strings that represent all the file lines that don't begin with one of the specified prefixes:
var filePath = #"f:\public\temp\temp.txt";
var ignorePrefixes = new List<string> {"Page:", "Report Date:"};
var filteredContent = File.ReadAllLines(filePath)
.Where(line => ignorePrefixes.All(prefix => !line.StartsWith(prefix)))
.ToList();
If you want all the content as a single string, you can use String.Join:
var filteredAsString = string.Join(Environment.NewLine, filteredContent);
If Linq isn't your thing, or you don't understand what it's doing, here's the "old school" way of doing the same thing:
List<string> filtered = new List<string>();
foreach (string line in File.ReadLines(filePath))
{
bool okToAdd = true;
foreach (string prefix in ignorePrefixes)
{
if (line.StartsWith(prefix))
{
okToAdd = false;
break;
}
}
if (okToAdd)
{
filtered.Add(line);
}
}
public static IEnumerable<string> ReadReportFile(FileInfo file)
{
var line = String.Empty;
var page = "Page:";
var date = "Report Date:";
using(var reader = File.OpenText(file.FullName))
while((line = reader.ReadLine()) != null)
while(line.IndexOf(page) == -1 AND line.IndexOf(date) == -1)
yield return line;
}
Code is pretty straight forward, while line is not null and doesn't contain page or date, return line. You could condense or even get fancier, building lookups for your prefix etc. but if the code is simple or not needed to be that complex, this should suffice.

Splitting article by sentences using delimiters

I have a small assignment where I have an article in a format that is like this
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2">
<TITLE>STANDARD OIL <SRD> TO FORM FINANCIAL UNIT</TITLE>
<DATELINE> CLEVELAND, Feb 26 - </DATELINE><BODY>Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc <BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.
Reuter
</BODY></TEXT>
</REUTERS>
and I am writing it to a new xml file with this format
<article id= some id >
<subject>articles subject </subject>
<sentence> sentence #1 </sentence>
.
.
.
<sentence> sentence #n </sentence>
</article>
I have written a code that does all of this and works fine.
The problem is that I am splitting sentences by using the delimiter ., but if the there is a number like 2.00, the code thinks that 2 is a sentence and 00 is a different sentence.
Does anyone have any idea on how to identify sentences better so it will keep the numbers and such in same sentence?
Without having to go over all of the array?
Is there a way I can have the string.Split() method ignore the split if there is a number before and after the delimiter?
My code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Data;
using System.Xml;
namespace project
{
class Program
{
static void Main(string[] args)
{
string[] lines = System.IO.File.ReadAllLines(#"path");
string body = "";
REUTERS article = new REUTERS();
string sentences = "";
for (int i = 0; i<lines.Length;i++){
string line = lines[i];
// finding the first tag of the article
if (line.Contains("<REUTERS"))
{
//extracting the id from the tag
int Id = line.IndexOf("NEWID=\"") + "NEWID=\"".Length;
article.NEWID = line.Substring(Id, line.Length-2 - Id);
}
if (line.Contains("TITLE"))
{
string subject = line;
subject = subject.Replace("<TITLE>", "").Replace("</TITLE>", "");
article.TITLE = subject;
}
if( line.Contains("<BODY"))
{
int startLoc = line.IndexOf("<BODY>") + "<BODY>".Length;
sentences = line.Substring(startLoc, line.Length - startLoc);
while (!line.Contains("</BODY>"))
{
i++;
line = lines[i];
sentences = sentences +" " + line;
}
int endLoc = sentences.IndexOf("</BODY>");
sentences = sentences.Substring(0, endLoc);
char[] delim = {'.'};
string[] sentencesSplit = sentences.Split(delim);
using (System.IO.StreamWriter file =
new System.IO.StreamWriter(#"path",true))
{
file.WriteLine("<articles>");
file.WriteLine("\t <article id = " + article.NEWID + ">");
file.WriteLine("\t \t <subject>" + article.TITLE + "</subject>");
foreach (string sentence in sentencesSplit)
{
file.WriteLine("\t \t <sentence>" + sentence + "</sentence>");
}
file.WriteLine("\t </article>");
file.WriteLine("</articles>");
}
}
}
}
public class REUTERS
{
public string NEWID;
public string TITLE;
public string Body;
}
}
}
ok so i found a solution using the ideas i recieved here
i used the overload method of split like this
.Split(new string[] { ". " }, StringSplitOptions.None);
and it looks much better now
You can also use a regular expression that looks for the sentence terminators with white space:
var pattern = #"(?<=[\.!\?])\s+";
var sentences = Regex.Split(input, pattern);
foreach (var sentence in sentences) {
//do something with the sentence
var node = string.Format("\t \t <sentence>{0}</sentence>", sentence);
file.WriteLine(node);
}
Note that this applies to the English language as there may be other rules for sentences in other languages.
The Following example
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var input = #"Standard Oil Co and BP North America
Inc said they plan to form a venture to manage the money market
borrowing and investment activities of both companies.
BP North America is a subsidiary of British Petroleum Co
Plc <BP>, which also owns a 55 pct interest in Standard Oil.
The venture will be called BP/Standard Financial Trading
and will be operated by Standard Oil under the oversight of a
joint management committee.";
var pattern = #"(?<=[\.!\?])\s+";
var sentences = Regex.Split(input, pattern);
foreach (var sentence in sentences)
{
var innerText = sentence.Replace("\n", " ").Replace('\t', ' ');
//do something with the sentence
var node = string.Format("\t \t <sentence>{0}</sentence>", innerText);
Console.WriteLine(node);
}
}
}
Produces this output
<sentence>Standard Oil Co and BP North America Inc said they plan to form a venture to manage the money market borrowing and investment activities of both companies.</sentence>
<sentence>BP North America is a subsidiary of British Petroleum Co Plc <BP>, which also owns a 55 pct interest in Standard Oil.</sentence>
<sentence>The venture will be called BP/Standard Financial Trading and will be operated by Standard Oil under the oversight of a joint management committee.</sentence>
I would make a list of all index points of the '.' characters.
foreach index point, check each side for numbers, if numbers are on both sides, remove the index point from the list.
Then when you are outputting simply use the substring functions with the remaining index points to get each sentence as an individual.
Bad quality code follows (it's late):
List<int> indexesToRemove = new List<int>();
int count=0;
foreach(int indexPoint in IndexPoints)
{
if((sentence.elementAt(indexPoint-1)>='0' && elementAt(indexPoint-1<='9')) && (sentence.elementAt(indexPoint+1)>='0' && elementAt(indexPoint+1<='9')))
indexesToRemove.Add(count);
count++;
}
The next line is so that we do not have to alter the removal number as we traverse the list in the last step.
indexesToRemove = indexesToRemove.OrderByDescending();
Now we simply remove all the locations of the '.'s that have numbers on either side.
foreach(int indexPoint in indexesToRemove)
{
IndexPoints.RemoveAt(indexPoint);
}
Now when you read out the sentences into the new file format you just loop sentences.substring(lastIndexPoint+1, currentIndexPoint)
Spent much time on this - thought you might like to see it as it really doesn't use any awkward code whatsoever - it is producing output 99% similar to yours.
<articles>
<article id="2">
<subject>STANDARD OIL <SRD> TO FORM FINANCIAL UNIT</subject>
<sentence>Standard Oil Co and BP North America</sentence>
<sentence>Inc said they plan to form a venture to manage the money market</sentence>
<sentence>borrowing and investment activities of both companies.</sentence>
<sentence>BP North America is a subsidiary of British Petroleum Co</sentence>
<sentence>Plc <BP>, which also owns a 55.0 pct interest in Standard Oil.</sentence>
<sentence>The venture will be called BP/Standard Financial Trading</sentence>
<sentence>and will be operated by Standard Oil under the oversight of a</sentence>
<sentence>joint management committee.</sentence>
</article>
</articles>
The console app is as follows:
using System.Xml;
using System.IO;
namespace ReutersXML
{
class Program
{
static void Main()
{
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("reuters.xml");
var reuters = xmlDoc.GetElementsByTagName("REUTERS");
var article = reuters[0].Attributes.GetNamedItem("NEWID").Value;
var subject = xmlDoc.GetElementsByTagName("TITLE")[0].InnerText;
var body = xmlDoc.GetElementsByTagName("BODY")[0].InnerText;
string[] sentences = body.Split(new string[] { System.Environment.NewLine },
System.StringSplitOptions.RemoveEmptyEntries);
using (FileStream fileStream = new FileStream("reuters_new.xml", FileMode.Create))
using (StreamWriter sw = new StreamWriter(fileStream))
using (XmlTextWriter xmlWriter = new XmlTextWriter(sw))
{
xmlWriter.Formatting = Formatting.Indented;
xmlWriter.Indentation = 4;
xmlWriter.WriteStartElement("articles");
xmlWriter.WriteStartElement("article");
xmlWriter.WriteAttributeString("id", article);
xmlWriter.WriteElementString("subject", subject);
foreach (var s in sentences)
if (s.Length > 10)
xmlWriter.WriteElementString("sentence", s);
xmlWriter.WriteEndElement();
}
}
}
}
I hope you like it :)

C# Programming How to grep columns/lines from Text File?

I have a C# console program which main functions should let a user grep lines / columns from a log text file.
An Example within the text file the user wishes to grep a group of all the related lines starting from a particular date etc. "Tue Aug 03 2004 22:58:34" to "Wed Aug 04 2004 00:56:48". Therefore after processing, the program would then output all the data found within the log text files between the 2 dates.
Could someone please advise on some codes that I could use to grep or create a filter to retrieve the neccessary text/data from the file? Thanks!
C# Program Files:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.IO;
namespace Testing
{
class Analysis
{
static void Main()
{
// Read the file lines into a string array.
string[] lines = System.IO.File.ReadAllLines(#"C:\Test\ntfs.txt");
System.Console.WriteLine("Analyzing ntfs.txt:");
foreach (string line in lines)
{
Console.WriteLine("\t" + line);
// ***Trying to filter/grep out dates, file size, etc****
if (lines = "Sun Nov 19 2000")
{
Console.WriteLine("Print entire line");
}
}
// Keep the console window open in debug mode.
Console.WriteLine("Press any key to exit.");
System.Console.ReadKey();
}
}
}
Log Text File Example:
Wed Jul 21 2004 16:58:48 499712 m... r/rrwxrwxrwx 0 0 8360-128-3
C:/Program Files/AccessData/Common Files/AccessData LicenseManager/LicenseManager.exe
Tue Aug 03 2004 22:58:34 23040 m... r/rrwxrwxrwx 0 0 8522-128-3
C:/System Volume Information/_restore{88D7369F-4F7E-44D4-8CD1-
F7FF1F6AC067}/RP4/A0002101.sys
23040 m... r/rrwxrwxrwx 0 0 9132-128-3
C:/WINDOWS/system32/ReinstallBackups/0003/DriverFiles/i386/mouclass.sys
23040 m... r/rrwxrwxrwx 0 0 9135-128-4 C:/System Volume
Information/_restore{88D7369F-4F7E-44D4-8CD1-F7FF1F6AC067}/RP4/A0003123.sys
23040 m... r/rrwxrwxrwx 0 0 9136-128-3
C:/WINDOWS/system32/drivers/mouclass.sys
Tue Aug 03 2004 23:01:16 196864 m... r/rrwxrwxrwx 0 0 4706-128-3
C:/WINDOWS/system32/drivers/rdpdr.sys
Tue Aug 03 2004 23:08:18 24960 m... r/rrwxrwxrwx 0 0 8690-128-3
C:/WINDOWS/system32/drivers/hidparse.sys
You could do this using Regex to select matching lines in a richer way than string.Contains allows.
Not sure why you are reinventing findstr.exe though.
For large files you might find File.ReadLines (.Net 4 only) performs better - this reads the same lines but allows you to process them in a foreach and other IEnumerable scenarios without loading the entire file into RAM at once.
Well, as a quick fix for the specific example:
if (line.StartsWith("Sun Nov 19 2000"))
{
Console.WriteLine(line);
}
You could use Contains to find a substring within the line.
Note that loading the whole file in an array won't scale well for very large logs. We can look into fixing that if it's an issue for you - but let's take things slowly :)
Here's a grep style method I use in testing:
public static List<string> FileGrep(string filePath, string searchText)
{
var matches = new List<string>();
using (var f = File.OpenRead(filePath))
{
var s = new StreamReader(f);
while (!s.EndOfStream)
{
var line = s.ReadLine();
if (line != null && line.Contains(searchText)) matches.Add(line);
}
f.Close();
}
return matches;
}

Categories

Resources