C# Regex to match single number among multiple numbers in a string - c#

What regex for C# can I use that matches the a "string + some number + string + some number +string"
Sample Inputs:
Book a hotel room for 10 people -- o/p: 10
Book a hotel room for 15 people at 10AM -- o/p: 15
Book a hotel room for 5 employees for 12 dec at 10 am -- o/p: 5
Book a hotel room in Singapore for 10 people at today -- o/p: 10
Book a hotel room for 12 dec for 10 members -- o/p: 10
So have to fetch how many members/people/employees for booking hotel.
Hope this makes sense
A regular expression that I could plug into C# would be fantastic
I tried below pattern but not matching.
[A-Za-z]*\d+\s?(people)|(memebers)|(peoples)|(member)*$

If your number always precedes the keyword, you might not need a regex.
Try the below code.
var parts = line.Split(' ', StringSplitOptions.RemoveEmptyEntries);
var index = Array.Find(parts, p => p == "member" || p == "members" || p == "people");
int count = -1;
var found = index > 0 && int.TryParse(parts[index-1], out count);
If found is true, it indicates count has a valid value which you can use later on.

Try following :
string[] inputs = {
"Book a hotel room for 10 people -- o/p: 10",
"Book a hotel room for 15 people at 10AM -- o/p: 15",
"Book a hotel room for 5 employees for 12 dec at 10 am -- o/p: 5",
"Book a hotel room in Singapore for 10 people at today -- o/p: 10",
"Book a hotel room for 12 dec for 10 members -- o/p: 10"
};
string pattern = #"for\s+(?'count'\d+)\s+(?'type'[^\s]+)";
foreach(string input in inputs)
{
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches.Cast<Match>().AsEnumerable())
{
Console.WriteLine("Count : '{0}', Type : '{1}'", match.Groups["count"].Value, match.Groups["type"].Value);
}
}
Console.ReadLine();

if you want just the number, not capturing much else, maybe you are looking for something like this
(?<=for)(?: +)(?<number>\d+)(?= +(?:people|employee|member)s?)

Using the asterix * after the group (member)* will repeat the group 0 or more times so you could omit that.
Using the $ after member(member)$ will only match it at the end of the string.
You could use an alternation to match either people, member with an optional s or employee with an optional s
If you want to capture the digits as well for further processing you could also use a capturing group for that part.
\b[A-Za-z]*(\d+)\s?(people|members?|employees?)\b
Regex demo | C# demo
For example
string pattern = #"\b[A-Za-z]*(\d+)\s?(people|members?|employees?)\b";
string input = #"Book a hotel room for 10 people -- o/p: 10
Book a hotel room for 15 people at 10AM -- o/p: 15
Book a hotel room for 5 employees for 12 dec at 10 am -- o/p: 5
Book a hotel room in Singapore for 10 people at today -- o/p: 10
Book a hotel room for 12 dec for 10 member -- o/p: 10 ";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine("Match: {0}\nGroup 1: {1}\nGroup: {2}", m.Value, m.Groups[1].Value, m.Groups[2].Value);
}
If all the matches are preceded by for you might also use
\bfor (\d+)\s?(people|members?|employees?)\b

Related

How to get counted words in files in BODY field?

The following code counting words in directory from all ".sgm" files.
But I need to get counted words in all ".sgm" files between BODY tags for example.
How can I do that?
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.Serialization;
namespace Project2
{
class Program
{
static void Main(string[] args)
{
string[] parcesPlaces = new string[] { "west-germany", "usa", "france", "uk", "canada", "japan" };
DirectoryInfo filePaths = new DirectoryInfo(#"D:\project_IAD");
FileInfo[] Files = filePaths.GetFiles("*.sgm");
List<TotalBody> allNeedBody = new List<TotalBody>();
foreach (FileInfo file in Files)
{
string fileContent = File.ReadAllText(file.FullName);
string fileContentCleared = ReplaceHexadecimalSymbols(fileContent);
string myRootedXml = "<root>" + fileContentCleared + "</root>";
root result = (root)XmlDeserializeFromString(myRootedXml, typeof(root));
Console.WriteLine(" Ilość potrzebnych słów: {0}", result.REUTERS.ToList().Count);
foreach (rootREUTERS rootREUTERS in result.REUTERS)
{
if (rootREUTERS.PLACES.Length != 1)
{
continue;
}
else if (!parcesPlaces.Contains(rootREUTERS.PLACES[0]))
{
continue;
}
else
{
if (rootREUTERS.TEXT.BODY != null)
{
allNeedBody.Add(new TotalBody(rootREUTERS.PLACES[0], rootREUTERS.TEXT.BODY));
}
else
{
continue;
}
}
}
}
Console.WriteLine("Total count words: ");
Console.WriteLine(allNeedBody.Count);
Console.ReadKey();
}
private static object XmlDeserializeFromString(string v, Type type)
{
object result = null;
using (TextReader reader = new StringReader(v))
{
result = new XmlSerializer(type).Deserialize(reader);
}
return result;
}
private static string ReplaceHexadecimalSymbols(string txt)
{
string r = "[\x00-\x08\x0B\x0C\x0E-\x1F\x26]";
return Regex.Replace(txt, r, "", RegexOptions.Compiled);
}
}
}
Example of text in file "reut2-000.sgm":
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES> <UNKNOWN> C T
f0704reute u f BC-BAHIA-COCOA-REVIEW 02-26
0105</UNKNOWN> <TEXT> <TITLE>BAHIA COCOA REVIEW</TITLE> <DATELINE>
SALVADOR, Feb 26 - </DATELINE><BODY>**Showers continued throughout the
week in the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao, although
normal humidity levels have not been restored, Comissaria Smith said
in its weekly review.
The dry period means the temporao will be late this year.
Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against
5.81 at the same stage last year. Again it seems that cocoa delivered earlier on consignment was included in the arrivals figures.
Comissaria Smith said there is still some doubt as to how much old crop cocoa is still available as harvesting has practically come to an
end. With total Bahia crop estimates around 6.4 mln bags and sales
standing at almost 6.2 mln there are a few hundred thousand bags still
in the hands of farmers, middlemen, exporters and processors.
There are doubts as to how much of this cocoa would be fit for export as shippers are now experiencing dificulties in obtaining
+Bahia superior+ certificates.
In view of the lower quality over recent weeks farmers have sold a good part of their cocoa held on consignment.
Comissaria Smith said spot bean prices rose to 340 to 350 cruzados per arroba of 15 kilos.
Bean shippers were reluctant to offer nearby shipment and only limited sales were booked for March shipment at 1,750 to 1,780 dlrs
per tonne to ports to be named.
New crop sales were also light and all to open ports with June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs under
New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs per tonne FOB.
Routine sales of butter were made. March/April sold at 4,340, 4,345 and 4,350 dlrs.
April/May butter went at 2.27 times New York May, June/July at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at
2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and
2.27 times New York Dec, Comissaria Smith said.
Destinations were the U.S., Covertible currency areas, Uruguay and open ports.
Cake sales were registered at 785 to 995 dlrs for March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times New York Dec for
Oct/Dec.
Buyers were the U.S., Argentina, Uruguay and convertible currency areas.
Liquor sales were limited with March/April selling at 2,325 and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New York July,
Aug/Sept at 2,400 dlrs and at 1.25 times New York Sept and Oct/Dec at
1.25 times New York Dec, Comissaria Smith said.
Total Bahia sales are currently estimated at 6.13 mln bags against the 1986/87 crop and 1.06 mln bags against the 1987/88 crop.
Final figures for the period to February 28 are expected to be published by the Brazilian Cocoa Trade Commission after carnival which
ends midday on February 27.** Reuter </BODY></TEXT> </REUTERS>
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="5545" NEWID="2"> <DATE>26-FEB-1987 15:02:20.00</DATE>
<TOPICS></TOPICS> <PLACES><D>usa</D></PLACES> <PEOPLE></PEOPLE>
<ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES>
<UNKNOWN> F Y f0708reute d f
BC-STANDARD-OIL-<SRD>-TO 02-26 0082</UNKNOWN>
Need to count words only in the BODY fields (On example marked in bold), without different characters, etc.
File example for testing propose.
What I see in your question is you trying to create xml formatted content, and trying to deserialize it just to count the content, that would be fine if you need to collect data, but if the intention is only to count words tagged in between body of documents it is much faster to just parse it and count it on the fly.
My strategy is to take substring of content that starts with <body> and take the substring that ends with </body> and count it by splitting it.
Here is the solution:
DirectoryInfo filePaths = new DirectoryInfo(#"D:\Stackoverflow\SgmCount\docs");
FileInfo[] Files = filePaths.GetFiles("*.sgm");
int wordCount = 0;
foreach (FileInfo file in Files)
{
string content = File.ReadAllText(file.FullName);
content = content.Substring(content.IndexOf("<BODY>", StringComparison.Ordinal) + 5);
content = content.Substring(0, content.IndexOf("</BODY>", StringComparison.Ordinal) - 1);
char[] delimiters = { ' ', '\r', '\n' };
wordCount = content.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Length;
}
Console.WriteLine($"Total count words: {wordCount}" words);
This gives an output:
Total count words: 488 words

converting file with unspecified number of lines, by using regex, visual c#

I have an app which converts a file,by reading all lines from source text file and printing only lines which contain word:'student'.Also removes some characters and splits the printed line into 5 fields as shown below:
input text file
Form|01; 23_anna- Member 12569 is student - 12*01*2006
Form|02; 17_smith_ Member 12570 is teacher - 13*01*2007
Form|03; 12_ben_ Member 12571 is student - 14*01*2007
The output file:
Form01 anna 12569 student 12 01 2006
Form03 ben 12571 student 14 01 2007
The code i have tried:
private Regex find = new Regex(#"^(.+?)(?:\|)(\d+)(?:.+?_)(.+?)(?:[_-] Member ?)(\d+)(?:.+?)(student)(?:.+?)(\d\d).(\d\d).(\d\d\d\d)$", RegexOptions.Multiline);
private void MyButton_Click(object sender, EventArgs e)
{
string sample = "Form|01; 23_anna- Member 12569 is student - 12*01*2006\nForm|02; 17_smith_ Member 12570 is teacher - 13*01*2007\nForm|03; 12_ben_ Member 12571 is student - 14*01*2007";
MatchCollection matches = find.Matches(sample);
foreach (Match m in matches)
{
Console.WriteLine("{0}{1} {2} {3} is {4} {5} {6} {7}", m.Groups[1], m.Groups[2], m.Groups[3], m.Groups[4], m.Groups[5], m.Groups[6], m.Groups[7], m.Groups[8]);
}
Console.WriteLine();
But how can I change the code if I want to convert a file with more lines( ~ 500 lines)
The best way to do this in my opinion is to use File.ReadAllLines() then in a foreach loop do your regex. I also think that you are overcomplicating your regex so I have made a few changes where I think it can be simplified.
Working under the assumption that the format of the string you are looking for will always be the same. Since form and student are in all of these lines then I see little reason to capture it. In reality there are 6 key pieces of information to capture.
1 – the numbers after form
2 – the name
3 – the 5-digit member number
4,5,6 – the three sections of the date
Everything else is either constant or not used in the output string. So when we come to rewrite the search and replace we get something like:
/^\w+\|([^;]+).+?([a-z]+)[^\d]+(\d{5})[^\d]+(\d{2}).(\d{2}).(\d{4})/m
Console.WriteLine("Form{0} {1} {2} student {3} {4} {5}", m.Groups[1], m.Groups[2], m.Groups[3], m.Groups[4], m.Groups[5], m.Groups[6])
Note that there are assumptions in the regex such as the name is always in lower case and the member number is always 5 digits and some other stuff like there can't be numbers in the names etc. It isn't optimal but I think it is tidier than yours, but this is personal preference I guess.
To get the lines with student use string.Contains("student") or if you really want to include it in your regex I would recommend using a positive lookahead for student (?=.*student)
Here is a bit of example code I wrote for one way that I would do it:
var regex = new Regex(#"^\w+\|([^;]+).+?([a-z]+)[^\d]+(\d{5})[^\d]+(\d{2}).(\d{2}).(\d{4})$",RegexOptions.Multiline);
var file = File.ReadAllLines(#"C:temp\test.txt");
foreach(var line in file)
{
if (line.Contains("student"))
{
var m = regex.Match(line);
Console.WriteLine("Form{0} {1} {2} student {3} {4} {5}", m.Groups[1], m.Groups[2], m.Groups[3], m.Groups[4], m.Groups[5], m.Groups[6]);
}
}

Trying to match multiple words multiple times, any order using regex

I'm trying to check if a text contains two or more specific words. The words can be in any order an can show up in the text multiple times but at least once.
If the text is a match I will need to get the information about location of the words.
Lets say we have the text :
"Once I went to a store and bought a coke for a dollar and I got another coke for free"
In this example I want to match the words coke and dollar.
So the result should be:
coke : index 37, lenght 4
dollar : index 48, length 6
coke : index 84, length 4
What I have already is this: (which I think is little bit wrong because it should contain each word at least once so the + should be there instead of the *)
(?:(\bcoke\b))\*(?:(\bdollar\b))\*
But with that regex the RegEx Buddy highlights all the three words if I ask it to hightlight group 1 and group 2.
But when I run this in C# I won't get any results.
Can you point me to the right direction ?
I don't think it's possible what you want only using regular expressions.
Here is a possible solution using regular expressions and linq:
var words = new HashSet<string>(StringComparer.OrdinalIgnoreCase) { "coke", "dollar" };
var regex = new Regex(#"\b(?:"+string.Join("|", words)+#")\b", RegexOptions.IgnoreCase);
var text = #"Once I went to a store and bought a coke
for a dollar and I got another coke for free";
var grouped = regex.Matches(text)
.OfType<Match>()
.GroupBy(m => m.Value, StringComparer.OrdinalIgnoreCase)
.ToArray();
if (grouped.Length != words.Count)
{
//not all words were found
}
else
{
foreach (var g in grouped)
{
Console.WriteLine("Found: " + g.Key);
foreach (var match in g)
Console.WriteLine(" At {0} length {1}", match.Index, match.Length);
}
}
Output:
Found: coke
At 36 length 4
At 72 length 4
Found: dollar
At 47 length 6
How about this, it is pret-tay bad but I think it has a shot at working and it is pure RegEx no extra tools.
(?:^|\W)[cC][oO][kK][eE](?:$|\W)|(?:^|\W)[dD][oO][lL][lL][aA][rR](?:$|\W)
Get rid of the \w's if you want it to capture cokeDollar or dollarCoKe etc.

Prepend string and Suffix sting to record using CSVHelper

I need to export Entities to a CSV File using CSVHelper. I made a trial work but I would have to write every field manually. What I want is to Write a record Prepended with either an 'H' or a 'D' and end every line with a single space. My Demo models:
PersonId FirstName LastName DateOfBirth
1 Randy Smith 1968-08-31
2 Zachary Smith 2002-01-10
3 Angie Smith 1969-11-20
4 Khelzie Smith 1996-07-27
AutoId Year Make Model OwnerId
1 2000 Toyota 4Runner 1
2 1995 Ford Mustang 1
3 2014 Chevrolet Corvette Stingray Coupe 2
4 2014 Volkswagen Beetle Coupe 4
5 1980 Ford F-150 2
6 1968 Chevrolet Camaro 3
7 2000 Tonka Truck 3
8 1993 Honda Accord 4
Into a CSV File Like this:
H 1 Randy Smith 8/31/1968
D 1 2000 Toyota 4Runner
D 2 1995 Ford Mustang
H 2 Zachary Smith 1/10/2002
D 3 2014 Chevy Corevett
D 5 1980 Ford F-150
H 3 Angie Smith 11/20/1969
D 6 1968 Chevrolet Camaro
D 7 2000 Tonka Truck
H 4 Khelzie Smith 7/27/1996
D 4 2014 Volkswagen Beetle Coupe
This is the Code I finally got to work:
StreamWriter textWriter = File.CreateText(fileName);
var csv = new CsvWriter(textWriter);
csv.Configuration.Delimiter = delimiter;
csv.Configuration.QuoteNoFields = true;
// This will skip those people who don't own a vehicle
foreach (Person person in people.Where(person => person.Vehicles.Count > 0))
{
// The letter 'H' must prefix every Header line
csv.WriteField((#"H " + person.PersonId));
csv.WriteField(person.FirstName);
csv.WriteField(person.LastName);
// Headers lines must end with a single space.
csv.WriteField((person.DateOfBirth.ToShortDateString() + " "));
csv.NextRecord();
foreach (Automobile auto in person.Vehicles)
{
// The letter 'D' must prefix every Detail line
csv.WriteField((#"D " + auto.AutoId));
csv.WriteField(auto.Year);
csv.WriteField(auto.Make);
// Details lines must end with a single space.
csv.WriteField((auto.Model + " "));
csv.NextRecord();
}
}
The real tables have ~70 fields apiece.
Just for those that have as thick a skull as mine, here is a solution:
foreach (TransactionHeader header in headers)
{
csv.WriteField("H");
csv.WriteRecord(header);
csv.WriteField(" ");
csv.NextRecord();
foreach (TransactionDetail detail in header.TransactionDetail)
{
csv.WriteField("D");
csv.WriteRecord(detail);
csv.WriteField(" ");
csv.NextRecord();
}
}
Thanks to everyone who saw this as pretty obvious and patiently waited for me to bash my head down on my desk enough times and then figure this out myself.

Best way to Find which cell of string array contins text

I have a block of text that im taking from a Gedcom (Here and Here) File
The text is flat and basically broken into "nodes"
I am splitting each node on the \r char and thus subdividing it into each of its parts( amount of "lines" can vary)
I know the 0 address will always be the ID but after that everything can be anywhere so i want to test each Cell of the array to see if it contains the correct tag for me to proccess
an example of what two nodes would look like
0 #ind23815# INDI <<<<<<<<<<<<<<<<<<< Start of node 1
1 NAME Lawrence /Hucstepe/
2 DISPLAY Lawrence Hucstepe
2 GIVN Lawrence
2 SURN Hucstepe
1 POSITION -850,-210
2 BOUNDARY_RECT (-887,-177),(-813,-257)
1 SEX M
1 BIRT
2 DATE 1521
1 DEAT Y
2 DATE 1559
1 NOTE * Born: Abt 1521, Kent, England
2 CONT * Marriage: Jane Pope 17 Aug 1546, Kent, England
2 CONT * Died: Bef 1559, Kent, England
2 CONT
1 FAMS #fam08318#
0 #ind23816# INDI <<<<<<<<<<<<<<<<<<<<<<< Start of Node 2
1 NAME Jane /Pope/
2 DISPLAY Jane Pope
2 GIVN Jane
2 SURN Pope
1 POSITION -750,-210
2 BOUNDARY_RECT (-787,-177),(-713,-257)
1 SEX F
1 BIRT
2 DATE 1525
1 DEAT Y
2 DATE 1609
1 NOTE * Born: Abt 1525, Tenterden, Kent, England
2 CONT * Marriage: Lawrence Hucstepe 17 Aug 1546, Kent, England
2 CONT * Died: 23 Oct 1609
2 CONT
1 FAMS #fam08318#
0 #ind23817# INDI <<<<<<<<<<< start of Node 3
So a when im done i have an array that looks like
address , string
0 = "1 NAME Lawrence /Hucstepe/"
1 = "2 DISPLAY Lawrence Hucstepe"
2 = "2 GIVN Lawrence"
3 = "2 SURN Hucstepe"
4 = "1 POSITION -850,-210"
5 = "2 BOUNDARY_RECT (-887,-177),(-813,-257)"
6 = "1 SEX M"
7 = "1 BIRT "
8 = "1 FAMS #fam08318#"
So my question is what is the best way to search the above array to see which Cell has the SEX tag or the NAME Tag or the FAMS Tag
this is the code i have
private int FindIndexinArray(string[] Arr, string search)
{
int Val = -1;
for (int i = 0; i < Arr.Length; i++)
{
if (Arr[i].Contains(search))
{
Val = i;
}
}
return Val;
}
But it seems inefficient because i end up calling it twice to make sure it doesnt return a -1
Like so
if (FindIndexinArray(SubNode, "1 BIRT ") != -1)
{
// add birthday to Struct
I.BirthDay = SubNode[FindIndexinArray(SubNode, "1 BIRT ") + 1].Replace("2 DATE ", "").Trim();
}
sorry this is a longer post but hopefully you guys will have some expert advice
Can use the static method FindAll of the Array class:
It will return the string itself though, if that works..
string[] test = { "Sex", "Love", "Rock and Roll", "Drugs", "Computer"};
Array.FindAll(test, item => item.Contains("Sex") || item.Contains("Drugs") || item.Contains("Computer"));
The => indicates a lamda expression. Basically a method without a concrete implementation.
You can also do this if the lamda gives you the creeps.
//Declare a method
private bool HasTag(string s)
{
return s.Contains("Sex") || s.Contains("Drugs") || s.Contains("Computer");
}
string[] test = { "Sex", "Love", "Rock and Roll", "Drugs", "Computer"};
Array.FindAll(test, HasTag);
What about a simple regular expression?
^(\d)\s=\s\"\d\s(SEX|BIRT|FAMS){1}.*$
First group captures the address, second group the tag.
Also, it might be quicker to dump all array items into a string and do your regex on the whole lot at once.
"But it seems inefficient because i end up calling it twice to make sure it doesnt return a -1"
Copy the returned value to a variable before you test to prevent multiple calls.
IndexResults = FindIndexinArray(SubNode, "1 BIRT ")
if (IndexResults != -1)
{
// add birthday to Struct
I.BirthDay = SubNode[IndexResults].Replace("2 DATE ", "").Trim();
}
The for loop in method FindIndexinArray shd break once you find a match if you are interested in only the first match.

Categories

Resources