Search for some phrases in a text file using Regex C#

Search for some phrases in a text file using Regex C# - c#

The task:
Write a program, which counts the phrases in a text file. Any sequence of characters could be given as phrase for counting, even sequences containing separators. For instance in the text "I am a student in Sofia" the phrases "s", "stu", "a" and "I am" are found respectively 2, 1, 3 and 1 times.
I know the solution with string.IndexOf or with LINQ or with some type of algorithm like Aho-Corasick. I want to do same thing with Regex.
This is what I've done so far:
using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;
namespace CountThePhrasesInATextFile
{
class Program
{
static void Main(string[] args)
{
string input = ReadInput("file.txt");
input.ToLower();
List<string> phrases = new List<string>();
using (StreamReader reader = new StreamReader("words.txt"))
{
string line = reader.ReadLine();
while (line != null)
{
phrases.Add(line.Trim());
line = reader.ReadLine();
}
}
foreach (string phrase in phrases)
{
Regex regex = new Regex(String.Format(".*" + phrase.ToLower() + ".*"));
int mathes = regex.Matches(input).Count;
Console.WriteLine(phrase + " ----> " + mathes);
}
}
private static string ReadInput(string fileName)
{
string output;
using (StreamReader reader = new StreamReader(fileName))
{
output = reader.ReadToEnd();
}
return output;
}
}
}
I know my regular expression is incorrect but I don't know what to change.
The output:
Word ----> 2
S ----> 2
MissingWord ----> 0
DS ----> 2
aa ----> 0
The correct output:
Word --> 9
S --> 13
MissingWord --> 0
DS --> 2
aa --> 3
file.txt contains:
Word? We have few words: first word, second word, third word.
Some passwords: PASSWORD123, #PaSsWoRd!456, AAaA, !PASSWORD
words.txt contains:
Word
S
MissingWord
DS
aa

You need to post the file.txt contents first, otherwise it's difficult to verify if the regex is working correctly or not.
That being said, check out the Regex answer here:
Finding ALL positions of a substring in a large string in C#
and see if that helps with your code in the mean time.
edit:
So there's a simple solution, add "(?=(" and "))" to each of your phrases. This is a lookahead assertion in regex. The following code handles what you want.
foreach (string phrase in phrases) {
string MatchPhrase = "(?=(" + phrase.ToLower() + "))";
int mathes = Regex.Matches(input, MatchPhrase).Count;
Console.WriteLine(phrase + " ----> " + mathes);
}
You also had an issue with
input.ToLower();
which should be instead
input = input.ToLower();
as strings in c# are immutable. In total, your code should be:
static void Main(string[] args) {
string input = ReadInput("file.txt");
input = input.ToLower();
List<string> phrases = new List<string>();
using (StreamReader reader = new StreamReader("words.txt")) {
string line = reader.ReadLine();
while (line != null) {
phrases.Add(line.Trim());
line = reader.ReadLine();
}
}
foreach (string phrase in phrases) {
string MatchPhrase = "(?=(" + phrase.ToLower() + "))";
int mathes = Regex.Matches(input, MatchPhrase).Count;
Console.WriteLine(phrase + " ----> " + mathes);
}
Thread.Sleep(50000);
}
private static string ReadInput(string fileName) {
string output;
using (StreamReader reader = new StreamReader(fileName)) {
output = reader.ReadToEnd();
}
return output;
}

here is what happened. I am going to use Word as example.
the regex you built for "word" is ".word.". It is telling regex to match anything starts with anything, contains "word" and ends with anything.
for your input, it matched
Word? We have few words: first word, second word, third word.
which starts with "Word? We have few words: first" and ends with ", second word, third word."
then second line starts with "Some pass" contains "word" and ends with ": PASSWORD123, #PaSsWoRd!456, AAaA, !PASSWORD"
so the count is 2
the regex you want is simple, string "word" is sufficient.
Update:
for ignore case pattern try "(?i)word"
And for the multiple matches within AAaA, try "(?i)(?<=a)a"
?<= is a Zero-width positive lookbehind assertion

Try this code:
string input = File.ReadAllText("file.txt");
foreach (string word in File.ReadLines("words.txt"))
{
var regex = new Regex(word, RegexOptions.IgnoreCase);
int startat = 0;
int count = 0;
Match match = regex.Match(input, startat);
while (match.Success)
{
count++;
startat = match.Index + 1;
match = regex.Match(input, startat);
}
Console.WriteLine(word + "\t" + count);
}
To correctly find all substrings like "aa", had to use the overload Match method with startat parameter.
Note the RegexOptions.IgnoreCase parameter.
A shorter but less clear code:
Match match;
while ((match = regex.Match(input, startat)).Success)
{
count++;
startat = match.Index + 1;
}

Related

Removing special characters from a string with RegEx

Am reading a text file that contains words, numbers and special characters, I want to remove certain special characters like: [](),'
I have this code but it is not working !
using (var reader = new StreamReader ("C://Users//HP//Documents//result2.txt")) {
string line = reader.ReadToEnd ();
Regex rgx = new Regex ("[^[]()',]");
string res = rgx.Replace (line, "");
Message1.text = res;
}
what am I missing, thanks

Some of the characters in your Regex, specifically [ ] ( ) ^, hold special meaning in Regex and in order to use them literally they must be escaped.
Use the following properly escaped Regex:
Regex rgx = new Regex (#"[\^\[\]\(\)',]");
Note that it is necessary to use the # verbatim string, because we don't want to escape these characters from the string, only from the Regex.
Alternatively, double escape the backslashes:
Regex rgx = new Regex ("[\\^\\[\\]\\(\\)',]");
But that's less readable in this case.

You could skip Regex and just maintain a list of characters you want to remove and then replace the old fashioned way:
string[] specialCharsToRemove = new [] { "[", "]", "(", ")", "'", "," };
using (var reader = new StreamReader ("C://Users//HP//Documents//result2.txt"))
{
string line = reader.ReadToEnd();
foreach(string s in specialCharsToRemove)
{
line = line.Replace(s, string.Empty);
}
Message1.text = res;
}
Ideally this would be in its own method, something like this:
private static string RemoveCharacters(string input, string[] specialCharactersToRemove)
{
foreach(string s in specialCharactersToRemove)
{
input = input.Replace(s, string.Empty);
}
return input;
}
I made a fiddle here

Replace them one at a time with String.Replace:
using (var reader = new StreamReader ("C://Users//HP//Documents//result2.txt"))
{
string line = reader.ReadToEnd ();
string res = line.Replace(line, "[", "");
res = res.Replace(line, "]", "");
res = res.Replace(line, "(", "");
res = res.Replace(line, ")", "");
res = res.Replace(line, "'", "");
res = res.Replace(line, ",", "");
Message1.text = res;
}

I agree with avoiding regex for this, but I would not use string.Replace multiple times, either.
Consider implementing a Replace or Remove method that accepts an array of characters to replace, and scan the input string only once. For example:
var builder = new StringBuilder();
foreach (char ch in input)
{
if (!chars.Contains(ch))
{
builder.Append(ch):
}
}
return builder.ToString();

How can i delete space from text file and replace it semicolon?

I have this data into the test text file:
behzad razzaqi xezerlooot abrizii ast
i want delete space and replace space one semicolon character,write this code in c# for that:
string[] allLines = File.ReadAllLines(#"d:\test.txt");
using (StreamWriter sw = new StreamWriter(#"d:\test.txt"))
{
foreach (string line in allLines)
{
if (!string.IsNullOrEmpty(line) && line.Length > 1)
{
sw.WriteLine(line.Replace(" ", ";"));
}
}
}
MessageBox.Show("ok");
behzad;;razzaqi;;xezerlooot;;;abrizii;;;;;ast
but i want one semicolon in space.how can i solve that?

Regex is an option:
string[] allLines = File.ReadAllLines(#"d:\test.txt");
using (StreamWriter sw = new StreamWriter(#"d:\test.txt"))
{
foreach (string line in allLines)
{
if (!string.IsNullOrEmpty(line) && line.Length > 1)
{
sw.WriteLine(Regex.Replace(line,#"\s+",";"));
}
}
}
MessageBox.Show("ok");

Use this code:
string[] allLines = File.ReadAllLines(#"d:\test.txt");
using (StreamWriter sw = new StreamWriter(#"d:\test.txt"))
{
foreach (string line in allLines)
{
string[] words = line.Split(" ", StringSplitOptions.RemoveEmptyEntries);
string joined = String.Join(";", words);
sw.WriteLine(joined);
}
}

You need to use a regular expression:
(\s\s+)
Usage
var input = "behzad razzaqi xezerlooot abrizii ast";
var pattern = "(\s\s+)";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(input, ';');

You can do that with a regular expression.
using System.Text.RegularExpressions;
and:
string pattern = "\\s+";
string replacement = ";";
Regex rgx = new Regex(pattern);
sw.WriteLine(rgx.Replace(line, replacement));
This regular expression matches any series of 1 or more spaces and replaces the entire series with a semicolon.

you can try this
Regex r=new Regex(#"\s+");
string result=r.Replace("YourString",";");
\s+ is for matching all spaces. + is for one or more occurrences.
for more information on regular expression see http://www.w3schools.com/jsref/jsref_obj_regexp.asp

You should check a string length after replacement, not before ;-).
const string file = #"d:\test.txt";
var result = File.ReadAllLines(file).Select(line => Regex.Replace(line, #"\s+", ";"));
File.WriteAllLines(file, result.Where(line => line.Length > 1));
...and don't forget, that for input hello you will get ;hello;.

Regex Replace on a JSON structure

I am currently trying to do a Regex Replace on a JSON string that looks like:
String input = "{\"`####`Answer_Options11\": \"monkey22\",\"`####`Answer_Options\": \"monkey\",\"Answer_Options2\": \"not a monkey\"}";
a
The goal is to find and replace all the value fields who's key field starts with `####`
I currently have this:
static Regex _FieldRegex = new Regex(#"`####`\w+" + ".:.\"(.*)\",");
static public string MatchKey(string input)
{
MatchCollection match = _encryptedFieldRegex.Matches(input.ToLower());
string match2 = "";
foreach (Match k in match )
{
foreach (Capture cap in k.Captures)
{
Console.WriteLine("" + cap.Value);
match2 = Regex.Replace(input.ToLower(), cap.Value.ToString(), #"CAKE");
}
}
return match2.ToString();
}
Now this isn't working. Naturally I guess since it picks up the entire `####`Answer_Options11\": \"monkey22\",\"`####`Answer_Options\": \"monkey\", as a match and replaces it. I want to just replace the match.Group[1] like you would for a single match on the string.
At the end of the day the JSON string needs to look something like this:
String input = "{\"`####`Answer_Options11\": \"CATS AND CAKE\",\"`####`Answer_Options\": \"CAKE WAS A LIE\",\"Answer_Options2\": \"not a monkey\"}";
Any idea how to do this?

you want a positive lookahead and a positive lookbehind :
(?<=####.+?:).*?(?=,)
the lookaheads and lookbehinds will verify that it matches those patterns, but not include them in the match. This site explains the concept pretty well.
Generated code from RegexHero.com :
string strRegex = #"(?<=####.+?:).*?(?=,)";
Regex myRegex = new Regex(strRegex);
string strTargetString = #" ""{\""`####`Answer_Options11\"": \""monkey22\"",\""`####`Answer_Options\"": \""monkey\"",\""Answer_Options2\"": \""not a monkey\""}""";
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
if (myMatch.Success)
{
// Add your code here
}
}
this will match "monkey22" and "monkey" but not "not a monkey"

Working from #Jonesy's answer I got to this which works for what I wanted. It includes the .Replace on the groups that I required. The negative look ahead and behinds were very interesting but I needed to replace some of those values hence groups.
static public string MatchKey(string input)
{
string strRegex = #"(__u__)(.+?:\s*)""(.*)""(,|})*";
Regex myRegex = new Regex(strRegex, RegexOptions.IgnoreCase | RegexOptions.Multiline);
IQS_Encryption.Encryption enc = new Encryption();
int count = 1;
string addedJson = "";
int matchCount = 0;
foreach (Match myMatch in myRegex.Matches(input))
{
if (myMatch.Success)
{
//Console.WriteLine("REGEX MYMATCH: " + myMatch.Value);
input = input.Replace(myMatch.Value, "__e__" + myMatch.Groups[2].Value + "\"c" + count + "\"" + myMatch.Groups[4].Value);
addedJson += "c"+count + "{" +enc.EncryptString(myMatch.Groups[3].Value, Encoding.UTF8.GetBytes("12345678912365478912365478965412"))+"},";
}
count++;
matchCount++;
}
Console.WriteLine("MAC" + matchCount);
return input + addedJson;
}`
Thanks again to #Jonesy for the huge help.

Exception "String cannot be of Zero length"

We are trying to read each word from a text file and replace it with another word.
For smaller text files, it works well. But for larger text files we keep getting the exception: "String cannot be of zero length.
Parameter name: oldValue "
void replace()
{
string s1 = " ", s2 = " ";
StreamReader streamReader;
streamReader = File.OpenText("C:\\sample.txt");
StreamWriter streamWriter = File.CreateText("C:\\sample1.txt");
//int x = st.Rows.Count;
while ((line = streamReader.ReadLine()) != null)
{
char[] delimiterChars = { ' ', '\t' };
String[] words = line.Split(delimiterChars);
foreach (string str in words)
{
s1 = str;
DataRow drow = st.Rows.Find(str);
if (drow != null)
{
index = st.Rows.IndexOf(drow);
s2 = Convert.ToString(st.Rows[index]["Binary"]);
s2 += "000";
// Console.WriteLine(s1);
// Console.WriteLine(s2);
streamWriter.Write(s1.Replace(s1,s2)); // Exception occurs here
}
else
break;
}
}
streamReader.Close();
streamWriter.Close();
}
we're unable to find the reason.
Thanks in advance.

When you do your string.Split you may get empty entries if there are multiple spaces or tabs in sequence. These can't be replaced as the strings are 0 length.
Use the overload that strips empty results using the StringSplitOptions argument:
var words = line.Split(delimiterChars, StringSplitOptions.RemoveEmptyEntries);

The exception occurs because s1 is an empty string at some point. You can avoid this by replacing the line
String[] words = line.Split(delimiterChars);
with this:
String[] words = line.Split(delimiterChars, StringSplitOptions.RemoveEmptyEntries);

You want to change your Split method call like this:
String[] words = line.Split(delimiterChars,StringSplitOptions.RemoveEmptyEntries);

It means that s1 contains an empty string ("") which can happen if you have two consecutive white spaces or tabs in your file.

C# complicated regex

Hey guys, thanks for all the help that you can provide. I need a little bit of regex help thats far beyond my knowledge.
I have a listbox with a file name in it, example 3123~101, a delimited file that has 1 line of text in it. I need to Regex everything after the last "\" before the last "-" in the text file. The ending will could contain a prefix then ###{####-004587}.txt The ~ formula is {### + ~# -1.
File name:
3123~101
So Example 1:
3123|X:directory\Path\Directory|Pre0{0442-0500}.txt
Result:
X:\directory\Path\Directory\Pre00542.txt
File name:
3123~101
So Example 1:
3123|X:directory\Path\Directory|0{0442-0500}.txt
Result:
X:\directory\Path\Directory\00542.txt

According your example I've created the following regexp:
\|(.)(.*)\|(.*)\{\d{2}(\d{2})\-(\d{2}).*(\..*)
The result should be as following:
group1 + "\\" + group2 + "\\" + group3 + group5 + group4 + group6
If you ain't satisfied, you can always give it a spin yourself here.
EDIT:
After remembering me about named groups:
\|(?<drive>.)(?<path>.*)\|(?<prefix>.*)\{\d{2}(?<number2>\d{2})\-(?<number1>\d{2}).*(?<extension>\..*)
drive + "\\" + path + "\\" + prefix + number1 + number2 + extension

public static string AdjustPath(string filename, string line)
{
int tilde = GetTilde(filename);
string[] fields = Regex.Split(line, #"\|");
var addbackslash = new MatchEvaluator(
m => m.Groups[1].Value + "\\" + m.Groups[2].Value);
string dir = Regex.Replace(fields[1], #"^([A-Z]:)([^\\])", addbackslash);
var addtilde = new MatchEvaluator(
m => (tilde + Int32.Parse(m.Groups[1].Value) - 1).
ToString().
PadLeft(m.Groups[1].Value.Length, '0'));
return Path.Combine(dir, Regex.Replace(fields[2], #"\{(\d+)-.+}", addtilde));
}
private static int GetTilde(string filename)
{
Match m = Regex.Match(filename, #"^.+~(\d+)$");
if (!m.Success)
throw new ArgumentException("Invalid filename", "filename");
return Int32.Parse(m.Groups[1].Value);
}
Call AdjustPath as in the following:
public static void Main(string[] args)
{
Console.WriteLine(AdjustPath("3123~101", #"3123|X:directory\Path\Directory|Pre0{0442-0500}.txt"));
Console.WriteLine(AdjustPath("3123~101", #"3123|X:directory\Path\Directory|0{0442-0500}.txt"));
}
Output:
X:\directory\Path\Directory\Pre00542.txt
X:\directory\Path\Directory\00542.txt
If instead you want to write the output to a file, use
public static void WriteAdjustedPaths(string inpath, string outpath)
{
using (var w = new StreamWriter(outpath))
{
var r = new StreamReader(inpath);
string line;
while ((line = r.ReadLine()) != null)
w.WriteLine("{0}", AdjustPath(inpath, line));
}
}
You might call it with
WriteAdjustedPaths("3123~101", "output.txt");
If you want a List<String> instead
public static List<String> AdjustedPaths(string inpath)
{
var paths = new List<String>();
var r = new StreamReader(inpath);
string line;
while ((line = r.ReadLine()) != null)
paths.Add(AdjustPath(inpath, line));
return paths;
}
To avoid repeated logic, we should define WriteAdjustedPaths in terms of the new function:
public static void WriteAdjustedPaths(string inpath, string outpath)
{
using (var w = new StreamWriter(outpath))
{
foreach (var p in AdjustedPaths(inpath))
w.WriteLine("{0}", p);
}
}
The syntax could be streamlined with Linq. See C# File Handling.

A slight variation on gbacon's answer that will also work in older versions of .Net:
static void Main(string[] args)
{
Console.WriteLine(Adjust("3123~101", #"3123|X:directory\Path\Directory|Pre0{0442-0500}.txt"));
Console.WriteLine(Adjust("3123~101", #"3123|X:directory\Path\Directory|0{0442-0500}.txt"));
}
private static string Adjust(string name, string file)
{
Regex nameParse = new Regex(#"\d*~(?<value>\d*)");
Regex fileParse = new Regex(#"\d*\|(?<drive>[A-Za-z]):(?<path>[^\|]*)\|(?<prefix>[^{]*){(?<code>\d*)");
Match nameMatch = nameParse.Match(name);
Match fileMatch = fileParse.Match(file);
int value = Convert.ToInt32(nameMatch.Groups["value"].Value);
int code = Convert.ToInt32(fileMatch.Groups["code"].Value);
code = code + value - 1;
string drive = fileMatch.Groups["drive"].Value;
string path = fileMatch.Groups["path"].Value;
string prefix = fileMatch.Groups["prefix"].Value;
string result = string.Format(#"{0}:\{1}\{2}{3:0000}.txt",
drive,
path,
prefix,
code);
return result;
}

You don't seem to be very clear in your examples.
That said,
/.*\\(.*)-[^-]*$/
will capture all text between the last backslash and the last hyphen in whatever it's matched against.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Search for some phrases in a text file using Regex C# - c#

Related

Removing special characters from a string with RegEx

How can i delete space from text file and replace it semicolon?

Regex Replace on a JSON structure

Exception "String cannot be of Zero length"

C# complicated regex

Categories

Resources