Regex doesn't recognize String C# - c#

I'm working with Regex Class. I'm trying to figure it out, how many common matches a String
has in another String.
Here is the situation:
MainWindow.DetailBLL.Name = "Top Senders By Total Threat Messages"
String detailName = MainWindow.DetailBLL.Name;
Extracted from:
MainWindow = Window Class
DetailBLL = Class
Name = Variable
public String Name
{
get { return _Name; }
set { _Name = value; }
}
CharacterReplacement(openedFile) = "Incoming Mail Domains Top Senders By Total Threat Messages"
String fileName = CharacterReplacement(openedFile);
Extracted from:
OpenFileDialog openedFile = new OpenFileDialog();
Incoming_Mail_Domains_Top_Senders_by_Graymail_Messages_RawData.csv
private String CharacterReplacement(OpenFileDialog file)
{
String input = file.SafeFileName;
String output = input.Replace("_", " ").Replace("RawData", " ").Replace("by", "By").Replace(".csv", " ");
//output: "Incoming Mail Domains Top Senders By Graymail Messages"
return output;
}
This method takes the file's name (The name of a .csv file) and convert it to a readable String, returning it as is depicted.
The use of the Regex Class:
String source = detailName;
String searchPattern = fileName;
1st try: Doesn't work
int count = Regex.Matches(searchPattern, source).Count;
or doesn't work
int count = Regex.Matches(fileName, detailName).Count;
if (count > 0)
{
System.Windows.MessageBox.Show("Match!");
}
2nd try: Doesn't work
foreach (Match match in Regex.Matches(fileName, detailName))
or doesn't work
foreach (Match match in Regex.Matches(searchPattern, source))
{
System.Windows.MessageBox.Show("Matches: " + counter++);
}
I've noticed something, Regex doesn't work like this way. There's no recognition on the variables:
String source = detailName;
String searchPattern = fileName;
Only works when the variables are like this:
String source = "Top Senders By Total Threat Messages";
String searchPattern = "Incoming Mail Domains Top Senders By Total Threat Messages";
But, this won't work for me, I need them to evaluate as a implicit (Non-Literal) String, not as a explicit (Literal) one,
cause the variables change everytime.
There's a way to get to it please?

Well, first of all - you probably do not need regex (still I recommend to read about regex https://www.regular-expressions.info/).
I guess that you need to count how many words are contained in both strings. What your comments neither question says is if you want to count the same word twice or just once.
Here you can find basic sample:
using System;
using System.Linq;
namespace SearchLinq
{
class Program
{
static void Main(string[] args)
{
string source = "Top Senders By Total Threat Messages";
string find = "Incoming Mail Domains Top Senders By Total Threat Messages";
// first possible solution
int result = 0;
foreach (string word in find.Split(' '))
{
if (source.Contains(word))
{
result++;
}
}
Console.WriteLine(result);
// second possible solution
int result2 = find.Split(' ').Count(w => source.Contains(w));
Console.WriteLine(result2);
}
}
}

Related

Remove characters from List<string> in between separators (from text file)

Fast way to replace text in text file.
From this: somename#somedomain.com:hello_world
To This: somename:hello_world
It needs to be FAST and support multiple lines of text file.
I tried spiting the string into three parts but it seems slow. Example in the code below.
<pre><code>
public static void Conversion()
{
List<string> list = File.ReadAllLines("ETU/Tut.txt").ToList();
Console.WriteLine("Please wait, converting in progress !");
foreach (string combination in list)
{
if (combination.Contains("#"))
{
write: try
{
using (StreamWriter sw = new
StreamWriter("ETU/UPCombination.txt", true))
{
sw.WriteLine(combination.Split('#', ':')[0] + ":"
+ combination.Split('#', ':')[2]);
}
}
catch
{
goto write;
}
}
else
{
Console.WriteLine("At least one line doesn't contain #");
}
}
}</code></pre>
So a fast way to convert every line in text file from
somename#somedomain.com:hello_world
To: somename:hello_world
then save it different text file.
!Remember the domain bit always changes!
Most likely not the fastest, but it is pretty fast with an expression similar to,
#[^:]+
and replace that with an empty string.
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"#[^:]+";
string substitution = #"";
string input = #"somename#somedomain.com:hello_world1
somename#some_other_domain.com:hello_world2";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
}
}
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

A good method to detect any(not just malicious) intentionally obfuscated urls in text

I have a website on which anyone can register and post public content. However certain malicious users go on the site and post links that have to be legally taken down.
So all new users have to be approved before they can post URLs.
However, they attempt to avoid URL detection, like posting URLs similar to these:
h tt.pp//example(dot)com
Mainly URL shorteners and image hosting sites are the problems.
I would like to use a one-time solution.
Our URL detection method is continuously advanced to counter their obfuscation attempts. It uses regex and replaces techniques and blacklisting of known sites that the malicious users can/have used, but there are too many to blacklist them all.
I have looked into AI solutions, but I can't find one that seems good for my problem, as most AI's are dedicated to detecting phishing sites, instead of obfuscated URLs.
Another option is to train our own AI, but this would take us more time(we have not done this before) to workout then continuously advance the URL detector as we do now.
As a last resort, we may try to train an AI ourselves, suggestions where to find suitable data for this, are also appreciated.
We use C#, but any language is an option, we will integrate it one way or another.
public static class SecurityHelper
{
private static string UrlCheckerReplaceRegex;
private static string UrlCheckerBadStringsRegex;
private static string UrlCheckerBadStringsSpaceRegex;
static SecurityHelper()
{
var repalceStrings = new string[] { #"\s", "-", #"\[", #"\]", #"\(", #"\)", #"\{", #"\}", "d+o*t+", #"/", #"\\", #"\|", "d*o+t+", "d+o+t+", #"\." };
Array.ForEach(repalceStrings, (str) => str = '(' + str + ')');
UrlCheckerReplaceRegex = string.Join("|", repalceStrings);
var badStrings = new string[]
{
#"u\.*to\.+", #"goo\.+gl"
};
Array.ForEach(badStrings, (str) => str = '(' + str + ')');
UrlCheckerBadStringsRegex = string.Join("|", badStrings);
var badStringsSpace = new string[]
{
"ww+", "h+t+p+", "imgur", "sh4re", "ibb", #"bitly", #"tinyurl", #"cuttly", #"demopolrme", #"doiopcom",
#"tinycc", #"[0-9]ly", #"shrunkencom", #"[0-9]gp", #"pasia", #"gasia", #"gggg", #"bitdo", #"isgd", #"vgd",
#"tnysh", #"tinypic", #"0rztw", #"chilpit"
};
Array.ForEach(badStringsSpace, (str) => str = '(' + str + ')');
UrlCheckerBadStringsSpaceRegex = string.Join("|", badStringsSpace);
}
public static bool ContainsPossibleUrls(string text)
{
var modifiedText = text.ToLower();
modifiedText = Regex.Replace(modifiedText, string.Join("|", UrlCheckerReplaceRegex), ".");
// replace all spaces and other characters to get the unobfuscated text and try and find a blacklisted url, here they are replaced by a '.', because some blacklisted urls are too broad and we do not want to accidently block a non malicious input (www\u.to\, 'u to' can be matched if someone writes 'u too' as part of a message, so we look for "u\.*to\.+")
if (Regex.Match(modifiedText, UrlCheckerBadStringsRegex, RegexOptions.IgnoreCase).Success)
{
return true;
}
var modifiedTextSpace = text.ToLower();
modifiedTextSpace = Regex.Replace(modifiedTextSpace, string.Join("|", UrlCheckerReplaceRegex), "");
// remove all spaces and other characters to get the unobfuscated text and try and find a blacklisted url
if (Regex.Match(modifiedTextSpace, UrlCheckerBadStringsSpaceRegex, RegexOptions.IgnoreCase).Success)
{
return true;
}
if (FindUrls(text).Count > 0)
{
return true;
}
return false;
}
public static List<string> FindUrls(string text)
{
var urls = new List<string>();
if (string.IsNullOrWhiteSpace(text))
{
return urls;
}
MatchCollection matches = Regex.Matches(text, #"(http|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])?"); // find all urls
foreach (Match match in matches)
{
foreach (Capture capture in match.Captures)
{
urls.Add(capture.Value);
}
}
return urls.Distinct().ToList();
}
}
If a human mind can identify the pattern of the URL, then an AI must be able to as well.

SSIS C# Script Task: How to match/replace pattern with increment on a large XML file

There are other similar questions that have been asked and answered, but none of those answers work in what I'm trying to do, or there isn't enough information for me to know how to implement it in my own code. I've been at it for two days and now must ask for help.
I have a script task in an SSIS package where I need to do a match and replace on a large XML file that contains thousands of Record Identifier tags. Each one contains a number. I need those numbers to be consecutive and increment by one. For example, within the xml file, I am able to find tags that appear like this:
<ns1:recordIdentifier>1</ns1:recordIdentifier>
<ns1:recordIdentifier>6</ns1:recordIdentifier>
<ns1:recordIdentifier>223</ns1:recordIdentifier>
<ns1:recordIdentifier>4102</ns1:recordIdentifier>
I need to find and replace those tags with consecutive increments like so:
<ns1:recordIdentifier>1</ns1:recordIdentifier>
<ns1:recordIdentifier>2</ns1:recordIdentifier>
<ns1:recordIdentifier>3</ns1:recordIdentifier>
<ns1:recordIdentifier>4</ns1:recordIdentifier>
The code I have so far is causing all the numbers to be "1" with no incrementation.
I've tried dozens of different methods, but nothing has worked yet.
Any ideas as to how I can modify the below code to increment as desired?
public void Main()
{
string varStart = "<ns1:recordIdentifier>";
string varEnd = "</ns1:recordIdentifier>";
int i = 1;
string path = Dts.Variables["User::xmlFilename"].Value.ToString();
string outPath = Dts.Variables["User::xmlOutputFile"].Value.ToString();
string ptrn = #"<ns1:recordIdentifier>\d{1,4}<\/ns1:recordIdentifier>";
string replace = varStart + i + varEnd;
using (StreamReader sr = File.OpenText(path))
{
string s = "";
while ((s = sr.ReadLine()) != null && i>0)
{
File.WriteAllText(outPath, Regex.Replace(File.ReadAllText(path),
ptrn, replace));
i++;
}
}
}
You were on the right path with the Replace method, but will need to use the MatchEvaluater parameter when you increment.
string inputFile = Dts.Variables["User::xmlFilename"].Value.ToString();
string outPutfile = Dts.Variables["User::xmlOutputFile"].Value.ToString();
string fileText = File.ReadAllText(inputFile);
//get any number between elements
Regex reg = new Regex("<ns1:recordIdentifier>[0-9]</ns1:recordIdentifier>");
string xmlStartTag = "<ns1:recordIdentifier>";
string xmlEndTag = "</ns1:recordIdentifier>";
//assuming this starts at 1
int incrementInt = 1;
fileText = reg.Replace(fileText, tag =>
{ return xmlStartTag + incrementInt++.ToString() + xmlEndTag; });
File.WriteAllText(outPutfile, fileText);

Remove parts of string

I have the following string
string a = #"\\server\MainDirectory\SubDirectoryA\SubdirectoryB\SubdirectoryC\Test.jpg";
I'm trying to remove part of the string so in the end I want to be left with
string a = #"\\server\MainDirectory\SubDirectoryA\SubdirectoryB";
So currently I'm doing
string b = a.Remove(a.LastIndexOf('\\'));
string c = b.Remove(b.LastIndexOf('\\'));
Console.WriteLine(c);
which gives me the correct result. I was wondering if there is a better way of doing this? because I'm having to do this in a fair few places.
Note: the SubdirectoryC length will be unknown. As it is made of the numbers/letters a user inputs
There is Path.GetDirectoryName
string a = #"\\server\MainDirectory\SubDirectoryA\SubdirectoryB\SubdirectoryC\Test.jpg";
string b = Path.GetDirectoryName(Path.GetDirectoryName(a));
As explained in MSDN it works also if you pass a directory
....passing the returned path back into the GetDirectoryName method will
result in the truncation of one folder level per subsequent call on
the result string
Of course this is safe if you have at least two directories level
Heyho,
if you just want to get rid of the last part.
You can use :
var parentDirectory = Directory.GetParent(Path.GetDirectoryName(path));
https://msdn.microsoft.com/de-de/library/system.io.directory.getparent(v=vs.110).aspx
An alternative answer using Linq:
var b = string.Join("\\", a.Split(new string[] { "\\" }, StringSplitOptions.None)
.Reverse().Skip(2).Reverse());
Some alternatives
string a = #"\\server\MainDirectory\SubDirectoryA\SubdirectoryB\SubdirectoryC\Test.jpg";
var b = Path.GetFullPath(a + #"\..\..");
var c = a.Remove(a.LastIndexOf('\\', a.LastIndexOf('\\') - 1));
but I do find this kind of string extensions generally usefull:
static string beforeLast(this string str, string delimiter)
{
int i = str.LastIndexOf(delimiter);
if (i < 0) return str;
return str.Remove(i);
}
For such repeated tasks, a good solution is often to write an extension method, e.g.
public static class Extensions
{
public static string ChopPath(this string path)
{
// chopping code here
}
}
Which you then can use anywhere you need it:
var chopped = a.ChopPath();

Extract multiple values from string using C#

I'am creating my own forum. I've got problem with quoting messages. I know how to add quoting message into text box, but i cannot figure out how to extract values from string after post. In text box i've got something like this:
[quote IdPost=8] Some quoting text [/quote]
[quote IdPost=15] Second quoting text [/quote]
Could You tell what is the easiest way to extract all "IdPost" numbers from string after posting form ?.
by using a regex
#"\[quote IdPost=(\d+)\]"
something like
Regex reg = new Regex(#"\[quote IdPost=(\d+)\]");
foreach (Match match in reg.Matches(text))
{
...
}
var originalstring = "[quote IdPost=8] Some quoting text [/quote]";
//"[quote IdPost=" and "8] Some quoting text [/quote]"
var splits = originalstring.Split('=');
if(splits.Count() == 2)
{
//"8" and "] Some quoting text [/quote]"
var splits2 = splits[1].Split(']');
int id;
if(int.TryParse(splits2[0], out id))
{
return id;
}
}
I do not know exactly what is your string, but here is a regex-free solution with Substring :
using System;
public class Program
{
public static void Main()
{
string source = "[quote IdPost=8] Some quoting text [/quote]";
Console.WriteLine(ExtractNum(source, "=", "]"));
Console.WriteLine(ExtractNum2(source, "[quote IdPost="));
}
public static string ExtractNum(string source, string start, string end)
{
int index = source.IndexOf(start) + start.Length;
return source.Substring(index, source.IndexOf(end) - index);
}
// just another solution for fun
public static string ExtractNum2(string source, string junk)
{
source = source.Substring(junk.Length, source.Length - junk.Length); // erase start
return source.Remove(source.IndexOf(']')); // erase end
}
}
Demo on DotNetFiddle

Categories

Resources