fairly new to c#, im looking for a way to search a string for a particular sequence:
string mytext = "I want to find t56b45 in a string"
In the above example i would like to search mytext for the position of "t" but only when it is followed by any two numeric chars and a "b" followed by any two numeric chars. If i find a "t"+any two numeric values+"b"+any two numeric values then i would like to create a sub string up to that position ie. the result string will read "I want to find"
Use a Regex:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
// \s : Matches a space
// t : Exact match t
// \d{2} : Any digit, 2 repetition
// t : Exact match b
// \d{2} : Any digit, 2 repetition
var match = Regex.Match("I want to find t56b45 in a string", #".*(?=\st\d{2}b\d{2})");
if(match.Success)
Console.WriteLine("\"" + match.Value + "\"");
else
Console.WriteLine("Nothing found.");
// Outputs: "I want to find"
}
}
Fiddle:
https://dotnetfiddle.net/kguwDW
So this is the code i ended up using, seems to get the job done and ignore the case
var tempmatch = Regex.Match(TempCleaned, "(?i)y[0-9]+(?i)z[0-9]+");
if (tempmatch.Success)
{
//clean all text from YxxZxx
string NewName = TempCleaned.Substring(0, tempmatch.Index -1);
Related
I need to check if a sentence contains any of the word from a string array but while checking it should ignore special characters like comma. But the result should have original sentence.
For example, I have a sentence "Tesla car price is $ 250,000."
In my word array I've wrdList = new string[5]{ "250000", "Apple", "40.00"};
I have written the below line of code, but it is not returning the result because 250,000 and 250000 are not matching.
List<string> res = row.ItemArray.Where(itmArr => wrdList.Any(wrd => itmArr.ToString().ToLower().Contains(wrd.ToString()))).OfType<string>().ToList();
And one important thing is, I need to get original sentence if it matches with string array.
For example, result should be "Tesla car price is $ 250,000."
not like "Tesla car price is $ 250000."
How about Replace(",", "")
itmArr.ToString().ToLower().Replace(",", "").Contains(wrd.ToString())
side note: .ToLower() isn't required since digits are case insensitive and a string doesn't need .ToString()
so the resuld could also be
itmArr.Replace(",", "").Contains(wrd)
https://dotnetfiddle.net/A2zN0d
Update
sice the , could be a different character - culture based, you can also use
ystem.Threading.Thread.CurrentThread.CurrentCulture.NumberFormat.NumberGroupSeparator
instead
The first option to consider for most text matching problems is to use regular expressions. This will work for your problem. The core part of the solution is to construct an appropriate regular expression to match what you need to match.
You have a list of words, but I'll focus on just one word. Your requirements specify that you want to match on a "word". So to start with, you can use the "word boundary" pattern \b. To match the word "250000", the regular expression would be \b250000\b.
Your requirements also specify that the word can "contain" characters that are "special". For it to work correctly, you need to be clear what it means to "contain" and which characters are "special".
For the "contain" requirement, I'll assume you mean that the special character can be between any two characters in the word, but not the first or last character. So for the word "250000", any of the question marks in this string could be a special character: "2?5?0?0?0?0".
For the "special" requirement, there are options that depend on your requirements. If it's simply punctuation, you can use the character class \p{P}. If you need to specify a specific list of special characters, you can use a character group. For example, if your only special character is comma, the character group would be [,].
To put all that together, you would create a function to build the appropriate regular expression for each target word, then use that to check your sentence. Something like this:
public static void Main()
{
string sentence = "Tesla car price is $ 250,000.";
var targetWords = new string[]{ "250000", "350000", "400000"};
Console.WriteLine($"Contains target word? {ContainsTarget(sentence, targetWords)}");
}
private static bool ContainsTarget(string sentence, string[] targetWords)
{
return targetWords.Any(targetWord => ContainsTarget(sentence, targetWord));
}
private static bool ContainsTarget(string sentence, string targetWord)
{
string targetWordExpression = TargetWordExpression(targetWord);
var re = new Regex(targetWordExpression);
return re.IsMatch(sentence);
}
private static string TargetWordExpression(string targetWord)
{
var sb = new StringBuilder();
// If special characters means a specific list, use this:
string specialCharacterMatch = $"[,]?";
// If special characters means any punctuation, then you can use this:
//string specialCharactersMatch = "\\p{P}?";
bool any = false;
foreach (char c in targetWord)
{
if (any)
{
sb.Append(specialCharacterMatch);
}
any = true;
sb.Append(c);
}
return $"\\b{sb}\\b";
}
Working code: https://dotnetfiddle.net/5UJSur
Hope below solution can help,
Used Regular expression for removing non alphanumeric characters
Returns the original string if it contains any matching word from wrdList.
string s = "Tesla car price is $ 250,000.";
string[] wrdList = new string[3] { "250000", "Apple", "40.00" };
Regex rgx = new Regex("[^a-zA-Z0-9 -]");
string str = rgx.Replace(s, "");
if (wrdList.Any(str.Contains))
{
Console.Write(s);
}
else
{
Console.Write("No Match Found!");
}
Uplodade on fiddle for more exploration
https://dotnetfiddle.net/zbwuDy
In addition for paragraph, can split into array of sentences and iterate through. Check the same on below fiddle.
https://dotnetfiddle.net/AvO6FJ
I have a problem to find the pattern that solves the problem in onestep.
The string looks like this:
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$Text5$Text6 etc.
What i want to get is: Take up to 4x Text. If there are more than "4xText" take only the last sign.
Example:
Text1$Text2$Text3$Text4$Text5$Text6 -> Text1$Text2$Text3$Text4&56
My current solution is:
First pattern:
^([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?
After this i will do a substitution with the first pattern
New string: Text5$Text6
second pattern is:
([^\$])\b
result: 56
combine both and get the result:
Text1$Text2$Text3$Text4$56
For me it is not clear why i cant easily put the second pattern after the first pattern into one pattern. Is there something like an anchor that tells the engine to start the pattern from here like it would do if is would be the only pattern ?
You might use an alternation with a positive lookbehind and then concatenate the matches.
(?<=^(?:[^$]+\$){0,3})[^$]+\$?|[^$](?=\$|$)
Explanation
(?<= Positive lookbehind, assert what is on the left is
^(?:[^$]+\$){0,3} Match 0-3 times any char except $ followed by an optional $
) Close lookbehind
[^$]+\$? Match 1+ times any char except $, then match an optional $
| Or
[^$] Match any char except $
(?=\$|$) Positive lookahead, assert what is directly to the right is either $ or the end of the string
.NET regex demo | C# demo
Example
string pattern = #"(?<=^(?:[^$]*\$){0,3})[^$]*\$?|[^$](?=\$|$)";
string[] strings = {
"Text1",
"Text1$Text2$Text3",
"Text1$Text2$Text3$Text4$Text5$Text6"
};
Regex regex = new Regex(pattern);
foreach (String s in strings) {
Console.WriteLine(string.Join("", from Match match in regex.Matches(s) select match.Value));
}
Output
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$56
I strongly believe regular expression isn't the way to do that. Mostly because of the readability.
You may consider using simple algorithm like this one to reach your goal:
using System;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var result = "";
for(var i=0; i<parts.Length; i++){
result += (i <= 4 ? parts[i] + "$" : parts[i].Substring(4));
}
Console.WriteLine(result);
}
}
There are also linq alternatives :
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var first4 = parts.Take(4);
var remainings = parts.Skip(4);
var result2 = string.Join("$", first4) + "$" + string.Join("", remainings.Select( r=>r.Substring(4)));
Console.WriteLine(result2);
}
}
It has to be adjusted to the actual needs but the idea is there
Try this code:
var texts = new string[] {"Text1", "Text1$Text2$Text3", "Text1$Text2$Text3$Text4$Text5$Text6" };
var parsed = texts
.Select(s => Regex.Replace(s,
#"(Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)",
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
)).ToArray();
// parsed is now: string[3] { "Text1$", "Text1$Text2$Text3$", "Text1$Text2$Text3$Text4$56" }
Explanation:
solution uses regex pattern: (Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)
(...) - first capturing group
(?:...) - non-capturing group
Text\d{1,3}(?:\$Text\d{1,3} - match Text literally, then match \d{1,3}, which is 1 up to three digits, \$ matches $ literally
Rest is just repetition of it. Basically, first group captures first four pieces, second group captures the rest, if any.
We also use MatchEvaluator here which is delegate type defined as:
public delegate string MatchEvaluator(Match match);
We define such method:
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
We use it to evaluate match, so takee first capturing group and concatenate with second, removing unnecessary text.
It's not clear to me whether your goal can be achieved using exclusively regex. If nothing else, the fact that you want to introduce a new character '&' into the output adds to the challenge, since just plain matching would never be able to accomplish that. Possibly using the Replace() method? I'm not sure that would work though...using only a replacement pattern and not a MatchEvaluator, I don't see a way to recognize but still exclude the "$Text" portion from the fifth instance and later.
But, if you are willing to mix regex with a small amount of post-processing, you can definitely do it:
static readonly Regex regex1 = new Regex(#"(Text\d(?:\$Text\d){0,3})(?:\$Text(\d))*", RegexOptions.Compiled);
static void Main(string[] args)
{
for (int i = 1; i <= 6; i++)
{
string text = string.Join("$", Enumerable.Range(1, i).Select(j => $"Text{j}"));
WriteLine(KeepFour(text));
}
}
private static string KeepFour(string text)
{
Match match = regex1.Match(text);
if (!match.Success)
{
return "[NO MATCH]";
}
StringBuilder result = new StringBuilder();
result.Append(match.Groups[1].Value);
if (match.Groups[2].Captures.Count > 0)
{
result.Append("&");
// Have to iterate (join), because we don't want the whole match,
// just the captured text.
result.Append(JoinCaptures(match.Groups[2]));
}
return result.ToString();
}
private static string JoinCaptures(Group group)
{
return string.Join("", group.Captures.Cast<Capture>().Select(c => c.Value));
}
The above breaks your requirement into three different capture groups in a regex. Then it extracts the captured text, composing the result based on the results.
I'm trying to parse a a bunch of file with Replace method(string) while is doing what I expect: I feels is not practical. for instance I will process 10K files but in the First 72 I found like 30 values that need to be replace And this is the rule :
My Goal :"
My goal is to replace all Instance of the ':' Dont follows this Rules :
1- the 2nd or 3rd Character foward is Not Another ':'
2-the 3rd or 2nd Chacarcter backward is Not Another ':'
All other should be Replaced
1- Any time that I found this character (:) and this character is not preceded by two char or three characters like :00: or :12A: I should replace it with an (*).
This is the method that I have so far.....
private static string cleanMesage(string str)
{
string result = String.Empty;
try
{
result = str.Replace("BNF:", "BNF*").Replace("B/O:", "B/O*").Replace("O/B:", "O/B*");
result = result.Replace("Epsas:", "Epsas*").Replace("2017:", "2017*").Replace("BANK:", "BANK*");
result = result.Replace("CDT:", "CDT*").Replace("ENT:", "").Replace("GB22:", "GB22*");
result = result.Replace("A / C:", "A/C*").Replace("ORD:", "ORD*").Replace("A/C:", "A/C*");
result = result.Replace("REF:", "REF*").Replace("ISIN:", "ISIN*").Replace("PAY:", "PAY*");
result = result.Replace("DEPOSITO:", "DEPOSITO*").Replace("WITH:", "WITH*");
result = result.Replace("Operaciones:", "Operaciones*").Replace("INST:", "INST*");
result = result.Replace("DETAIL:", "DETAIL*").Replace("WITH:", "WITH*").Replace("BO:", "BO*");
result = result.Replace("CUST:", "CUST*").Replace("ISIN:", "ISIN*").Replace("SEDL:", "SEDL*");
result = result.Replace("Enero:", "Enero*").Replace("enero:", "Enero*");
result = result.Replace("agosto:", "agosto*").Replace("febrero:", "febrero*");
result = result.Replace("marzo:", "marzo*").Replace("abril:", "abril*");
result = result.Replace("mayo:", "mayo*").Replace("junio:", "junio*").Replace("RE:", "RE:*");
result = result.Replace("julio:", "julio*").Replace("septiembre:", "septiembre*");
result = result.Replace("NIF:", "NIF*").Replace("INST:", "INST*").Replace("SHS:", "SHS*")
.Replace("SK:", "");
result = result.Replace("PARTY:", "PARTY*").Replace("SEDOL:", "SEDOL*").Replace("PD:", "PD*");
}
catch (Exception e)
{
}
return result;
}
And this is some sample data :"
:13: <-- keep /ISIN/XS SVUNSK UXPORTKRUDIT ZX PZY DZTU:<- replace UX DZ
TU:<- replace02ZUG12 RZTU:<- replace W/H TZX RZTU:<- replace0.00000 SHZRUS PZID:<- replace
0.000000 IDDSIN:<- replace
:31: <-- keep 1201000100CD05302,24NSUC20523531001//00520023531014
:13: <-- keep /ISIN/XS0153242003 SVUNSK UXPORTKRUDIT ZX PZY DZTU:<- replace00ZUG12 UX DZ
TU:02ZUG12 RZTU:0.30241 W/H TZX RZTU:<- replace0.00000 SHZRUS PZID:<- replace
0.000000 ISIN:XS0153242003
:31: <-- keep 1201000100DD121253,25S202IMSSMSZUX534C//S0322211DF4301
S F/O 0150001400
:13: <-- keep XNF:<- replace this
If your goal is to replace all instances of the ':' character where it is not followed by 2 or 3 other characters. You could indeed try the System.Text.RegularExpressions library. You could then simplify your cleanMessage function in the following way.
using System.Text.RegularExpressions;
function string cleanMessage(string str)
{
string pattern = ":(\s)"; //This will be a ':' followed by a space
Regex rgx = new Regex(pattern);
string replaceResult = rgx.Replace(str,"*$1") //this will replace the pattern with a '*' followed by a space.
return replaceResult;
}
If your goal is to replace all instances of the ':' character where it is not followed by 2 or 3 other characters and the 2nd or 3rd character forward or backward is not another ':'. You could change your cleanMessage to the following instead.
using System.Text.RegularExpressions;
function string cleanMessage(string str)
{
string pattern = "([^;]{2}.):(\s[^:]{2})";
//This will be 2 characters that cannot be ':' followed by anything then a ':' followed by a space and 2 more characters that cannot by ':'
//For instance, "BNF: :F" would FAIL and not get replaced but "BNF: HH" would pass and become "BNF* HH"
Regex rgx = new Regex(pattern);
string replaceResult = rgx.Replace(str,"$1*$2") //this will replace the : with a *
return replaceResult;
}
More information on the System.Text.RegularExpressions library replace can be found at
https://msdn.microsoft.com/en-us/library/xwewhkd1(v=vs.110).aspx
As #dymanoid mentioned, regular expressions are a way to handle this. By using the following you'd get what you want:
result = Regex.Replace(str, "([a-zA-Z0-9]{2,3})\:", "$1*");
However for large datasets this won't perform well. In that case I'd look at walking through str character by character using a for-loop. If the current character is not a colon, add it to the result string and to a temporary string. When the current character is a colon (:) and the temporary string has a length of 2 or 3, write an asterisk to the result and clear the temporary string.
In this case you don't do any string replacement, you just select what to write to a new string.
See here for a speed comparison between string replacement and regex replacement.
Is there a way to convert string without spaces to a proper sentence??
E.g. "WhoAmI" needs to be converted to "Who Am I"
A regex replacement would do this, if you're just talking about inserting a space before each capital letter:
using System;
using System.Text.RegularExpressions;
class Test
{
static void Main()
{
var input = "WhoAmI";
var output = Regex.Replace(input, #"\p{Lu}", " $0").TrimStart();
Console.WriteLine(output);
}
}
However, I suspect there will be significant corner cases. Note that the above uses \p{Lu} instead of just [A-Z] to cope with non-ASCII capital letters; you may find A-Z simpler if you only need to deal with ASCII. The TrimStart() call is to remove the leading space you'd get otherwise.
If every word in the string is starting with uppercase you may just convert each part that is starting with uppercase to a space separated string.
You can use LINQ
string words = "WhoAmI";
string sentence = String.Concat(words.Select(letter => Char.IsUpper(letter) ? " " + letter
: letter.ToString()))
.TrimStart();
Problem!
I Have the following input (rules) from a flat file (talking about numeric input):
Input might be a natural number (below 1000): 1, 10, 100, 999, ...
Input might be a comma separated number surrounded by quotes (above 1000): "1,000", "2,000", "3,000", "10,000", ...
I Have the following regular expression to validate the input: (?:(\d+)|\x22([0-9]+(?:,[0-9]+)*)\x22), So for an input like 10 I'm expecting in the first matching group 10, which is exactly what I got. But when I got an input like "10,000" I'm expecting in the first matching group 10,000, but it is stored at the second matching group.
Example
string text1 = "\"" + "10,000" + "\"";
string text2 = "50";
string pattern = #"(\d+)|\x22([0-9]+(?:,[0-9]+){0,})\x22";
Match match1 = Regex.Match(text1, pattern);
Match match2 = Regex.Match(text2, pattern);
if (match1.Success)
{
Console.WriteLine("Match#1 Group#1: " + match1.Groups[1].Value);
Console.WriteLine("Match#1 Group#2: " + match1.Groups[2].Value);
# Outputs
# Match#1 Group#1:
# Match#1 Group#2: 10,000
}
if (match2.Success)
{
Console.WriteLine("Match#2 Group#1: " + match2.Groups[1].Value);
Console.WriteLine("Match#2 Group#2: " + match2.Groups[2].Value);
# Outputs
# Match#2 Group#1: 50
# Match#2 Group#2:
}
Expected Result
Both results on the same matching group, in this case 1
Questions?
What am I doing wrong? I'm just getting bad grouping from the regular expression matches.
Also, I'm using filehelpers .NET to parse the file, is there any other way to resolve this problem. Actualy I'm trying to implement a custom converter.
Object File
[FieldConverter(typeof(OOR_Quantity))]
public Int32 Quantity;
OOR_Quantity
internal class OOR_Quantity : ConverterBase
{
public override object StringToField(string from)
{
string pattern = #"(?:(\d+)|\x22([0-9]+(?:,[0-9]+)*)\x22)";
Regex regex = new Regex(pattern);
if (regex.IsMatch(from))
{
Match match = regex.Match(from);
return int.Parse(match.Groups[1].Value);
}
throw new ...
}
}
Group numbers are assigned purely on the basis of their positions in the regex--specifically, the relative position of the opening bracket, (. In your regex, (\d+) is the first group and ([0-9]+(?:,[0-9]+)*) is the second.
If you want to refer to them both with the same identifier, use named groups and give them both the same name:
#"(?:(?<NUMBER>\d+)|\x22(?<NUMBER>[0-9]+(?:,[0-9]+)*)\x22)"
Now you can retrieve the captured value as match.Groups["NUMBER"].Value.
I tested the regex below with Ruby:
text1 = "\"10,000\""
text2 = "50"
regex = /"?([0-9]+(?:,[0-9]+){0,})"?/
text1 =~ regex
puts "#$1"
text2 =~ regex
puts "#$1"
The result is:
10,000
50
I think you can rewrite in C#. Isn't it enough for you?