Position of and length in regular expression - c#

I have text like this:
This is a sample {text}. I want to inform my {Dada} that I have some
data which is {not useful}. So I need data to start by { and ends with
}. This data needs to {find out}.
Total text have some substrings separated within curly braces {}. How can I find the starting position and length of the substring starting with { and ending with }? Further, I will replace the substring with the processed string.

With Regex.Match, you can check the index of each match by accessing the Index property, and the length of each match by checking the Length property.
If you want to count the curly braces in, you can use \{(.*?)\} regex, like this:
var txt = "This is a sample {text}. I want to inform my {Dada} that I have some data which is {not useful}. So I need data to start by { and ends with }. This data needs to {find out}.";
var rgx1 = new Regex(#"\{(.*?)\}");
var matchees = rgx1.Matches(txt);
// Get the 1st capure groups
var all_matches = matchees.Cast<Match>().Select(p => p.Groups[1].Value).ToList();
// Get the indexes of the matches
var idxs = matchees.Cast<Match>().Select(p => p.Index).ToList();
// Get the lengths of the matches
var lens = matchees.Cast<Match>().Select(p => p.Length).ToList();
Outputs:
Perhaps, you will want to use a dictionary with search and replace terms, and that will be more effecient:
var dic = new Dictionary<string, string>();
dic.Add("old", "new");
var ttxt = "My {old} car";
// And then use the keys to replace with the values
var output = rgx1.Replace(ttxt, match => dic[match.Groups[1].Value]);
Output:

If you know you will not have nested curly braces, you can use the following:
var input = #"This is a sample {text}. I want to inform my {Dada} that I have some data which is {not useful}. So I need data to start by { and ends with }. This data needs to {find out}."
var pattern = #"{[^]*}"
foreach (Match match in Regex.Matches(input, pattern)) {
string subString = match.Groups(1).Value;
int start = match.Groups(1).Index;
int length = match.Groups(1).Length;
}

Related

Regex How to Match 2 fields

How would capture both the filenames inside the quotes, and the numbers following as named captures (Regex / C#)?
Files("fileone.txt", 5969784, "file2.txt", 45345333)
Out of every occurrence in the string, the ability to capture "fileone.txt" and the integer following (a loop cycles each pair)
I am trying to use this https://regex101.com/r/MwMzBo/1 but having issues matching without the '[' and ']'.
Required to be able to loop each filename+size as a pair and moving next.
Any help is appreciated!
UPDATE
string file = "Files(\"fileone.txt\", 5969784, \"file2.txt\", 45345333, \"file2.txt\", 45345333)";
var regex = new Regex(#"(?:\G(?!\A)\s*,\s*|\w+\()(?:""(?<file>.*?)""|'(?<file>.*?)')\s*,\s*(?<number>\d+)");
var match = regex.Match(file);
var names = match.Groups["file"].Captures.Cast<Capture>();
var lengths = match.Groups["number"].Captures.Cast<Capture>();
var filelist = names.Zip(lengths, (f, n) => new { file = f.Value, length = long.Parse(n.Value) }).ToArray();
foreach (var item in filelist)
{
// Only returning 1 pair result, ignoring the rest
}
Reading match.Value to confirm what is being read. Only first pair is being picked up.
while (match.Success)
{
MessageBox.Show(match.Value);
match = match.NextMatch();
}
Now we are getting all results properly. I read, that Regex.Match only returns the first matched result. This explains a lot.
You can use
(?:\G(?!\A)\s*,\s*|\w+\()(?:""(?<file>.*?)""|'(?<file>.*?)')\s*,\s*(?<number>\d+)
See the regex demo
Details:
(?:\G(?!\A)\s*,\s*|\w+\() - end of the previous successful match and a comma enclosed with zero or more whitespaces, or a word and an opening ( char
(?:""(?<file>.*?)""|'(?<file>.*?)') - ", Group "file" capturing any zero or more chars other than a newline char as few as possible and then a ", or a ', Group "file" capturing any zero or more chars other than a newline char as few as possible and then a '
\s*,\s* - a comma enclosed with zero or more whitespaces
(?<number>\d+) - Group "number": one or more digits.
I like doing it in smaller pieces :
string input = "cov('Age', ['5','7','9'])";
string pattern1 = #"\((?'key'[^,]+),\s+\[(?'values'[^\]]+)";
Match match = Regex.Match(input, pattern1);
string key = match.Groups["key"].Value.Trim(new char[] {'\''});
string pattern2 = #"'(?'value'[^']+)'";
string values = match.Groups["values"].Value;
MatchCollection matches = Regex.Matches(values, pattern2);
int[] number = matches.Cast<Match>().Select(x => int.Parse(x.Value.Replace("'",string.Empty))).ToArray();

Loop through string and remove any occurrence of specified word

I'm trying to remove all conjunctions and pronouns from any array of strings(let call that array A), The words to be removed are read from a text file and converted into an array of strings(lets call that array B).
What I need is to Get the first element of array A and compare it to every word in array B, if the word matches I want to delete the word of array A.
For example:
array A = [0]I [1]want [2]to [3]go [4]home [5]and [6]sleep
array B = [0]I [1]and [2]go [3]to
Result= array A = [0]want [1]home [2]sleep
//remove any duplicates,conjunctions and Pronouns
public IQueryable<All_Articles> removeConjunctionsProNouns(IQueryable<All_Articles> myArticles)
{
//get words to be removed
string text = System.IO.File.ReadAllText("A:\\EnterpriceAssigment\\EnterpriceAssigment\\TextFiles\\conjunctions&ProNouns.txt").ToLower();
//split word into array of strings
string[] wordsToBeRemoved = text.Split(',');
//all articles
foreach (var article in myArticles)
{
//split articles into words
string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
//loop through array of articles words
foreach (var y in articleSplit)
{
//loop through words to be removed from articleSplit
foreach (var x in wordsToBeRemoved)
{
//if word of articles matches word to be removed, remove word from article
if (y == x)
{
//get index of element in array to be removed
int g = Array.IndexOf(articleSplit,y);
//assign elemnt to ""
articleSplit[g] = "";
}
}
}
//re-assign splitted article to string
article.ArticleContent = articleSplit.ToString();
}
return myArticles;
}
If it is possible as well, I need array A to have no duplicates/distinct values.
You are looking for IEnumerable.Except, where the passed parameter is applied to the input sequence and every element of the input sequence not present in the parameter list is returned only once
For example
string inputText = "I want this string to be returned without some words , but words should have only one occurence";
string[] excludedWords = new string[] {"I","to","be", "some", "but", "should", "have", "one", ","};
var splitted = inputText.Split(' ');
var result = splitted.Except(excludedWords);
foreach(string s in result)
Console.WriteLine(s);
// ---- Output ----
want
this
string
returned
without
words <<-- This appears only once
only
occurence
And applied to your code is:
string text = File.ReadAllText(......).ToLower();
string[] wordsToBeRemoved = text.Split(',');
foreach (var article in myArticles)
{
string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
var result = articleSplit.Except(wordsToBeRemoved);
article.ArticleContent = string.Join(" ", result);
}
You may have your answer already in your code. I am sure your code could be cleaned up a bit, as all our code could be. You loop through articleSplit and pull out each word. Then compare that word to the words in the wordsToBeRemoved array in a loop one by one. You use your conditional to compare and when true you remove the items from your original array, or at least try.
I would create another array of the results and then display, use or what ever you'd like with the array minus the words to exclude.
Loop through articleSplit
foreach x in arcticle split
foreach y in wordsToBeRemoved
if x != y newArray.Add(x)
However this is quite a bit of work. You may want to use array.filter and then add that way. There is a hundred ways to achieve this.
Here are some helpful articles:
filter an array in C#
https://msdn.microsoft.com/en-us/library/d9hy2xwa(v=vs.110).aspx
These will save you from all that looping.
You want to remove stop words. You can do it with a help of Linq:
...
string filePath = #"A:\EnterpriceAssigment\EnterpriceAssigment\TextFiles\conjunctions & ProNouns.txt";
// Hashset is much more efficient than array in the context
HashSet<string> stopWords = new HashSet<string>(File
.ReadLines(filePath), StringComparer.OrdinalIgnoreCase);
foreach (var article in myArticles) {
// read article, split into words, filter out stop words...
var cleared = article
.ArticleContent
.Split(' ')
.Where(word => !stopWords.Contains(word));
// ...and join words back into article
article.ArticleContent = string.Join(" ", cleared);
}
...
Please, notice, that I've preserved Split() which you've used in your code and so you have a toy implementation. In real life you have at least to take punctuation into consideration, and that's why a better code uses regular expressions:
foreach (var article in myArticles) {
// read article, extract words, filter out stop words...
var cleared = Regex
.Matches(article.ArticleContent, #"\w+") // <- extract words
.OfType<Match>()
.Select(match => match.Value)
.Where(word => !stopWords.Contains(word));
// ...and join words back into article
article.ArticleContent = string.Join(" ", cleared);
}

Multi Substring from long string

I have a long string I need to take out only substrings that are between { and }, and turn it into a Json object
This string
sys=t85,fggh{"Name":"5038.zip","Folder":"Root",,"Download":"services/DownloadFile.ashx?"} dsdfg x=565,dfg
{"Name":"5038.zip","Folder":"Root",,"Download":"services/DownloadFile.ashx?"}dfsdfg567
{"Name":"5038.zip","Folder":"Root",,"Download":"services/DownloadFile.ashx?"}sdfs
I have trash inside so I need to extract the substring of the data between { and }
My code is here, but I'm stuck, I can't remove the data that I already taken.
List<JsonTypeFile> AllFiles = new List<JsonTypeFile>();
int lenght = -1;
while (temp.Length>3)
{
lenght = temp.IndexOf("}") - temp.IndexOf("{");
temp=temp.Substring(temp.IndexOf("{"), lenght+1);
temp.Remove(temp.IndexOf("{"), lenght + 1);
var result = JsonConvert.DeserializeObject<SnSafe.JsonTypeFile>(temp);
AllFiles.Add(result);
}
Or using regex you can get the strings like this:
var regex = new Regex("{([^}]*)}");
var matches = regex.Matches(str);
var list = (from object m in matches select m.ToString().Replace("{",string.Empty).Replace("}",string.Empty)).ToList();
var jsonList = JsonConvert.SerializeObject(list);
The str variable containing your string as you provided in your question.
You can use a regex for this but what I would do is use .split ('{') to split into sections, skip the first section, and then using .split('}) to find the first portion of each section.
You can do this using LINQ
var data = temp
.Split('{')
.Skip(1)
.Select(v => v.Split('}').FirstOrDefault());
If I understand correctly, you just want to extract anything in-between the braces and ignore anything else.
The following regular expression should allow you to extract that info:
{[^}]*} (a brace, followed by anything that isn't a brace, followed by a brace)
You can extract all instances and then deserialize them using something along the lines of:
using System.Text.RegularExpressions;
...
List<JsonTypeFile> AllFiles = new List<JsonTypeFile>();
foreach(Match match in Regex.Matches(temp, "{[^}]*}"))
{
var result = JsonConvert.DeserializeObject<SnSafe.JsonTypeFile>(match.Value);
AllFiles.Add(result);
}

Regex.Split on two chars

Input string is: name = valu\=e;
I want to split it with Regex to: name and valu\=e;.
So split expresion should split on char = which is not prefixed by \.
I want to keep spaces after name or before valu\=e etc. Couldn't it show here because SO trims ``.
EDIT: Input string can contains many name=value pairs. Example: name=value;name2=value2;.
You can use this pattern:
#"(?<!\\)="
(?<!..) is a negative lookbehind assertion and means:"not preceded by"
Heinzi question is interesting. If you choose that an even number of backslashes doesn't escape the equal sign, you must replace the pattern by:
#"(?<![^\\](?:\\{2})*\\)="
Instead of using regular expressions, you might use 'regular' code. :)
string items = "name=value;name2 = valu= = e2";
// Split the list on items.
var itemlist = items.Split(';');
// Split each item after the first '='.
var nameValueArrayList = itemlist.Select(s => s.Split("=".ToCharArray(), 2));
// Convert the list of arrays to a dictionary
var nameValues = nameValueArrayList.ToDictionary(i => i[0], i => i[1]);
MessageBox.Show("<<<" + nameValues["name2 "] + ">>>");
Or in short:
string items = "name=value;name2 = valu= = e2";
var nameValues = items
.Split(';')
.Select(s => s.Split("=".ToCharArray(), 2))
.ToDictionary(i => i[0], i => i[1]);
MessageBox.Show("<<<" + nameValues["name2 "] + ">>>");
I personally think that code like this is easier to maintain or pull apart and modify when the specs change. And it gives you an actual dictionary from which you can pull values by their key.
Maybe it's possible to write this even a little shorter, but I'm still practicing with this. :)
(?<name>[^=]+)=(?<value>[^;]+;)
then use the named groups "name" and "value" to retrieve each part separately.
e.g:
var matches = System.Text.RegularExpressions.Regex.Matches("myInput", #"(?<name>[^=]+)=(?<value>[^;]+;)");
foreach(Match match in matches)
{
var name = match.Groups["name"];
var value = match.Groups["value"];
doSomething(name, value);
}
EDIT:
I don't know why you say it won't work, here is what I get in LinqPad using the input you gave me in the comments:
void Main()
{
var matches = System.Text.RegularExpressions.Regex.Matches(#"zenek=ben\\\;\\ek;juzek=jozek;benek2=true;krowa=-2147483648;du-pa=\\\\\\3/\\\=3\;3\;;", #"(?<name>[^=]+)=(?<value>[^;]+;)");
foreach(Match match in matches)
{
var name = match.Groups["name"].Value;
var value = match.Groups["value"].Value;
("Name: "+name).Dump();
("Value: "+value).Dump();
}
}
Results:
Name: zenek
Value: ben\\\;
Name: \\ek;juzek
Value: jozek;
Name: benek2
Value: true;
Name: krowa
Value: -2147483648;
Name: du-pa
Value: \\\\\\3/\\\=3\;

Replace all alphanumeric characters in a string except pattern

I'm trying to obfuscate a string, but need to preserve a couple patterns. Basically, all alphanumeric characters need to be replaced with a single character (say 'X'), but the following (example) patterns need to be preserved (note that each pattern has a single space at the beginning)
QQQ"
RRR"
I've looked through a few samples on negative lookahead/behinds, but still not haven't any luck with this (only testing QQQ).
var test = #"""SOME TEXT AB123 12XYZ QQQ""""empty""""empty""1A2BCDEF";
var regex = new Regex(#"((?!QQQ)(?<!\sQ{1,3}))[0-9a-zA-Z]");
var result = regex.Replace(test, "X");
The correct result should be:
"XXXX XXXX XXXXX XXXXX QQQ""XXXXX""XXXXX"XXXXXXXX
This works for an exact match, but will fail with something like ' QQR"', which returns
"XXXX XXXX XXXXX XXXXX XQR""XXXXX""XXXXX"XXXXXXXX
You can use this:
var regex = new Regex(#"((?> QQQ|[^A-Za-z0-9]+)*)[A-Za-z0-9]");
var result = regex.Replace(test, "$1X");
The idea is to match all that must be preserved first and to put it in a capturing group.
Since the target characters are always preceded by zero or more things that must be preserved, you only need to write this capturing group before [A-Za-z0-9]
Here's a non-regex solution. Works quite nice, althought it fails when there is one pattern in an input sequence more then once. It would need a better algorithm fetching occurances. You can compare it with a regex solution for a large strings.
public static string ReplaceWithPatterns(this string input, IEnumerable<string> patterns, char replacement)
{
var patternsPositions = patterns.Select(p =>
new { Pattern = p, Index = input.IndexOf(p) })
.Where(i => i.Index > 0);
var result = new string(replacement, input.Length);
if (!patternsPositions.Any()) // no pattern in the input
return result;
foreach(var p in patternsPositions)
result = result.Insert(p.Index, p.Pattern); // return patterns back
return result;
}

Categories

Resources