Regular Expression - Get partial string

Regular Expression - Get partial string - c#

I have a list of project names that I need some matching on.The list of projects could look something like this:
suzu
suzu-domestic
suzu-international
suzuran
suzuran-international
scorpion
scorpion-default
yada
yada-yada
etc
If the searched for project is suzu, I'd like to have the following result from the list:
suzu
suzu-domestic
suzu-international
but not anything containing suzuran. I also like to have the following match if the search for project is suzuran
suzuran
suzuran-international
but not anything containing suzu.
In C# code I have something that looks like similar to this:
String searchForProject = "suzu";
String regStr = #"THE_REGEX_GOES_HERE"; // The regStr will be in a config file
List<Project> projects = DataWrapper.GetAllProjects();
Regex regEx = new Regex(String.Format(regStr, searchForProject));
result = new List<Project>();
foreach (Project proj in projects)
{
if (regEx.IsMatch(proj.ProjectName))
{
result.Add(proj);
}
}
The question is, can I have a regexp that will enable me to get match on all exact project names, but not the ones that would get returned by a startWith equivalent?
(Today I have a regStr = #"^({0})#", but this does not satisfy the above scenario since it gives more hits than it should)
I'd appreciate if someone can give me a hint in the right direction. Thanks !
Magnus

All you need is actually
var regStr = #"^{0}\b";
The ^ anchor asserts the position at the beginning of string.
The \b pattern matches a location between a word and a non-word character, the start or end of string. You do not need to match the rest of string with .* since you are using Regex.IsMatch, it is a redundant overhead.
C# test code:
var projects = new List<string>() { "suzu", "suzu-domestic", "suzu-international", "suzuran", "suzuran-international", "scorpion", "scorpion-default", "yada", "yada-yada" };
var searchForProject = "suzu";
var regStr = #"^{0}\b"; // The regStr will be in a config file
var regEx = new Regex(String.Format(regStr, searchForProject));
var result = new List<string>();
foreach (var proj in projects)
{
if (regEx.IsMatch(proj))
{
result.Add(proj);
}
}
The foreach may be replaced with a shorter LINQ:
var result = projects.Where(s => regEx.IsMatch(s)).ToList();

You can use a regex like this:
^suzu\b.*
Working demo
If you want suzuran just use:
^suzuran\b.*

You can use "\b{0}\b.*" if you want the match anywhere in the string (but not in the middle of a word), or "^{0}\b.*" if you only want it at the start.
See a regexstorm sample.

If you want an elegant solution in one line with Linq and without regex, you can check this working solution (Demo on .NETFiddle) :
using System;
using System.Linq;
using System.Collections.Generic;
public class Program
{
public void Main()
{
string input = "suzu";
string s = #"suzu
suzu-domestic
suzu-international
suzuran
suzuran-international
scorpion
scorpion-default
yada
yada-yada";
foreach (var line in ExtractLines(s, input))
Console.WriteLine(line);
}
// works if "-" is your delimiter.
IEnumerable<string> ExtractLines(string lines, string input)
{
return from line in lines.Split(new char[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries) // use to split your string by line
let cleanLine = line.Contains("-") ? line.Split('-')[0] : line // use only the needed part
where cleanLine.Equals(input) // check if the output match with the input
select line; // return the valid line
}
}

With negative lookahead:
suzu(?!.*ran).*\b
This also uses \b for a word break

Related

Find pattern to solve regex in one step

I have a problem to find the pattern that solves the problem in onestep.
The string looks like this:
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$Text5$Text6 etc.
What i want to get is: Take up to 4x Text. If there are more than "4xText" take only the last sign.
Example:
Text1$Text2$Text3$Text4$Text5$Text6 -> Text1$Text2$Text3$Text4&56
My current solution is:
First pattern:
^([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?([^\$]*)\$?
After this i will do a substitution with the first pattern
New string: Text5$Text6
second pattern is:
([^\$])\b
result: 56
combine both and get the result:
Text1$Text2$Text3$Text4$56
For me it is not clear why i cant easily put the second pattern after the first pattern into one pattern. Is there something like an anchor that tells the engine to start the pattern from here like it would do if is would be the only pattern ?

You might use an alternation with a positive lookbehind and then concatenate the matches.
(?<=^(?:[^$]+\$){0,3})[^$]+\$?|[^$](?=\$|$)
Explanation
(?<= Positive lookbehind, assert what is on the left is
^(?:[^$]+\$){0,3} Match 0-3 times any char except $ followed by an optional $
) Close lookbehind
[^$]+\$? Match 1+ times any char except $, then match an optional $
| Or
[^$] Match any char except $
(?=\$|$) Positive lookahead, assert what is directly to the right is either $ or the end of the string
.NET regex demo | C# demo
Example
string pattern = #"(?<=^(?:[^$]*\$){0,3})[^$]*\$?|[^$](?=\$|$)";
string[] strings = {
"Text1",
"Text1$Text2$Text3",
"Text1$Text2$Text3$Text4$Text5$Text6"
};
Regex regex = new Regex(pattern);
foreach (String s in strings) {
Console.WriteLine(string.Join("", from Match match in regex.Matches(s) select match.Value));
}
Output
Text1
Text1$Text2$Text3
Text1$Text2$Text3$Text4$56

I strongly believe regular expression isn't the way to do that. Mostly because of the readability.
You may consider using simple algorithm like this one to reach your goal:
using System;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var result = "";
for(var i=0; i<parts.Length; i++){
result += (i <= 4 ? parts[i] + "$" : parts[i].Substring(4));
}
Console.WriteLine(result);
}
}
There are also linq alternatives :
using System;
using System.Linq;
public class Program
{
public static void Main()
{
var input = "Text1$Text2$Text3$Text4$Text5$Text6";
var parts = input.Split('$');
var first4 = parts.Take(4);
var remainings = parts.Skip(4);
var result2 = string.Join("$", first4) + "$" + string.Join("", remainings.Select( r=>r.Substring(4)));
Console.WriteLine(result2);
}
}
It has to be adjusted to the actual needs but the idea is there

Try this code:
var texts = new string[] {"Text1", "Text1$Text2$Text3", "Text1$Text2$Text3$Text4$Text5$Text6" };
var parsed = texts
.Select(s => Regex.Replace(s,
#"(Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)",
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
)).ToArray();
// parsed is now: string[3] { "Text1$", "Text1$Text2$Text3$", "Text1$Text2$Text3$Text4$56" }
Explanation:
solution uses regex pattern: (Text\d{1,3}(?:\$Text\d{1,3}){0,3})((?:\$Text\d{1,3})*)
(...) - first capturing group
(?:...) - non-capturing group
Text\d{1,3}(?:\$Text\d{1,3} - match Text literally, then match \d{1,3}, which is 1 up to three digits, \$ matches $ literally
Rest is just repetition of it. Basically, first group captures first four pieces, second group captures the rest, if any.
We also use MatchEvaluator here which is delegate type defined as:
public delegate string MatchEvaluator(Match match);
We define such method:
(match) => match.Groups[1].Value +"$"+ match.Groups[2].Value.Replace("Text", "").Replace("$", "")
We use it to evaluate match, so takee first capturing group and concatenate with second, removing unnecessary text.

It's not clear to me whether your goal can be achieved using exclusively regex. If nothing else, the fact that you want to introduce a new character '&' into the output adds to the challenge, since just plain matching would never be able to accomplish that. Possibly using the Replace() method? I'm not sure that would work though...using only a replacement pattern and not a MatchEvaluator, I don't see a way to recognize but still exclude the "$Text" portion from the fifth instance and later.
But, if you are willing to mix regex with a small amount of post-processing, you can definitely do it:
static readonly Regex regex1 = new Regex(#"(Text\d(?:\$Text\d){0,3})(?:\$Text(\d))*", RegexOptions.Compiled);
static void Main(string[] args)
{
for (int i = 1; i <= 6; i++)
{
string text = string.Join("$", Enumerable.Range(1, i).Select(j => $"Text{j}"));
WriteLine(KeepFour(text));
}
}
private static string KeepFour(string text)
{
Match match = regex1.Match(text);
if (!match.Success)
{
return "[NO MATCH]";
}
StringBuilder result = new StringBuilder();
result.Append(match.Groups[1].Value);
if (match.Groups[2].Captures.Count > 0)
{
result.Append("&");
// Have to iterate (join), because we don't want the whole match,
// just the captured text.
result.Append(JoinCaptures(match.Groups[2]));
}
return result.ToString();
}
private static string JoinCaptures(Group group)
{
return string.Join("", group.Captures.Cast<Capture>().Select(c => c.Value));
}
The above breaks your requirement into three different capture groups in a regex. Then it extracts the captured text, composing the result based on the results.

Substitute only one group when dealing with an unknown number of capturing groups

Assuming I have this input:
/green/blah/agriculture/apple/blah/
I'm only trying to capture and replace the occurrence of apple (need to replace it with orange), so I have this regex
var regex = new Regex("^/(?:green|red){1}(?:/.*)+(apple){1}(?:/.*)");
So I'm grouping sections of the input, but as non-capturing, and only capturing the one I'm concerned with. According to this $` will retrieve everything before the match in the input string, and $' will get everything after, so theoretically the following should work:
"$`Orange$'"
But it only retrieves the match ("apple").
Is it possible to do this with just substitutions and NOT match evaluators and looping through groups?
The issue is that apple can occur anywhere in that url scheme, hence an unknown number of capture groups.
Thanks.

To achieve what you want, I slightly changed your regex.
The new regex looks like this look for the updated version at the end of the answer:
What I am doing here is, I want all the other groups to become captured groups. Doing this I can use them as follow:
String replacement = "$1Orange$2";
string result = Regex.Replace(text, regex.ToString(), replacement);
I am using group 1,2 and 4 and in the middle of everything (where I suspect 'apple') I replace it with Orange.
A complete example looks like this:
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
String text = "/green/blah/agriculture/apple/blah/hallo/apple";
var regex = new Regex("^(/(?:green|red)/(?:[^/]+/)*?)apple(/.*)");
String replacement = "$1$2Orange$4";
string result = Regex.Replace(text, regex.ToString(), replacement);
Console.WriteLine(result);
}
}
And as well a running example is here
See the updated regex, I needed to change it again to capture things like this:
/green/blah/agriculture/apple/blah/hallo/apple/green/blah/agriculture/apple/blah/hallo/apple
With the above regex it matched the last apple and not the first as prio designated. I changed the regex to this:
var regex = new Regex("^(/(?:green|red)/(?:[^/]+/)*?)apple(/.*)");
I updated the code as well as the running example.

If you really want to replace only the first occurence of apple and dont mind about the URL structure then can you use one of the following methods:
First simply use apple as regex and use the overloaded Replace method.
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
String text = "/green/blah/agriculture/apple/blah/hallo/apple/green/blah/agriculture/apple/blah/hallo/apple";
var regex = new Regex(Regex.Escape("apple"));
String replacement = "Orange";
string result = regex.Replace(text, replacement.ToString(), 1);
Console.WriteLine(result);
}
}
See working Example
Second is the use of IndexOf and Substring which could be much quick as the use of the regex classes.
See the following Example:
class Program
{
static void Main(string[] args)
{
string search = "apple";
string text = "/green/blah/agriculture/apple/blah/hallo/apple/green/blah/agriculture/apple/blah/hallo/apple";
int idx = text.IndexOf(search);
int endIdx = idx + search.Length;
int secondStrLen = text.Length - endIdx;
if (idx != -1 && idx < text.Length && endIdx < text.Length && secondStrLen > -1)
{
string first = text.Substring(0, idx);
string second = text.Substring(endIdx, secondStrLen);
string result = first + "Orange" + second;
Console.WriteLine(result);
}
}
}
Working Example

Multi Substring from long string

I have a long string I need to take out only substrings that are between { and }, and turn it into a Json object
This string
sys=t85,fggh{"Name":"5038.zip","Folder":"Root",,"Download":"services/DownloadFile.ashx?"} dsdfg x=565,dfg
{"Name":"5038.zip","Folder":"Root",,"Download":"services/DownloadFile.ashx?"}dfsdfg567
{"Name":"5038.zip","Folder":"Root",,"Download":"services/DownloadFile.ashx?"}sdfs
I have trash inside so I need to extract the substring of the data between { and }
My code is here, but I'm stuck, I can't remove the data that I already taken.
List<JsonTypeFile> AllFiles = new List<JsonTypeFile>();
int lenght = -1;
while (temp.Length>3)
{
lenght = temp.IndexOf("}") - temp.IndexOf("{");
temp=temp.Substring(temp.IndexOf("{"), lenght+1);
temp.Remove(temp.IndexOf("{"), lenght + 1);
var result = JsonConvert.DeserializeObject<SnSafe.JsonTypeFile>(temp);
AllFiles.Add(result);
}

Or using regex you can get the strings like this:
var regex = new Regex("{([^}]*)}");
var matches = regex.Matches(str);
var list = (from object m in matches select m.ToString().Replace("{",string.Empty).Replace("}",string.Empty)).ToList();
var jsonList = JsonConvert.SerializeObject(list);
The str variable containing your string as you provided in your question.

You can use a regex for this but what I would do is use .split ('{') to split into sections, skip the first section, and then using .split('}) to find the first portion of each section.
You can do this using LINQ
var data = temp
.Split('{')
.Skip(1)
.Select(v => v.Split('}').FirstOrDefault());

If I understand correctly, you just want to extract anything in-between the braces and ignore anything else.
The following regular expression should allow you to extract that info:
{[^}]*} (a brace, followed by anything that isn't a brace, followed by a brace)
You can extract all instances and then deserialize them using something along the lines of:
using System.Text.RegularExpressions;
...
List<JsonTypeFile> AllFiles = new List<JsonTypeFile>();
foreach(Match match in Regex.Matches(temp, "{[^}]*}"))
{
var result = JsonConvert.DeserializeObject<SnSafe.JsonTypeFile>(match.Value);
AllFiles.Add(result);
}

How to extract the useful data with regular expression in C#?

Sorry guys, it seems like I didn't explain my question clearly. Please allow me to rephrase my question again.
I use WebClient to download the whole webpage and I got the content as a string
"
.......
.....
var picArr ="/d/manhua/naruto/516/1.png|/d/manhua/naruto/516/2.png|/d/manhua/naruto/516/3.png|/d/manhua/naruto/516/4.png|/d/manhua/naruto/516/5.png|/d/manhua/naruto/516/6.png|/d/manhua/naruto/516/7.png|/d/manhua/naruto/516/8.png|/d/manhua/naruto/516/9.png|/d/manhua/naruto/516/10.png|/d/manhua/naruto/516/11.png|/d/manhua/naruto/516/12.png|/d/manhua/naruto/516/13.png|/d/manhua/naruto/516/14.png|/d/manhua/naruto/516/15.png|/d/manhua/naruto/516/16.png"
......
";
in this content, I want to get only one line which is
var picArr ="/d/manhua/naruto/516/1.png|/d/manhua/naruto/516/2.png|/d/manhua/naruto/516/3.png|/d/manhua/naruto/516/4.png|/d/manhua/naruto/516/5.png|/d/manhua/naruto/516/6.png|/d/manhua/naruto/516/7.png|/d/manhua/naruto/516/8.png|/d/manhua/naruto/516/9.png|/d/manhua/naruto/516/10.png|/d/manhua/naruto/516/11.png|/d/manhua/naruto/516/12.png|/d/manhua/naruto/516/13.png|/d/manhua/naruto/516/14.png|/d/manhua/naruto/516/15.png|/d/manhua/naruto/516/16.png"
now I want use regular expression to get this string and get the value of picArr.
my reg exp is
var picArr ="([.]*)"
I think the dot means any characters. But it doesn't work. :(
Any idea?
THanks a lot

/picArr =\"([^\"]+)\"/
If I got this right that's what you need.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ExtractFileNames
{
class Program
{
static void Main(string[] args)
{
string pageData = #"blah blah
var picArr =""/d/manhua/naruto/516/1.png|/d/manhua/naruto/516/2.png|/d/manhua/naruto/516/3.png|/d/manhua/naruto/516/4.png|/d/manhua/naruto/516/5.png|/d/manhua/naruto/516/6.png|/d/manhua/naruto/516/7.png|/d/manhua/naruto/516/8.png|/d/manhua/naruto/516/9.png|/d/manhua/naruto/516/10.png|/d/manhua/naruto/516/11.png|/d/manhua/naruto/516/12.png|/d/manhua/naruto/516/13.png|/d/manhua/naruto/516/14.png|/d/manhua/naruto/516/15.png|/d/manhua/naruto/516/16.png""
more blah decimal blah";
var match = Regex.Match(pageData, #"var\s+picArr\s*=\s*""(.*?)""");
var str = match.Groups[1].Value;
var files = str.Split('|');
foreach(var f in files)
{
Console.WriteLine(f);
}
Console.ReadLine();
}
}
}
Output:
/d/manhua/naruto/516/1.png
/d/manhua/naruto/516/2.png
/d/manhua/naruto/516/3.png
/d/manhua/naruto/516/4.png
/d/manhua/naruto/516/5.png
/d/manhua/naruto/516/6.png
/d/manhua/naruto/516/7.png
/d/manhua/naruto/516/8.png
/d/manhua/naruto/516/9.png
/d/manhua/naruto/516/10.png
/d/manhua/naruto/516/11.png
/d/manhua/naruto/516/12.png
/d/manhua/naruto/516/13.png
/d/manhua/naruto/516/14.png
/d/manhua/naruto/516/15.png
/d/manhua/naruto/516/16.png

If you just want to get the filenames, you could just do a split on the pipe:
var picArr = "/d/manhua/naruto/516/1.png|/d/manhua/naruto/516/2.png|/d/manhua/naruto/516/3.png|/d/manhua/naruto/516/4.png|/d/manhua/naruto/516/5.png|/d/manhua/naruto/516/6.png|/d/manhua/naruto/516/7.png|/d/manhua/naruto/516/8.png|/d/manhua/naruto/516/9.png|/d/manhua/naruto/516/10.png|/d/manhua/naruto/516/11.png|/d/manhua/naruto/516/12.png|/d/manhua/naruto/516/13.png|/d/manhua/naruto/516/14.png|/d/manhua/naruto/516/15.png|/d/manhua/naruto/516/16.png";
var splitPics = picArr.Split('|');
foreach (var pic in splitPics)
{
Console.WriteLine(pic);
}

It looks like you want the value of the string literal in your snippet, "/d/manhua/naruto/516/1.png|..."
Get rid of the square brackets. "." matches any character just as it is, without brackets. Square brackets are for matching a limited set of characters: For example, you'd use "[abc]" to match any "a", "b", or "c".
It looks like the brackets have the effect of escaping the ".", a feature I hadn't known about (or forgot, sometime in the Ordovician). But I tested the regex as you have it with the string value replaced with a series of dots, and the regex matched. It's being treated as a literal "." character, which you would more likely try to match with a backslash escape: "\."
So just get rid of the brackets and it should work. It works in VS2008 for me.

Regular expression get string between curly braces

I want to ask about regular expression in C#.
I have a string. ex : "{Welcome to {stackoverflow}. This is a question C#}"
Any idea about regular expressions to get content between {}. I want to get 2 string are : "Welcome to stackoverflow. This is a question C#" and "stackoverflow".
Thank for advance and sorry about my English.

Hi wouldn't know how to do that with a single regular expression, but it would be easier adding a little recursion:
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
static class Program {
static void Main() {
string test = "{Welcome to {stackoverflow}. This is a question C#}";
// get whatever is not a '{' between braces, non greedy
Regex regex = new Regex("{([^{]*?)}", RegexOptions.Compiled);
// the contents found
List<string> contents = new List<string>();
// flag to determine if we found matches
bool matchesFound = false;
// start finding innermost matches, and replace them with their
// content, removing braces
do {
matchesFound = false;
// replace with a MatchEvaluator that adds the content to our
// list.
test = regex.Replace(test, (match) => {
matchesFound = true;
var replacement = match.Groups[1].Value;
contents.Add(replacement);
return replacement;
});
} while (matchesFound);
foreach (var content in contents) {
Console.WriteLine(content);
}
}
}

ive written a little RegEx, but havent tested it, but you can try something like this:
Regex reg = new Regex("{(.*{(.*)}.*)}");
...and build up on it.

Thanks everybody. I have the solution. I use stack instead regular expression. I have push "{" to stack and when I meet "}", i will pop "{" and get index. After I get string from that index to index "}". Thank again.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular Expression - Get partial string - c#

You can use a regex like this: ^suzu\b.* Working demo If you want suzuran just use: ^suzuran\b.*

You can use "\b{0}\b." if you want the match anywhere in the string (but not in the middle of a word), or "^{0}\b." if you only want it at the start. See a regexstorm sample.

With negative lookahead: suzu(?!.ran).\b This also uses \b for a word break

Related

Find pattern to solve regex in one step

Substitute only one group when dealing with an unknown number of capturing groups

Multi Substring from long string

How to extract the useful data with regular expression in C#?

Regular expression get string between curly braces

Categories

Resources

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular Expression - Get partial string - c#

You can use a regex like this: ^suzu\b.* Working demo If you want suzuran just use: ^suzuran\b.*

You can use "\b{0}\b.*" if you want the match anywhere in the string (but not in the middle of a word), or "^{0}\b.*" if you only want it at the start. See a regexstorm sample.

With negative lookahead: suzu(?!.*ran).*\b This also uses \b for a word break

Related

Find pattern to solve regex in one step

Substitute only one group when dealing with an unknown number of capturing groups

Multi Substring from long string

How to extract the useful data with regular expression in C#?

Regular expression get string between curly braces

Categories

Resources

You can use "\b{0}\b." if you want the match anywhere in the string (but not in the middle of a word), or "^{0}\b." if you only want it at the start. See a regexstorm sample.

With negative lookahead: suzu(?!.ran).\b This also uses \b for a word break