How to extract the useful data with regular expression in C#?

How to extract the useful data with regular expression in C#? - c#

Sorry guys, it seems like I didn't explain my question clearly. Please allow me to rephrase my question again.
I use WebClient to download the whole webpage and I got the content as a string
"
.......
.....
var picArr ="/d/manhua/naruto/516/1.png|/d/manhua/naruto/516/2.png|/d/manhua/naruto/516/3.png|/d/manhua/naruto/516/4.png|/d/manhua/naruto/516/5.png|/d/manhua/naruto/516/6.png|/d/manhua/naruto/516/7.png|/d/manhua/naruto/516/8.png|/d/manhua/naruto/516/9.png|/d/manhua/naruto/516/10.png|/d/manhua/naruto/516/11.png|/d/manhua/naruto/516/12.png|/d/manhua/naruto/516/13.png|/d/manhua/naruto/516/14.png|/d/manhua/naruto/516/15.png|/d/manhua/naruto/516/16.png"
......
";
in this content, I want to get only one line which is
var picArr ="/d/manhua/naruto/516/1.png|/d/manhua/naruto/516/2.png|/d/manhua/naruto/516/3.png|/d/manhua/naruto/516/4.png|/d/manhua/naruto/516/5.png|/d/manhua/naruto/516/6.png|/d/manhua/naruto/516/7.png|/d/manhua/naruto/516/8.png|/d/manhua/naruto/516/9.png|/d/manhua/naruto/516/10.png|/d/manhua/naruto/516/11.png|/d/manhua/naruto/516/12.png|/d/manhua/naruto/516/13.png|/d/manhua/naruto/516/14.png|/d/manhua/naruto/516/15.png|/d/manhua/naruto/516/16.png"
now I want use regular expression to get this string and get the value of picArr.
my reg exp is
var picArr ="([.]*)"
I think the dot means any characters. But it doesn't work. :(
Any idea?
THanks a lot

/picArr =\"([^\"]+)\"/
If I got this right that's what you need.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ExtractFileNames
{
class Program
{
static void Main(string[] args)
{
string pageData = #"blah blah
var picArr =""/d/manhua/naruto/516/1.png|/d/manhua/naruto/516/2.png|/d/manhua/naruto/516/3.png|/d/manhua/naruto/516/4.png|/d/manhua/naruto/516/5.png|/d/manhua/naruto/516/6.png|/d/manhua/naruto/516/7.png|/d/manhua/naruto/516/8.png|/d/manhua/naruto/516/9.png|/d/manhua/naruto/516/10.png|/d/manhua/naruto/516/11.png|/d/manhua/naruto/516/12.png|/d/manhua/naruto/516/13.png|/d/manhua/naruto/516/14.png|/d/manhua/naruto/516/15.png|/d/manhua/naruto/516/16.png""
more blah decimal blah";
var match = Regex.Match(pageData, #"var\s+picArr\s*=\s*""(.*?)""");
var str = match.Groups[1].Value;
var files = str.Split('|');
foreach(var f in files)
{
Console.WriteLine(f);
}
Console.ReadLine();
}
}
}
Output:
/d/manhua/naruto/516/1.png
/d/manhua/naruto/516/2.png
/d/manhua/naruto/516/3.png
/d/manhua/naruto/516/4.png
/d/manhua/naruto/516/5.png
/d/manhua/naruto/516/6.png
/d/manhua/naruto/516/7.png
/d/manhua/naruto/516/8.png
/d/manhua/naruto/516/9.png
/d/manhua/naruto/516/10.png
/d/manhua/naruto/516/11.png
/d/manhua/naruto/516/12.png
/d/manhua/naruto/516/13.png
/d/manhua/naruto/516/14.png
/d/manhua/naruto/516/15.png
/d/manhua/naruto/516/16.png

If you just want to get the filenames, you could just do a split on the pipe:
var picArr = "/d/manhua/naruto/516/1.png|/d/manhua/naruto/516/2.png|/d/manhua/naruto/516/3.png|/d/manhua/naruto/516/4.png|/d/manhua/naruto/516/5.png|/d/manhua/naruto/516/6.png|/d/manhua/naruto/516/7.png|/d/manhua/naruto/516/8.png|/d/manhua/naruto/516/9.png|/d/manhua/naruto/516/10.png|/d/manhua/naruto/516/11.png|/d/manhua/naruto/516/12.png|/d/manhua/naruto/516/13.png|/d/manhua/naruto/516/14.png|/d/manhua/naruto/516/15.png|/d/manhua/naruto/516/16.png";
var splitPics = picArr.Split('|');
foreach (var pic in splitPics)
{
Console.WriteLine(pic);
}

It looks like you want the value of the string literal in your snippet, "/d/manhua/naruto/516/1.png|..."
Get rid of the square brackets. "." matches any character just as it is, without brackets. Square brackets are for matching a limited set of characters: For example, you'd use "[abc]" to match any "a", "b", or "c".
It looks like the brackets have the effect of escaping the ".", a feature I hadn't known about (or forgot, sometime in the Ordovician). But I tested the regex as you have it with the string value replaced with a series of dots, and the regex matched. It's being treated as a literal "." character, which you would more likely try to match with a backslash escape: "\."
So just get rid of the brackets and it should work. It works in VS2008 for me.

Related

How to make regex only match with patterns that have exactly one letter before a =

I am trying to get the regex to match only when there is one letter from A-Z followed by a = like this A=, a=, B=, currently it is picking up any number of letters before the = like hem=, ac2=. Usually ^[a-zA-Z] works just fine but its not working for this case since I'm using named capture groups
String pattern = "FL2 (77) Flashing,77,a=1.875,A=90.0,b=3.625,B=95.0,c=1.375,C=175.0,d=2.5,hem=0.5,16GA-AL,";
var regex = new Regex("(?<label>[a-zA-Z]+)=(?<value>[^,]+)");
Other ways I've tried
var regex = new Regex("(?<label>^[a-zA-Z]+)=(?<value>[^,]+)");
var regex = new Regex("(?<label>[^a-zA-Z]+)=(?<value>[^,]+)");

If you want to match l= but not word=, you need a negative look-behind assertion.
new Regex("(?<![a-zA-Z])(?<label>[a-zA-Z])=(?<value>[^,]+)")

If the string pattern you have in your question is really the "haystack" in which you're looking for "needles", a really easy way to solve the problem would be to first split the string on ,, then use RegEx. Then you can use a simpler pattern ^(?<label>[a-zA-Z])=(?<value>.+)$ on each item in the list you get from splitting the string, and only keep the matches.

It's because you have a + after [a-zA-Z], which makes it match one or more characters in that character class. If you remove the +, it will only match one character before the =.
If you want it to only match in situations where there is exactly one alphabetical character before the equals sign, you will want to add to the beginning of the regex to make sure that the character before the letter you want to match is not a letter, like this:
(?<![a-zA-Z])(?<label>[a-zA-Z])=(?<value>[^,]+)
(notice though that this only matters in the case where you don't put a ^ before [a-zA-Z], in the case where you want matches that don't start at the beginning of a line)

Have you tried
var regex = new Regex("(?<label>^[a-zA-Z]?)=(?<value>[^,]+)");
I believe the "+" means 1 or more
"?" means 0 or 1
or exactly 1 should be {1} (at least in python, not sure about C#)
var regex = new Regex("(?<label>^[a-zA-Z]{1})=(?<value>[^,]+)");

Assuming that the label is separated by a comma (which seems to be the case based on your example and code) then you can use:
^|,(?<label>[A-Za-z])=(?<value>[^,]+)

I recommend Regex.Matches over capture groups here:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace Rextester
{
public class Program
{
public static void Main(string[] args)
{
string content = "FL2 (77) Flashing,77,a=1.875,A=90.0,b=3.625,B=95.0,c=1.375,C=175.0,d=2.5,hem=0.5,16GA-AL,";
const string regexPattern = "(?<=[,| ])[a-zA-Z]=([0-9|.|-])+";
string singleMatch = new Regex(regexPattern).Match(content).ToString();
Console.WriteLine(singleMatch); // a=1.875
MatchCollection matchList = Regex.Matches(content, regexPattern);
var matches = matchList.Cast<Match>().Select(match => match.Value).ToList();
Console.WriteLine(string.Join(", ", matches)); // a=1.875, A=90.0, b=3.625, B=95.0, c=1.375, C=175.0, d=2.5
}
}
}

Regular Expression - Get partial string

I have a list of project names that I need some matching on.The list of projects could look something like this:
suzu
suzu-domestic
suzu-international
suzuran
suzuran-international
scorpion
scorpion-default
yada
yada-yada
etc
If the searched for project is suzu, I'd like to have the following result from the list:
suzu
suzu-domestic
suzu-international
but not anything containing suzuran. I also like to have the following match if the search for project is suzuran
suzuran
suzuran-international
but not anything containing suzu.
In C# code I have something that looks like similar to this:
String searchForProject = "suzu";
String regStr = #"THE_REGEX_GOES_HERE"; // The regStr will be in a config file
List<Project> projects = DataWrapper.GetAllProjects();
Regex regEx = new Regex(String.Format(regStr, searchForProject));
result = new List<Project>();
foreach (Project proj in projects)
{
if (regEx.IsMatch(proj.ProjectName))
{
result.Add(proj);
}
}
The question is, can I have a regexp that will enable me to get match on all exact project names, but not the ones that would get returned by a startWith equivalent?
(Today I have a regStr = #"^({0})#", but this does not satisfy the above scenario since it gives more hits than it should)
I'd appreciate if someone can give me a hint in the right direction. Thanks !
Magnus

All you need is actually
var regStr = #"^{0}\b";
The ^ anchor asserts the position at the beginning of string.
The \b pattern matches a location between a word and a non-word character, the start or end of string. You do not need to match the rest of string with .* since you are using Regex.IsMatch, it is a redundant overhead.
C# test code:
var projects = new List<string>() { "suzu", "suzu-domestic", "suzu-international", "suzuran", "suzuran-international", "scorpion", "scorpion-default", "yada", "yada-yada" };
var searchForProject = "suzu";
var regStr = #"^{0}\b"; // The regStr will be in a config file
var regEx = new Regex(String.Format(regStr, searchForProject));
var result = new List<string>();
foreach (var proj in projects)
{
if (regEx.IsMatch(proj))
{
result.Add(proj);
}
}
The foreach may be replaced with a shorter LINQ:
var result = projects.Where(s => regEx.IsMatch(s)).ToList();

You can use a regex like this:
^suzu\b.*
Working demo
If you want suzuran just use:
^suzuran\b.*

You can use "\b{0}\b.*" if you want the match anywhere in the string (but not in the middle of a word), or "^{0}\b.*" if you only want it at the start.
See a regexstorm sample.

If you want an elegant solution in one line with Linq and without regex, you can check this working solution (Demo on .NETFiddle) :
using System;
using System.Linq;
using System.Collections.Generic;
public class Program
{
public void Main()
{
string input = "suzu";
string s = #"suzu
suzu-domestic
suzu-international
suzuran
suzuran-international
scorpion
scorpion-default
yada
yada-yada";
foreach (var line in ExtractLines(s, input))
Console.WriteLine(line);
}
// works if "-" is your delimiter.
IEnumerable<string> ExtractLines(string lines, string input)
{
return from line in lines.Split(new char[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries) // use to split your string by line
let cleanLine = line.Contains("-") ? line.Split('-')[0] : line // use only the needed part
where cleanLine.Equals(input) // check if the output match with the input
select line; // return the valid line
}
}

With negative lookahead:
suzu(?!.*ran).*\b
This also uses \b for a word break

Regex match for comma delimited string

I have the following string which is legal.
1-5,10-15
Using the following regex, I get a false for match.
^[^-\s]*-?[^-\s]*$
It works fine for things like
1-5,10
1,5
which are all legal. But it won't handle comma delimited ranges. What am I missing?

where's the handling for a comma? try to visualize your regex in regexper
now try this:
^(\d+-?\d+)(?:\,(\d+-?\d+))+$
Update: my regex is not a solution as you might have very specific needs for the captures. However, that nifty tool might help you with the task once you see what your regex does.

Try this pattern,
^\d+(\-\d+)?(\,(\d+(\-\d+)?))*$
it will match on the following strings:
1-5,10-15,5
1,2,3-5,3-4
1-5,10-15
10-15
10-15,5
but not on
1-,10-15,5
1,2,3-5,3-
1-510-15
10-15,
,10-15,5
Screenshot

The best regex I know for splitting comma separated strings is:
",(?=(?:[^\""]*\""[^\""]*\"")*(?![^\""]*\""))"
It will not split an entry in quotations that contains commas.
E.g. Hello, There, "You, People" gives
Hello
There
You, People

^(\s*\d+\s*-?\s*\d*)(\,(\s*\d+\s*-?\s*\d*))*$
It takes care of starting spaces, followed by at least 1 digit. "-" is optional and can be followed by one or more digits. "," is optional and can be followed by the same group as before.
Matches:
1,5
1-5,5-10
15,2,10,4-10

Here's an approach using just splits:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace RangeTester
{
class Program
{
static void Main(string[] args)
{
string str = "1,2,3,1-5,10-15,100-200";
string[] ranges = str.Split(',');
foreach (string range in ranges)
{
Console.WriteLine(GetRange(range.Trim()));
}
Console.Read();
}
static string GetRange(string range)
{
string[] rng = range.Split('-');
if (rng.Length == 2)
return rng[0] + " to " + rng[1];
else
return rng[0];
}
}
}
I was over thinking the solution to this problem, but since you know that your list of numbers/ranges will be first delimited by commas and then by dashes, you can use splits to parse out the individual parts. There is no need to use regular expressions for parsing this string.

Try this regex:
^\d+(-\d+)?(,\d+(-\d+)?)*$

You might want something like (\d+)-(\d+) to get every range. Example here.

I think this should also work fine.
^\d*(-\d*)?,\d*(-?\d*)?$
Hope it helps.

matching and replacing text in a string while keeping non replaced text

I know how to use Regex.Split() and Regex.Replace(); but not how to keep certain data when replacing.
if I had the following lines of text in a String[] (Split after every ;)
"
using system;
using system.blab;
using system.blab.blabity;
"
how would I loop trough and replace all 'using' to '' but match the whole line 'using (.+;)' for example.
and end up with the following (but not just Regex.replace("using", "");)
"
<using> system;
<using> system.blab;
<using> system.blab.blabity;
"

if str is your current string then
string str = #"using system;
using system.blab;
using system.blab.blabity;";
str = str.Replace("using ", "<using> ");

using parens in a Regex instructs the engine to store that value as a group. then when you call Replace, you can reference groups with $n, where n is the number of the group. I haven't tested this, but something like this:
Regex.Replace(input, #"^using( .+;)$", "$1");
Read here for more info

This should get you pretty close. You should use a named group for every logical item you're trying to match. In this instance, you're trying to match everything that is not the string "using". You can then use the notation ${yourGroupName} to reference the match in the replacement string. I wrote a tool called RegexPixie that will show you live matching of your content as you type so you can see what works and what doesn't work.
//the named group has the name "everythingElse"
var regex = new Regex(#"using(?<everythingElse>[^\r\n]+)");
var content = new string [] { /* ... */ };
for(int i = 0; i < content[i]; i++)
{
content[i] = regex.Replace(content[i], "${everythingElse}");
}

This combines 2 of the answers. It wraps word boundaries \b around using to perform a whole words only search and then captures the regex in a back-reference $1
string str = #"using system;
using system.blab;
using system.blab.blabity;";
str = Regex.Replace(str, #"\b(using)\b", "<$1>");

Regular expression get string between curly braces

I want to ask about regular expression in C#.
I have a string. ex : "{Welcome to {stackoverflow}. This is a question C#}"
Any idea about regular expressions to get content between {}. I want to get 2 string are : "Welcome to stackoverflow. This is a question C#" and "stackoverflow".
Thank for advance and sorry about my English.

Hi wouldn't know how to do that with a single regular expression, but it would be easier adding a little recursion:
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
static class Program {
static void Main() {
string test = "{Welcome to {stackoverflow}. This is a question C#}";
// get whatever is not a '{' between braces, non greedy
Regex regex = new Regex("{([^{]*?)}", RegexOptions.Compiled);
// the contents found
List<string> contents = new List<string>();
// flag to determine if we found matches
bool matchesFound = false;
// start finding innermost matches, and replace them with their
// content, removing braces
do {
matchesFound = false;
// replace with a MatchEvaluator that adds the content to our
// list.
test = regex.Replace(test, (match) => {
matchesFound = true;
var replacement = match.Groups[1].Value;
contents.Add(replacement);
return replacement;
});
} while (matchesFound);
foreach (var content in contents) {
Console.WriteLine(content);
}
}
}

ive written a little RegEx, but havent tested it, but you can try something like this:
Regex reg = new Regex("{(.*{(.*)}.*)}");
...and build up on it.

Thanks everybody. I have the solution. I use stack instead regular expression. I have push "{" to stack and when I meet "}", i will pop "{" and get index. After I get string from that index to index "}". Thank again.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to extract the useful data with regular expression in C#? - c#

/picArr =\"([^\"]+)\"/ If I got this right that's what you need.

Related

How to make regex only match with patterns that have exactly one letter before a =

Regular Expression - Get partial string

Regex match for comma delimited string

matching and replacing text in a string while keeping non replaced text

Regular expression get string between curly braces

Categories

Resources