Looping through Regex Matches

Looping through Regex Matches - c#

This is my source string:
<box><3>
<table><1>
<chair><8>
This is my Regex Patern:
<(?<item>\w+?)><(?<count>\d+?)>
This is my Item class
class Item
{
string Name;
int count;
//(...)
}
This is my Item Collection;
List<Item> OrderList = new List(Item);
I want to populate that list with Item's based on source string.
This is my function. It's not working.
Regex ItemRegex = new Regex(#"<(?<item>\w+?)><(?<count>\d+?)>", RegexOptions.Compiled);
foreach (Match ItemMatch in ItemRegex.Matches(sourceString))
{
Item temp = new Item(ItemMatch.Groups["item"].ToString(), int.Parse(ItemMatch.Groups["count"].ToString()));
OrderList.Add(temp);
}
Threre might be some small mistakes like missing letter it this example because this is easier version of what I have in my app.
The problem is that In the end I have only one Item in OrderList.
UPDATE
I got it working.
Thans for help.

class Program
{
static void Main(string[] args)
{
string sourceString = #"<box><3>
<table><1>
<chair><8>";
Regex ItemRegex = new Regex(#"<(?<item>\w+?)><(?<count>\d+?)>", RegexOptions.Compiled);
foreach (Match ItemMatch in ItemRegex.Matches(sourceString))
{
Console.WriteLine(ItemMatch);
}
Console.ReadLine();
}
}
Returns 3 matches for me. Your problem must be elsewhere.

For future reference I want to document the above code converted to using a declarative approach as a LinqPad code snippet:
var sourceString = #"<box><3>
<table><1>
<chair><8>";
var count = 0;
var ItemRegex = new Regex(#"<(?<item>[^>]+)><(?<count>[^>]*)>", RegexOptions.Compiled);
var OrderList = ItemRegex.Matches(sourceString)
.Cast<Match>()
.Select(m => new
{
Name = m.Groups["item"].ToString(),
Count = int.TryParse(m.Groups["count"].ToString(), out count) ? count : 0,
})
.ToList();
OrderList.Dump();
With output:

Related

How to highlight only results of PrefixQuery in Lucene and not whole words?

I'm fairly new to Lucene and perhaps doing something really wrong, so please correct me if it is the case. Being searching for the answer for a few days now and not sure where to go from here.
The goal is to use Lucene.NET to search for user names with partial search (like StartsWith) and highlight only the found parts. For instance if I search for abc in a list of ['a', 'ab', 'abc', 'abcd', 'abcde'] it should return just the last three in a form of ['<b>abc</b>', '<b>abc</b>d', '<b>abc</b>de']
Here is how I approached this.
First the index creation:
using var indexDir = FSDirectory.Open(Path.Combine(IndexDirectory, IndexName));
using var standardAnalyzer = new StandardAnalyzer(CurrentVersion);
var indexConfig = new IndexWriterConfig(CurrentVersion, standardAnalyzer);
indexConfig.OpenMode = OpenMode.CREATE_OR_APPEND;
using var indexWriter = new IndexWriter(indexDir, indexConfig);
if (indexWriter.NumDocs == 0)
{
//fill the index with Documents
}
The documents are created like this:
static Document BuildClientDocument(int id, string surname, string name)
{
var document = new Document()
{
new StringField("Id", id.ToString(), Field.Store.YES),
new TextField("Surname", surname, Field.Store.YES),
new TextField("Surname_sort", surname.ToLower(), Field.Store.NO),
new TextField("Name", name, Field.Store.YES),
new TextField("Name_sort", name.ToLower(), Field.Store.NO),
};
return document;
}
The search is done like this:
using var multiReader = new MultiReader(indexWriter.GetReader(true)); //the plan was to use multiple indexes per entity types
var indexSearcher = new IndexSearcher(multiReader);
var queryString = "abc"; //just as a sample
var queryWords = queryString.SplitWords();
var query = new BooleanQuery();
queryWords
.Process((word, index) =>
{
var boolean = new BooleanQuery()
{
{ new PrefixQuery(new Term("Surname", word)) { Boost = 100 }, Occur.SHOULD }, //surnames are most important to match
{ new PrefixQuery(new Term("Name", word)) { Boost = 50 }, Occur.SHOULD }, //names are less important
};
boolean.Boost = (queryWords.Count() - index); //first words in a search query are more important than others
query.Add(boolean, Occur.MUST);
})
;
var topDocs = indexSearcher.Search(query, 50, new Sort( //sort by relevance and then in lexicographical order
SortField.FIELD_SCORE,
new SortField("Surname_sort", SortFieldType.STRING),
new SortField("Name_sort", SortFieldType.STRING)
));
And highlighting:
var htmlFormatter = new SimpleHTMLFormatter();
var queryScorer = new QueryScorer(query);
var highlighter = new Highlighter(htmlFormatter, queryScorer);
foreach (var found in topDocs.ScoreDocs)
{
var document = indexSearcher.Doc(found.Doc);
var surname = document.Get("Surname"); //just for simplicity
var surnameFragment = highlighter.GetBestFragment(standardAnalyzer, "Surname", surname);
Console.WriteLine(surnameFragment);
}
The problem is that the highlighter returns results like this:
<b>abc</b>
<b>abcd</b>
<b>abcde</b>
<b>abcdef</b>
So it "highlights" entire words even though I was searching for partials.
Explain returned NON-MATCH all the way so not sure if it's helpful here.
Is it possible to highlight only the parts which were searched for? Like in my example.

While searching a bit more on this I came to a conclusion that to make such highlighting work one needs to tweak index generation methods and split indices by parts so offsets would be properly calculated. Or else highlighting will highlight only surrounding words (fragments) entirely.
So based on this I've managed to build a simple highlighter of my own.
public class Highlighter
{
private const string TempStartToken = "\x02";
private const string TempEndToken = "\x03";
private const string SearchPatternTemplate = $"[{TempStartToken}{TempEndToken}]*{{0}}";
private const string ReplacePattern = $"{TempStartToken}$&{TempEndToken}";
private readonly ConcurrentDictionary<HighlightKey, Regex> _regexPatternsCache = new();
private static string GetHighlightTypeTemplate(HighlightType highlightType) =>
highlightType switch
{
HighlightType.Starts => "^{0}",
HighlightType.Contains => "{0}",
HighlightType.Ends => "{0}$",
HighlightType.Equals => "^{0}$",
_ => throw new ArgumentException($"Unsupported {nameof(HighlightType)}: '{highlightType}'", nameof(highlightType)),
};
public string Highlight(string text, IReadOnlySet<string> words, string startToken, string endToken, HighlightType highlightType)
{
foreach (var word in words)
{
var key = new HighlightKey
{
Word = word,
HighlightType = highlightType,
};
var regex = _regexPatternsCache.GetOrAdd(key, _ =>
{
var parts = word.Select(w => string.Format(SearchPatternTemplate, Regex.Escape(w.ToString())));
var pattern = string.Concat(parts);
var highlightPattern = string.Format(GetHighlightTypeTemplate(highlightType), pattern);
return new Regex(highlightPattern, RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.Compiled);
});
text = regex.Replace(text, ReplacePattern);
}
return text
.Replace(TempStartToken, startToken)
.Replace(TempEndToken, endToken)
;
}
private record HighlightKey
{
public string Word { get; init; }
public HighlightType HighlightType { get; init; }
}
}
public enum HighlightType
{
Starts,
Contains,
Ends,
Equals,
}
Use it like this:
var queries = new[] { "abc" }.ToHashSet();
var search = "a ab abc abcd abcde";
var highlighter = new Highlighter();
var outputs = search
.Split((string[])null, StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries)
.Select(w => highlighter.Highlight(w, queries, "<b>", "</b>", HighlightType.Starts))
;
var result = string.Join(" ", outputs).Dump();
Util.RawHtml(result).Dump();
Output looks like this:
a ab <b>abc</b> <b>abc</b>d <b>abc</b>de
a ab abc abcd abcde
I'm open to any other better solutions.

C# Replace regex matched pattern using dictionary

I am trying to replace a pattern in my string where only the words between the tags should be replaced. The word that needs to be replaced resides in a dictionary as key and value pair.
Currently this is what I am trying:
string input = "<a>hello</a> <b>hello world</b> <c>I like apple</c>";
string pattern = (#"(?<=>)(.)?[^<>]*(?=</)");
Regex match = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = match.Matches(input);
var dictionary1 = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
dictionary1.Add("hello", "Hi");
dictionary1.Add("world", "people");
dictionary1.Add("apple", "fruit");
string output = "";
output = match.Replace(input, replace => { return dictionary1.ContainsKey(replace.Value) ? dictionary1[replace.Value] : replace.Value; });
Console.WriteLine(output);
Console.ReadLine();
Using this, it does replace but only the first 'hello' and not the second one. I want to replace every occurrence of 'hello' between the tags.
Any help will be much appreciated.

The problem is that the matches are:
hello
hello world
I like apple
so e.g. hello world is not in your dictionary.
Based on your code, this could be a solution:
using System;
using System.Text.RegularExpressions;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
var dictionary1 = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
dictionary1.Add("hello", "Hi");
dictionary1.Add("world", "people");
dictionary1.Add("apple", "fruit");
string input = "<a>hello</a> <b>hello world</b> <c>I like apple</c>";
string pattern = ("(?<=>)(.)?[^<>]list|" + GetKeyList(dictionary1) + "(?=</)");
Regex match = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = match.Matches(input);
string output = "";
output = match.Replace(input, replace => {
Console.WriteLine(" - " + replace.Value);
return dictionary1.ContainsKey(replace.Value) ? dictionary1[replace.Value] : replace.Value;
});
Console.WriteLine(output);
}
private static string GetKeyList(Dictionary<string, string> list)
{
return string.Join("|", new List<string>(list.Keys).ToArray());
}
}
Fiddle: https://dotnetfiddle.net/zNkEDv
If someone wants to dig into this an tell me why do I need a "list|" in the list (because the first item is being ignored), I'll appreciate it.

This is another way of doing it - I parse the string into XML and then select elements containing the keys in your dictionary and then replace each element's value.
However, you have to have a valid XML document - your example lacks a root node.
var xDocument = XDocument.Parse("<root><a>hello</a> <b>hello world</b> <c>I like apple</c></root>");
var dictionary1 = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase) { { "hello", "Hi" }, { "world", "people" }, { "apple", "fruit" } };
string pattern = #"\w+";
Regex match = new Regex(pattern, RegexOptions.IgnoreCase);
var xElements = xDocument.Root.Descendants()
.Where(x => dictionary1.Keys.Any(s => x.Value.Contains(s)));
foreach (var xElement in xElements)
{
var updated = match.Replace(xElement.Value,
replace => {
return dictionary1.ContainsKey(replace.Value)
? dictionary1[replace.Value] : replace.Value; });
xElement.Value = updated;
}
string output = xDocument.ToString(SaveOptions.DisableFormatting);
This pattern of "\w+" matches words, not spaces.
This LINQ selects descendants of the root node where the element value contains any of the keys of your dictionary:
var xElements = xDocument.Root.Descendants().Where(x => dictionary1.Keys.Any(s => x.Value.Contains(s)));
I then iterate through the XElement enumerable collection returned and apply your replacement MatchEvaluator to just the string value, which is a lot easier!
The final output is <root><a>Hi</a><b>Hi people</b><c>I like fruit</c></root>. You could then remove the opening and closing <root> and </root> tags, but I don't know what your complete XML looks like.

This will do what you want (from what you have provided so far):
private static Dictionary<string, string> dict;
static void Main(string[] args)
{
dict =
new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
{
{ "hello", "Hi" },
{ "world", "people" },
{ "apple", "fruit" }
};
var input = "<a>hello</a> <b>hello world</b> apple <c>I like apple</c> hello";
var pattern = #"<.>([^<>]+)<\/.>";
var output = Regex.Replace(input, pattern, Replacer);
Console.WriteLine(output);
Console.ReadLine();
}
static string Replacer(Match match)
{
var value = match.Value;
foreach (var kvp in dict)
{
if (value.Contains(kvp.Key)) value = value.Replace(kvp.Key, kvp.Value);
}
return value;
}

How to split and take multiple strings from a url in c#?

I have a string looking something like this:
/Gender=&Age=&Query=&Orgrimmar+l%C3%A4n=01&Stormwind+l%C3%A4n=07&Undercity+l%C3%A4n=09&Pag
I want a list of string with "Orgrimmar", "Stormwind" and "Undercity". How is this possible so that it splits AFTER Query and between & and + in order so that we avoid getting a string like this "Orgrimmar+l%C3%A4n=01&Stormwind".
Let us assume that we don't know the name of the strings.. :)
Updated, i still don't seem to get it to work. I have added a list of counties that i can use to validate this. However i still find it hard in this case. countyList is used to validate that the counties/cities in the url matches a pre-existing Collection.
var countyQuery = Request.Url.Query;
var counties = this._locationService.GetAllCounties();
List<string> countyList = new List<string>();
List<string> selectedCountiesList = new List<string>();
foreach (var i in counties)
{
countyList.Add(i.Name);
}
Regex r = new Regex(#"&(.+?)\+");
MatchCollection mc = r.Matches(countyQuery);
foreach (Match curMatch in mc)
{
if (countyList.Contains(curMatch.Groups[1].Value))
{
selectedCountiesList.Add(curMatch.Groups[1].Value);
}
}
return selectedCountiesList;
Changed url to be/?Gender=&Age=&Query=&county=13&county=08&county=01&Page=1
where 13, 08, 01 and so on is Id of the counties
The final solution was:
var selectedCountyQuery = Request.QueryString
//CountySearch = "county"
[QueryStringParameters.CountySearch];
List countyList = new List();
List<string> selectedCounties = new List<string>();
if (!string.IsNullOrEmpty(selectedCountyQuery))
{
var selectedCountiesArray = selectedCountyQuery.Split(new[]{ ',' });
foreach (var selectedCounty in selectedCountiesArray)
{
selectedCounties.Add(selectedCounty);
}
}
return selectedCounties;

You can get all parameter and value with Substring() and Split() method.
Example :
var URL = "controller/method?var1=&var2=&var3=dsgdf";
var ParameterPart = URL.Split("?")[1];
var ParametersArray = ParameterPart.Split("&");
//output : ["var1=","var2=","var3=dsgdf"];
foreach(var Parameter in ParametersArray)
{
var ParameterName= Parameter.Split("=")[0];
var ParameterValue= Parameter.Split("=")[1];
}

You can use a regex and extract the matches:
Regex r = new Regex(#"&(.+?)\+");
MatchCollection mc = r.Matches(s);
Then you can itterate your desired strings (in this case wow cities) like:
foreach(Match curMatch in mc)
{
Console.WriteLine(curMatch.Groups[1].Value);
}

string[] numbers ={ "/Gender=&Age=&Query=&Orgrimmar+l%C3%A4n=01&Stormwind+l%C3%A4n=07&Undercity+l%C3%A4n=09&Pag"};
string sPattern = #"(?<=&Orgrimmar)+";
foreach (string s in numbers){
if (System.Text.RegularExpressions.Regex.IsMatch(s, sPattern)){
System.Console.WriteLine(" - valid");}
else{System.Console.WriteLine(" - invalid");}
Output: valid
string[] numbers ={ "/Gender=&Age=&Query=Orgrimmar+l%C3%A4n=01&Stormwind+l%C3%A4n=07&Undercity+l%C3%A4n=09&Pag"};
Output: invalid
Further to check two parameters:
string[] numbers ={ "/Gender=&Age=&Query=&Orgrimmar+l%C3%A4n=01&Stormwind+l%C3%A4n=07&Undercity+l%C3%A4n=09&Pag"};
string sPattern = #"(?<=&Orgrimmar)+";
string sPattern2 = #"(?<=&Stormwind)+";
foreach (string s in numbers){
if (System.Text.RegularExpressions.Regex.IsMatch(s, sPattern) && System.Text.RegularExpressions.Regex.IsMatch(s, sPattern2))
...

Method that extracts a list of URLs that match the pattern

Well, I'm trying to create a method, using Regex , that will extract a list of URLs that matches this pattern #"http://(www\.)?([^\.]+)\.com", and so far I've done this :
public static List<string> Test(string url)
{
const string pattern = #"http://(www\.)?([^\.]+)\.com";
List<string> res = new List<string>();
MatchCollection myMatches = Regex.Matches(url, pattern);
foreach (Match currentMatch in myMatches)
{
}
return res;
}
main issue is , which code should I use in foreach loop
res.Add(currentMatch.Groups.ToString());
or
res.Add(currentMatch.Value);
Thanks!

You just need to get all .Match.Values. In your code, you should use
res.Add(currentMatch.Value);
Or, just use LINQ:
res = Regex.Matches(url, pattern).Cast<Match>()
.Select(p => p.Value)
.ToList();

res.Add(currentMatch.Groups.ToString()); will give: System.Text.RegularExpressions.GroupCollection so you didn't test it.
How many matches do you expect from the parameter url?
I would use this:
static readonly Regex _domainMatcher = new Regex(#"http://(www\.)?([^\.]+)\.com", RegexOptions.Compiled);
public static bool IsValidDomain(string url)
{
return _domainMatcher.Match(url).Success;
}
or
public static string ExtractDomain(string url)
{
var match = _domainMatcher.Match(url);
if(match.Success)
return match.Value;
else
return string.Empty;
}
Because the parameter is called url so it should be one url
If there are more possibilities and you want to extract all domainnames that matches the pattern:
public static IEnumerable<string> ExtractDomains(string data)
{
var result = new List<string>();
var match = _domainMatcher.Match(data);
while (match.Success)
{
result.Add(match.Value);
match = match.NextMatch();
}
return result;
}
Notice the IEnumerable<> instead of List<> because there is no need to modify the result by the caller.
Or this lazy variant:
public static IEnumerable<string> ExtractDomains(string data)
{
var match = _domainMatcher.Match(data);
while (match.Success)
{
yield return match.Value;
match = match.NextMatch();
}
}

Regular Expression with Lambda Expression

I've got several text files which should be tab delimited, but actually are delimited by an arbitrary number of spaces. I want to parse the rows from the text file into a DataTable (the first row of the text file has headers for property names). This got me thinking about building an extensible, easy way to parse text files. Here's my current working solution:
string filePath = #"C:\path\lowbirthweight.txt";
//regex to remove multiple spaces
Regex regex = new Regex(#"[ ]{2,}", RegexOptions.Compiled);
DataTable table = new DataTable();
var reader = ReadTextFile(filePath);
//headers in first row
var headers = reader.First();
//skip headers for data
var data = reader.Skip(1).ToArray();
//remove arbitrary spacing between column headers and table data
headers = regex.Replace(headers, #" ");
for (int i = 0; i < data.Length; i++)
{
data[i] = regex.Replace(data[i], #" ");
}
//make ready the DataTable, split resultant space-delimited string into array for column names
foreach (string columnName in headers.Split(' '))
{
table.Columns.Add(new DataColumn() { ColumnName = columnName });
}
foreach (var record in data)
{
//split into array for row values
table.Rows.Add(record.Split(' '));
}
//test prints correctly to the console
Console.WriteLine(table.Rows[0][2]);
}
static IEnumerable<string> ReadTextFile(string fileName)
{
using (var reader = new StreamReader(fileName))
{
while (!reader.EndOfStream)
{
yield return reader.ReadLine();
}
}
}
In my project I've already received several large (gig +) text files that are not in the format in which they are purported to be. So can I see having to write methods such as these with some regularity, albeit with a different regular expression. Is there a way to do something like
data =data.SmartRegex(x => x.AllowOneSpace) where I can use a regular expression to iterate over the collection of strings?
Is something like the following on the right track?
public static class SmartRegex
{
public static Expression AllowOneSpace(this List<string> data)
{
//no idea how to return an expression from a method
}
}
I'm not too overly concerned with performance, just would like to see how something like this works

You should consult with your data source and find out why your data is bad.
As for the API design that you are trying to implement:
public class RegexCollection
{
private readonly Regex _allowOneSpace = new Regex(" ");
public Regex AllowOneSpace { get { return _allowOneSpace; } }
}
public static class RegexExtensions
{
public static IEnumerable<string[]> SmartRegex(
this IEnumerable<string> collection,
Func<RegexCollection, Regex> selector
)
{
var regexCollection = new RegexCollection();
var regex = selector(regexCollection);
return collection.Select(l => regex.Split(l));
}
}
Usage:
var items = new List<string> { "Hello world", "Goodbye world" };
var results = items.SmartRegex(x => x.AllowOneSpace);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Looping through Regex Matches - c#

Related

How to highlight only results of PrefixQuery in Lucene and not whole words?

C# Replace regex matched pattern using dictionary

How to split and take multiple strings from a url in c#?

Method that extracts a list of URLs that match the pattern

Regular Expression with Lambda Expression

Categories

Resources