I have a class, Symbol_Group, that represents an invertible expression of the nature AB(C+DE) + FG. Symbol_Group contains a List<List<iSymbol>>, where iSymbol is an interface applied to Symbol_Group, and Symbol.
The above equation would be represented as A,B,Sym_Grp + F,G; Sym_Grp = C + D,E, where each + represents a new List<iSymbol>
I need to be able to invert and expand this equation using an algorithm that can handle any amount of nesting, and any amount of symbols anded or ored together, to produce a set of Symbol_Group, with each containing a unique expansion. For the above question the answer set would be !A!F; !B!F; !C!D!F; !C!E!F; !A!G; !B!G; !C!D!G; !C!E!G;
I know that I will need to use recursion, but I have had very little experience with it. Any help figuring out this algorithm would be appreciated.
Unless you are somehow required to use a List<List<iSymbol>>, I recommend switching to a different class structure, with a base class (or interface) Expression and subclasses (or implementors) SymbolExpression, NotExpression, OrExpression, and AndExpression. A SymbolExpression contains a single symbol; a NotExpression contains one Expression, and OrExpression and AndExpression contain two expressions each. This is a much more standard structure for working with mathematical expressions, and it is probably simpler to perform the transformations on it.
With the above classes, you can model any expression as a binary tree. Negate the expression by replacing the root by a NotExpression whose child is the original root. Then, traverse the tree with a depth-first search, and whenever you hit a NotExpression whose child is an OrExpression or an AndExpression, you can replace that by an AndExpression or an OrExpression (respectively) whose children are NotExpressions with the original children below them. You might also want to eliminate double negations (look for NotExpressions whose child is a NotExpression, and remove both).
(Whether this answer is understandable probably depends on how comfortable you are with working with trees. Let me know if you need clarification.)
After much work, this is the method I used to get the minimum terms of inversion.
public List<iSymbol> GetInvertedGroup()
{
TrimSymbolList();
List<List<iSymbol>> symbols = this.CopyListMembers(Symbols);
List<iSymbol> SymList;
while (symbols.Count > 1)
{
symbols.Add(MultiplyLists(symbols[0], symbols[1]));
symbols.RemoveRange(0, 2);
}
SymList = symbols[0];
for(int i=0;i<symbols[0].Count;i++)
{
if (SymList[i] is Symbol)
{
Symbol sym = SymList[i] as Symbol;
SymList.RemoveAt(i--);
Symbol_Group symgrp = new Symbol_Group(null);
symgrp.AddSymbol(sym);
SymList.Add(symgrp);
}
}
for (int i = 0; i < SymList.Count; i++)
{
if (SymList[i] is Symbol_Group)
{
Symbol_Group SymGrp = SymList[i] as Symbol_Group;
if (SymGrp.Symbols.Count > 1)
{
List<iSymbol> list = SymGrp.GetInvertedGroup();
SymList.RemoveAt(i--);
AddElementsOf(list, SymList);
}
}
}
return SymList;
}
public List<iSymbol> MultiplyLists(List<iSymbol> L1, List<iSymbol> L2)
{
List<iSymbol> Combined = new List<iSymbol>(L1.Count + L2.Count);
foreach (iSymbol S1 in L1)
{
foreach (iSymbol S2 in L2)
{
Symbol_Group newGrp = new Symbol_Group(null);
newGrp.AddSymbol(S1);
newGrp.AddSymbol(S2);
Combined.Add(newGrp);
}
}
return Combined;
}
This resulted in a List of Groups of Symbols, with each group representing 1 or term in the final result (e.g !A!F). Some further code was used to reduce this to a List>, as there was a reasonable amount of nesting in the answer. To reduce it, I used:
public List<List<Symbol>> ReduceList(List<iSymbol> List)
{
List<List<Symbol>> Output = new List<List<Symbol>>(List.Count);
foreach (iSymbol iSym in List)
{
if (iSym is Symbol_Group)
{
List<Symbol> L = new List<Symbol>();
(iSym as Symbol_Group).GetAllSymbols(L);
Output.Add(L);
}
else
{
throw (new Exception());
}
}
return Output;
}
public void GetAllSymbols(List<Symbol> List)
{
foreach (List<iSymbol> SubList in Symbols)
{
foreach (iSymbol iSym in SubList)
{
if (iSym is Symbol)
{
List.Add(iSym as Symbol);
}
else if (iSym is Symbol_Group)
{
(iSym as Symbol_Group).GetAllSymbols(List);
}
else
{
throw(new Exception());
}
}
}
}
Hope this helps someone else!
I came to this simpler solution after a bit of rejigging. I hope it helps out somebody else with a similar problem! This is the class structure (plus a few other properties)
public class SymbolGroup : iSymbol
{
public SymbolGroup(SymbolGroup Parent, SymRelation Relation)
{
Symbols = new List<iSymbol>();
this.Parent = Parent;
SymbolRelation = Relation;
if (SymbolRelation == SymRelation.AND)
Name = "AND Group";
else
Name = "OR Group";
}
public int Depth
{
get
{
foreach (iSymbol s in Symbols)
{
if (s is SymbolGroup)
{
return (s as SymbolGroup).Depth + 1;
}
}
return 1;
}
}
}
The method of inversion is also contained within this class. It replaces an unexpanded group in the results list with all of the expanded results of that result. It only strips away one level at a time.
public List<SymbolGroup> InvertGroup()
{
List<SymbolGroup> Results = new List<SymbolGroup>();
foreach (iSymbol s in Symbols)
{
if (s is SymbolGroup)
{
SymbolGroup sg = s as SymbolGroup;
sg.Parent = null;
Results.Add(s as SymbolGroup);
}
else if (s is Symbol)
{
SymbolGroup sg = new SymbolGroup(null, SymRelation.AND);
sg.AddSymbol(s);
Results.Add(sg);
}
}
bool AllChecked = false;
while (!AllChecked)
{
AllChecked = true;
for(int i=0;i<Results.Count;i++)
{
SymbolGroup result = Results[i];
if (result.Depth > 1)
{
AllChecked = false;
Results.RemoveAt(i--);
}
else
continue;
if (result.SymbolRelation == SymRelation.OR)
{
Results.AddRange(result.MultiplyOut());
continue;
}
for(int j=0;j<result.nSymbols;j++)
{
iSymbol s = result.Symbols[j];
if (s is SymbolGroup)
{
result.Symbols.RemoveAt(j--); //removes the symbolgroup that is being replaced, so that the rest of the group can be added to the expansion.
AllChecked = false;
SymbolGroup subResult = s as SymbolGroup;
if(subResult.SymbolRelation == SymRelation.OR)
{
List<SymbolGroup> newResults;
newResults = subResult.MultiplyOut();
foreach(SymbolGroup newSg in newResults)
{
newSg.Symbols.AddRange(result.Symbols);
}
Results.AddRange(newResults);
}
break;
}
}
}
}
return Results;
}
Related
I am trying to search for fairly complex queries with Lucene.Net like
"inject* needle*" OR "point* thingy"~2
So basically I need wildcards in regular as well as proximity phrases. However, the basic Lucene.Net QueryParser gets rid of these wildcards.
I understand that ComplexPhraseQueryParser would work for that, unfortunately this is not included in Lucene.Net.
Is there any way of constructing queries like this in Lucene.Net?
I ended up by actually porting the ComplexPhraseQueryParser from Java to C#. It was a lot easier than expected and was a good excercise for learning C# a bit better.
I have provided the code below in case it is helpfull to anyone else. Please note that it is still very Java-like Code as I am a lot more familiar with Java than I am with C# ;-)
/**
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// Ported to C# from Java source at http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-misc/3.0.3/org/apache/lucene/queryParser/complexPhrase/ComplexPhraseQueryParser.java
using Lucene.Net.Analysis;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
using Lucene.Net.Search.Spans;
using System;
using System.Collections.Generic;
using Version = Lucene.Net.Util.Version;
public class ComplexPhraseQueryParser : QueryParser
{
private List<ComplexPhraseQuery> complexPhrases = null;
private Boolean isPass2ResolvingPhrases;
private ComplexPhraseQuery currentPhraseQuery = null;
public ComplexPhraseQueryParser(Version matchVersion, String f, Analyzer a) : base(matchVersion, f, a) { }
protected override Query GetFieldQuery(String field, String queryText, int slop)
{
ComplexPhraseQuery cpq = new ComplexPhraseQuery(field, queryText, slop);
complexPhrases.Add(cpq); // add to list of phrases to be parsed once
// we
// are through with this pass
return cpq;
}
public override Query Parse(String query)
{
if (isPass2ResolvingPhrases)
{
RewriteMethod oldMethod = MultiTermRewriteMethod;
try
{
// Temporarily force BooleanQuery rewrite so that Parser will
// generate visible
// collection of terms which we can convert into SpanQueries.
// ConstantScoreRewrite mode produces an
// opaque ConstantScoreQuery object which cannot be interrogated for
// terms in the same way a BooleanQuery can.
// QueryParser is not guaranteed threadsafe anyway so this temporary
// state change should not
// present an issue
MultiTermRewriteMethod = MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE;
return base.Parse(query);
}
finally
{
MultiTermRewriteMethod = oldMethod;
}
}
// First pass - parse the top-level query recording any PhraseQuerys
// which will need to be resolved
complexPhrases = new List<ComplexPhraseQuery>();
Query q = base.Parse(query);
// Perform second pass, using this QueryParser to parse any nested
// PhraseQueries with different
// set of syntax restrictions (i.e. all fields must be same)
isPass2ResolvingPhrases = true;
try
{
using (IEnumerator<ComplexPhraseQuery> enumerator = complexPhrases.GetEnumerator())
{
while (enumerator.MoveNext())
{
currentPhraseQuery = enumerator.Current;
currentPhraseQuery.ParsePhraseElements(this);
}
}
}
finally
{
isPass2ResolvingPhrases = false;
}
return q;
}
// There is No "getTermQuery throws ParseException" method to override so
// unfortunately need
// to throw a runtime exception here if a term for another field is embedded
// in phrase query
protected override Query NewTermQuery(Term term)
{
if (isPass2ResolvingPhrases)
{
try
{
CheckPhraseClauseIsForSameField(term.Field);
}
catch (ParseException pe)
{
throw new SystemException("Error parsing complex phrase", pe);
}
}
return base.NewTermQuery(term);
}
// Helper method used to report on any clauses that appear in query syntax
private void CheckPhraseClauseIsForSameField(String field)
{
if (!field.Equals(currentPhraseQuery.Field))
{
throw new ParseException("Cannot have clause for field \"" + field
+ "\" nested in phrase " + " for field \"" + currentPhraseQuery.Field
+ "\"");
}
}
protected override Query GetWildcardQuery(String field, String termStr)
{
if (isPass2ResolvingPhrases)
{
CheckPhraseClauseIsForSameField(field);
}
return base.GetWildcardQuery(field, termStr);
}
protected override Query GetRangeQuery(String field, String part1, String part2, Boolean inclusive)
{
if (isPass2ResolvingPhrases)
{
CheckPhraseClauseIsForSameField(field);
}
return base.GetRangeQuery(field, part1, part2, inclusive);
}
protected override Query NewRangeQuery(String field, String part1, String part2,
Boolean inclusive)
{
if (isPass2ResolvingPhrases)
{
// Must use old-style RangeQuery in order to produce a BooleanQuery
// that can be turned into SpanOr clause
TermRangeQuery rangeQuery = new TermRangeQuery(field, part1, part2, inclusive, inclusive, RangeCollator);
rangeQuery.RewriteMethod = MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE;
return rangeQuery;
}
return base.NewRangeQuery(field, part1, part2, inclusive);
}
protected Query GetFuzzyQuery(String field, String termStr, float minSimilarity)
{
if (isPass2ResolvingPhrases)
{
CheckPhraseClauseIsForSameField(field);
}
return base.GetFuzzyQuery(field, termStr, minSimilarity);
}
/*
* Used to handle the query content in between quotes and produced Span-based
* interpretations of the clauses.
*/
class ComplexPhraseQuery : Query
{
public string Field { get; set; }
public string PhrasedQueryStringContents { get; set; }
public int SlopFactor { get; set; }
private Query Contents;
public ComplexPhraseQuery(string Field, string PhrasedQueryStringContents, int SlopFactor)
: base()
{
this.Field = Field;
this.PhrasedQueryStringContents = PhrasedQueryStringContents;
this.SlopFactor = SlopFactor;
}
// Called by ComplexPhraseQueryParser for each phrase after the main
// parse
// thread is through
public void ParsePhraseElements(QueryParser qp)
{
// TODO ensure that field-sensitivity is preserved ie the query
// string below is parsed as
// field+":("+phrasedQueryStringContents+")"
// but this will need code in rewrite to unwrap the first layer of
// boolean query
Contents = qp.Parse(PhrasedQueryStringContents);
}
public override Query Rewrite(IndexReader reader)
{
// ArrayList spanClauses = new ArrayList();
if (Contents is TermQuery)
{
return Contents;
}
// Build a sequence of Span clauses arranged in a SpanNear - child
// clauses can be complex
// Booleans e.g. nots and ors etc
int numNegatives = 0;
if (!(Contents is BooleanQuery))
{
throw new ArgumentException("Unknown query type \""
+ Contents.GetType()
+ "\" found in phrase query string \"" + PhrasedQueryStringContents
+ "\"");
}
BooleanQuery bq = (BooleanQuery)Contents;
BooleanClause[] bclauses = bq.GetClauses();
SpanQuery[] allSpanClauses = new SpanQuery[bclauses.Length];
// For all clauses e.g. one* two~
for (int i = 0; i < bclauses.Length; i++)
{
// HashSet bclauseterms=new HashSet();
Query qc = bclauses[i].Query;
// Rewrite this clause e.g one* becomes (one OR onerous)
qc = qc.Rewrite(reader);
if (bclauses[i].Occur.Equals(Occur.MUST_NOT))
{
numNegatives++;
}
if (qc is BooleanQuery)
{
List<SpanQuery> sc = new List<SpanQuery>();
AddComplexPhraseClause(sc, (BooleanQuery)qc);
if (sc.Count > 0)
{
allSpanClauses[i] = sc[0];
}
else
{
// Insert fake term e.g. phrase query was for "Fred Smithe*" and
// there were no "Smithe*" terms - need to
// prevent match on just "Fred".
allSpanClauses[i] = new SpanTermQuery(new Term(Field,
"Dummy clause because no terms found - must match nothing"));
}
}
else
{
if (qc is TermQuery)
{
TermQuery tq = (TermQuery)qc;
allSpanClauses[i] = new SpanTermQuery(tq.Term);
}
else
{
throw new ArgumentException("Unknown query type \""
+ qc.GetType()
+ "\" found in phrase query string \""
+ PhrasedQueryStringContents + "\"");
}
}
}
if (numNegatives == 0)
{
// The simple case - no negative elements in phrase
return new SpanNearQuery(allSpanClauses, SlopFactor, true);
}
// Complex case - we have mixed positives and negatives in the
// sequence.
// Need to return a SpanNotQuery
List<SpanQuery> positiveClauses = new List<SpanQuery>();
for (int j = 0; j < allSpanClauses.Length; j++)
{
if (!bclauses[j].Occur.Equals(Occur.MUST_NOT))
{
positiveClauses.Add(allSpanClauses[j]);
}
}
//SpanQuery[] includeClauses = positiveClauses.ToArray(new SpanQuery[positiveClauses.Count]);
SpanQuery[] includeClauses = positiveClauses.ToArray();
SpanQuery include = null;
if (includeClauses.Length == 1)
{
include = includeClauses[0]; // only one positive clause
}
else
{
// need to increase slop factor based on gaps introduced by
// negatives
include = new SpanNearQuery(includeClauses, SlopFactor + numNegatives,
true);
}
// Use sequence of positive and negative values as the exclude.
SpanNearQuery exclude = new SpanNearQuery(allSpanClauses, SlopFactor,
true);
SpanNotQuery snot = new SpanNotQuery(include, exclude);
return snot;
}
private void AddComplexPhraseClause(List<SpanQuery> spanClauses, BooleanQuery qc)
{
List<SpanQuery> ors = new List<SpanQuery>();
List<SpanQuery> nots = new List<SpanQuery>();
BooleanClause[] bclauses = qc.GetClauses();
// For all clauses e.g. one* two~
for (int i = 0; i < bclauses.Length; i++)
{
Query childQuery = bclauses[i].Query;
// select the list to which we will add these options
List<SpanQuery> chosenList = ors;
if (bclauses[i].Occur == Occur.MUST_NOT)
{
chosenList = nots;
}
if (childQuery is TermQuery)
{
TermQuery tq = (TermQuery)childQuery;
SpanTermQuery stq = new SpanTermQuery(tq.Term);
stq.Boost = tq.Boost;
chosenList.Add(stq);
}
else if (childQuery is BooleanQuery)
{
BooleanQuery cbq = (BooleanQuery)childQuery;
AddComplexPhraseClause(chosenList, cbq);
}
else
{
// TODO alternatively could call extract terms here?
throw new ArgumentException("Unknown query type:"
+ childQuery.GetType());
}
}
if (ors.Count == 0)
{
return;
}
SpanOrQuery soq = new SpanOrQuery(ors.ToArray());
if (nots.Count == 0)
{
spanClauses.Add(soq);
}
else
{
SpanOrQuery snqs = new SpanOrQuery(nots.ToArray());
SpanNotQuery snq = new SpanNotQuery(soq, snqs);
spanClauses.Add(snq);
}
}
public override String ToString(String field)
{
return "\"" + PhrasedQueryStringContents + "\"";
}
public override int GetHashCode()
{
const int prime = 31;
int result = 1;
result = prime * result + ((Field == null) ? 0 : Field.GetHashCode());
result = prime
* result
+ ((PhrasedQueryStringContents == null) ? 0
: PhrasedQueryStringContents.GetHashCode());
result = prime * result + SlopFactor;
return result;
}
public override Boolean Equals(Object obj)
{
if (this == obj)
return true;
if (obj == null)
return false;
if (GetType() != obj.GetType())
return false;
ComplexPhraseQuery other = (ComplexPhraseQuery)obj;
if (Field == null)
{
if (other.Field != null)
return false;
}
else if (!Field.Equals(other.Field))
return false;
if (PhrasedQueryStringContents == null)
{
if (other.PhrasedQueryStringContents != null)
return false;
}
else if (!PhrasedQueryStringContents
.Equals(other.PhrasedQueryStringContents))
return false;
if (SlopFactor != other.SlopFactor)
return false;
return true;
}
}
}
I have a class ValuePair with two properties defined in it:
public class ValuePair: IEquatable<ValuePair>
{
public string value1;
public string value2;
public ValuePair(string v1, string v1)
{
this.value1 = v1;
this.value2 = v2;
}
...
}
I have some test data in a List as defined below:
List<ValuePair> pairs = new ValuePair<ValuePair>();
pairs.Add(new ValuePair("A","B"));
pairs.Add(new ValuePair("A","C"));
pairs.Add(new ValuePair("B","C"));
pairs.Add(new ValuePair("C","D"));
My goal is to keep pairs[0] and pairs[1] because the pairs "A,B" and "A,C" are unique, but to remove pair[2] because the relationship "B,C" has already been captured in the first two relationships. pairs[3] should remain since the "C,D" relationship is unique.
I have a feeling the solution to this problem will be recursive, which is something that I have very little experience with. I started going down a path of adding a method to the class ValuePair that looks something like this:
public string EqualToEither(ValuePair v)
{
if (v.value1 == this.value1 || v.value1 == this.value2)
return v.value1;
else if (v.value2 == this.value1 || v.value2 == this.value2)
return v.value2;
else
return string.Empty;
}
I've started to try to use the above method inside of a function like this, but I am getting hung up on what to do next:
for (int i = 0; i < pairs.Count; i++)
{
for (int j = pairs.Count - 1; j >= 0; j--)
{
if (pairs[j].EqualToEither(pairs[i]) != string.Empty)
{
pairs[j].EqualToEither(pairs[i]);
}
else
{
continue;
}
}
}
I feel like I am close but still unable to get it. Can anyone please offer some guidance? If I'm approaching this the completely wrong way please let me know, thank you!
I had to solve a similar problem recently, here is how I solved it:
Transitivity is best represented, in my opinion, by grouping interrelated elements together.
For each pair you have to validate if it already belongs to a group (both values are already in the group) or if it extends the relation of a group (only one of the values belong to the group).
In the case it does not belong in any group, it becomes a new group.
In the case both values belong in different groups then you have to merge them.
As mentioned, this is closely related to a spanning tree.
One solution could be to use HashSets to represent the transitivity of your relations (I did not use HashSets in my case, there are many possible solutions).
Each HashSet would represent a group of interrelated elements.
Example implementation using HashSets:
List<ValuePair> pairs = new List<ValuePair>();
pairs.Add(new ValuePair("A", "B"));
pairs.Add(new ValuePair("A", "C"));
pairs.Add(new ValuePair("B", "C"));
pairs.Add(new ValuePair("C", "D"));
List<ValuePair> uniquePairs = new List<ValuePair>();
// this list is not really needed if all you care about
// is getting the resulting groups
List<HashSet<string>> sets = new List<HashSet<string>>();
foreach (ValuePair pair in pairs)
{
int value1Set = -1;
int value2Set = -1;
for (int i = 0; i < sets.Count; i++)
{
HashSet<string> set = sets[i];
if (set.Contains(pair.value1))
value1Set = i;
if (set.Contains(pair.value2))
value2Set = i;
}
if (value1Set == -1 && value2Set == -1)
{
// we have a new set
sets.Add(new HashSet<string> {pair.value1, pair.value2});
}
else if (value1Set == -1)
{
sets[value2Set].Add(pair.value1);
}
else if (value2Set == -1)
{
sets[value1Set].Add(pair.value2);
}
else
{
if (value1Set == value2Set)
{
// duplicate entry, skip the add
continue;
}
// merge the sets at value1Set and value2Set
foreach (string value in sets[value2Set])
{
sets[value1Set].Add(value);
}
sets.RemoveAt(value2Set);
}
uniquePairs.Add(pair);
}
I have a large list and I would like to overwrite one value if required. To do this, I create two subsets of the list which seems to give me an OutOfMemoryException. Here is my code snippet:
if (ownRG != "")
{
List<string> maclist = ownRG.Split(',').ToList();
List<IVFile> temp = powlist.Where(a => maclist.Contains(a.Machine)).ToList();
powlist = powlist.Where(a => !maclist.Contains(a.Machine)).ToList(); // OOME Here
temp.ForEach(a => { a.ReportingGroup = ownRG; });
powlist.AddRange(temp);
}
Essentially I'm splitting the list into the part that needs updating and the part that doesn't, then I perform the update and put the list back together. This works fine for smaller lists, but breaks with an OutOfMemoryException on the third row within the if for a large list. Can I make this more efficient?
NOTE
powlist is the large list (>1m) items. maclist only has between 1 and 10 but even with 1 item this breaks.
Solving your issue
Here is how to rearrange your code using the enumerator code from my answer:
if (!string.IsNullOrEmpty(ownRG))
{
var maclist = new CommaSeparatedStringEnumerable(str);
var temp = powlist.Where(a => maclist.Contains(a.Machine));
foreach (var p in temp)
{
p.ReportingGroup = ownRG;
}
}
You should not use ToList in your code.
You don't need to remove thee contents of temp from powlist (you are re-adding them anyway)
Streaming over a large comma-separated string
You can iterate over the list manually instead of doing what you do now, by looking for , characters and remembering the position of the last found one and the one before. This will definitely make your app work because then it won't need to store the entire set in the memory at once.
Code example:
var str = "aaa,bbb,ccc";
var previousComma = -1;
var currentComma = 0;
for (; (currentComma = str.IndexOf(',', previousComma + 1)) != -1; previousComma = currentComma)
{
var currentItem = str.Substring(previousComma + 1, currentComma - previousComma - 1);
Console.WriteLine(currentItem);
}
var lastItem = str.Substring(previousComma + 1);
Console.WriteLine(lastItem);
Custom iterator
If you want to do it 'properly' in a fancy way, you can even write a custom enumerator:
public class CommaSeparatedStringEnumerator : IEnumerator<string>
{
int previousComma = -1;
int currentComma = -1;
string bigString = null;
bool atEnd = false;
public CommaSeparatedStringEnumerator(string s)
{
if (s == null)
throw new ArgumentNullException("s");
bigString = s;
this.Reset();
}
public string Current { get; private set; }
public void Dispose() { /* No need to do anything here */ }
object IEnumerator.Current { get { return this.Current; } }
public bool MoveNext()
{
if (atEnd)
return false;
atEnd = (currentComma = bigString.IndexOf(',', previousComma + 1)) == -1;
if (!atEnd)
Current = bigString.Substring(previousComma + 1, currentComma - previousComma - 1);
else
Current = bigString.Substring(previousComma + 1);
previousComma = currentComma;
return true;
}
public void Reset()
{
previousComma = -1;
currentComma = -1;
atEnd = false;
this.Current = null;
}
}
public class CommaSeparatedStringEnumerable : IEnumerable<string>
{
string bigString = null;
public CommaSeparatedStringEnumerable(string s)
{
if (s == null)
throw new ArgumentNullException("s");
bigString = s;
}
public IEnumerator<string> GetEnumerator()
{
return new CommaSeparatedStringEnumerator(bigString);
}
IEnumerator IEnumerable.GetEnumerator()
{
return this.GetEnumerator();
}
}
Then you can iterate over it like this:
var str = "aaa,bbb,ccc";
var enumerable = new CommaSeparatedStringEnumerable(str);
foreach (var item in enumerable)
{
Console.WriteLine(item);
}
Other thoughts
Can I make this more efficient?
Yes, you can. I suggest to either work with a more efficient data format (you can take a look around databases or XML, JSON, etc. depending on your needs). If you really want to work with comma-separated items, see my code examples above.
There's no need to create a bunch of sub-lists from powlist and reconstruct it. Simply loop over the powlist and update the ReportingGroup property accordingly.
var maclist = new HashSet<string>( ownRG.Split(',') );
foreach( var item in powlist) {
if( maclist.Contains( item.Machine ) ){
item.ReportingGroup = ownRG;
}
}
Since this changes powlist in place, you won't allocate any extra memory and shouldn't run into an OutOfMemoryException.
In a loop find the next ',' char. Take the substring between the ',' and the previous ',' position. At the end of the loop save a reference to the previous ',' position (which is initially set to 0). So you parse the items one-by-one rather than all at once.
You can try looping the items of your lists, but this will increase processing time.
foreach(var item in powlist)
{
//do your opeartions
}
I've got the following BoolExpr class:
class BoolExpr
{
public enum BOP { LEAF, AND, OR, NOT };
//
// inner state
//
private BOP _op;
private BoolExpr _left;
private BoolExpr _right;
private String _lit;
//
// private constructor
//
private BoolExpr(BOP op, BoolExpr left, BoolExpr right)
{
_op = op;
_left = left;
_right = right;
_lit = null;
}
private BoolExpr(String literal)
{
_op = BOP.LEAF;
_left = null;
_right = null;
_lit = literal;
}
//
// accessor
//
public BOP Op
{
get { return _op; }
set { _op = value; }
}
public BoolExpr Left
{
get { return _left; }
set { _left = value; }
}
public BoolExpr Right
{
get { return _right; }
set { _right = value; }
}
public String Lit
{
get { return _lit; }
set { _lit = value; }
}
//
// public factory
//
public static BoolExpr CreateAnd(BoolExpr left, BoolExpr right)
{
return new BoolExpr(BOP.AND, left, right);
}
public static BoolExpr CreateNot(BoolExpr child)
{
return new BoolExpr(BOP.NOT, child, null);
}
public static BoolExpr CreateOr(BoolExpr left, BoolExpr right)
{
return new BoolExpr(BOP.OR, left, right);
}
public static BoolExpr CreateBoolVar(String str)
{
return new BoolExpr(str);
}
public BoolExpr(BoolExpr other)
{
// No share any object on purpose
_op = other._op;
_left = other._left == null ? null : new BoolExpr(other._left);
_right = other._right == null ? null : new BoolExpr(other._right);
_lit = new StringBuilder(other._lit).ToString();
}
//
// state checker
//
Boolean IsLeaf()
{
return (_op == BOP.LEAF);
}
Boolean IsAtomic()
{
return (IsLeaf() || (_op == BOP.NOT && _left.IsLeaf()));
}
}
What algorithm should I use to parse an input boolean expression string like "¬((A ∧ B) ∨ C ∨ D)" and load it into the above class?
TL;DR: If you want to see the code, jump to the second portion of the answer.
I would build a tree from the expression to parse and then traverse it depth first. You can refer to the wikipedia article about Binary Expression Trees to get a feel for what I'm suggesting.
Start by adding the omitted optional parentheses to make the next step easier
When you read anything that is not an operator or a parenthese, create a LEAF type node
When you read any operator (in your case not, and, or), create the corresponding operator node
Binary operators get the previous and following nodes as children, unary operators only get the next one.
So, for your example ¬((A ∧ B) ∨ C ∨ D), the algorithm would go like this:
¬((A ∧ B) ∨ C ∨ D) becomes ¬(((A ∧ B) ∨ C) ∨ D)
Create a NOT node, it'll get the result of the following opening paren as a child.
Create A LEAF node, AND node and B LEAF node. AND has A and B as children.
Create OR node, it has the previously created AND as a child and a new LEAF node for C.
Create OR node, it has the previously created OR and a new node for D as children.
At that point, your tree looks like this:
NOT
|
OR
/\
OR D
/ \
AND C
/\
A B
You can then add a Node.Evaluate() method that evaluates recursively based on its type (polymorphism could be used here). For example, it could look something like this:
class LeafEx {
bool Evaluate() {
return Boolean.Parse(this.Lit);
}
}
class NotEx {
bool Evaluate() {
return !Left.Evaluate();
}
}
class OrEx {
bool Evaluate() {
return Left.Evaluate() || Right.Evaluate();
}
}
And so on and so forth. To get the result of your expression, you then only need to call
bool result = Root.Evaluate();
Alright, since it's not an assignment and it's actually a fun thing to implement, I went ahead. Some of the code I'll post here is not related to what I described earlier (and some parts are missing) but I'll leave the top part in my answer for reference (nothing in there is wrong (hopefully!)).
Keep in mind this is far from optimal and that I made an effort to not modify your provided BoolExpr class. Modifying it could allow you to reduce the amount of code. There's also no error checking at all.
Here's the main method
static void Main(string[] args)
{
//We'll use ! for not, & for and, | for or and remove whitespace
string expr = #"!((A&B)|C|D)";
List<Token> tokens = new List<Token>();
StringReader reader = new StringReader(expr);
//Tokenize the expression
Token t = null;
do
{
t = new Token(reader);
tokens.Add(t);
} while (t.type != Token.TokenType.EXPR_END);
//Use a minimal version of the Shunting Yard algorithm to transform the token list to polish notation
List<Token> polishNotation = TransformToPolishNotation(tokens);
var enumerator = polishNotation.GetEnumerator();
enumerator.MoveNext();
BoolExpr root = Make(ref enumerator);
//Request boolean values for all literal operands
foreach (Token tok in polishNotation.Where(token => token.type == Token.TokenType.LITERAL))
{
Console.Write("Enter boolean value for {0}: ", tok.value);
string line = Console.ReadLine();
booleanValues[tok.value] = Boolean.Parse(line);
Console.WriteLine();
}
//Eval the expression tree
Console.WriteLine("Eval: {0}", Eval(root));
Console.ReadLine();
}
The tokenization phase creates a Token object for all tokens of the expression. It helps keep the parsing separated from the actual algorithm. Here's the Token class that performs this:
class Token
{
static Dictionary<char, KeyValuePair<TokenType, string>> dict = new Dictionary<char, KeyValuePair<TokenType, string>>()
{
{
'(', new KeyValuePair<TokenType, string>(TokenType.OPEN_PAREN, "(")
},
{
')', new KeyValuePair<TokenType, string>(TokenType.CLOSE_PAREN, ")")
},
{
'!', new KeyValuePair<TokenType, string>(TokenType.UNARY_OP, "NOT")
},
{
'&', new KeyValuePair<TokenType, string>(TokenType.BINARY_OP, "AND")
},
{
'|', new KeyValuePair<TokenType, string>(TokenType.BINARY_OP, "OR")
}
};
public enum TokenType
{
OPEN_PAREN,
CLOSE_PAREN,
UNARY_OP,
BINARY_OP,
LITERAL,
EXPR_END
}
public TokenType type;
public string value;
public Token(StringReader s)
{
int c = s.Read();
if (c == -1)
{
type = TokenType.EXPR_END;
value = "";
return;
}
char ch = (char)c;
if (dict.ContainsKey(ch))
{
type = dict[ch].Key;
value = dict[ch].Value;
}
else
{
string str = "";
str += ch;
while (s.Peek() != -1 && !dict.ContainsKey((char)s.Peek()))
{
str += (char)s.Read();
}
type = TokenType.LITERAL;
value = str;
}
}
}
At that point, in the main method, you can see I transform the list of tokens in Polish Notation order. It makes the creation of the tree much easier and I use a modified implementation of the Shunting Yard Algorithm for this:
static List<Token> TransformToPolishNotation(List<Token> infixTokenList)
{
Queue<Token> outputQueue = new Queue<Token>();
Stack<Token> stack = new Stack<Token>();
int index = 0;
while (infixTokenList.Count > index)
{
Token t = infixTokenList[index];
switch (t.type)
{
case Token.TokenType.LITERAL:
outputQueue.Enqueue(t);
break;
case Token.TokenType.BINARY_OP:
case Token.TokenType.UNARY_OP:
case Token.TokenType.OPEN_PAREN:
stack.Push(t);
break;
case Token.TokenType.CLOSE_PAREN:
while (stack.Peek().type != Token.TokenType.OPEN_PAREN)
{
outputQueue.Enqueue(stack.Pop());
}
stack.Pop();
if (stack.Count > 0 && stack.Peek().type == Token.TokenType.UNARY_OP)
{
outputQueue.Enqueue(stack.Pop());
}
break;
default:
break;
}
++index;
}
while (stack.Count > 0)
{
outputQueue.Enqueue(stack.Pop());
}
return outputQueue.Reverse().ToList();
}
After this transformation, our token list becomes NOT, OR, OR, C, D, AND, A, B.
At this point, we're ready to create the expression tree. The properties of Polish Notation allow us to just walk the Token List and recursively create the tree nodes (we'll use your BoolExpr class) as we go:
static BoolExpr Make(ref List<Token>.Enumerator polishNotationTokensEnumerator)
{
if (polishNotationTokensEnumerator.Current.type == Token.TokenType.LITERAL)
{
BoolExpr lit = BoolExpr.CreateBoolVar(polishNotationTokensEnumerator.Current.value);
polishNotationTokensEnumerator.MoveNext();
return lit;
}
else
{
if (polishNotationTokensEnumerator.Current.value == "NOT")
{
polishNotationTokensEnumerator.MoveNext();
BoolExpr operand = Make(ref polishNotationTokensEnumerator);
return BoolExpr.CreateNot(operand);
}
else if (polishNotationTokensEnumerator.Current.value == "AND")
{
polishNotationTokensEnumerator.MoveNext();
BoolExpr left = Make(ref polishNotationTokensEnumerator);
BoolExpr right = Make(ref polishNotationTokensEnumerator);
return BoolExpr.CreateAnd(left, right);
}
else if (polishNotationTokensEnumerator.Current.value == "OR")
{
polishNotationTokensEnumerator.MoveNext();
BoolExpr left = Make(ref polishNotationTokensEnumerator);
BoolExpr right = Make(ref polishNotationTokensEnumerator);
return BoolExpr.CreateOr(left, right);
}
}
return null;
}
Now we're golden! We have the expression tree that represents the expression so we'll ask the user for the actual boolean values of each literal operand and evaluate the root node (which will recursively evaluate the rest of the tree as needed).
My Eval function follows, keep in mind I'd use some polymorphism to make this cleaner if I modified your BoolExpr class.
static bool Eval(BoolExpr expr)
{
if (expr.IsLeaf())
{
return booleanValues[expr.Lit];
}
if (expr.Op == BoolExpr.BOP.NOT)
{
return !Eval(expr.Left);
}
if (expr.Op == BoolExpr.BOP.OR)
{
return Eval(expr.Left) || Eval(expr.Right);
}
if (expr.Op == BoolExpr.BOP.AND)
{
return Eval(expr.Left) && Eval(expr.Right);
}
throw new ArgumentException();
}
As expected, feeding our test expression ¬((A ∧ B) ∨ C ∨ D) with values false, true, false, true for A, B, C, D respectively yields the result false.
From the algorithm point of view, to parse an expression, you need one stack.
We use two steps algorithm :
Lexing
The aim of lexing is to get 'keywords', 'identifiers' and 'separators' :
- A keyword is 'if' 'then' 'else' '(' ')' '/\' '/' etc...
- An identifiers in your case is 'A', 'B', 'C' etc...
- A separator is blank space, tabulation, end of line, end of file, etc...
Lexing consist of using an automata. In lexing you will read your input string char by char. When you encouter a char that is compatible with one of your keyword, identifiers, separators, you start a sequence of char. When you encouter a separators you stop the sequence, look in a dictionnary of the sequence is a keyword (if not it is a identifier); then put the tuple [sequence, keyword or identifier/class] on the stack.
I leave you as exercice the case of small keyword '(' that can be also see as separators.
Parsing
Parsing is similar to grammar. In your case the only rules to check are comma, and binary operations, and just a simple identifier.
formaly :
expression::
'(' expression ')'
expression /\ expression
expression \/ expression
identifier
This can be write by a recursive function.
First reverse your stack, then:
myParseExpression(stack, myC#ResultObject)
{
if(stack.top = kewyord.'(' )
then myParseOpenComma(all stack but top, myC#ResultObject)
if(stack.top = keyword.'/\')
then myParseBinaryAnd(stack, myC#ResultObject)
}
myParseOpenComma(stack, myC#ResultObject)
{
...
}
myParseBinaryAnd(stack, myC#ResultObject)
{
myNewRigthPartOfExpr = new C#ResultObject
myParseExpression(stack.top, myNewRigthPartOfExpr)
remove top of stack;
myNewLeftPartOfExpr = new C#ResultObject
myParseExpression(stack.top, myNewLeftPartOfExpr)
C#ResultObject.add("AND", myNewRigthPartOfExpr, myNewLeftPartOfExpr)
}
...
There is multiple function that share recursion on each other.
As exercice, try to add the negation.
Lexing is traditionnally done by a lexer (like lex tool).
Parsing is traditionnaly done by a parser (like bison tool).
Tool allow write of thoses function more like I have done in the formaly expression.
Thoses aspect are fundamental of program compilation.
Coding thoses thing will improve you a lot because it is hard and fundamental.
I have an array of words I need to do a find-and-replace by regex operation on, and sometimes this array can be thousands of words long. I've tested and found that stemming the words using common prefixes is much faster than searching for them individually. That is, ^where|why$ is slower than ^wh(ere|y)$. Obviously it's not a noticeable difference in such a short example, but it's considerably faster where there are thousands of alternatives and the subject string is long.
So I'm looking for a way to do this stemming automatically, for instance to convert a string[] { "what", "why", "where", "when", "which" } into wh(at|y|e(re|n)|i(ch))
Is there already a recognized algorithm out there that does this ? If not, how would you go about it ? It seems to need to be done recursively but I can't quite get my head round how to do it. I have a method I wrote that works to a limited extent, but it's inelegant, 60 lines longs and uses multiple nested foreach loops so it's a future maintenance nightmare. I'm sure there's a much better way, if anyone could point me in the right direction that'd be much appreciated...
This code should work:
public static class StemmingUtilities
{
private class Node
{
public char? Value { get; private set; }
public Node Parent { get; private set; }
public List<Node> Children { get; private set; }
public Node(char? c, Node parent)
{
this.Value = c;
this.Parent = parent;
this.Children = new List<Node>();
}
}
public static string GetRegex(IEnumerable<string> tokens)
{
var root = new Node(null,null);
foreach (var token in tokens)
{
var current = root;
for (int i = 0; i < token.Length; i++)
{
char c = token[i];
var node = current.Children.FirstOrDefault(x => x.Value.Value == c);
if (node == null)
{
node = new Node(c,current);
current.Children.Add(node);
}
current = node;
}
}
return BuildRexp(root);
}
private static string BuildRexp(Node root)
{
string s = "";
bool addBracket = root.Children.Count > 1;
// uncomment the following line to avoid first brakets wrapping (occurring in case of multiple root's children)
// addBracket = addBracket && (root.Parent != null);
if (addBracket)
s += "(";
for(int i = 0; i < root.Children.Count; i++)
{
var child = root.Children[i];
s += child.Value;
s += BuildRexp(child);
if (i < root.Children.Count - 1)
s += "|";
}
if (addBracket)
s += ")";
return s;
}
}
Usage:
var toStem1 = new[] { "what", "why", "where", "when", "which" };
string reg1 = StemmingUtilities.GetRegex(toStem1);
// reg1 = "wh(at|y|e(re|n)|ich)"
string[] toStem2 = new[] { "why", "abc", "what", "where", "apple", "when" };
string reg2 = StemmingUtilities.GetRegex(toStem2);
// reg2 = "(wh(y|at|e(re|n))|a(bc|pple))"
EDIT:
to get reg2 = "wh(y|at|e(re|n))|a(bc|pple)" i.e. without the first wrapping brackets, just uncomment the marked line in BuildRexp method.