Forward-Only Seeking Lookup Algorithm

Forward-Only Seeking Lookup Algorithm - c#

Given a string in the format {Length}.{Text} (such as 3.foo), I want to determine which string, from a finite list, the given string is.
The reader starts at the 0-index and can seek forward (skipping characters if desired).
As an example, consider the following list:
10.disconnect
7.dispose
7.distort
The shortest way to determine which of those strings has been presented might look like:
if (reader.Current == "1")
{
// the word is "disconnect"
}
else
{
reader.MoveForward(5);
if (reader.Current == "p")
{
// the word is "dispose"
}
else
{
// the word is "distort"
}
}
The question has 2 parts, though I hope someone can just point me at the right algorithm or facet of information theory that I need to read more about.
1) Given a finite list of strings, what is the best way to generate logic that requires the least number of seeks & comparisons, on average, to determine which word was presented?
2) As with the first, but allowing weighting such that hotpaths can be accounted for. i.e. if the word "distort" is 4 times more likely than the words "disconnect" and "dispose", the logic shown above would be more performant on average if structured as:
reader.MoveForward(5);
if (reader.Current == "t")
{
// the word is distort
}
else //...
Note: I'm aware that the 6th character in the example set is unique so all you need to do to solve the example set is switch on that character, but please assume there is a longer list of words.
Also, this isn't some homework assignment - I'm writing a parser/interception layer for the Guacamole protocol. I've looked at Binary Trees, Tries, Ulam's Game, and a few others, but none of those fit my requirements.

I dont know if this would be of any help, but I'll throw my 5 cents in anyway.
What about a tree that automatically gets more granular as you have more strings in the list, and checking of the existing leaves are done with respect to "hotpaths"?
for example, I would have something like this with your list:
10.disconnect
7.dispose
7.distort
root ---- 7 "check 4th letter" ------ if "t" return "distort"
| "in the order of " |
| " hot paths " --- if "p"return "dispose"
|
----10 ---- return "disconnect"
you can have this dynamically build up. for example if you add 7.display it would be
root ---- 7 "check 4th letter" ------ if "t" return "distort"
| "in the order of " |
| " hot paths " --- if "p" --- "check 5th letter" --- if "o" ...
| |
----10 ---- return "disconnect" --- if "l" ...
so nodes in the tree would have a variable "which index to check", and leaves corresponding to possible results (order is determined statistically). so something like:
# python example
class node():
def __init__(which_index, letter):
self.which_index = which_index # which index this node checks to determine next node
self.letter = letter # for which letter we go to this node
self.leaves = SomeLinkedList()
def add_leaf(node):
self.leaves.putInCorrectPositionDependingOnHowHotPathItIs(node)
def get_next(some_string):
for leaf in self.leaves:
if some_string[self.which_index] == leaf.letter:
return leaf
raise Exception("not found")
another alternative is of course hashing.
But if you are micro-optimizing, it is hard to say as there are other factors that come into play (eg. probably time you save from memory caching would be very significant).

Related

Find occurrences of the adjacent sub strings in the text

I have a text of the Word document and an array of the strings. The goal is to find all occurrences for those strings in the document's text. I tried to use Aho-Corasick string matching in C# implementation of the Aho-Corasick algorithm but the default implementation doesn't fit for me.
The typical part of the text looks like
“Activation” means a written notice from Lender to the Bank substantially in the form of Exhibit A.
“Activation Notice” means a written notice from Lender to the Bank substantially in the form of Exhibit A and Activation.
“Business Day" means each day (except Saturdays and Sundays) on which banks are open for general business and Activation Notice.
The array of the keywords looks like
var keywords = new[] {"Activation", "Activation Notice"};
The default implementation of the Aho-Corasick algorithm returns the following count of the occurrences
Activation - 4
Activation Notice - 2
For 'Activation Notes' it's the correct result. But for 'Activation' the correct count should be also 2
because I do not need to consider occurrences inside the adjacent keyword 'Activation Notice'.
Is there a proper algorithm for this case?

I will assume you got your results according to the example you linked.
StringSearchResult[] results = searchAlg.FindAll(textToSearch);
With those results, if you assume that the only overlaps are subsets, you can sort by index and collect your desired results in a single pass.
public class SearchResultComparer : IComparer<StringSearchResult> {
public int StringSearchResult(StringSearchResult x, StringSearchResult y)
{
// Try ordering by the start index.
int compare = x.Index.CompareTo(y.Index);
if (compare == 0)
{
// In case of ties, reverse order by keyword length.
compare = y.Keyword.Length.CompareTo(x.Keyword.Length);
}
return compare;
}
}
// ...
IComparer searchResultComparer = new SearchResultComparer();
Array.Sort(results, searchResultComparer);
int activeEndIndex = -1;
List<StringSearchResult> nonOverlappingResults = new List<StringSearchResult>();
foreach(StringSearchResult r in results)
{
if (r.Index < activeEndIndex)
{
// This range starts before the active range ends.
// Since it's an overlap, skip it.
continue;
}
// Save this result, track when it ends.
nonOverlappingResults.Add(r);
activeEndIndex = r.Index + r.Keyword.Length;
}
Due to the index sorting, the loop guarantees that only non-overlapping ranges will be kept. But some ranges will be rejected. This can only happen for two reasons.
The candidate starts at the same index as the active range. Since the sorting breaks these ties so longest goes first, the candidate must be shorter than the active range and can be skipped.
The candidate starts after the active range. Since the only overlaps are subsets, and this overlaps with the active range, it is a subset that starts later but still ends at or before.
Therefore the only rejected candidates will be subsets, and must end before the active range. So the active range remains the only thing to worry about overlapping with.

How to properly check the available path on a maze [duplicate]

What are the possible ways to solve a maze?
Ive got two ideas, but I think they are not very elegant.
Base situation: We have a matrix, and the elements in this matrix are ordered in a way that it represents a maze, with one way in, and one out.
My first idea was to send a robot through the maze, following one side, until it's out of the maze. I think this is a very slow solution.
The second one passes through every successive item marked with 1, checks where it can go (up, right, down, left) chooses one way and it continues its path there. This is even slower than the first one.
Of course it's a bit faster if I make the two bots multi-threaded at every junction, but thats also not the best way.
There needs to be better solutions to send a bot through a maze.
EDIT
First: Thanks for the nice answers!
The second part of my question is: What to do in the case if we have a multi-dimensional graph? Are there special practices for that, or is the answer of Justin L. usable for that too?
I think it's not the best way for this case.
The third question:
Which of these maze solver algorithms is/are the fastest? (Purely hypothetically)

You can think of your maze as a tree.
A
/ \
/ \
B C
/ \ / \
D E F G
/ \ \
H I J
/ \
L M
/ \
** O
(which could possibly represent)
START
+ +---+---+
| A C G |
+---+ + + +
| D B | F | J |
+---+---+ +---+---+
| L H E I |
+---+ +---+---+
| M O |
+ +---+
FINISH
(ignoring left-right ordering on the tree)
Where each node is a junction of paths. D, I, J, L and O are dead ends, and ** is the goal.
Of course, in your actual tree, each node has a possibility of having as many as three children.
Your goal is now simply finding what nodes to traverse to to find the finish. Any ol' tree search algorithm will do.
Looking at the tree, it's pretty easy to see your correct solution by simply "tracing up" from the ** at the deepest part of the tree:
A B E H M **
Note that this approach becomes only slightly more complicated when you have "loops" in your maze (i.e., when it is possible, without backtracing, you re-enter a passage you've already traversed through). Check the comments for one nice solution.
Now, let's look at your first solution you mentioned, applied to this tree.
Your first solution is basically a Depth-First Search, which really isn't that bad. It's actually a pretty good recursive search. Basically, it says, "Always take the rightmost approach first. If nothing is there, backtrack until the first place you can go straight or left, and then repeat.
A depth-first search will search the above tree in this order:
A B D (backtrack) E H L (backtrack) M ** (backtrack) O (backtrack thrice) I
(backtrack thrice) C F (backtrack) G J
Note that you can stop as soon as you find the **.
However, when you actually code a depth-first search, using recursive programming makes makes everything much easier. Even iterative methods work too, and you never have to explicitly program how to backtrack. Check out the linked article for implementations.
Another way of searching a tree is the Breadth-First solution, which searches through trees by depth. It'd search through the above tree in this order:
A (next level) B C (next level) D E F G (next level)
H I J (next level) L M (next level) ** O
Note that, due to the nature of a maze, breadth-first has a much higher average amount of nodes it checks. Breadth-first is easily implementing by having a queue of paths to search, and each iteration popping a path out of a queue, "exploding it" by getting all of the paths that it can turn into after one step, and putting those new paths at the end of the queue. There are no explicit "next level" commands to code, and those were just there to aid in understanding.
In fact, there is a whole expansive list of ways to search a tree. I've just mentioned the two simplest, most straightforward way.
If your maze is very, very long and deep, and has loops and crazies, and is complicated, I suggest the A* algorithm, which is the industry standard pathfinding algorithm which combines a Breadth-First search with heuristics...sort of like an "intelligent breadth-first search".
It basically works like this:
Put one path in a queue (the path where you only walk one step straight into the maze). A path has a "weight" given by its current length + its straight-line distance from the end (which can be calculated mathematically)
Pop the path with the lowest weight from the queue.
"Explode" the path into every path that it could be after one step. (i.e., if your path is Right Left Left Right, then your exploded paths are R L L R R and R L L R L, not including illegal ones that go through walls)
If one of these paths has the goal, then Victory! Otherwise:
Calculate the weights of the exploded paths, and put all of them back into the queue (not including the original path)
Sort the queue by weight, lowest first. Then repeat from Step #2
And that's A*, which I present specially highlighted because it is more or less the industry standard pathfinding algorithm for all applications of pathfinding, including moving from one edge of the map to another while avoiding off-road paths or mountains, etc. It works so well because it uses a shortest possible distance heuristic, which gives it its "intelligence". A* is so versatile because, given any problem, if you have a shortest possible distance heuristic available (ours is easy -- the straight line), you can apply it.
BUT it is of great value to note that A* is not your only option.
In fact, the wikipedia category of tree traversal algorithms lists 97 alone! (the best will still be on this page linked earlier)
Sorry for the length =P (I tend to ramble)

Lots of maze-solving algorithms exist:
http://en.wikipedia.org/wiki/Maze_solving_algorithm
http://www.astrolog.org/labyrnth/algrithm.htm#solve
For a robot, Tremaux's algorithm looks promising.

An interesting approach, at least I found it interesting, is to use cellular automata. In short a "space" cell surrounded by 3 "wall" cells turns into a "wall" cell. At the end the only space cells left are the ones on route to the exit.
If you look at the tree Justin put in his answer then you can see that leaf nodes have 3 walls. Prune the tree until you have a path.

This is one of my favorite algorithms ever....
1) Move forward
2) Are you at a wall?
2a) If yes, turn left
3) Are you at the finish?
3a) If no, go to 1
3b) If yes, solved

How about building a graph out of your Matrix and using Breadth First Search, Depth First Search or Dijkstras Algorithm?

I had a similar problem in one of my University Comp. Sci. courses. The solution we came up with was to follow the left hand wall (right hand wall will work just as well). Here is some pseudocode
While Not At End
If Square To Left is open,
Rotate Left
Go Forward
Else
Rotate Right
End If
Wend
That's basically it. The complex part is keeping track of which direction your facing, and figuring out which grid position is on your left based on this direction. It worked for any test case I put up against it. Interesting enough the Professors solution was something along the lines of:
While Not At End
If Can Go North
Go North
ElseIf Can Go East
Go East
ElseIf Can Go South
Go South
ElseIf Can Go West
Go West
EndIf
Wend
Which will work well for most simple mazes, but fails on the a maze that looks like the following:
SXXXXXXXXXXXXX
X X
X X
X X
XXX X
X X X
X XXXXXXXXXXX XXXE
X X
XXXXXXXXXXXXXXXXXXX
With S and E being the start and end.
With anything that doesn't follow the wall, you end up having to keep a list of the places you have been, so that you can backtrack if necessary, when you fall into a dead end, and so that you don't get caught in a loop. If you follow the wall, there's no need to keep track of where you've been. Although you won't find the most optimal path through the maze, you will always get through it.

This is a very simple representation to simulate maze in C++ :)
#ifndef vAlgorithms_Interview_graph_maze_better_h
#define vAlgorithms_Interview_graph_maze_better_h
static const int kMaxRows = 100;
static const int kMaxColumns = 100;
class MazeSolver
{
private:
char m_matrix[kMaxRows][kMaxColumns]; //matrix representation of graph
int rows, cols; //actual rows and columns
bool m_exit_found;
int m_exit_row, m_exit_col;
int m_entrance_row, m_entrance_col;
struct square //abstraction for data stored in every verex
{
pair<int, int> m_coord; //x and y co-ordinates of the matrix
square* m_parent; //to trace the path backwards
square() : m_parent(0) {}
};
queue<square*> Q;
public:
MazeSolver(const char* filename)
: m_exit_found(false)
, m_exit_row(0)
, m_exit_col(0)
, m_entrance_row(0)
, m_entrance_col(0)
{
ifstream file;
file.open(filename);
if(!file)
{
cout << "could not open the file" << endl << flush;
// in real world, put this in second phase constructor
}
init_matrix(file);
}
~MazeSolver()
{
}
void solve_maze()
{
//we will basically use BFS: keep pushing squares on q, visit all 4 neighbors and see
//which way can we proceed depending on obstacle(wall)
square* s = new square();
s->m_coord = make_pair(m_entrance_row, m_entrance_col);
Q.push(s);
while(!m_exit_found && !Q.empty())
{
s = Q.front();
Q.pop();
int x = s->m_coord.first;
int y = s->m_coord.second;
//check if this square is an exit cell
if(x == m_exit_row && y == m_exit_col)
{
m_matrix[x][y] = '>'; // end of the path
m_exit_found = true;
//todo: try breaking? no= queue wont empty
}
else
{
//try walking all 4 neighbors and select best path
//NOTE: Since we check all 4 neighbors simultaneously,
// the path will be the shortest path
walk_path(x-1, y, s);
walk_path(x+1, y, s);
walk_path(x, y-1, s);
walk_path(x, y+1, s);
}
} /* end while */
clear_maze(); //unset all previously marked visited shit
//put the traversed path in maze for printing
while(s->m_parent)
{
m_matrix[s->m_coord.first][s->m_coord.second] = '-';
s = s->m_parent;
} /* end while */
}
void print()
{
for(int i=0; i<rows; i++)
{
for(int j=0; j<cols; j++)
cout << m_matrix[i][j];
cout << endl << flush;
}
}
private:
void init_matrix(ifstream& file)
{
//read the contents line-wise
string line;
int row=0;
while(!file.eof())
{
std::getline(file, line);
for(int i=0; i<line.size(); i++)
{
m_matrix[row][i] = line[i];
}
row++;
if(line.size() > 0)
{
cols = line.size();
}
} /* end while */
rows = row - 1;
find_exit_and_entry();
m_exit_found = false;
}
//find and mark ramp and exit points
void find_exit_and_entry()
{
for(int i=0; i<rows; i++)
{
if(m_matrix[i][cols-1] == ' ')
{
m_exit_row = i;
m_exit_col = cols - 1;
}
if(m_matrix[i][0] == ' ')
{
m_entrance_row = i;
m_entrance_col = 0;
}
} /* end for */
//mark entry and exit for testing
m_matrix[m_entrance_row][m_entrance_col] = 's';
m_matrix[m_exit_row][m_exit_col] = 'e';
}
void clear_maze()
{
for(int x=0; x<rows; x++)
for(int y=0; y<cols; y++)
if(m_matrix[x][y] == '-')
m_matrix[x][y] = ' ';
}
// Take a square, see if it's the exit. If not,
// push it onto the queue so its (possible) pathways
// are checked.
void walk_path(int x, int y, square* parent)
{
if(m_exit_found) return;
if(x==m_exit_row && y==m_exit_col)
{
m_matrix[x][y] = '>';
m_exit_found = true;
}
else
{
if(can_walk_at(x, y))
{
//tag this cell as visited
m_matrix[x][y] = '-';
cout << "can walk = " << x << ", " << y << endl << flush;
//add to queue
square* s = new square();
s->m_parent = parent;
s->m_coord = make_pair(x, y);
Q.push(s);
}
}
}
bool can_walk_at(int x, int y)
{
bool oob = is_out_of_bounds(x, y);
bool visited = m_matrix[x][y] == '-';
bool walled = m_matrix[x][y] == '#';
return ( !oob && !visited && !walled);
}
bool is_out_of_bounds(int x, int y)
{
if(x<0 || x > rows || y<0 || y>cols)
return true;
return false;
}
};
void run_test_graph_maze_better()
{
MazeSolver m("/Users/vshakya/Dropbox/private/graph/maze.txt");
m.print();
m.solve_maze();
m.print();
}
#endif

Just an idea. Why not throw some bots in there in the monte carlo fashion.
Let's call the first generation of bots gen0.
We only keep the bots from gen0 that have some continuous roads in this way:
-from the start to some point
or -from some point to the end
We run a new gen1 of bots in new random dots, then we try to connect the roads of the bots of gen1 with those of gen0 and see if we get a continous road from start to finish.
So for genn we try to connect with the bots form gen0, gen1, ..., genn-1.
Of course a generation lasts only a feasibil finit amount of time.
I don't know if the complexion of the algorithm will prove to be practical for small data sets.
Also the algorithm assumes we know start and finish points.
some good sites for ideas:
http://citeseerx.ist.psu.edu/
http://arxiv.org/

If the robot can keep track of its location, so it knows if it has been to a location before, then depth-first search is the obvious algorithm. You can show by an adversarial argument that it is not possible to get better worst-case performance than depth-first search.
If you have available to you techniques that cannot be implemented by robots, then breadth-first search may perform better for many mazes, as may Dijkstra's algorithm for finding the shortest path in a graph.

Same answer as all questions on stack-overflow ;)
Use vi!
http://www.texteditors.org/cgi-bin/wiki.pl?Vi-Maze
It's truly fascinating to see a text editor solve an ascii-maze, I'm sure the emacs guys have an equivalent ..

there are many algorithms, and many different settings that specify which algorithm is best.
this is just one idea about an interesting setting:
let's assume you have the following properties...
you move a robot and you want to minimize its movement, not its CPU usage.
that robot can either inspect only its neighbouring cells or look along corridors either seeing or not seeing cross-ways.
it has GPS.
it knows the coordinates of its destination.
then you can design an A.I. which...
draws a map – every time it receives new information about the maze.
calculates the minimal known path lengths between all unobserved positions (and itself and the destination).
can prioritize unobserved positions for inspection based upon surrounding structures. (if it is impossible to reach the destination from there anyway...)
can prioritize unobserved positions for inspection based upon direction and distance to destination.
can prioritize unobserved positions for inspection based upon experience about collecting information. (how far can it see on average and how far does it have to walk?)
can prioritize unobserved positions to find possible shortcuts. (experience: are there many loops?)

This azkaban algorithm might also help you,
http://journals.analysisofalgorithms.com/2011/08/efficient-maze-solving-approach-with.html

The best way to solve a maze is to use a connectivity algorithm such as union-find which is a quasi-linear time algorithm assuming path compression is done.
Union-Find is a data structure that tells you whether two elements in a set are transitively connected.
To use a union-find data structure to solve a maze, first the neighbor connectivity data is used to build the union-find data structure. Then the union find is compressed. To determine whether the maze is solvable the entrance and exit values are compared. If they have the same value, then they are connected and the maze is solvable. Finally, to find a solution, you start with the entrance and examine the root associated with each of its neighbors. As soon as you find a previously unvisited neighbor with the same root as the current cell, you visit that cell and repeat the process.
The main disadvantage of this approach is that it will not tell you the shortest route through the maze, if there is more than one path.

Not specifically for your case, but I've come across several programming contest questions where I found the Lee's algorithm quite handy to code up quickly. Its not the most efficient for all cases, but is easy to crank out. Here's one I hacked up for a contest.

Preventing overlap of number ranges

I failed at this problem for several hours now and just can't get my head around it. It seems fairly simple from a "human" POV, but somehow I just can't seem able to write it into code.
Situation: Given several number ranges that are defined by a starting number and the current "active" number which are assigned to specific locations (or 0 for generic ones)
startno | actualno | location
100 | 159 | 0
200 | 203 | 1
300 | 341 | 2
400 | 402 | 0
Now, as you can see, there can also be two ranges for one location. In this case, only the range with the highest startno (in this case, 400) is regarded as active, the other one only exists for history purposes.
Every user is assigned to a specific location (the same IDs as in the location column), but never to a generic one (zero).
When a used wants a new number, he will get a number assigned from a range that is assigned to his location, or, if none is found, from the highest generic one (e.g. user.location = 0 would get 403, user.location = 2 would get 342).
Then, the user can select to either use this number or an amount X starting from the assigned number.
Here comes the question: How can I assure that the ranges don't overlap into each other? Say the user (location = 2) gets the next number 342 and decides he needs 100 numbers following that. This would produce the end number to 441, which is inside the generic range, which mustn't happen.
I tried around with several nested SELECTs, using both the starting and ending number, aggregating MAX(), JOINing the table on itself, but I just can't get it 100% right.

From my understanding with such a thing I may just create a trigger on the table in db to do the validation and raise an error if overlap found while the application update the table, so that user will just simply get an error saying you can't do it. Say if you want it end with 441 then just let user do it and try to update the table with actualno to 441, then a simple select compare the new number to all existing startno see if it's bigger than any startno then raise the error. Something like following in the update trigger:
IF EXISTS(SELECT 1 FROM
Table1
WHERE #newnumber >= startno AND id <> #currentID)
BEGIN
'Go Raise the error
END
Well maybe I missed something here in some certain case this won't work and please let me know.
Using trigger for data integrity check is totally OK and shouldn't be a problem at all. This would be much easier than validation ahead especially if you think about multithreading stuff might create some big problem there.
In the other hand, for prevent this happened too easy, I might just add couple more zero into those numbers as initial values:
startno | actualno | location
100000 | 100059 | 0
200000 | 200003 | 1
300000 | 300041 | 2
400000 | 400002 | 0

As so often, I found an approach not long after posting the question. It seems describing a problem so other people understand it is half-way to getting the solution. At least, I got a possible one which so far proofed to be quite resistant.
I query the database with
SELECT nostart FROM numbers
WHERE nostart BETWEEN X AND Y
where X is the start number requested and Y is the end number of the user. (To be conform with my introduction example, X = 342 and Y = 441
This will then give me a list of all ranges whose starting number is inside the range of the numbers the user requested, in this case the list would be
nostart
400
Now, if the query doesn't find a result, I'm golden and the numbers can be used. If the query finds a single result, and that result is equal to the starting number of the user, I'm also OK because this means it's the first time a user requested something from this range.
If that is not the case, the range cannot be used, because another range is inside it. Also, if the query finds multiple results (e.g. for X = 100 and Y = 350, which would result in 100|200|300 I also deny the request, because several ranges are overlapped.
If anyone has a better solution or notes on this one, I'll leave this here and use it as long as it works out.

Dynamic Regex generation for predictable repeating string patterns in a data feed

I'm currently trying to process a number of data feeds that I have no control over, where I am using Regular Expressions in C# to extract information.
The originator of the data feed is extracting basic row data from their database (like a product name, price, etc), and then formatting that data within rows of English text. For each row, some of the text is repeated static text and some is the dynamically generated text from the database.
e.g
Panasonic TV with FREE Blu-Ray Player
Sony TV with FREE DVD Player + Box Office DVD
Kenwood Hi-Fi Unit with $20 Amazon MP3 Voucher
So the format in this instance is: PRODUCT with FREEGIFT.
PRODUCT and FREEGIFT are dynamic parts of each row, and the "with" text is static. Each feed has about 2000 rows.
Creating a Regular Expression to extract the dynamic parts is trivial.
The problem is that the marketing bods in control of the data feed keep on changing the structure of the static text, usually once a fortnight, so this week I might have:
Brand new Panasonic TV and a FREE Blu-Ray Player if you order today
Brand new Sony TV and a FREE DVD Player + Box Office DVD if you order today
Brand new Kenwood Hi-Fi unit and a $20 Amazon MP3 Voucher if you order today
And next week it will probably be something different, so I have to keep modifying my Regular Expressions...
How would you handle this?
Is there an algorithm to determine static and variable text within repeating rows of strings? If so, what would be the best way to use the output of such an algorithm to programatically create a dynamic Regular Expression?
Thanks for any help or advice.

This code isn't perfect, it certainly isn't efficient, and it's very likely to be too late to help you, but it does work. If given a set of strings, it will return the common content above a certain length.
However, as others have mentioned, an algorithm can only give you an approximation, as you could hit a bad batch where all products have the same initial word, and then the code would accidentally identify that content as static. It may also produce mismatches when dynamic content shares values with static content, but as the size of samples you feed into it grows, the chance of error will shrink.
I'd recommend running this on a subset of your data (20000 rows would be a bad idea!) with some sort of extra sanity checking (max # of static elements etc)
Final caveat: it may do a perfect job, but even if it does, how do you know which item is the PRODUCT and which one is the FREEGIFT?
The algorithm
If all strings in the set start with the same character, add that character to the "current match" set, then remove the leading character from all strings
If not, remove the first character from all strings whose first x (minimum match length) characters aren't contained in all the other strings
As soon as a mismatch is reached (case 2), yield the current match if it meets the length requirement
Continue until all strings are exhausted
The implementation
private static IEnumerable<string> FindCommonContent(string[] strings, int minimumMatchLength)
{
string sharedContent = "";
while (strings.All(x => x.Length > 0))
{
var item1FirstCharacter = strings[0][0];
if (strings.All(x => x[0] == item1FirstCharacter))
{
sharedContent += item1FirstCharacter;
for (int index = 0; index < strings.Length; index++)
strings[index] = strings[index].Substring(1);
continue;
}
if (sharedContent.Length >= minimumMatchLength)
yield return sharedContent;
sharedContent = "";
// If the first minMatch characters of a string aren't in all the other strings, consume the first character of that string
for (int index = 0; index < strings.Length; index++)
{
string testBlock = strings[index].Substring(0, Math.Min(minimumMatchLength, strings[index].Length));
if (!strings.All(x => x.Contains(testBlock)))
strings[index] = strings[index].Substring(1);
}
}
if (sharedContent.Length >= minimumMatchLength)
yield return sharedContent;
}
Output
Set 1 (from your example):
FindCommonContent(strings, 4);
=> "with "
Set 2 (from your example):
FindCommonContent(strings, 4);
=> "Brand new ", "and a ", "if you order today"
Building the regex
This should be as simple as:
"{.*}" + string.Join("{.*}", FindCommonContent(strings, 4)) + "{.*}";
=> "^{.*}Brand new {.*}and a {.*}if you order today{.*}$"
Although you could modify the algorithm to return information about where the matches are (between or outside the static content), this will be fine, as you know some will match zero-length strings anyway.

I think it would be possible with an algorithm , but the time it would take you to code it versus simply doing the Regular Expression might not be worth it.
You could however make your changing process faster. If instead of having your Regex String inside your application, you'd put it in a text file somewhere, you wouldn't have to recompile and redeploy everything every time there's a change, you could simply edit the text file.
Depending on your project size and implementation, this could save you a generous amount of time.

Irony: How to give KeyTerm precedence over variable?

Relevant chunk of Irony grammar:
var VARIABLE = new RegexBasedTerminal("variable", #"(?-i)\$?\w+");
variable.Rule = VARIABLE;
tag_blk.Rule = html_tag_kw + attr_args_opt + block;
term_simple.Rule = NUMBER | STRING | variable | boolean | "null" | term_list;
term.Rule = term_simple | term_filter;
block.Rule = statement_list | statement | ";";
statement.Rule = tag_blk | directive_blk | term;
The problem is that both a "tag" and a "variable" can appear in the same place. I want my parser to prefer the tag over the variable, but it always prefers the variable. How can I change that?
I've tried changing tag_blk.Rule to PreferShiftHere() + html_tag_kw + attr_args_opt + block; and ImplyPrecedenceHere(-100) + html_tag_kw + attr_args_opt + block; but it doesn't help any. The parser doesn't even complain of an ambiguity.

Try changing the order of 'tag_blk.Rule' and 'variable.Rule' as tokenisers usually go after first match, and variable is first in your list.

You can increase the Priority of the tag_blk Terminal or decrease the one of variable whichever suits your purpose. Terminal class has a Priority field defaulting to 0. According to the comment right above it
// Priority is used when more than one terminal may match the input char.
// It determines the order in which terminals will try to match input for a given char in the input.
// For a given input char the scanner uses the hash table to look up the collection of terminals that may match this input symbol.
// It is the order in this collection that is determined by Priority property - the higher the priority,
// the earlier the terminal gets a chance to check the input.
Unfortunately I can't test this at the moment as the code fragment provided needs work and lots of assumptions to be made compilable. But from the description above this should be the one you are looking for. Hope it helps someone -even 10 years after the question aired.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.