I am not very good at regex, and frankly find it difficult to wrap my head around. Therefore my question may not make any sense.
Could you use regular expressions to guarantee that when someone enters a string that it finds the closest fit from a list and makes it match one of the entries?
Here is what the list might look like.
QR9456
QR6222
QR9487
QR2311
QR2311 AB
QR2311 A
QR4781
QR4781 A
XX920-009
QR9456 Z
I apologize if this question can't be answered or doesn't make sense.
Nope. Regexp:es only describe exact match to the patterns you specify: I doubt you could handcraft patterns to match the list above satisfactorily (much less define regexpes to match any list).
It sounds like what you are after is a fuzzy search algorithm like e.g. bitap.
Related
i have a question if i can define a String.contain("-") to search for specific character "-" followed by one or more integers.
so it will cover something such as :-
search -12
t-123est
but will not cover
search-t12
t-est123
You best option might not be to use String.Contains, You might be best served using Regex.IsMatch. With that you can define a regular expression that will exactly match your needs. you can use sites like https://www.regex101.com/ to test your expression to make sure it covers your cases. In your case, you can use
Regex.IsMatch(myString, #"-\d+")
This would be enough:
Regex.IsMatch("search -12", #"-\d")
I'm writing a word game. I have access to the dictionary object to validate the words. I need to find all possible words that contains a word and a set of additional characters.
for example:
lets the say the word is "MEN" and the set of additional characters are "WALOHTD". I need a way to find words like....
1.MEND
2.WOMEN
3.MENTAL
4. etc....
basically we are looking at all possible words that contain "MEN" and any of the specific additional characters.
I can certainly write code that can loop through the entire dictionary to first words that contains the subword and then check for the specific characters existance but that is not optimal. It's taking more than a second. Any help towards optimal solution is greatly appreciated.
_rey
The problem is a mixture of that of regular language and that of searching a data structure.
Considering the first aspect alone, we'd be inclined to use a regular expression. You don't say if we can repeat the "additional characters". If we can, it's easy enough [WALOTHD]*MEN[WALOTHD]* for your case, and that's easily adapted.
If we can't repeat, then we can start with [WALOTHD]{0,7}MEN[WALOTHD]{0,7} and filter out any that break the rule ("ALLOTMENT" matches that expression, but repeats L and T).
Or we can try to build a much more complicated regular expression, though I'm not sure if the gains in the better expression would out-weigh the cost of working out what it was though.
Coming from the other side of searching a dictionary, a DAWG is very space-efficient and makes finding matches that contain substrings relatively efficient. It's not a complete match to this puzzle, as we have quite a few permutations of prefixes and suffixes to worry about. Without testing, I'd guess it'd being reasonably good if we can't repeat from the "additional", and horrible if we can. But that is just a guess. A GADDAG might well be worth looking at, it'd be bigger than a DAWG, but likely faster for this sort of search (GADDAGs are used in scrabble-solving, which is pretty much the same problem that you have here).
I need to Parse Values out of a Text that looks like this:
Description. Question?
A. First Answer
B. Second Answer
C. Third Answer
Answer: A, B
Now i need to find out the Description, the Question, the Answers and which Answers are correct. Is that Possible with RegEx? I know it should be possible, but I'm not an RegEx Expert.
Seriously Regex is great, but once the parsing logic becomes advanced, so does the regex needed to solve the problem. I would suggest breaking up the logic into smaller pieces (i take it you have some sort of scripting language available to do some preprocessing?)
Even if you get the whole thing matched with one killer regex - changing it later (by you or some other sorry person) would be a pain.
I would match the answers with something like this (You'd need to strip the commas):
^Answer: (\w,?)+
And then I'd do logic to reparse the text with the answers found with the first regex, with something like this (rebuilding the match, in this case A was an answer):
^A\.\s(.*)
It might not be something to flash your friends with, but it will be easier to maintain, and a heck lot easier to understand.
Just about anything you could possibly want to do with parsing text is possible with Regular Expressions, you will have to invest some time to learn it though. How tricky your particular task is depends on how consistent your body of text it. So in short, yes, but don't ask me for the Reg Ex! Good Luck.
If you could be more specific with your example and show an actual question and description it would be easier to tell for sure, but if I'm reading this right you could find all the text up to the last full stop "." before the question-mark "?", then find the text after it up to the question mark "?", and finally use the letters with full stops "." right after them, so something like this pseudo:
lastFullStopBeforeQ = text.substring(0 to first question
mark).lastIndexOf(".")
Description = text.substring(0 to lastFullStopBeforeQ)
Question = text.substring(lastFullStopBeforeQ+1 to first question
mark)
Answers[0] = text.substring(first question mark+1 to next "\n") ...
CorrectAnswers[0] = text.substring(next index of "Answer:" to next
",") ...
I know this is possible using C#, if you use something else then i can't give you a clear answer.
We use Lucene.NET to implement a full text search on a clients website. The search itself works already but we now want to implement a modification.
Currently all terms get appended a * which leads Lucene to perform what I would classify as a StartsWith search.
In the future we would like to have a search that performs something like a Contains rather than a StartsWith.
We use
Lucene.Net 2.9.2.2
StandardAnalyzer
default QueryParser
Samples:
(Title:Orch*) matches: Orchestra
but:
(Title:rch*) does not match: Orchestra
We want the first and the second one to both match Orchestra.
Basically I want the exact opposite of what was asked in this question, I'm not sure why for this person Lucene performed a Contains and rather than a StartsWith by default:
Why is this Lucene query a "contains" instead of a "startsWith"?
How can we make this happen?
I have the feeling it has something to do with the Analyzer but I'm not sure.
First off, I assume you're using StandardAnalyzer, or something similar. Your linked question fail to understand that you search for terms, and his case a* will match "Fleet Africa" because it's tokenized into "fleet" and "africa".
You need to call QueryParser.SetAllowLeadingWildcard(true) to be able to write queries like field:*value*. Are you actually changing the string that's passed to QueryParser?
You could parse the query as usual, and then implement a QueryVisitor that rewrites all TermQuery into WildcardQuery. That way you still support phrase searches.
I see no good things in rewriting queries into prefix- or wildcard-queries. There is very little shared between an orc, or a chest, and an Orchestra, but both words will match. Instead, hook up your customer with an analyzer that supports stemming, synonyms, and provide a spell correction feature to fix simple searching mistakes.
#Simon Svensson probably gave the better answer (i.e. you don't need this), but if you do, you should use a Shingle Filter.
Note that this will make your index massively larger, since instead of just storing "orchestra", you will store "orc", "rch", "che", "hes"... But just having a plain term query with leading wildcards will be massively slow. It will essentially have to look through every single term in your corpus.
I'd like to give users the ability to search through a large list of businesses, but still find near matches.
Does anyone have any recommendations on how best to go about this when you're not targeting simple dictionary words, but instead complex names like ABC Business Name?
Regards.
Check out the wikipedia article on Levenshtein distance. It's a fairly simple concept to wrap your head around and pretty easy to implement an algorithm in whichever language you are using, in your case, C#.
I found an example in C# for you here.
Also, here is an example of a spelling corrector from Peter Norvig of Google. It was said on the SO podcast a few episodes ago that Jon Skeet attempted a rewrite of this same algorithm in C#. Not sure if he completed it and/or made it publicly available though.
Consider using Keyword match and edit distance based similarity. Might combine with 'original searched' to 'actually clicked'.
This is probably a crazy solution but could you split the business name by space and then search either all the items or maybe the first couple.
So you might search on 'ABC' and 'Business' but leave out 'Name' as this might take too long.
You might even check to see if the string is of a certain length, then trim and just search on the first say 5 letters.
Have you had a look at "soundex" as a way of searching through your businesses. Again, I think you'd need to split the name by space.
You might check out the SQL Server SOUNDEX and DIFFERENCE functions. SOUNDEX converts a sequence of characters (such as a word) into a 4-character code which will be the same for similar-sounding words. DIFFERENCE gives a number which represents how "different" two strings are based on sound.
You could, for example, create a computed column based on the SOUNDEX function and match on that column later. Or you could use DIFFERENCE in a WHERE clause.