I have a list of different types of images I need to store in a database, they all have a type description such as Indoor or GardenSummer and things like that, but there are a lot of the descriptions that contain repeated words, like GardenSummer and AreaSummer1KM for example both contain "Summer", so is there a way for me to do something like this in c#:
open System
let strs = ["Kitchen"; "GardenSummer"; "GardenWinter"; "AreaSummer1KM"; "PoolIndoors"; "LivingRoom"; "BathRoom"]
let switch (x: string) = match x with
| a when a.Contains "Summer" -> Some "Summer" // here
| b when b.Contains "Winter" -> Some "Winter" // here
| "Exterior" | "ParkFacilities" -> Some "Outdoors"
| "Kitchen" | "Landing" -> Some "Indoors"
| c when c.Contains "Room" -> Some "Indoors" // and here
| _ -> None
let sorted = List.map switch strs
// part from here and down was just added to print the contents, and isn't a part of the issue
let printOption = function
| Some v -> v.ToString () |> Console.WriteLine
| None -> "No Match" |> Console.WriteLine
List.iter printOption sorted
is there a way for me to switch on str.Contains(str2) without making a bunch of else ifs?
The short answer is no.
The slightly longer answer is "no, for a good reason". The switch statement is actually quite a smart statement, that performs better than a chain of if-else if statements in many cases (a good example being the typical switch (MessageType) ...). To do this, however, it requires certain contracts to be held. In the end, it doesn't evaluate every possibility. It performs something similar to a binary search on the possible options.
In the end, your F# code probably does the equivalent of if-else if statements, rather than the equivalent of switch in C#.
Of course, nothing prevents you from creating your own method that would be syntactically similar to F#'s match. Anonymous delegates, generic functions, all those make it rather easy to write such syntax shorteners :)
And of course, there's other options too, like using regular expressions or such. Calling Contains 10 times in a row is going to mean a significant performance penalty if the searched string is long.
Some sample regexes for your data and switches. The common code is as follows:
void Main()
{
var data =
new []
{
"Kitchen", "GardenSummer", "GardenWinter", "AreaSummer1KM",
"PoolIndoors", "LivingRoom", "BathRoom", "Exterior", "ParkFacilities"
};
foreach (var str in data)
{
Matcher(str).Dump();
}
}
Now the thing we're going to change is the Matcher method implementation.
First, to just simplify the whole thing and avoid multiple string matching (comparing strings isn't exactly free):
Regex matcherRegex = new Regex("(Summer)|(Winter)|(^Exterior|ParkFacilities$)",
RegexOptions.Compiled);
string Matcher(string input)
{
var m = matcherRegex.Match(input);
if (m.Groups.Count == 4)
{
if (m.Groups[0].Success) return "Summer";
else if (m.Groups[1].Success) return "Winter";
else if (m.Groups[2].Success) return "Outdoors";
}
return null;
}
So, we still have a if-else chain, but we no longer traverse the strings multiple times. It also allows you to easily specify the conditions you want.
One way to improve this to be more "switchy" is by using LINQ. This is definitely not something you want to do for performance reasons, it's only about aesthetics:
var groupIndex = m.Groups.OfType<Group>()
.Skip(1)
.Select((i, idx) => new { Item = i, Index = idx + 1 })
.Where(i => i.Item.Success)
.Select(i => i.Index)
.FirstOrDefault();
switch (groupIndex)
{
case 0: return null;
case 1: return "Summer";
case 2: return "Winter";
case 3: return "Outdoors";
}
Basically, I get the index of the matched group, and use a switch on that. As I said before, this is probably going to be slower than the first variant, at least due to the LINQ overhead.
You can also use named captures to get the matched groups by name, rather than by index, which is a bit more maintainable. Also, for simple cases, you could use the named group name to avoid the switch altogether:
Regex matcherRegex =
new Regex("(?<Summer>Summer)"
+ "|(?<Winter>Winter)"
+ "|(?<Outdoors>(^Exterior|ParkFacilities$))",
RegexOptions.Compiled | RegexOptions.ExplicitCapture);
string Matcher(string input)
{
return matcherRegex.Match(input)
.Groups
.OfType<Group>()
.Select((i, idx) => new { Item = i, Index = idx })
.Skip(1)
.Where(i => i.Item.Success)
.Select(i => matcherRegex.GroupNameFromNumber(i.Index))
.FirstOrDefault();
}
All of those are just samples, you may want to change those for better edge case or exception handling, and performance, but it shows the ideas.
The last version in particular is handy in that there's nothing preventing you from using this as a common method that handles all the string "switches" that you can explain in regular expressions. Sadly, group names allow a lot of unicode characters, but not whitespaces; it's nothing you couldn't work around, though.
You could even build the pattern matcher automatically, for example by passing Expression<Func<...>> to a helper method, but that's going into complicated territory :)
Related
Personally, I only know that dynamic cannot be used in pattern matching which is considered a pity :(
dynamic foo = 10;
switch(foo) {
case int i:
break;
}
Also, valued tuple/neo-tuples cannot be used in pattern matching:
dynamic foo = (420, 360);
switch(foo) {
case (int, int) i:
break;
}
It was removed in current version of C#7 and was assigned for future usage.
What are the other things I cannot do?
The new pattern matching features in C# 7 consist of the following:
Support for type switching,
Simple use of var patterns,
The addition of when guards to case statements,
the x is T y pattern expression.
Your examples focus on the first of these. And type switching is likely to be the most popular and commonly used of these new features. Whilst there are limitations, such as those that you mention, other features can be used to work around many of them.
For example, your first limitation is easily solved by boxing foo to object:
dynamic foo = 10;
switch ((object)foo)
{
case int i:
Console.WriteLine("int");
break;
default:
Console.WriteLine("other");
break;
}
Will print int as expected.
The var pattern and a guard can be used to work around your second restriction:
dynamic foo = (420, 360);
switch (foo)
{
case var ii when ii.GetType() == typeof((int, int)):
Console.WriteLine("(int,int)");
break;
default:
Console.WriteLine("other");
break;
}
will print (int,int).
Additionally, value tuples can be used for type switching, you just have to use the long-hand syntax:
var foo = (420, 360);
switch (foo)
{
case ValueTuple<int,int> x:
Console.WriteLine($"({x.Item1},{x.Item2})");
break;
default:
Console.WriteLine("other");
break;
}
The above will print (420,360).
For me, personally, the biggest restriction to pattern matching in C# 7 is the lack of pattern matching expressions using the match keyword. Originally, the following was planned for this release, but was pulled due to time constraints:
var x = 1;
var y = x match (
case int _ : "int",
case * : "other"
);
This can be worked around using switch, but it's messy:
var x = 1;
var y = IntOrOther(x);
...
private string IntOrOther(int i)
{
switch (i)
{
case int _ : return "int";
default: return "other";
}
}
But help is at hand here with numerous 3rd party pattern matching libraries, such as my own Succinc<T> library, which let's you write it as:
var x = 1;
var y = x.TypeMatch().To<string>()
.Caseof<int>().Do("int")
.Else("other")
.Result();
It's not as good as having the match keyword, but it's an optional workaround until that feature appears in a later language release.
To really understand the restrictions imposed by C# 7, it's worth referring to the the pattern matching spec on GitHub and comparing that with what will be in the next release of C#. Looking at it though, it becomes apparent that there are work-arounds to all of these.
This question was originally closed because it's open-ended as currently phrased. To give a couple of silly examples, restrictions to C# 7's pattern matching are that it won't make you a perfect cup of coffee, or fly you across the world in seconds ... but I prefer to answer the spirit of the question. And the answer really is that the only restriction is your imagination. If you don't let that restrict you, then one must take into account the fact that the work-arounds have readability and/or performance implications. They are likely the only real-world restrictions.
I have a data interpretation algorithm & actual data. Using this algorithm, I have to interpret the actual data and display it as a report.
For this, Firstly I need to create a form which will accept some variable values from user. The variables are defined in pseudocode as below. (one example given)
AGEYEARS {
Description: Age in Years
Type: Range;
MinVal: 0;
MaxVal: 124;
Default: 0;
ErrorAction: ERT1:=04 GRT4:=960Z;
}
I have several variables like this in my Variables.txt file. I don't wish to use StreamReader, read it line by ine & interpret the variables.
Instead, I am looking for some logic, which can read XXXX { } as one object and Type:Range as Attribute:Value. This way, I can skip one step of reading the file and converting it to a understandable code.
Like this, I also have other files which has conditions to check. For ex,
IF SEX = '9' THEN
SEX:=U
ENDIF
Is there any way to interpret them easily and faster? Can someone help me with it?
I am using C# as my programming language.
So you need a parser for a DSL.
I can advise you ANTLR, which will let you build a grammar easily.
Here's a totally untested simple grammar for it:
grammar ConfigFile;
file: object+;
object: ID '{' property+ '}';
property: ID ':' value ';';
value: (ID|CHAR)+;
ID: [a-zA-Z][a-zA-Z0-9_]*;
WS: [ \t\r\n]+ -> channel(HIDDEN);
CHAR: .;
Alternate solution: You also could use regex:
(?<id>\w+)\s*\{\s*(?:(?<prop>\w+)\s*:\s*(?<value>.+?)\s*;\s*)*\}
Then extract the captured information. For each match, you'll have a group id with the name of the object. The groups prop and value will have multiple captures, each pair defining a property.
In C#:
var text = #"
AGEYEARS {
Description: Age in Years;
Type: Range;
MinVal: 0;
MaxVal: 124;
Default: 0;
ErrorAction: ERT1:=04 GRT4:=960Z;
}
OTHER {
Foo: Bar;
Bar: Baz;
}";
var re = new Regex(#"(?<id>\w+)\s*\{\s*(?:(?<prop>\w+)\s*:\s*(?<value>.+?)\s*;\s*)*\}");
foreach (Match match in re.Matches(text))
{
Console.WriteLine("Object {0}:", match.Groups["id"].Value);
var properties = match.Groups["prop"].Captures.Cast<Capture>();
var values = match.Groups["value"].Captures.Cast<Capture>();
foreach (var property in properties.Zip(values, (prop, value) => new {name = prop.Value, value = value.Value}))
{
Console.WriteLine(" {0} = {1}", property.name, property.value);
}
Console.WriteLine();
}
This solution is not as "pretty" as the parser one, but works without any external lib.
I advice you against using regular expressions. Maybe it will work at start, but if your task will become a bit more complex it might be the case regex won't solve your problem, because it technically cannot do this.
The better choice (for the price of adding library) is using some parser. For C# there might not be as many as for other languages, but there are enough -- just take your pick :-). You have Irony, Coco/R, GOLD, ANTLR, LLLPG, Sprache, or my NLT.
If you sense that you will have mathematical precedence issues (i.e. you will have to work with evaluating of expressions like "5+5*2" which should give 15, not 20) than compare top-down parsers -- ANLTR is one of them -- syntax first against bottom-up parsers -- NLT for example. Usually in the first ones you have to write rules in quirky order (you have to embed the rules) while in the latter ones you have just to set the order of them (stating * goes before +). In other words, rules are separated from precedence.
I'm trying to understand how well C# and F# can play together. I've taken some code from the F# for Fun & Profit blog which performs basic validation returning a discriminated union type:
type Result<'TSuccess,'TFailure> =
| Success of 'TSuccess
| Failure of 'TFailure
type Request = {name:string; email:string}
let TestValidate input =
if input.name = "" then Failure "Name must not be blank"
else Success input
When trying to consume this in C#; the only way I can find to access the values against Success and Failure (failure is a string, success is the request again) is with big nasty casts (which is a lot of typing, and requires typing actual types that I would expect to be inferred or available in the metadata):
var req = new DannyTest.Request("Danny", "fsfs");
var res = FSharpLib.DannyTest.TestValidate(req);
if (res.IsSuccess)
{
Console.WriteLine("Success");
var result = ((DannyTest.Result<DannyTest.Request, string>.Success)res).Item;
// Result is the Request (as returned for Success)
Console.WriteLine(result.email);
Console.WriteLine(result.name);
}
if (res.IsFailure)
{
Console.WriteLine("Failure");
var result = ((DannyTest.Result<DannyTest.Request, string>.Failure)res).Item;
// Result is a string (as returned for Failure)
Console.WriteLine(result);
}
Is there a better way of doing this? Even if I have to manually cast (with the possibility of a runtime error), I would hope to at least shorten access to the types (DannyTest.Result<DannyTest.Request, string>.Failure). Is there a better way?
Working with discriminated unions is never going to be as straightforward in a language that does not support pattern matching. However, your Result<'TSuccess, 'TFailure> type is simple enough that there should be some nice way to use it from C# (if the type was something more complicated, like an expression tree, then I would probably suggest to use the Visitor pattern).
Others already mentioned a few options - both how to access the values directly and how to define Match method (as described in Mauricio's blog post). My favourite method for simple DUs is to define TryGetXyz methods that follow the same style of Int32.TryParse - this also guarantees that C# developers will be familiar with the pattern. The F# definition looks like this:
open System.Runtime.InteropServices
type Result<'TSuccess,'TFailure> =
| Success of 'TSuccess
| Failure of 'TFailure
type Result<'TSuccess, 'TFailure> with
member x.TryGetSuccess([<Out>] success:byref<'TSuccess>) =
match x with
| Success value -> success <- value; true
| _ -> false
member x.TryGetFailure([<Out>] failure:byref<'TFailure>) =
match x with
| Failure value -> failure <- value; true
| _ -> false
This simply adds extensions TryGetSuccess and TryGetFailure that return true when the value matches the case and return (all) parameters of the discriminated union case via out parameters. The C# use is quite straightforward for anyone who has ever used TryParse:
int succ;
string fail;
if (res.TryGetSuccess(out succ)) {
Console.WriteLine("Success: {0}", succ);
}
else if (res.TryGetFailure(out fail)) {
Console.WriteLine("Failuere: {0}", fail);
}
I think the familiarity of this pattern is the most important benefit. When you use F# and expose its type to C# developers, you should expose them in the most direct way (the C# users should not think that the types defined in F# are non-standard in any way).
Also, this gives you reasonable guarantees (when it is used correctly) that you will only access values that are actually available when the DU matches a specific case.
A really nice way to do this with C# 7.0 is using switch pattern matching, it's allllmost like F# match:
var result = someFSharpClass.SomeFSharpResultReturningMethod()
switch (result)
{
case var checkResult when checkResult.IsOk:
HandleOk(checkResult.ResultValue);
break;
case var checkResult when checkResult.IsError:
HandleError(checkResult.ErrorValue);
break;
}
EDIT: C# 8.0 is around the corner and it is bringing switch expressions, so although I haven't tried it yet I am expecting we will be able to do something like this this:
var returnValue = result switch
{
var checkResult when checkResult.IsOk: => HandleOk(checkResult.ResultValue),
var checkResult when checkResult.IsError => HandleError(checkResult.ErrorValue),
_ => throw new UnknownResultException()
};
See https://blogs.msdn.microsoft.com/dotnet/2018/11/12/building-c-8-0/ for more info.
How about this? It's inspired by #Mauricio Scheffer's comment above and the CSharpCompat code in FSharpx.
C#:
MyUnion u = CallIntoFSharpCode();
string s = u.Match(
ifFoo: () => "Foo!",
ifBar: (b) => $"Bar {b}!");
F#:
type MyUnion =
| Foo
| Bar of int
with
member x.Match (ifFoo: System.Func<_>, ifBar: System.Func<_,_>) =
match x with
| Foo -> ifFoo.Invoke()
| Bar b -> ifBar.Invoke(b)
What I like best about this is that it removes the possibility of a runtime error. You no longer have a bogus default case to code, and when the F# type changes (e.g. adding a case) the C# code will fail to compile.
You can use C# type aliasing to simplify referencing the DU Types in a C# File.
using DanyTestResult = DannyTest.Result<DannyTest.Request, string>;
Since C# 8.0 and later have Structural pattern matching, it's easy to do the following:
switch (res) {
case DanyTestResult.Success {Item: var req}:
Console.WriteLine(req.email);
Console.WriteLine(req.name);
break;
case DanyTestResult.Failure {Item: var msg}:
Console.WriteLine("Failure");
Console.WriteLine(msg);
break;
}
This strategy is the simplest as it works with reference type F# DU without modification.
The syntax C# syntax could be reduced more if F# added a Deconstruct method to the codegen for interop. DanyTestResult.Success(var req)
If your F# DU is a struct style, you just need to pattern match on the Tag property without the type. {Tag:DanyTestResult.Tag.Success, SuccessValue:var req}
Probably, one of the simplest ways to accomplish this is by creating a set of extension methods:
public static Result<Request, string>.Success AsSuccess(this Result<Request, string> res) {
return (Result<Request, string>.Success)res;
}
// And then use it
var successData = res.AsSuccess().Item;
This article contains a good insight. Quote:
The advantage of this approach is 2 fold:
Removes the need to explicitly name types in code and hence gets back the advantages of type inference;
I can now use . on any of the values and let Intellisense help me find the appropriate method to use;
The only downfall here is that changed interface would require refactoring the extension methods.
If there are too many such classes in your project(s), consider using tools like ReSharper as it looks not very difficult to set up a code generation for this.
I had this same issue with the Result type. I created a new type of ResultInterop<'TSuccess, 'TFailure> and a helper method to hydrate the type
type ResultInterop<'TSuccess, 'TFailure> = {
IsSuccess : bool
Success : 'TSuccess
Failure : 'TFailure
}
let toResultInterop result =
match result with
| Success s -> { IsSuccess=true; Success=s; Failure=Unchecked.defaultof<_> }
| Failure f -> { IsSuccess=false; Success=Unchecked.defaultof<_>; Failure=f }
Now I have the choice of piping through toResultInterop at the F# boundary or doing so within the C# code.
At the F# boundary
module MyFSharpModule =
let validate request =
if request.isValid then
Success "Woot"
else
Failure "request not valid"
let handleUpdateRequest request =
request
|> validate
|> toResultInterop
public string Get(Request request)
{
var result = MyFSharpModule.handleUpdateRequest(request);
if (result.IsSuccess)
return result.Success;
else
throw new Exception(result.Failure);
}
After the interop in Csharp
module MyFSharpModule =
let validate request =
if request.isValid then
Success "Woot"
else
Failure "request not valid"
let handleUpdateRequest request = request |> validate
public string Get(Request request)
{
var response = MyFSharpModule.handleUpdateRequest(request);
var result = Interop.toResultInterop(response);
if (result.IsSuccess)
return result.Success;
else
throw new Exception(result.Failure);
}
I'm using the next methods to interop unions from F# library to C# host. This may add some execution time due to reflection usage and need to be checked, probably by unit tests, for handling right generic types for each union case.
On F# side
type Command =
| First of FirstCommand
| Second of SecondCommand * int
module Extentions =
let private getFromUnionObj value =
match value.GetType() with
| x when FSharpType.IsUnion x ->
let (_, objects) = FSharpValue.GetUnionFields(value, x)
objects
| _ -> failwithf "Can't parse union"
let getFromUnion<'r> value =
let x = value |> getFromUnionObj
(x.[0] :?> 'r)
let getFromUnion2<'r1,'r2> value =
let x = value |> getFromUnionObj
(x.[0] :?> 'r1, x.[1] :? 'r2)
On C# side
public static void Handle(Command command)
{
switch (command)
{
case var c when c.IsFirstCommand:
var data = Extentions.getFromUnion<FirstCommand>(change);
// Handler for case
break;
case var c when c.IsSecondCommand:
var data2 = Extentions.getFromUnion2<SecondCommand, int>(change);
// Handler for case
break;
}
}
How to write following statement in c using switch statement in c
int i = 10;
int j = 20;
if (i == 10 && j == 20)
{
Mymethod();
}
else if (i == 100 && j == 200)
{
Yourmethod();
}
else if (i == 1000 || j == 2000) // OR
{
Anymethod();
}
EDIT:
I have changed the last case from 'and' to 'or' later. So I appologise from people who answered my question before this edit.
This scenario is for example, I just wanted to know that is it possible or not. I have google this and found it is not possible but I trust gurus on stackoverflow more.
Thanks
You're pressing for answers that will unnaturally force this code into a switch - that's not the right approach in C, C++ or C# for the problem you've described. Live with the if statements, as using a switch in this instance leads to less readable code and the possibility that a slip-up will introduce a bug.
There are languages that will evaluate a switch statement syntax similar to a sequence of if statements, but C, C++, and C# aren't among them.
After Jon Skeet's comment that it can be "interesting to try to make it work", I'm going to go against my initial judgment and play along because it's certainly true that one can learn by trying alternatives to see where they work and where they don't work. Hopefully I won't end up muddling things more than I should...
The targets for a switch statement in the languages under consideration need to be constants - they aren't expressions that are evaluated at runtime. However, you can potentially get a behavior similar to what you're looking for if you can map the conditions that you want to have as switch targets to a hash function that will produce a perfect hash the matches up to the conditions. If that can be done, you can call the hash function and switch on the value it produces.
The C# compiler does something similar to this automatically for you when you want to switch on a string value. In C, I've manually done something similar when I want to switch on a string. I place the target strings in a table along with enumerations that are used to identify the strings, and I switch on the enum:
char* cmdString = "copystuff"; // a string with a command identifier,
// maybe obtained from console input
StrLookupValueStruct CmdStringTable[] = {
{ "liststuff", CMD_LIST },
{ "docalcs", CMD_CALC },
{ "copystuff", CMD_COPY },
{ "delete", CMD_DELETE },
{ NULL, CMD_UNKNOWN },
};
int cmdId = strLookupValue( cmdString, CmdStringTable); // transform the string
// into an enum
switch (cmdId) {
case CMD_LIST:
doList();
break;
case CMD_CALC:
doCalc();
break;
case CMD_COPY:
doCopy();
break;
// etc...
}
Instead of having to use a sequence of if statements:
if (strcmp( cmdString, "liststuff") == 0) {
doList();
}
else if (strcmp( cmdString, "docalcs") == 0) {
doCalc();
}
else if (strcmp( cmdString, "copystuff") == 0) {
doCopy();
}
// etc....
As an aside, for the string to function mapping here I personally find the table lookup/switch statement combination to be a bit more readable, but I imagine there are people who might prefer the more direct approach of the if sequence.
The set of expressions you have in your question don't look particularly simple to transform into a hash - your hash function would almost certainly end up being a sequence of if statements - you would have basically just moved the construct somewhere else. Jon Skeet's original answer was essentially to turn your expressions into a hash, but when the or operation got thrown into the mix of one of the tests, the hash function broke down.
In general you can't. What you are doing already is fine, although you might want to add an else clause at the end to catch unexpected inputs.
In your specific example it seems that j is often twice the value of i. If that is a general rule you could try to take advantage of that by doing something like this instead:
if (i * 2 == j) /* Watch out for overflow here if i could be large! */
{
switch (i)
{
case 10:
// ...
break;
case 100:
// ...
break;
// ...
}
}
(Removed original answer: I'd missed the fact that the condition was an "OR" rather than an "AND". EDIT: Ah, because apparently it wasn't to start with.)
You could still theoretically use something like my original code (combining two 32-bit integers into one 64-bit integer and switching on that), although there would be 2^33 case statements covering the last condition. I doubt that any compiler would actually make it through such code :)
But basically, no: use the if/else structure instead.
Assuming I do not want to use external libraries or more than a dozen or so extra lines of code (i.e. clear code, not code golf code), can I do better than string.Contains to handle a collection of input strings and a collection of keywords to check for?
Obviously one can use objString.Contains(objString2) to do a simple substring check. However, there are many well-known algorithms which are able to do better than this under special circumstances, particularly if one is working with multiple strings. But sticking such an algorithm into my code would probably add length and complexity, so I'd rather use some sort of shortcut based on a built in function.
E.g. an input would be a collection of strings, a collection of positive keywords, and a collection of negative keywords. Output would be a subset of the first collection of keywords, all of which had at least 1 positive keyword but 0 negative keywords.
Oh, and please don't mention regular expressions as a suggested solutions.
It may be that my requirements are mutually exclusive (not much extra code, no external libraries or regex, better than String.Contains), but I thought I'd ask.
Edit:
A lot of people are only offering silly improvements that won't beat an intelligently used call to contains by much, if anything. Some people are trying to call Contains more intelligently, which completely misses the point of my question. So here's an example of a problem to try solving. LBushkin's solution is an example of someone offering a solution that probably is asymptotically better than standard contains:
Suppose you have 10,000 positive keywords of length 5-15 characters, 0 negative keywords (this seems to confuse people), and 1 1,000,000 character string. Check if the 1,000,000 character string contains at least 1 of the positive keywords.
I suppose one solution is to create an FSA. Another is delimit on spaces and use hashes.
Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.
As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...
One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.
The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.
The general search algorithm is:
For each string you are searching against:
Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
Populate the tokens into a search index (either a sorted array or dictionary)
Search the index for your "negative keywords", if one is found, skip to the next search string
Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)
Here's an example using a sorted array and binary search in C# 2.0:
NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.
string[] FindKeyWordOccurence( string[] stringsToSearch,
string[] positiveKeywords,
string[] negativeKeywords )
{
Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
foreach( string searchIn in stringsToSearch )
{
// tokenize and sort the input to make searches faster
string[] tokenizedList = searchIn.Split( ' ' );
Array.Sort( tokenizedList );
// if any negative keywords exist, skip to the next search string...
foreach( string negKeyword in negativeKeywords )
if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
continue; // skip to next search string...
// for each positive keyword, add to dictionary to keep track of it
// we could have also used a SortedList, but the dictionary is easier
foreach( string posKeyword in positiveKeyWords )
if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
foundKeywords[posKeyword] = 1;
}
// convert the Keys in the dictionary (our found keywords) to an array...
string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
return foundKeywordsArray;
}
Here's a version that uses a dictionary-based index and LINQ in C# 3.0:
NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.
public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
IEnumerable<string> positiveKeywords,
IEnumerable<string> negativeKeywords )
{
var foundKeywordsDict = new Dictionary<string, int>();
foreach( var searchIn in searchStrings )
{
// tokenize the search string...
var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
// skip if any negative keywords exist...
if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
continue;
// merge found positive keywords into dictionary...
// an example of where Enumerable.ForEach() would be nice...
var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
foreach (var keyword in found)
foundKeywordsDict[keyword] = 1;
}
return foundKeywordsDict.Keys;
}
If you add this extension method:
public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
foreach (var keyword in keywords)
{
if (testString.Contains(keyword))
return true;
}
return false;
}
Then this becomes a one line statement:
var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));
This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.
If you're truly just looking for space-delimited words, this code would be a very simple implementation:
static void Main(string[] args)
{
string sIn = "This is a string that isn't nearly as long as it should be " +
"but should still serve to prove an algorithm";
string[] sFor = { "string", "as", "not" };
Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
}
private static string[] FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Intersect(hsFor).ToArray();
}
If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:
private static bool FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Overlaps(hsFor);
}
Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.
First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.
Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.
So you get something like:
IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
foreach (Bucket b in buckets) {
if (i + b.Length >= inputString.Length)
continue;
string candidate = inputString.Substring(i, b.Length);
int hash = ComputeHash(candidate);
foreach (MatchString match in b.MatchStrings) {
if (hash != match.Hash)
continue;
if (candidate == match.String) {
if (match.IsPositive) {
// positive case
}
else {
// negative case
}
}
}
}
}
To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.
That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.