Is there any way to make Search and addToSearch faster?
I am trying to make it faster. I am not sure if regex in addtosearch can be a problem, it is really small. I am out ofideas how to optimize it further. Now i am just trying to meet word count. I wonder if there is a way to concatenate parts of name that are not empty more effectivly than i do.
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System;
namespace AutoComplete
{
public struct FullName
{
public string Name;
public string Surname;
public string Patronymic;
}
public class AutoCompleter
{
private List<string> listOfNames = new List<string>();
private static readonly Regex sWhitespace = new Regex(#"\s+");
public void AddToSearch(List<FullName> fullNames)
{
foreach (FullName i in fullNames)
{
string nameToAdd = "";
if (!string.IsNullOrWhiteSpace(i.Surname))
{
nameToAdd += sWhitespace.Replace(i.Surname, "") + " ";
}
if (!string.IsNullOrWhiteSpace(i.Name))
{
nameToAdd += sWhitespace.Replace(i.Name, "") + " ";
}
if (!string.IsNullOrWhiteSpace(i.Patronymic))
{
nameToAdd += sWhitespace.Replace(i.Patronymic, "") + " ";
}
listOfNames.Add(nameToAdd.Substring(0, nameToAdd.Length - 1));
}
}
public List<string> Search(string prefix)
{
if (prefix.Length > 100 || string.IsNullOrWhiteSpace(prefix))
{
throw new System.Exception();
}
List<string> namesWithPrefix = new List<string>();
foreach (string name in listOfNames)
{
if (IsPrefix(prefix, name))
{
namesWithPrefix.Add(name);
}
}
return namesWithPrefix;
}
private bool IsPrefix(string prefix, string stringToSearch)
{
if (stringToSearch.Length < prefix.Length)
{
return false;
}
for (int i = 0; i < prefix.Length; i++)
{
if (prefix[i] != stringToSearch[i])
{
return false
}
}
return true
}
}
}
Regular expression (Regexp) are great because of their ease-of use and flexibility but most Regexp engines are actually quite slow. This is the case for the one of C#. Moreover, strings can contain Unicode character and "\s" needs to consider all the (fancy) spaces characters included in the Unicode character set. This make Regexp search/replace much slower. If you know your input does not contain such characters (eg. ASCII), then you can write a much faster implementation. Alternatively, you can play with RegexpOptions like Compiled and CultureInvariant so to reduce a bit the run time.
The AddToSearch performs many hidden allocations. Indeed, += create a new string (because C# string are immutable and not designed to be often resized) and Replace calls does allocate new strings too. You can speed up the computation by directly replace and write the result in a preallocated buffer and simply copy the result with a Substring like you currently do.
Search is fine and it is not easy to optimize it. That being said, if listOfNames is big, then you can use multiple threads so to significantly speed up the computation. Be careful though because Add is not thread-safe. Parallel linkq may help you to do that easily (I never tested it though).
Another solution to speed up a bit the computation of Search is to start the loop of IsPrefix from prefix.Length-1. Indeed, if most string contains the beginning of the prefix, then a significant portion of the time will be spend comparing nearly equal characters. The probability that prefix[prefix.Length-1] != stringToSearch[prefix.Length-1] is higher than prefix[0] != stringToSearch[0]. Additionally, partial loop unrolling may help a bit to speed up the function if the JIT is not able to do that.
Others have already pointed out that the use of regex can be problematic. I would personally consider using str.Replace(" ", String.Empty) - if I understood the regex correctly; I normally try to avoid regex as I have a hard time reading code using regex. Note that String.Empty does not allocate a new string.
That said, I think performance could boost if you would not store the names in a List but at least order the list alpabetically. Thus you do not need to iterate all elemnts of the list but e.g. use binary search to find all elements matching a given prefix - as range within the list of names you already have.
Related
I'm working on a book encryption program for one of my courses and I've run into a problem. Our professor gave us the example of using say Pride and Prejudice as the book used to encrypt, so I chose that one to test my program. The current function I'm using to remove the punctuation from the string is taking so long that the program is being forced into break mode. This function works for smaller strings even pages long, but when I fed it Pride and Prejudice it takes way to long.
public void removePunctuation(ref string s) {
string result = "";
for (int i = 0; i < s.Length; i++) {
if (Char.IsWhiteSpace(s[i])) {
result += ' ';
} else if (!Char.IsLetter(s[i]) && !Char.IsNumber(s[i])) {
// do nothing
} else {
result += s[i];
}
}
s = result;
}
So I think I need a faster way to remove punctuation from this string if anyone has any suggestions? I know looping through every character is horrible, but I'm stumped and I was never taught Regex in depth.
Edit: I was asked how I was storing the string in the dictionary class! This is the constructor for another class that actually uses the formatted string.
public CodeBook(string book)
{
BookMap = new Dictionary<string, List<int>>();
Key = book.Split(null).ToList(); // split string into words
foreach(string s in Key)
{
if (!BookMap.Keys.Contains(s))
{
BookMap.Add(s, Enumerable.Range(0, Key.Count).Where(i => Key[i] == s).ToList());
// add word and add list of occurrances of word
}
}
}
This is slow because you construct string by concatenations in a loop. You have several approaches that are more performant:
Use StringBuilder - unlike string concatenation which constructs a new object each time you add a character, this approach expands the string under construction by larger chunks, preventing excessive garbage creation.
Use LINQ's filtering with Where - this approach constructs an array of chars in a single shot, then constructs a single string from it.
Use regular expression's Replace - this method is optimized to deal with strings of virtually unlimited sizes.
Roll your own algorithm - create an array of chars that corresponds to the length of the original string. Walk through the string, and add the characters that you wish to keep to the array. Use string's constructor that takes the array, the initial index, and the length to construct the string at once.
Looping through every character once is not that bad. You're doing it all in one pass, that's not trivial to avoid.
The problem lies in the fact that the framework will need to allocate a new copy of the (partial) string whenever you do something like
result += s[i];
You can avoid that by introducing a StringBuilder documented here to append non-punctuation characters as you go.
public string removePunctuation(string s)
{
var result = new StringBuilder();
for (int i = 0; i < s.Length; i++) {
if (Char.IsWhiteSpace(s[i])) {
result.Append(" ");
} else if (!Char.IsLetter(s[i]) && !Char.IsNumber(s[i])) {
// do nothing
} else {
result.Append(s[i]);
}
}
return result.ToString();
}
You could further reduce the number of necessary Append calls with a refined algorithm, for example look ahead to the next punctuation and append larger portions at once, or use an existing string manipulation library like RegEx. But the introduction of StringBuilder above should give you a noticable performance gain already.
I was never taught Regex in depth
Use the search provider of your choice, you may end up with a tested solution which you can just study and use: https://stackoverflow.com/a/5871826/1132334
You can use Regex to remove punctuations as below.
public string removePunctuation(string s)
{
string result = Regex.Replace(s, #"[^\w\s]", "");
return result;
}
^ Means: not these characters (letters, numbers).
\w Means: word characters.
\s Means: space characters.
I have a string of dash-separated numbers that I am removing duplicate numbers from
string original = "45-1-3-45-10-3-15";
string new = "45-1-3-10-15";
I have tried two approaches, and used Stopwatch to determine which method is faster, but I am getting inconsistent time elapses so I was hoping for some insight into which method would be more efficient for achieving the new duplicate-free list.
Method 1: While loop
List<string> temp = new List<string>();
bool moreNumbers = true;
while (moreNumbers)
{
if (original.Contains("-"))
{
string number = original.Substring(0, original.IndexOf("-"));
//don't add if the number is already in the list
int index = temp.FindIndex(item => item == number);
if (index < 0)
temp.Add(value);
original = original.Substring(original.IndexOf("-") + 1);
}
else
moreNumbers = false;
}
//add remaining value in
string lastNumber = original;
//don't add if the number is already in the list
int indexLast = temp.FindIndex(item => item == lastNumber);
if (indexLast < 0)
temp.Add(lastNumber);
string new = "";
foreach (string number in temp)
{
new += "-" + number;
}
if (new[0] == '-')
new = new.Substring(1);
Method 2: Split
List<string> temp = original.Split('-').Distinct().ToList();
string new = "";
foreach (string number in temp)
{
new += "-" + number;
}
if (new[0] == '-')
new = new.Substring(1);
I think the second method is more readable, but possibly slower? Which of these methods would be more efficient or a better approach?
This will be highly optimized but you test for performance.
string result = string.Join("-", original.Split('-').Distinct());
You have some inefficiencies in both your examples.
Method 1: manipulating a string is never efficient. Strings are immutable.
Method 2: no need to create a List and use a StringBuilder() instead of using string concatenation.
Lastly, new is a C# reserved word so none of your code will compile.
In the first approach, you're using several Substring calls and several IndexOf calls. I don't know exactly the internal implementation, but I guess they are O(n) in time complexity.
Since, for each number in the list, you'll do a full loop in the other list (you're using strings as lists), you'll have an O(n^2) time complexity.
The second option, I assume it is O(n^2) too, because to make a distinct of the list in an IEnumerable, it will have to iterate the list.
I think one optimezed approach to the problem is:
1) loop the main string and for each "-" or end of string, save the number (this will be more economic than the Split in terms of space).
2) for each number, put it in a Dictionary. This won't be economic in terms of space, but will provide O(1) time to check if the item. Hashing small strings shouldn't be too constly.
3) Loop the Dictionary to retrieve the distinct values.
This implementation will be O(n), better than O(n^2).
Note that only using the dictionary can deliver the result string in a different order. If the order is important, use the Dictionary to check if the item is duplicated, but put in an auxiliary list. Again, this will have a space cost.
Consider the following situation:
public class Employee
{
public string Name {get; set}
public string Email {get; set}
}
public class EnployeeGroup
{
//List of employees in marketting
public IList<Employee> MarkettingEmployees{ get; }
//List of employees in sales
public IList<Employee> SalesEmployees{ get; }
}
private EnployeeGroup GroupA;
int MarkettingCount;
string MarkettingNames;
MarkettingCount = GroupA.MarkettingEmployees.Count; //assigns MarkettingCount=5, this will always be 5-10 employees
MarkettingNames = <**how can i join all the GroupA.MarkettingEmployees.Name into a comma separated string?** >
//I tried a loop:
foreach(Employee MktEmployee in GroupA.MarkettingEmployees)
{
MarkettingNames += MktEmployee.Name + ", ";
}
The loop works, but i want to know:
Is Looping the most efficient/elegant way of doing this? If not then what are the better alternatives? I tried string.join but couldnt get it working..
I want to avoid Linq..
You need a little bit of LINQ whether you like it or not ;)
MarkettingNames = string.Join(", ", GroupA.MarkettingEmployees.Select(e => e.Name));
From a practicality standpoint, there's no reasonable argument for avoiding a loop. Iterations are at the hard of every general-purpose programming language.
Using LINQ is elegant in simple cases. Again, there's no sound reason to avoid it per se.
In case you are looking for a rather obscure, academic solution, there's always tail recursion. However, your data structure would have to be adapted for it. Note that even if you use it, a smart compiler will detect it and optimize into a loop. The odds are agains you!
As an alternative you could use a StringBuilder with Append instead of creating a new string at each iteration
This would be much more efficient (see caveat below):
var stringBuilder = new StringBuilder();
foreach (Employee MktEmployee in GroupA.MarkettingEmployees)
{
stringBuilder.Append(MktEmployee.Name + ", ");
}
Then this:
foreach(Employee MktEmployee in GroupA.MarkettingEmployees)
{
MarkettingNames += MktEmployee.Name + ", ";
}
Edit: If you were to have a large amount of employees this would be much more efficient. However, a trivial loop of 5-10 is actually slightly less efficient.
In small cases - this isn't going to be that large of a hit on performance, but in large cases the pay off will be significant.
Also, if you are to use the explicit loop approach, it's probably best to trim off that last ", " by using something like:
myString = myString.Trim().TrimEnd(',');
The article below explains when you should use StringBuilder to concatenate strings.
In short, in the approach you use: the concatenation is creating a new string each time, which obviously eats up a lot of memory. You also need to copy all the data from the existing string of MarkettingNames to the new string being appended yet another MktEmployee.Name + ", ".
Thank you, Jon Skeet: http://www.yoda.arachsys.com/csharp/stringbuilder.html
Here I am using Split function to get the parts of string.
string[] OrSets = SubLogic.Split('|');
foreach (string OrSet in OrSets)
{
bool OrSetFinalResult = false;
if (OrSet.Contains('&'))
{
OrSetFinalResult = true;
if (OrSet.Contains('0'))
{
OrSetFinalResult = false;
}
//string[] AndSets = OrSet.Split('&');
//foreach (string AndSet in AndSets)
//{
// if (AndSet == "0")
// {
// // A single "false" statement makes the entire And statement FALSE
// OrSetFinalResult = false;
// break;
// }
//}
}
else
{
if (OrSet == "1")
{
OrSetFinalResult = true;
}
}
if (OrSetFinalResult)
{
// A single "true" statement makes the entire OR statement TRUE
FinalResult = true;
break;
}
}
How can I replace the Split operation , along with replacement of foreach constructs.
Hypothesis #1
Depending of the kind of your process, you can parallellize the work :
var OrSets = SubLogic.Split('|').AsParallel();
foreach (string OrSet in OrSets)
{
...
....
}
However, this can often leads to problems with multithreaded apps (locking resource, etc.).
And you have also to measure the benefits. Switching from one thread to another can be costly. If the job is small, the AsParallel will be slower than a simple sequential loop.
This is very efficient when you have latency with network resource, or any kind of I/O.
Hypothesis #2
Your SubLogic variable is very very very big
You can, in this case, walk sequentially the file :
class Program
{
static void Main(string[] args)
{
var SubLogic = "darere|gfgfgg|gfgfg";
using (var sr = new StringReader(SubLogic))
{
var str = string.Empty;
int charValue;
do
{
charValue = sr.Read();
var c = (char)charValue;
if (c == '|' || (charValue == -1 && str.Length > 0))
{
Process(str);
str = string.Empty; // Reset the string
}
else
{
str += c;
}
} while (charValue >= 0);
}
Console.ReadLine();
}
private static void Process(string str)
{
// Your actual Job
Console.WriteLine(str);
}
Also, depending of the length of each chunk between |, you may want to use a StringBuilder and not a simple string concatenation.
Chances are that if you need to optimize to improve the performance of your application, that the code inside of the foreach loop is what needs to be optimized, not the string.Split method.
[EDIT:]
There are a number of good answers elsewhere on StackOverflow related to optimized string parsing:
Fastest Way to Parse Large Strings (multi threaded)
Fast string parsing in C#
String.Split() likely does more than you can do on your own to actually split the string up in a well-optimized manner. That assumes that you are interesting in returning true or false for each split section of your input, of course. Otherwise, you can just focus on searching your string.
As others have mentioned, if you need to search through a huge string (many hundreds of megabytes) and, especially, do so repeatedly and continuously, then look at what .NET 4 gives you with the Task Parallel Library.
For searching through strings, you can look at this example on MSDN for how to use IndexOf, LastIndexOf, StartsWith, and EndsWith methods. Those should perform better than the Contains method.
Of course, the best solution is dependent upon the facts of your particular situation. You'll want to use the System.Diagnostics.Stopwatch class to see how long your implementations (both current and new) take to see what works best.
You could possibly deal with it by using StringBuilder.
Try reading char-by-char from your source string into StringBuilder, till you find '|', then process what a StringBuilder contains.
That is how you'll avoid creation of tonns of String objects and save a lot of memory.
If you would have used Java, I'd recommend using StringTokenizer and StreamTokenizer classes. It's a pity there are no similar classes in .NET
Assuming I do not want to use external libraries or more than a dozen or so extra lines of code (i.e. clear code, not code golf code), can I do better than string.Contains to handle a collection of input strings and a collection of keywords to check for?
Obviously one can use objString.Contains(objString2) to do a simple substring check. However, there are many well-known algorithms which are able to do better than this under special circumstances, particularly if one is working with multiple strings. But sticking such an algorithm into my code would probably add length and complexity, so I'd rather use some sort of shortcut based on a built in function.
E.g. an input would be a collection of strings, a collection of positive keywords, and a collection of negative keywords. Output would be a subset of the first collection of keywords, all of which had at least 1 positive keyword but 0 negative keywords.
Oh, and please don't mention regular expressions as a suggested solutions.
It may be that my requirements are mutually exclusive (not much extra code, no external libraries or regex, better than String.Contains), but I thought I'd ask.
Edit:
A lot of people are only offering silly improvements that won't beat an intelligently used call to contains by much, if anything. Some people are trying to call Contains more intelligently, which completely misses the point of my question. So here's an example of a problem to try solving. LBushkin's solution is an example of someone offering a solution that probably is asymptotically better than standard contains:
Suppose you have 10,000 positive keywords of length 5-15 characters, 0 negative keywords (this seems to confuse people), and 1 1,000,000 character string. Check if the 1,000,000 character string contains at least 1 of the positive keywords.
I suppose one solution is to create an FSA. Another is delimit on spaces and use hashes.
Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.
As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...
One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.
The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.
The general search algorithm is:
For each string you are searching against:
Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
Populate the tokens into a search index (either a sorted array or dictionary)
Search the index for your "negative keywords", if one is found, skip to the next search string
Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)
Here's an example using a sorted array and binary search in C# 2.0:
NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.
string[] FindKeyWordOccurence( string[] stringsToSearch,
string[] positiveKeywords,
string[] negativeKeywords )
{
Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
foreach( string searchIn in stringsToSearch )
{
// tokenize and sort the input to make searches faster
string[] tokenizedList = searchIn.Split( ' ' );
Array.Sort( tokenizedList );
// if any negative keywords exist, skip to the next search string...
foreach( string negKeyword in negativeKeywords )
if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
continue; // skip to next search string...
// for each positive keyword, add to dictionary to keep track of it
// we could have also used a SortedList, but the dictionary is easier
foreach( string posKeyword in positiveKeyWords )
if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
foundKeywords[posKeyword] = 1;
}
// convert the Keys in the dictionary (our found keywords) to an array...
string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
return foundKeywordsArray;
}
Here's a version that uses a dictionary-based index and LINQ in C# 3.0:
NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.
public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
IEnumerable<string> positiveKeywords,
IEnumerable<string> negativeKeywords )
{
var foundKeywordsDict = new Dictionary<string, int>();
foreach( var searchIn in searchStrings )
{
// tokenize the search string...
var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
// skip if any negative keywords exist...
if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
continue;
// merge found positive keywords into dictionary...
// an example of where Enumerable.ForEach() would be nice...
var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
foreach (var keyword in found)
foundKeywordsDict[keyword] = 1;
}
return foundKeywordsDict.Keys;
}
If you add this extension method:
public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
foreach (var keyword in keywords)
{
if (testString.Contains(keyword))
return true;
}
return false;
}
Then this becomes a one line statement:
var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));
This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.
If you're truly just looking for space-delimited words, this code would be a very simple implementation:
static void Main(string[] args)
{
string sIn = "This is a string that isn't nearly as long as it should be " +
"but should still serve to prove an algorithm";
string[] sFor = { "string", "as", "not" };
Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
}
private static string[] FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Intersect(hsFor).ToArray();
}
If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:
private static bool FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Overlaps(hsFor);
}
Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.
First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.
Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.
So you get something like:
IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
foreach (Bucket b in buckets) {
if (i + b.Length >= inputString.Length)
continue;
string candidate = inputString.Substring(i, b.Length);
int hash = ComputeHash(candidate);
foreach (MatchString match in b.MatchStrings) {
if (hash != match.Hash)
continue;
if (candidate == match.String) {
if (match.IsPositive) {
// positive case
}
else {
// negative case
}
}
}
}
}
To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.
That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.