How do I optimize schemaDocument.Namespaces code for performance? - c#

I have this code that is called thousands of times and I need to optimize it for performance.
I thought about caching xmlQualifiedNames but it's not good enough.
any ideas ?
private static string GetPrefixForNamespace(string ns, XmlSchema schemaDocument)
{
string prefix = null;
XmlQualifiedName[] xmlQualifiedNames = schemaDocument.Namespaces.ToArray();
foreach (XmlQualifiedName qn in xmlQualifiedNames)
{
if (ns == qn.Namespace)
{
prefix = qn.Name;
break;
}
}
return prefix;
}

since you're looking for strings (Namespace) inside the xmlQualifiedNames, how about caching those?
Or using LINQ to search in them?
Or - depending on the kind of input you get - using memoization to speed up your calls (really just fancy caching) like in this article.

Stuff it in a Dictionary or Hashtable or even a some caching mechanism.

Related

Is there any way to make Search and addToSearch faster?

Is there any way to make Search and addToSearch faster?
I am trying to make it faster. I am not sure if regex in addtosearch can be a problem, it is really small. I am out ofideas how to optimize it further. Now i am just trying to meet word count. I wonder if there is a way to concatenate parts of name that are not empty more effectivly than i do.
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System;
namespace AutoComplete
{
public struct FullName
{
public string Name;
public string Surname;
public string Patronymic;
}
public class AutoCompleter
{
private List<string> listOfNames = new List<string>();
private static readonly Regex sWhitespace = new Regex(#"\s+");
public void AddToSearch(List<FullName> fullNames)
{
foreach (FullName i in fullNames)
{
string nameToAdd = "";
if (!string.IsNullOrWhiteSpace(i.Surname))
{
nameToAdd += sWhitespace.Replace(i.Surname, "") + " ";
}
if (!string.IsNullOrWhiteSpace(i.Name))
{
nameToAdd += sWhitespace.Replace(i.Name, "") + " ";
}
if (!string.IsNullOrWhiteSpace(i.Patronymic))
{
nameToAdd += sWhitespace.Replace(i.Patronymic, "") + " ";
}
listOfNames.Add(nameToAdd.Substring(0, nameToAdd.Length - 1));
}
}
public List<string> Search(string prefix)
{
if (prefix.Length > 100 || string.IsNullOrWhiteSpace(prefix))
{
throw new System.Exception();
}
List<string> namesWithPrefix = new List<string>();
foreach (string name in listOfNames)
{
if (IsPrefix(prefix, name))
{
namesWithPrefix.Add(name);
}
}
return namesWithPrefix;
}
private bool IsPrefix(string prefix, string stringToSearch)
{
if (stringToSearch.Length < prefix.Length)
{
return false;
}
for (int i = 0; i < prefix.Length; i++)
{
if (prefix[i] != stringToSearch[i])
{
return false
}
}
return true
}
}
}
Regular expression (Regexp) are great because of their ease-of use and flexibility but most Regexp engines are actually quite slow. This is the case for the one of C#. Moreover, strings can contain Unicode character and "\s" needs to consider all the (fancy) spaces characters included in the Unicode character set. This make Regexp search/replace much slower. If you know your input does not contain such characters (eg. ASCII), then you can write a much faster implementation. Alternatively, you can play with RegexpOptions like Compiled and CultureInvariant so to reduce a bit the run time.
The AddToSearch performs many hidden allocations. Indeed, += create a new string (because C# string are immutable and not designed to be often resized) and Replace calls does allocate new strings too. You can speed up the computation by directly replace and write the result in a preallocated buffer and simply copy the result with a Substring like you currently do.
Search is fine and it is not easy to optimize it. That being said, if listOfNames is big, then you can use multiple threads so to significantly speed up the computation. Be careful though because Add is not thread-safe. Parallel linkq may help you to do that easily (I never tested it though).
Another solution to speed up a bit the computation of Search is to start the loop of IsPrefix from prefix.Length-1. Indeed, if most string contains the beginning of the prefix, then a significant portion of the time will be spend comparing nearly equal characters. The probability that prefix[prefix.Length-1] != stringToSearch[prefix.Length-1] is higher than prefix[0] != stringToSearch[0]. Additionally, partial loop unrolling may help a bit to speed up the function if the JIT is not able to do that.
Others have already pointed out that the use of regex can be problematic. I would personally consider using str.Replace(" ", String.Empty) - if I understood the regex correctly; I normally try to avoid regex as I have a hard time reading code using regex. Note that String.Empty does not allocate a new string.
That said, I think performance could boost if you would not store the names in a List but at least order the list alpabetically. Thus you do not need to iterate all elemnts of the list but e.g. use binary search to find all elements matching a given prefix - as range within the list of names you already have.

detectecting destructive SQL queries with C#

So I am looking to find a more effective way to determine all variants of the strings in the array in this this C# code I wrote. I could loop over the whole string and compare each character in sqltext to the one before it and make it overly complicated or i could try to learn something new. I was thinking there has to be a more efficient way. I showed this to a co-worker and she suggested I use a regular expression. I have looked into regular expressions a little bit, but i cant seem to find the right expression.
what I am looking for is a version that takes all variants of the indexes of the array in this code:
public bool securitycheck(String sqltext)
{
string[] badSqlList = new string[] {"insert","Insert","INSERT",
"update","Update","UPDATE",
"delete","Delete","DELETE",
"drop","Drop", "DROP"};
for (int i = 0; i < badSqlList.Count(); i++)
{
if (sqltext.Contains(badSqlList[i]) == true)
{
return true;
}
}
return false;
}
but takes into account for alternate spelling. this code for example does not take into account for "iNsert, UpDate, dELETE, DrOP" but according to my coworker there is a way using Regular expressions to take into account for this.
What is the best way to do this in your opinion?
[Update]
thank you everyone, there is lots of really good information here and it really does open my eyes to handling SQL programatically. the scope on this tool I am building is very small and anyone with the permissions to access this tool and who has intent on being malicious would be someone who has direct access to the database anyway. these checks are in place to more or less prevent laziness. The use-case does not permit for parameterized queries or i would be doing that. your insight has been very educational and I appreciate all your help!
You can do:
if (badSqlList.Any(r => sqltext.IndexOf(r, StringComparison.InvariantCultureIgnoreCase) >= 0))
{
//bad SQL found
}
IndexOf with StringComparison enum value will ensure case insensitive comparison.
Another approach could be:
return sqltext.Split()
.Intersect(badSqlList,StringComparer.InvariantCultureIgnoreCase)
.Any()
Split your Sql on white space and then compare each word with your white list array. This could save you in cases where your legal table name has keyword like INESRTEDStudents
Not really sure about your requirements, but, generally, a better option would be to use Parameterized queries in the first place. You can't be 100% sure with your white list and there still would be ways to bypass it.
Do not reinvent the wheel - just use parameterized queries as everyone here tells you (fixes even more problem than you are currently aware), you'll thank as all in the future...
But do use this to sanitaze all your filter strings that go in WHERE clauses:
public static string EscapeSpecial(string s)
{
Contract.Requires(s != null);
var sb = new StringBuilder();
foreach(char c in s)
{
switch(c)
{
case '[':
case ']':
case '%':
case '*':
{
sb.AppendFormat(CultureInfo.InvariantCulture, "[{0}]", c);
break;
}
case '\'':
{
sb.Append("''");
break;
}
default:
{
sb.Append(c);
break;
}
}
}
return sb.ToString();
}

Replace string.Split with other constructs - Optimization

Here I am using Split function to get the parts of string.
string[] OrSets = SubLogic.Split('|');
foreach (string OrSet in OrSets)
{
bool OrSetFinalResult = false;
if (OrSet.Contains('&'))
{
OrSetFinalResult = true;
if (OrSet.Contains('0'))
{
OrSetFinalResult = false;
}
//string[] AndSets = OrSet.Split('&');
//foreach (string AndSet in AndSets)
//{
// if (AndSet == "0")
// {
// // A single "false" statement makes the entire And statement FALSE
// OrSetFinalResult = false;
// break;
// }
//}
}
else
{
if (OrSet == "1")
{
OrSetFinalResult = true;
}
}
if (OrSetFinalResult)
{
// A single "true" statement makes the entire OR statement TRUE
FinalResult = true;
break;
}
}
How can I replace the Split operation , along with replacement of foreach constructs.
Hypothesis #1
Depending of the kind of your process, you can parallellize the work :
var OrSets = SubLogic.Split('|').AsParallel();
foreach (string OrSet in OrSets)
{
...
....
}
However, this can often leads to problems with multithreaded apps (locking resource, etc.).
And you have also to measure the benefits. Switching from one thread to another can be costly. If the job is small, the AsParallel will be slower than a simple sequential loop.
This is very efficient when you have latency with network resource, or any kind of I/O.
Hypothesis #2
Your SubLogic variable is very very very big
You can, in this case, walk sequentially the file :
class Program
{
static void Main(string[] args)
{
var SubLogic = "darere|gfgfgg|gfgfg";
using (var sr = new StringReader(SubLogic))
{
var str = string.Empty;
int charValue;
do
{
charValue = sr.Read();
var c = (char)charValue;
if (c == '|' || (charValue == -1 && str.Length > 0))
{
Process(str);
str = string.Empty; // Reset the string
}
else
{
str += c;
}
} while (charValue >= 0);
}
Console.ReadLine();
}
private static void Process(string str)
{
// Your actual Job
Console.WriteLine(str);
}
Also, depending of the length of each chunk between |, you may want to use a StringBuilder and not a simple string concatenation.
Chances are that if you need to optimize to improve the performance of your application, that the code inside of the foreach loop is what needs to be optimized, not the string.Split method.
[EDIT:]
There are a number of good answers elsewhere on StackOverflow related to optimized string parsing:
Fastest Way to Parse Large Strings (multi threaded)
Fast string parsing in C#
String.Split() likely does more than you can do on your own to actually split the string up in a well-optimized manner. That assumes that you are interesting in returning true or false for each split section of your input, of course. Otherwise, you can just focus on searching your string.
As others have mentioned, if you need to search through a huge string (many hundreds of megabytes) and, especially, do so repeatedly and continuously, then look at what .NET 4 gives you with the Task Parallel Library.
For searching through strings, you can look at this example on MSDN for how to use IndexOf, LastIndexOf, StartsWith, and EndsWith methods. Those should perform better than the Contains method.
Of course, the best solution is dependent upon the facts of your particular situation. You'll want to use the System.Diagnostics.Stopwatch class to see how long your implementations (both current and new) take to see what works best.
You could possibly deal with it by using StringBuilder.
Try reading char-by-char from your source string into StringBuilder, till you find '|', then process what a StringBuilder contains.
That is how you'll avoid creation of tonns of String objects and save a lot of memory.
If you would have used Java, I'd recommend using StringTokenizer and StreamTokenizer classes. It's a pity there are no similar classes in .NET

Lucene serializer in C#, need performance advice

I'm trying to build a Lucene Serializer class that would serialize/de-serialize objects (classes) with properties decorated with the DataMember and a special attribute with instruction on how to store the property/field in a Lucene index.
The class works fine when I need to retrieve a single object by a certain key/value pair.
But I noticed that if sometimes I need to retrieve all items, and there let's say are 100,000 documents - then MySQL does it ~bout 10 times faster... for some reason...
Could you please review this code (Lucene experts) and suggest any possible performance related ideas for improvement ?
public IEnumerable<T> LoadAll()
{
IndexReader reader = IndexReader.Open(this.PathToLuceneIndex);
int itemsCount = reader.NumDocs();
for (int i = 0; i < itemsCount; i++)
{
if (!reader.IsDeleted(i))
{
Document doc = reader.Document(i);
if (doc != null)
{
T item = Deserialize(doc);
yield return item;
}
}
}
if (reader != null) reader.Close();
}
private T Deserialize(Document doc)
{
T itemInstance = Activator.CreateInstance<T>();
foreach (string fieldName in fieldTypes.Keys)
{
Field myField = doc.GetField(fieldName);
//Not every document may have the full collection of indexable fields
if (myField != null)
{
object fieldValue = myField.StringValue();
Type fieldType = fieldTypes[fieldName];
if (fieldType == typeof(bool))
fieldValue = fieldValue == "1" ? true : false;
if (fieldType == typeof(DateTime))
fieldValue = DateTools.StringToDate((string)fieldValue);
pF.SetValue(itemInstance, fieldName, fieldValue);
}
}
return itemInstance;
}
Thank you in advance!
Here are some tips:
First, don't use IndexReader.Open(string path). Not only will it be removed in the next major release of Lucene.net, it's generally not your best option. There's actually a ton of unnecessary code called when you let Lucene generate the directory for you. I suggest:
var dir = new SimpleFSDirectory(new DirectoryInfo(path));
var reader = IndexReader.Open(dir, true);
You should also do as I did above, and open the IndexReader as readonly, if you don't absolutely need to write to it, as it will be quicker in multi-threaded environments especially.
If you know the size of your index is not more than you can hold into memory (ie less than 500-600 MB and not compressed), you can use a RAMDirectory instead. This will load the entire index into memory allowing you to bypass most of the costly IO operations if you were leaving the index on disk. It should greatly improve your speed, especially if you do it with the other suggestions below.
If the index is too large to fit in memory, you either need to split the index up into chunks (ie an index every n MBs) or just continue to read it from disk.
Also, I know you can't yield return in a try...catch, but you can in a try...finally, and I would recommend wrapping your logic in LoadAll() into a try...finally, like
IndexReader reader = null;
try
{
//logic here...
}
finally
{
if (reader != null) reader.Close();
}
Now, when it comes to your actual Deserialize code, you're probably doing it in nearly the fastest way possible, except that you are boxing the string when you don't need to. Lucene only stores the field as a byte[] array or a string. Since you're calling string value, you know it will always be a string, and should only have to box it if absolutely necessary. Change it to this:
string fieldValue = myField.StringValue();
That will at least sometimes save you a minor boxing cost. (really, not much)
On the topic of boxing, we're working on a branch of lucene you can pull from SVN, that changes the internals of Lucene from using boxing containers (ArrayLists, non-generic Lists and HashTables) to a version that uses generics and more .net-friendly things. This is the 2.9.4g branch. .Net'ified, as we like to say. We haven't officially benchmarked it, but developer tests have show it, in some cases, to be around 200% faster than older versions.
The other thing to keep in mind, Lucene is great as a search engine, you may find that in some cases, it may not stack up to MySQL. Really, though, the only way to know for sure is to just test and try to find performance bottlenecks like some of the ones I mentioned above.
Hope that helps! Don't forget about the Lucene.Net mailing list (lucene-net-dev#lucene.apache.org), either if you have any questions. Me and the other committers are generally quick to answer questions.

C# - Collection is enough or comobination of LINQ will improve performance?

According to the requirement we have to return a collection either in reverse order or as
it is. We, beginning level programmer designed the collection as follow :(sample is given)
namespace Linqfying
{
class linqy
{
static void Main()
{
InvestigationReport rpt=new InvestigationReport();
// rpt.GetDocuments(true) refers
// to return the collection in reverse order
foreach( EnquiryDocument doc in rpt.GetDocuments(true) )
{
// printing document title and author name
}
}
}
class EnquiryDocument
{
string _docTitle;
string _docAuthor;
// properties to get and set doc title and author name goes below
public EnquiryDocument(string title,string author)
{
_docAuthor = author;
_docTitle = title;
}
public EnquiryDocument(){}
}
class InvestigationReport
{
EnquiryDocument[] docs=new EnquiryDocument[3];
public IEnumerable<EnquiryDocument> GetDocuments(bool IsReverseOrder)
{
/* some business logic to retrieve the document
docs[0]=new EnquiryDocument("FundAbuse","Margon");
docs[1]=new EnquiryDocument("Sexual Harassment","Philliphe");
docs[2]=new EnquiryDocument("Missing Resource","Goel");
*/
//if reverse order is preferred
if(IsReverseOrder)
{
for (int i = docs.Length; i != 0; i--)
yield return docs[i-1];
}
else
{
foreach (EnquiryDocument doc in docs)
{
yield return doc;
}
}
}
}
}
Question :
Can we use other collection type to improve efficiency ?
Mixing of Collection with LINQ reduce the code ? (We are not familiar with LINQ)
Looks fine to me. Yes, you could use the Reverse extension method... but that won't be as efficient as what you've got.
How much do you care about the efficiency though? I'd go with the most readable solution (namely Reverse) until you know that efficiency is a problem. Unless the collection is large, it's unlikely to be an issue.
If you've got the "raw data" as an array, then your use of an iterator block will be more efficient than calling Reverse. The Reverse method will buffer up all the data before yielding it one item at a time - just like your own code does, really. However, simply calling Reverse would be a lot simpler...
Aside from anything else, I'd say it's well worth you learning LINQ - at least LINQ to Objects. It can make processing data much, much cleaner than before.
Two questions:
Does the code you currently have work?
Have you identified this piece of code as being your performance bottleneck?
If the answer to either of those questions is no, don't worry about it. Just make it work and move on. There's nothing grossly wrong about the code, so no need to fret! Spend your time building new functionality instead. Save LINQ for a new problem you haven't already solved.
Actually this task seems pretty straightforward. I'd actually just use the Reverse method on a Generic List.
This should already be well-optimized.
Your GetDocuments method has a return type of IEnumerable so there is no need to even loop over your array when IsReverseOrder is false, you can just return it as is as Array type is IEnumerable...
As for when IsReverseOrder is true you can use either Array.Reverse or the Linq Reverse() extension method to reduce the amount of code.

Categories

Resources