Lucene serializer in C#, need performance advice - c#

I'm trying to build a Lucene Serializer class that would serialize/de-serialize objects (classes) with properties decorated with the DataMember and a special attribute with instruction on how to store the property/field in a Lucene index.
The class works fine when I need to retrieve a single object by a certain key/value pair.
But I noticed that if sometimes I need to retrieve all items, and there let's say are 100,000 documents - then MySQL does it ~bout 10 times faster... for some reason...
Could you please review this code (Lucene experts) and suggest any possible performance related ideas for improvement ?
public IEnumerable<T> LoadAll()
{
IndexReader reader = IndexReader.Open(this.PathToLuceneIndex);
int itemsCount = reader.NumDocs();
for (int i = 0; i < itemsCount; i++)
{
if (!reader.IsDeleted(i))
{
Document doc = reader.Document(i);
if (doc != null)
{
T item = Deserialize(doc);
yield return item;
}
}
}
if (reader != null) reader.Close();
}
private T Deserialize(Document doc)
{
T itemInstance = Activator.CreateInstance<T>();
foreach (string fieldName in fieldTypes.Keys)
{
Field myField = doc.GetField(fieldName);
//Not every document may have the full collection of indexable fields
if (myField != null)
{
object fieldValue = myField.StringValue();
Type fieldType = fieldTypes[fieldName];
if (fieldType == typeof(bool))
fieldValue = fieldValue == "1" ? true : false;
if (fieldType == typeof(DateTime))
fieldValue = DateTools.StringToDate((string)fieldValue);
pF.SetValue(itemInstance, fieldName, fieldValue);
}
}
return itemInstance;
}
Thank you in advance!

Here are some tips:
First, don't use IndexReader.Open(string path). Not only will it be removed in the next major release of Lucene.net, it's generally not your best option. There's actually a ton of unnecessary code called when you let Lucene generate the directory for you. I suggest:
var dir = new SimpleFSDirectory(new DirectoryInfo(path));
var reader = IndexReader.Open(dir, true);
You should also do as I did above, and open the IndexReader as readonly, if you don't absolutely need to write to it, as it will be quicker in multi-threaded environments especially.
If you know the size of your index is not more than you can hold into memory (ie less than 500-600 MB and not compressed), you can use a RAMDirectory instead. This will load the entire index into memory allowing you to bypass most of the costly IO operations if you were leaving the index on disk. It should greatly improve your speed, especially if you do it with the other suggestions below.
If the index is too large to fit in memory, you either need to split the index up into chunks (ie an index every n MBs) or just continue to read it from disk.
Also, I know you can't yield return in a try...catch, but you can in a try...finally, and I would recommend wrapping your logic in LoadAll() into a try...finally, like
IndexReader reader = null;
try
{
//logic here...
}
finally
{
if (reader != null) reader.Close();
}
Now, when it comes to your actual Deserialize code, you're probably doing it in nearly the fastest way possible, except that you are boxing the string when you don't need to. Lucene only stores the field as a byte[] array or a string. Since you're calling string value, you know it will always be a string, and should only have to box it if absolutely necessary. Change it to this:
string fieldValue = myField.StringValue();
That will at least sometimes save you a minor boxing cost. (really, not much)
On the topic of boxing, we're working on a branch of lucene you can pull from SVN, that changes the internals of Lucene from using boxing containers (ArrayLists, non-generic Lists and HashTables) to a version that uses generics and more .net-friendly things. This is the 2.9.4g branch. .Net'ified, as we like to say. We haven't officially benchmarked it, but developer tests have show it, in some cases, to be around 200% faster than older versions.
The other thing to keep in mind, Lucene is great as a search engine, you may find that in some cases, it may not stack up to MySQL. Really, though, the only way to know for sure is to just test and try to find performance bottlenecks like some of the ones I mentioned above.
Hope that helps! Don't forget about the Lucene.Net mailing list (lucene-net-dev#lucene.apache.org), either if you have any questions. Me and the other committers are generally quick to answer questions.

Related

Looking for better understanding on the coding standards

I installed CodeCracker
This is my original method.
//Add
public bool AddItemToMenu(MenuMapper mapperObj)
{
using (fb_databaseContext entities = new fb_databaseContext())
{
try
{
FoodItem newItem = new FoodItem();
newItem.ItemCategoryID = mapperObj.ItemCategory;
newItem.ItemName = mapperObj.ItemName;
newItem.ItemNameInHindi = mapperObj.ItemNameinHindi;
entities.FoodItems.Add(newItem);
entities.SaveChanges();
return true;
}
catch (Exception ex)
{
//handle exception
return false;
}
}
}
This is the recommended method by CodeCracker.
public static bool AddItemToMenu(MenuMapper mapperObj)
{
using (fb_databaseContext entities = new fb_databaseContext())
{
try
{
var newItem = new FoodItem
{
ItemCategoryID = mapperObj.ItemCategory,
ItemName = mapperObj.ItemName,
ItemNameInHindi = mapperObj.ItemNameinHindi,
};
entities.FoodItems.Add(newItem);
entities.SaveChanges();
return true;
}
catch (Exception ex)
{
//handle exception
return false;
}
}
}
As far as I know Static methods occupy memory when the application intialize irrespective if they are called or not.
When I alrady know the return type then why should I use var keyword.
Why this way of Object intializer is better.
I am very curios to get these answer, as it can guide me in a long way.
Adding one more method:-
private string GeneratePaymentHash(OrderDetailMapper order)
{
var payuBizzString = string.Empty;
payuBizzString = "hello|" + order.OrderID + "|" + order.TotalAmount + "|FoodToken|" + order.CustomerName + "|myemail#gmail.com|||||||||||10000";
var sha1 = System.Security.Cryptography.SHA512Managed.Create();
var inputBytes = Encoding.ASCII.GetBytes(payuBizzString);
var hash = sha1.ComputeHash(inputBytes);
var sb = new StringBuilder();
for (var i = 0; i < hash.Length; i++)
{
sb.Append(hash[i].ToString("X2"));
}
return sb.ToString().ToLower();
}
As far as I know Static methods occupy memory when the application intialize irrespective if they are called or not.
All methods do that. You are probably confusing this with static fields, which occupy memory even when no instances of the class are created. Generally, if a method can be made static, it should be made static, except when it is an implementation of an interface.
When I already know the return type then why should I use var keyword.
To avoid specifying the type twice on the same line of code.
Why this way of Object intializer is better?
Because it groups the assignments visually, and reduces the clutter around them, making it easier to read.
Static methods don't occupy any more memory than instance methods. Additionally, your method should be static because it doesn't rely in any way on accessing itself (this) as an instance.
Using var is most likely for readability. var is always only 3 letters while many types are much longer and can force the name of the variable much further along the line.
The object initializer is, again, most likely for readability by not having the variable name prefix all the attributes. It also means all your assignments are done at once.
In most cases, this tool you're using seems to be about making code more readable and clean. There may be certain cases where changes will boost performance by hinting to the compiler about your intentions, but generally, this is about being able to understand the code at a glance.
Only concern yourself with performance if you're actually experiencing performance issues. If you are experiencing performance issues then use some profiling tools to measure your application performance and find out which parts of your code are running slowly.
As far as I know Static methods occupy memory when the application
initialize irrespective if they are called or not.
This is true for all kind of methods, so that's irrelevant.
When I already know the return type then why should I use var keyword.
var is a personal preference (which is a syntactic sugar). This analyzer might think since the return type is already known, there is no need to use type explicitly, so, I recommend to use var instead. Personaly, I use var as much as possible. For this issue, you might wanna read Use of var keyword in C#
Why this way of Object intializer is better.
I can't say object initializer is always better but object initialize supplies that either your newItem will be null or it's fully initialized since your;
var newItem = new FoodItem
{
ItemCategoryID = mapperObj.ItemCategory,
ItemName = mapperObj.ItemName,
ItemNameInHindi = mapperObj.ItemNameinHindi,
};
is actually equal to
var temp = new FoodItem();
newItem.ItemCategoryID = mapperObj.ItemCategory;
newItem.ItemName = mapperObj.ItemName;
newItem.ItemNameInHindi = mapperObj.ItemNameinHindi;
var newItem = temp;
so, this is not the same as your first one. There is a nice answer on Code Review about this subject. https://codereview.stackexchange.com/a/4330/6136 Also you might wanna check: http://community.bartdesmet.net/blogs/bart/archive/2007/11/22/c-3-0-object-initializers-revisited.aspx
A lot of these are personal preferences but most coding standards allow other programmers to read your code easier.
Changing the static method to an instance takes more advantage of OO concepts, it limits the amount of mixed state and also allows you to add interfaces so you can mock out the class for testing.
The var keyword is still statically typed but because we should concentrate on naming and giving our objects more meaningful so explicitly declaring the type becomes redundant.
As for the object initialisation this just groups everything that is required to setup the object. Just makes it a little easier to read.
As far as I know Static methods occupy memory when the application intialize irrespective if they are called or not.
Methods that are never called may or may not be optimized away, depending on the compiler, debug vs. release and such. Static vs. non-static does not matter.
A method that doesn't need a this reference can (and IMO should) be static.
When I already know the return type then why should I use var keyword
No reason. There's no difference; do whatever you prefer.
Why this way of Object intializer is better.
The object initializer syntax generates the same code for most practical purposes (see answer #SonerGönül for the details). Mostly it's a matter of preference -- personally I find the object initializer syntax easier to read and maintain.

Exception of type 'System.OutOfMemoryException' was thrown

Basically I use Entity Framework to query a huge database. I want to return a string list then log it to a text file.
List<string> logFilePathFileName = new List<string>();
var query = from c in DBContext.MyTable where condition = something select c;
foreach (var result in query)
{
filePath = result.FilePath;
fileName = result.FileName;
string temp = filePath + "." + fileName;
logFilePathFileName.Add(temp);
if(logFilePathFileName.Count %1000 ==0)
Console.WriteLine(temp+"."+logFilePathFileName.Count);
}
However I got an exception when logFilePathFileName.Count=397000.
The exception is:
Exception of type 'System.OutOfMemoryException' was thrown.
A first chance exception of type 'System.OutOfMemoryException'
occurred in System.Data.Entity.dll
UPDATE:
What I want to use a different query say: select top 1000 then add to the list, but I don't know after 1000 then what?
Most probabbly it's not about a RAM as is, so increasing your RAM or even compiling and running your code in 64 bit machine will not have a positive effect, in this case.
I think it's related to a fact that .NET collections are limited to maximum 2GB RAM space (no difference either 32 or 64 bit).
To resolve this, split your list to much smaller chunks and most probabbly your problem will gone.
Just one possible solution:
foreach (var result in query)
{
....
if(logFilePathFileName.Count %1000 ==0) {
Console.WriteLine(temp+"."+logFilePathFileName.Count);
//WRITE SOMEWHERE YOU NEED
logFilePathFileName = new List<string>(); //RESET LIST !|
}
}
EDIT
If you want fragment a query, you can use Skip(...) and Take(...)
Just an explanatory example:
var fisrt1000 = query.Skip(0).Take(1000);
var second1000 = query.Skip(1000).Take(1000);
...
and so on..
Naturally put it in your iteration and parametrize it based on bounds of data you know or need.
Why are you collecting the data in a List<string> if all you need to do is write it to a text file?
You might as well just:
Open the text file;
Iterate over the records, appending each string to the text file (without storing the strings in memory);
Flush and close the text file.
You will need far less memory than now, because you won't be keeping all those strings unnecessarily in memory.
You probably need to set some vmargs for memory!
Also... look into writing it straight to your file and not holding it in a List
What Roy Dictus says sounds the best way.
Also you can try to add a limit to your query. So your database result won't be so large.
For info on:
Limiting query size with entity framework
You shouldn't read all records from database to list. It required a lot of memory. You an combine reading records and writing them to file. For example read 1000 records from db to list and save(append) them to text file, clear used memory (list.Clear()) and continue with new records.
From several other topics on StackOverflow I read that the Entity Framework is not designed to handle bulk data like that. The EF will cache/track all data in the context and will cause the exception in cases of huge bulks of data. Options are to use SQL directly or split up your records in smaller sets.
I used to use the gc arraylist in VS c++ similar to the gc List that you used, to works fin with small and intermediate data sets, but when using Big Dat, same problem 'System.OutOfMemoryException' was thrown.
As the size of these gcs cannot exceed 2 GB and therefore become inefficient with Big data, I built my own linked list, which gives the same functionality, dynamic increase and get by index, basically, it is a normal linked list class, with a dynamic array inside to provide getting data by index, it duplicates the space, but you may delete the linked list after updating the array is you do not need it keeping only the dynamic array, this would solve the problem. see the code:
struct LinkedNode
{
long data;
LinkedNode* next;
};
class LinkedList
{
public:
LinkedList();
~LinkedList();
LinkedNode* head;
long Count;
long * Data;
void add(long data);
void update();
//long get(long index);
};
LinkedList::LinkedList(){
this->Count = 0;
this->head = NULL;
}
LinkedList::~LinkedList(){
LinkedNode * temp;
while(head){
temp= this->head ;
head = head->next;
delete temp;
}
if (Data)
delete [] Data; Data=NULL;
}
void LinkedList::add (long data){
LinkedNode * node = new LinkedNode();
node->data = data;
node->next = this->head;
this->head = node;
this->Count++;}
void LinkedList::update(){
this->Data= new long[this->Count];
long i = 0;
LinkedNode * node =this->head;
while(node){
this->Data[i]=node->data;
node = node->next;
i++;
}
}
If you use this, please refer to my work https://www.liebertpub.com/doi/10.1089/big.2018.0064

Reflection is too slow while deserialising JSON strings into .NET objects

I'm having some issues with System.Reflection in C#. I'm pulling data from a database and retrieving that data in a JSON string. I've made my own implementation of handling the data from JSON into my self declared objects using Reflection. However, since I ussually get a JSON string with an array of like 50 - 100 objects my program runs really slow because of the loops I'm using with reflection.
I've heard that reflection is slow but it shouldn't be this slow. I feel something is not right in my implementation since I have a different project where I use JSON.NET serializer and instantiate my objects a bit differently with reflection that runs just fine on the same output (less than a second) while my slow program takes about 10 seconds for 50 objects.
Below are my classses that I'm using to store data
class DC_Host
{
public string name;
public void printProperties()
{
//Prints all properties of a class usign reflection
//Doesn't really matter, since I'm not usign this for processing
}
}
class Host : DC_Host
{
public string asset_tag;
public string assigned;
public string assigned_to;
public string attributes;
public bool? can_print;
public string category;
public bool? cd_rom;
public int? cd_speed;
public string change_control;
public string chassis_type;
//And some more properties (around 70 - 80 fields in total)
Below you'll find my methods for processing the information into the objects that are stored inside a List. The JSON data is stored inside a dictionairy that contains a another dictionairy for every array object defined in the JSON input. Deserialising the JSON happens in a matter of miliseconds so there shouldn't be a problem in there.
public List<DC_Host> readJSONTtoHost(ref Dictionary<string, dynamic> json)
{
bool array = isContainer();
List<DC_Host> hosts = new List<DC_Host>();
//Do different processing on objects depending on table type (array/single)
if (array)
{
foreach (Dictionary<string, dynamic> obj in json[json.First().Key])
{
hosts.Add(reflectToObject(obj));
}
}
else
{
hosts.Add(reflectToObject(json[json.First().Key]));
}
return hosts;
}
private DC_Host reflectToObject(Dictionary<string,dynamic> obj)
{
Host h = new Host();
FieldInfo[] fields = h.GetType().GetFields();
foreach (FieldInfo f in fields)
{
Object value = null;
/* IF there are values that are not in the dictionairy or where wrong conversion is
* utilised the values will not be processed and therefore not inserted into the
* host object or just ignored. On a later stage I might post specific error messages
* in the Catch module. */
/* TODO : Optimize and find out why this is soo slow */
try
{
value = obj[convTable[f.Name]];
}
catch { }
if (value == null)
{
f.SetValue(h, null);
continue;
}
// Het systeem werkt met list containers, MAAAR dan mogen er geen losse values zijn dus dit hangt
// zeer sterk af van de implementatie van Service Now.
if (f.FieldType == typeof(List<int?>)) //Arrays voor strings,ints en bools dus nog definieren
{
int count = obj[convTable[f.Name]].Count;
List<int?> temp = new List<int?>();
for (int i = 0; i < count; i++)
{
temp.Add(obj[convTable[f.Name]][i]);
f.SetValue(h, temp);
}
}
else if (f.FieldType == typeof(int?))
f.SetValue(h, int.Parse((string)value));
else if (f.FieldType == typeof(bool?))
f.SetValue(h, bool.Parse((string)value));
else
f.SetValue(h, (string)value);
}
Console.WriteLine("Processed " + h.name);
return h;
}
I'm not sure what JSON.NET's implementation is in the background for using reflection but I'm assumign they use something I'm missing for optimising their reflection.
Basically, high-performance code like this tends to use meta-programming extensively; lots of ILGenerator etc (or Expression / CodeDom if you find that scary). PetaPoco showed a similar example earlier today: prevent DynamicMethod VerificationException - operation could destabilize the runtime
You could also look at the code other serialization engines, such as protobuf-net, which has crazy amounts of meta-programming.
If you don't want to go quite that far, you could look at FastMember, which handles the crazy stuff for you, so you just have to worry about object/member-name/value.
For people that are running into this article I'll post my solution to my problem in here.
The issue wasn't really related to reflection. There are ways to improve the speed using Reflection like CodesInChaos and Marc Gravell mentioned where Marc even craeted a very usefull library (FastMember) for people with not too much experience in low level reflection.
The solution however was non related to reflection itself. I had a Try Catch statement to evaluate if values exist in my dictionary. Using try catch statements to handle program flow is not a good idea. Handling exceptions is heavy on performance and especially when you're running the debugger, Try Catch statements can drastically kill your performance.
//New implementation, use TryGetValue from Dictionary to check for excising values.
dynamic value = null;
obj.TryGetValue(convTable[f.Name], out value);
My program runs perfectly fine now since I omitted the TryCatch statement.

How do I optimize schemaDocument.Namespaces code for performance?

I have this code that is called thousands of times and I need to optimize it for performance.
I thought about caching xmlQualifiedNames but it's not good enough.
any ideas ?
private static string GetPrefixForNamespace(string ns, XmlSchema schemaDocument)
{
string prefix = null;
XmlQualifiedName[] xmlQualifiedNames = schemaDocument.Namespaces.ToArray();
foreach (XmlQualifiedName qn in xmlQualifiedNames)
{
if (ns == qn.Namespace)
{
prefix = qn.Name;
break;
}
}
return prefix;
}
since you're looking for strings (Namespace) inside the xmlQualifiedNames, how about caching those?
Or using LINQ to search in them?
Or - depending on the kind of input you get - using memoization to speed up your calls (really just fancy caching) like in this article.
Stuff it in a Dictionary or Hashtable or even a some caching mechanism.

C# - Collection is enough or comobination of LINQ will improve performance?

According to the requirement we have to return a collection either in reverse order or as
it is. We, beginning level programmer designed the collection as follow :(sample is given)
namespace Linqfying
{
class linqy
{
static void Main()
{
InvestigationReport rpt=new InvestigationReport();
// rpt.GetDocuments(true) refers
// to return the collection in reverse order
foreach( EnquiryDocument doc in rpt.GetDocuments(true) )
{
// printing document title and author name
}
}
}
class EnquiryDocument
{
string _docTitle;
string _docAuthor;
// properties to get and set doc title and author name goes below
public EnquiryDocument(string title,string author)
{
_docAuthor = author;
_docTitle = title;
}
public EnquiryDocument(){}
}
class InvestigationReport
{
EnquiryDocument[] docs=new EnquiryDocument[3];
public IEnumerable<EnquiryDocument> GetDocuments(bool IsReverseOrder)
{
/* some business logic to retrieve the document
docs[0]=new EnquiryDocument("FundAbuse","Margon");
docs[1]=new EnquiryDocument("Sexual Harassment","Philliphe");
docs[2]=new EnquiryDocument("Missing Resource","Goel");
*/
//if reverse order is preferred
if(IsReverseOrder)
{
for (int i = docs.Length; i != 0; i--)
yield return docs[i-1];
}
else
{
foreach (EnquiryDocument doc in docs)
{
yield return doc;
}
}
}
}
}
Question :
Can we use other collection type to improve efficiency ?
Mixing of Collection with LINQ reduce the code ? (We are not familiar with LINQ)
Looks fine to me. Yes, you could use the Reverse extension method... but that won't be as efficient as what you've got.
How much do you care about the efficiency though? I'd go with the most readable solution (namely Reverse) until you know that efficiency is a problem. Unless the collection is large, it's unlikely to be an issue.
If you've got the "raw data" as an array, then your use of an iterator block will be more efficient than calling Reverse. The Reverse method will buffer up all the data before yielding it one item at a time - just like your own code does, really. However, simply calling Reverse would be a lot simpler...
Aside from anything else, I'd say it's well worth you learning LINQ - at least LINQ to Objects. It can make processing data much, much cleaner than before.
Two questions:
Does the code you currently have work?
Have you identified this piece of code as being your performance bottleneck?
If the answer to either of those questions is no, don't worry about it. Just make it work and move on. There's nothing grossly wrong about the code, so no need to fret! Spend your time building new functionality instead. Save LINQ for a new problem you haven't already solved.
Actually this task seems pretty straightforward. I'd actually just use the Reverse method on a Generic List.
This should already be well-optimized.
Your GetDocuments method has a return type of IEnumerable so there is no need to even loop over your array when IsReverseOrder is false, you can just return it as is as Array type is IEnumerable...
As for when IsReverseOrder is true you can use either Array.Reverse or the Linq Reverse() extension method to reduce the amount of code.

Categories

Resources