I have a Excel COM addin which reads the CustomDocumentProperties section of a workbook.
This is how I access a particular entry from the CustomDocumentProperties section
DocumentProperties docProperties = (DocumentProperties)
xlWorkbook.CustomDocumentProperties;
docProperty = docProperties[propName];
The problem is when the CustomDocumentProperties contain more than 8000 entries, the performance of this
code is really bad. I have ran CPU profiler and it showed that the following line takes more than a minute.
docProperty = docProperties[propName];
Does anyone know how to improve the performance of accessing DocumentProperties?
Thanks!
I doubt that there is anything that you could do to improve the performance of the document properties. I believe that it is implemented as a simple list -- not as a dictionary or hash table. In fact, I don't believe that the list is sorted, so with 8000 entries, on average half of them, or 4000, would have to be accessed in order to find the property that you are looking for.
You might consider not using the CustomDocumentProperties as a dictionary. Instead, you might try putting all 8000 of your entries into a custom dictionary, serializing it, and then adding the entire serialized dictionary to the CustomDocumentProperties as a single entry. So to use it, you would access the CustomDocumentProperties, deserialize the dictionary, and then use it repeatedly. When done, if there were any changes to the dictionary, you would have to re-serialize it and save it back to the CustomDocumentProperties, which you would probably only want to do once -- for example, just before saving your workbook. (You might want to put code to re-serialize and save your custom dictionary to the CustomDocumentProperties within the Workbook.BeforeSave event.)
Related
I have a program that creates a list of objects from a file, and also creates a list of the same type of object, but with fewer/and some different properties, from the database, like:
List from FILE: Address ID, Address, City, State, Zip, other important properties
list from DB: Address ID, Address, City, State
I have implemented IEquatable on this CustObj so that it only compares against Address, City, and State, in the hopes of doing easy comparisons between the two lists.
The ultimate goal is to get the address ID from the database and update the address IDs for each address in the list of objects from the file. These two lists could have quite a lot of objects (over 1,000,000) so I want it to be fast.
The alternative is to offload this to the database and have the DB return the info we need. If that would be significantly faster/more resource efficient, I will go that route, but I want to see if it can be done quickly and efficiently in code first.
Anyways, I see there's a Zip method. I was wondering if I could use that to say "if there's a match between the two lists, keep the data in list 1 but update the address id property of each object in list 1 to the address Id from list 2".
Is that possible?
The answer is, it really depends. There are a lot of parameters you haven't mentioned.
The only way to be sure is to build one solution (preferably using the zip method, since it has less work involved) and if it works within the parameters of your requirements (time or any other parameter, memory footprint?), you can stop there.
Otherwise you have to off load it to the database. Mind you, you would have to hold the 1 million records from files and 1 million records from DB in memory at the same time if you want to use the zip method.
The problem with pushing everything to the database is, inserting that many records is resource (time, space etc) consuming. Moreover if you want to do that maybe everyday, it is going to be more difficult, resource wise.
Your question didn't say if this was going to be a one time thing or a daily event in a production environment. Even that is going to make a difference in which approach to choose.
To repeat, you would have to try different approaches to see which will work best for you based on your requirements: is this a one time thing? How much resources does the process have? How much time does it have? and possibly many more.
It kindof sounds also like a job for .Aggregate() aka
var aggreg = list1.Aggregate(otherListPrefilled, (acc,elemFrom1) =>
{
// some code to create the joined data, usig elemFrom1 to find
// and modify the correct element in otherListPrefilled
return acc;
});
normally I would use an empty otherListPrefilled, not sure how it performs on 100k data items though.
If its a onetime thing, its probably faster to put your file to a csv, import it in your database as temporary table and join the data in sql.
So for example I have Collection of Documents like this:
{
hotField1 : 0,
hotField2 : "",
coldField1 : 0,
...
coldFieldN : ""
}
In this scope cold properties are written once, and accessed sometimes, hot properties are written and then fairly often accessed\updated (but in different use cases, it is not same sub-document or parts of same object).
Amount of documents is fairly huge (1M and more), size of hot data is at least ten times less than cold.
Since partial update is still most wanted yet not implemented feature, only way to update hotField1 is:
Request full document
Change either hotField1 or hotField2
Write back whole document
This is costly in terms of RUs, and doesn't scale so well.
So the question is how to organize such data&calls in DocumentDB to minimize costs?
Discovered alternatives:
Obviously best: retrieve one property; change; update - not yet.
Separate on two Collections, use stored procedures to retrieve from Main Collection then from Dictionary?
Put hotFields1-2 as subdocument ({ sub: {hf1:0, hf2:""}}) and somehow only update it? (I'm not sure if it is possible)
PS. C# in tags for client library we use. If it lacks smth its ok to use REST interface instead.
While there's no exact "best" answer:
Your #2 choice will not work with stored procedures, since stored procedures are scoped to a collection.
Updating a subdocument (#3 choice) is no different than updating top-level properties - you are still retrieving, and re-writing, a document (a subdocument is just another property on the document).
While it may or may not reduce RU (you'd need to benchmark, as Larry pointed out in comments), you may choose to store your hot properties in a separate (smaller) document (or multiple smaller documents). With less properties, there would be less bandwidth consumed during updates, and less index updating. However, since you'd now be retrieving more than one document (possibly across multiple calls), you may find that this activity negates any RU savings from storing in a single document.
Note: There's nothing stopping you from storing these separate documents in the same collection (which then lets you approach the problem with a stored procedure, as you suggested in your #2 choice). You'll just need to create some type of property to help you identify different document types.
NoSQL based on Documents replace the document once you change one or all properties.
In terms of cost, it is based on per collection basis.
So, if you have a DB with two collections in it and each with a performance tier of S1 i.e., $25/month.
$25 x 2 = $50
Case you need a better performance, and change one to S2 you'll be charged:
$50 + $25 = $75
I have a Sharepoint list on a site that I want to update nightly from a SQL server DB, preferably using C#. Here is the catch, I do not know if any records were removed, added, or if any field in any record has been updated. I would believe then the simplest thing to do is remove the data from the list and then replace it with the new list data. But is there any simple way to do this? I would hate to remove 3000+ items line by line from the list and then add the 3000+ records one at a time.
Its up to your environment. If you not that much load on the systems in the night, i would prefer one of the following ways:
1) Build a timerjob, delete the list (not the items one by one, cause this is slow), recreate the list and import the items from the db. When we are talking about 3.000 - 5.000 Elements, this is not that much and i think done under 10 Minutes.
2) Loop through the sharepoint list with the items and check field by field if it was updated within the db and if yes, update it.
I would preferr to delete the list and import the complete table, cause we are talking about not that much data.
Another way, which is a good idea, is to use BCS or BDC. Then you would have the data always in place and synched with the db. Look at
https://msdn.microsoft.com/en-us/library/office/jj163782.aspx
https://msdn.microsoft.com/de-de/library/ee231515(v=vs.110).aspx
Unfortunately there is no "easy" and/or elegant way to delete all the items in a list, like the delete statement in SQL. You can either delete the entire list and recreate it if the list can be easily created from a list definition or, if your concern is performance, since SP 2007 the SPWeb Class has a method called ProcessBatchData. You can use it to batch process commands to avoid the performance penalty of issuing 6000 separate commands to the server. However, it still requires you to pass an ugly XML that contains a list of all the items to be deleted or added.
The ideal way is to enumerate all the rows from the database and see if each row already exists in the SharePoint list using a primary field value. If it already exists, simply update them[1]. Otherwise you can add a new item.
[1] - Optionally, while updating them we can compare the list item field values with database column values. Only if there is a change in any of the field, update it. Otherwise skip it.
I am working on optimizing some code I have been assigned from a previous employee's code base. Beyond the fact that the code is pretty well "spaghettified" I did run into an issue where I'm not sure how to optimize properly.
The below snippet is not an exact replication, but should detail the question fairly well.
He is taking one DataTable from an Excel spreasheet and placing rows into a consistantly formatted DataTable which later updates the database. This seems logical to me, however, the way he is copying data seems convoluted, and is a royal pain to modify, maintain or add new formats.
Here is what I'm seeing:
private void VendorFormatOne()
{
//dtSumbit is declared with it's column schema elsewhere
for (int i = 0; i < dtFromExcelFile.Rows.Count; i++)
{
dtSubmit.Rows.Add(i);
dtSubmit.Rows[i]["reference_no"] = dtFromExcelFile.Rows[i]["VENDOR REF"];
dtSubmit.Rows[i]["customer_name"] = dtFromExcelFile.Rows[i]["END USER ID"];
//etc etc etc
}
}
To me this is completely overkill for mapping columns to a different schema, but I can't think of a way to do this more gracefully. In the actual solution, there are about 20 of these methods, all using different formats from dtFromExcelFile and the column list is much longer. The column schema of dtSubmit remains the same across the board.
I am looking for a way to avoid having to manually map these columns every time the company needs to load a new file from a vendor. Is there a way to do this more efficiently? I'm sure I'm overlooking something here, but did not find any relevant answers on SO or elsewhere.
This might be overkill, but you could define an XML file that describes which Excel column maps to which database field, then input that along with each new Excel file. You'd want to whip up a class or two for parsing and consuming that file, and perhaps another class for validating the Excel file against the XML file.
Depending on the size of your organization, this may give you the added bonus of being able to offload that tedious mapping to someone less skilled. However, it is quite a bit of setup work and if this happens only sparingly, you might not get a significant return on investment for creating so much infrastructure.
Alternatively, if you're using MS SQL Server, this is basically what SSIS is built for, though in my experience, most programmers find SSIS quite tedious.
I had originally intended this just as a comment but ran out of space. It's in reply to Micah's answer and your first comment therein.
The biggest problem here is the amount of XML mapping would equal that of the manual mapping in code
Consider building a small tool that, given an Excel file with two
columns, produces the XML mapping file. Now you can offload the
mapping work to the vendor, or an intern, or indeed anyone who has a
copy of the requirement doc for a particular vendor project.
Since the file is then loaded at runtime in your import app or
whatever, you can change the mappings without having to redeploy the
app.
Having used exactly this kind of system many, many times in the past,
I can tell you this: you will be very glad you took the time to do
it - especially the first time you get a call right after deployment
along the lines of "oops, we need to add a new column to the data
we've given you, and we realised that we've misspelled the 19th
column by the way."
About the only thing that can perhaps go wrong is data type
conversions, but you can build that into the mapping file (type
from/to) and generalise your import routine to perform the
conversions for you.
Just my 2c.
A while ago I ran into similar problem where I had over 400 columns from 30 odd tables to be mapped to about 60 in the actual table in the database. I had the same dilemma whether to go with a schema or write something custom.
There was so much duplication that I ended up writing a simple helper class with a couple of overridden methods that basically took in a column name from import table and spit out the database column name. Also, for column names, I built a separate class of the format
public static class ColumnName
{
public const string FirstName = "FirstName";
public const string LastName = "LastName";
...
}
Same thing goes for TableNames as well.
This made it much simpler to maintain table names and column names. Also, this handled duplicate columns across different tables really well avoiding duplicate code.
I have a text file which follows exactly the same type:
**Unique-Key_1**
Value1
Value2
**Unique-Key_2**
Value1
**Unique_Key_3**
Value1
(Please note that the keys and values are not fixed.They might grow in time but one thing is confirmed: It will always follow this structure)
My program wants to search for a key and then retrive all values under it.
I have some viable solutions for this.
1) Should I use a dictionary type and then when my app loads read all keys and values and populate that list?
2) Can I use file access/search methods during run-time and based upon a key , search it and then retrieve values ?
3) Which is the optimum method or is there any methods or any other ways to achieve the same ?
Things to consider:
Does the application have time to load in and parse the file before data is searched? If so, consider parsing the file into a Dictionary. If not, parse the file as needed.
Will the file be very large? If so, parsing it into a Dictionary may take up too much memory. Consider an LRU Cache like Object cache for C#.
Are the keys in the file sorted? If so, a binary search on the file may be possible to speed up the file parses.
Does the data change frequently? If so, parsing the file would guarantee the data is up to date at the cost of slower data access.
Another alternative is to load the values into database tables or a key/value store. This allows the data to be updated piece meal or completely with reasonable access speed if needed at the expense of maintaining and running the database.
Okay, if the file isn't that large I would recommend the Dictionary approach because it's going to make accessing it easier and more efficient at runtime. However, if the file is too large to hold in memory you can use the algorithm provided in this answer to search it.