Getting more than 1000 objects using SharpGs library

Getting more than 1000 objects using SharpGs library - c#

I have many 1000s of files in Google Cloud Storage and I'm writing a .Net application to process the list of files. I'm using the SharpGs .Net library (https://github.com/acropolium/SharpGs) which seems simple and easy enough to use. However, I only seem to be getting back 1000 objects.
I am using the following code:
var bucket = GoogleStorageClient.GetBucket(rootBucketName)
var objects = bucket.Objects;
There doesn't seem to be any obvious way to obtain the next 1000 objects so I'm a bit stuck at the moment.
Does anyone have any ideas or suggestions?

I am not familiar with this particular library, but 1000 objects is the current limit for a single list call. Beyond that, you'd need to use paging to access the rest of the objects. If this library has support for paging, I'd recommend using that.

If you look at the Bucket class:
https://github.com/acropolium/SharpGs/blob/master/SharpGs/Internal/Bucket.cs#L33
It returns a Query object. The Query object allows you to pass in a Marker parameter:
https://github.com/acropolium/SharpGs/blob/master/SharpGs/Internal/Query.cs#L36
You will have to take the initial Query object, extract its marker, then pass it to a new Query to get the next page of results.

Related

Search Box with predictive result

I want to create a search box that will show result relating to the typed the text. I am using .NET MVC and I have been stuck on this for awhile. I want to use the AlphaVantage API search endpoint to create this.
It would like this. I just don't know what component to use or how to implement it.

As we don't know amount of your data and possible stack/budget in your project, autocompletion/autosuggestion could be implemented differently:
In memory (you break your word into all possible prefixes and map them to your entity through dictionary, could be optimized, like so - https://github.com/omerfarukz/autocomplete). Limit is around 10 million entries, a lot of memory consumption. Also support some storage mechanics, but I don't think it is more powerfull than fully fledged Lucene.
In Lucene index (Lucene.Net (4.8) AutoComplete / AutoSuggestion). Limited to 2 billions, very optimized usage of memory, stored on hard drive or anywhere else. Hard to work with, because provide low-level micro optimizations on indexes and overall pipeline of tokenization/indexing.
In Elasticsearch cluster (https://www.elastic.co/guide/en/elasticsearch/client/net-api/current/suggest-usage.html). Unlimited, uses Lucene indexes as sharding units. Same as lucene, but every cloud infrastructure provide it for pretty penny.
In SQL using full text index (SQL Server Full Text Catalog and Autocomplete). Limited by database providers such as SQLite/MSSQL/Oracle/etc, cheap, easy to use, usually consumes CPU as there is no tomorrow, but hey, it is relational, so you could join any data to results.
As to how to use it - basically you send request to desired framework instance and retrieve first N results, which then you serve in some REST GET response.

You'll have to make a POST request (HttpClient) to the API that will return your data. You'll also need to provide all required authorization information (whether headers or keys). That would need to be async or possibly a background worker so that it doesn't block your thread. The requests need to happen when there's a change in your search box.
You can probably find details on how to do the request here.

DynamoDB .NET SDK Get only Specific Properties

I have an application that I'm trying to store data in Amazon DynamoDB and I'm trying to figure out the best way to structure the tables. A quick description of the app:
It needs to be able to load a large number of elements from the DB based on a search of a small number of properties and display those limited properties to the user. Then the user can browse and select a few elements that they want to look closer at and it needs to show the rest of the properties for those items.
My thought is that basically for speed and memory purposes it needs to load a 'summarized' version of the objects for the initial step, then load the full object when the user asks to look into something fully. I can do this easily (and have done so) in my c# code. However here is what I'm wondering:
If I have a c# object and I use the Dynamo Object persistence SDK to relate say 5 properties to a Dynamo table that has say 30 properties; will the SDK request only the properties that are on the object? Or will it request all of them and then throw out the 25 that aren't related to the object?
If it only takes the needed properties then I think I can store everything in one table and relate both the summarized objects and the full objects to the same table and just pull the properties needed. If it takes everything then I'm worried it will create a lot of throughput that I don't need 75% of, plus slowing down the transfer due to the extra data. If that's the case I think it may be worth creating a GSI that just has the summarized properties...
Anyway sorry for the long description, any input from those more familiar with DynamoDB than I am would be appreciated :)

Mongo DB - fastest way to retrieve 5 Million records from a collection

I am using MongoDB in our project and I'm currently learning how things work
I have created a collection with 5 million records. When i fire the query db.ProductDetails.find() on the console it takes too much time to display all the data.
Also when i use the following code in C#
var Products = db.GetCollection("ProductDetails").FindAll().Documents.ToList();
the system throws OutOfMemoryException after some time..
Is there any other faster or more optimized way to achieve this ?

Never try to fetch all entries at the same time. Use filters or get a few rows at a time.
Read this question: MongoDB - paging

Try to get the subset which is needed. If you try to fetch all objects, then it is for sure you will need enough RAM as the size of your database collection !!
Try to fetch the objects which will be used in the application.

problem with huge data

I have WCF service which reads data from xml. Data in xml is being changed every 1 minute.
This xml is very big, it has about 16k records. Parsing this takes about 7 sec. so its definitely to long.
Now it works in that way:
ASP.NET call WCF
WCF parse xml
ASP.NET is waiting for WCF callback
WCF gives back data to ASP.NET
of course there is caching for 1 minute but after it WCF must load data again.
Is there any possibility to make something that will refresh data without stopping site? Something like ... I don't know, double buffering? that will retrieve old data if there is none of new? Maybe you know better solution?
best regards
EDIT:
the statement which takes the longest time:
XDocument = XDocument.Load(XmlReader.Create(uri)); //takes 7 sec.
parse takes 70 ms, its okey, but this is not the problem. Is there a better solution to dont block the website? :)
EDIT2:
Ok I have found a better solution. Simply, I download xml to the hdd and Im read data from it. Then the other proccess starts download new version of xml and replace the old. Thx for engagement.

You seems to have XML to Object tool that creates an object model from the XML.
What usually takes most of the time is not the parsing but creating all these objects to represent the data.
So You might want to extract only part of the XML data which will be faster for you and not systematically create a big object tree for extracting only part of it.
You could use XPath to extract the pieces you need from the XML file for example.
I have used in the past a nice XML parsing tool that focuses on performances. It is called vtd-xml (see http://vtd-xml.sourceforge.net/).
It supports XPath and other XML Tech.
There is a C# version. I have used the Java version but I am sure that the C# version has the same qualities.
LINQ to XML is also a nice tool and it might do the trick for you.

It all depends on your database design. If you designed database in a way you can recognize which data is already queried then for each new query return only a records difference from last query time till current time.
Maybe you could add rowstamp for each record and update it on each add/edit/delete action, then you can easily achieve logic from the beginning of this answer.
Also, if you don't want first call to take long (when initial data has to be collected) think about storing that data locally.
Use something else then XML (like JSON). If you have big XML overhead, try to replace long element names with something shorter (like single char element names).
Take a look at this:
What is the easiest way to add compression to WCF in Silverlight?
Create JSON from C# using JSON Library

If you take a few stackshots, it might tell you that the biggest "bottleneck" is not parsing, but data structure allocation, initialization, and subsequent garbage collection. If so, a way around it is to have a pool of pre-allocated row objects and re-use them.
Also, if each item is appended to the list, you might find it spending a large fraction of time doing the append. It might be faster to simply push each new row on the front, and then reverse the whole list at the end.
(But don't implement these things unless you prove they are problems by stackshots. Until then, they are just guesses.)
It's been my experience that the real cost of XML is not the parsing, but the data structure manipulation.

Joining results from separate API's together

I have 2 separate systems - a document management system and a sharepoint search server.
Both systems have an api that I can use to search the data within them. Tthe same data may exist in both systems, so we much search both.
What is the most efficient way (speed is very important) to search both api's at the same time and merge the results together.
Is the following idea bad/good/slow/fast:
user enters search terms
the api for each system is called on it's own thread
the results from each api is placed in a common IEnumerable class of same type
when both threads have executed linq is used to join the 2 IEnumerable result objects together
results are passed to view
The application is ASP.NET MVC C#.

Your solution looks alright - you're using the adapter pattern to convert the two different result feeds into your required format, and the overall design is a facade pattern. From a design point of view, your solution is valid.
If you wanted to make things even better, you could display the results as soon as they arrive and display a notification that results are still loading until all APIs have returned a value. If your document management system was significantly faster or slower than sharepoint, it would give the user information faster that way.

I don't see anything wrong in the way you are doing it. Your algorithm could take forever to produce the perfect results, but you need to strike a balance. Any optimisation would have to be done in the search algorithm (or rather document indexing). You would still have to compromise on how many hits are good enough to the user by limiting the duration of your thread execution.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.