Speed issues fetching Thousands of rows in a SQLCE database - c#

Quick disclaimer first: I am a complete noob when it comes to databases. I know how to form an SQL query and... well, that's pretty much it - I figured that'd be enough to start with. Performance optimizations would come later.
'Later' has arrived. I need your help.
I'm doing NLP-processing on news articles. The articles are taken from the Internet and stored in a database. Users give me an input period to analyze, I bring up all the articles in this period, analyze them and show them some graphs in return. I currently have a rather naive approach to this - I don't limit the number of articles returned. About 250 articles a day * 6 months is 45,000 records, a rather large number.
I'm experiencing mediocre fetch performance. I'm using C# + SQLCE (an easy DB to start with, with no set up cost). I tried indexing the database to no avail. I'm suspecting the problems comes from either
asking for so much data in one single query.
using SQLCE
Am I utterly crazy to try and fetch thousands of records all in 1 call ? Was SQLCE a stupid choice to make ? I basically need practical advice on this. Also, if you could point me to good alternatives to solve my problem that's even more awesome.
Your help is of great value to me - thanks in advance!
EDIT - Below is to command I use to get my articles:
using (SqlCeCommand com1 = new SqlCeCommand(mySqlRequestString, myConnectionString))
{
SqlCeResultSet res = com1.ExecuteResultSet(ResultSetOptions.Scrollable);
if (res.HasRows)
{
//Use the get ordinal method so we don’t have to worry about remembering what order our SQL put the field names in.
int ordGuid = res.GetOrdinal("Id"); int ordUrl = res.GetOrdinal("Url"); int ordPublicationDate = res.GetOrdinal("PublicationDate");
int ordTitle = res.GetOrdinal("Title"); int ordContent = res.GetOrdinal("Content"); int ordSource = res.GetOrdinal("Source");
int ordAuthor = res.GetOrdinal("Author"); int ordComputedKeywords = res.GetOrdinal("ComputedKeywords"); int ordComputedKeywordsDate = res.GetOrdinal("ComputedKeywordsDate");
//Get all the Articles
List<Article> articles = new List<Article>();
if (res.ReadFirst())
{
// Read the first record and get its data
res.ReadFirst();
Constants.Sources src = (Constants.Sources)Enum.Parse(typeof(Constants.Sources), res.GetString(ordSource));
string[] computedKeywords = res.IsDBNull(ordComputedKeywords)?new string[]{}: res.GetString(ordComputedKeywords).Split(',').ToArray();
DateTime computedKeywordsDate = res.IsDBNull(ordComputedKeywordsDate) ? new DateTime() : res.GetDateTime(ordComputedKeywordsDate);
articles.Add(new Article(res.GetGuid(ordGuid), new Uri(res.GetString(ordUrl)), res.GetDateTime(ordPublicationDate), res.GetString(ordTitle), res.GetString(ordContent), src, res.GetString(ordAuthor), computedKeywords, computedKeywordsDate));
}
// Read the remaining records
while (res.Read())
{
Constants.Sources src = (Constants.Sources)Enum.Parse(typeof(Constants.Sources), res.GetString(ordSource));
string[] computedKeywords = res.IsDBNull(ordComputedKeywords) ? new string[] { } : res.GetString(ordComputedKeywords).Split(',').ToArray();
DateTime computedKeywordsDate = res.IsDBNull(ordComputedKeywordsDate) ? new DateTime() : res.GetDateTime(ordComputedKeywordsDate);
articles.Add(new Article(res.GetGuid(ordGuid), new Uri(res.GetString(ordUrl)), res.GetDateTime(ordPublicationDate), res.GetString(ordTitle), res.GetString(ordContent), src, res.GetString(ordAuthor), computedKeywords, computedKeywordsDate));
}
return articles.ToArray();
}
}

You should only fetch one page of results at a time.
SqlCE is great for testing or very low usage applications, but you should really consider SQL Express or full blown SQL Server.

Related

C# EF Query optimisation

I have the following function using entity framework, that checks a db table for expired product data, the entire query seems extremely slow and i am wondering if there is a better / more efficient way of adapting this query as there is always a bulk amount of data to clean sometimes upto 3k items.
It is purely the selection item that can take 4-10 mins dependant on db size on the day.
Datetime format Example = 2021-03-31T23:59:59.000
using (ProductContext db = new ProductContext())
{
log.Info("Checking for expired data in the Product Data db");
//rule = now - configured hours, default = 2 hours in past.
var checkWindow = Convert.ToInt32(PRODUCTMAPPING_CONFIG.MinusExpiredWindowHours);
var dtCheck = Convert.ToDateTime(DateTime.Now.AddHours(-checkWindow).ToString("s"));
var rowData = db.ProductData.Where(le => Convert.ToDateTime(le.ProductEndDate.Value.ToString().Trim()) < dtCheck).ToList();
rowData.ForEach(i => {log.Debug($"DB Row ID {i.Id} with Product ID Value: {i.ProductUid} has expired with Product End Date: {i.ProductEndDate}, marked for removal."); });
log.Info($"Number of expired Products being removed: {rowData.Count()}");
db.ProductData.RemoveRange(rowData);
db.SaveChanges();
log.Info(rowData.Count == 0
? "No expired data present."
: $"Number of expired assets Successfully removed from database = {rowData.Count}");
}
Thanks in advance
EDIT::
Thanks to all the suggestions i will be looking at the ORM comments made by Panagiotis Kanavos below in regard to direct queries for this type of item and amended the column datatype based on the comment from Panagiotis Kanavos again. Finally the .ToList comment by jdweng removed the lag immediately so this at least gets my moving faster for now while i look at the suggestions by Panagiotis Kanavos as i think that is probably the best way forward?
Faster code is now:
using (ProductContext db = new ProductContext())
{
log.Info("Checking for expired data in the Product Data db");
//rule = now - configured hours, default = 2 hours in past.
var checkWindow = Convert.ToInt32(PRODUCTMAPPING_CONFIG.MinusExpiredWindowHours);
var dtCheck = Convert.ToDateTime(DateTime.Now.AddHours(-checkWindow).ToString("s"));
// Ammended DB Table from nvarchar to DateTime to allow direct comparison based on comment by: Panagiotis Kanavos
// Removed ToList() which returned the IEnumerable immediately based on comment by: jdweng
var rowData = db.ProductData.Where(le => le.ProductEndDate < dtCheck);
log.Info($"Number of expired Products being removed: {rowData.Count()}");
// added print out on debug only.
if(log.IsDebugEnabled)
rowData.ToList().ForEach(i => {log.Debug($"DB Row ID {i.Id} with Product ID Value: {i.ProductUid} has expired with Product End Date: {i.ProductEndDate}, marked for removal."); });
var rowCount = rowData.Count();
db.ProductData.RemoveRange(rowData);
db.SaveChanges();
log.Info(rowCount == 0
? "No expired data present."
: $"Number of expired assets Successfully removed from database = {rowCount}");
}
So grateful to all the useful comments below and grateful for the time you all took to respond and help me learn from this.

AMO get partitions where data are processed but not indexes

I am writing a script that return all unprocessed partitions within a measure group using the following command:
objMeasureGroup.Partitions.Cast<Partition>().Where(x => x.State != AnalysisState.Processed)
After doing some experiments, it looks like this property indicates if the data is processed and doesn't mention the indexes.
After searching for hours, i didn't find any method to list the partitions where data is processed but indexes are not.
Any suggestions?
Environment:
SQL Server 2014
SSAS multidimensional cube
Script are written within a SSIS package / Script task
First, ProcessIndexes is an incremental operation. So if you run it twice the second time will be pretty quick because there is nothing to do. So I would recommend just running it on the cube and not worrying about whether it was previously run. However if you do need to analyze the current state then read on.
The best way (only way I know of) to distinguish whether ProcessIndexes has been run on a partition is to study the DISCOVER_PARTITION_STAT and DISCOVER_PARTITION_DIMENSION_STAT DMVs as seen below.
The DISCOVER_PARTITION_STAT DMV returns one row per aggregation with the rowcount. The first row of that DMV has a blank aggregation name and represents the rowcount of the lowest level data processed in that partition.
The DISCOVER_PARTITION_DIMENSION_STAT DMV can tell you about whether indexes are processed and which range of values by each dimension attribute are in this partition (by internal IDs, so not super easy to interpret). We assume at least one dimension attribute is set to be optimized so it will be indexed.
You will need to add a reference to Microsoft.AnalysisServices.AdomdClient also to simplify running these DMVs:
string sDatabaseName = "YourDatabaseName";
string sCubeName = "YourCubeName";
string sMeasureGroupName = "YourMeasureGroupName";
Microsoft.AnalysisServices.Server s = new Microsoft.AnalysisServices.Server();
s.Connect("Data Source=localhost");
Microsoft.AnalysisServices.Database db = s.Databases.GetByName(sDatabaseName);
Microsoft.AnalysisServices.Cube c = db.Cubes.GetByName(sCubeName);
Microsoft.AnalysisServices.MeasureGroup mg = c.MeasureGroups.GetByName(sMeasureGroupName);
Microsoft.AnalysisServices.AdomdClient.AdomdConnection conn = new Microsoft.AnalysisServices.AdomdClient.AdomdConnection(s.ConnectionString);
conn.Open();
foreach (Microsoft.AnalysisServices.Partition p in mg.Partitions)
{
Console.Write(p.Name + " - " + p.State + " - ");
var restrictions = new Microsoft.AnalysisServices.AdomdClient.AdomdRestrictionCollection();
restrictions.Add("DATABASE_NAME", db.Name);
restrictions.Add("CUBE_NAME", c.Name);
restrictions.Add("MEASURE_GROUP_NAME", mg.Name);
restrictions.Add("PARTITION_NAME", p.Name);
var dsAggs = conn.GetSchemaDataSet("DISCOVER_PARTITION_STAT", restrictions);
var dsIndexes = conn.GetSchemaDataSet("DISCOVER_PARTITION_DIMENSION_STAT", restrictions);
if (dsAggs.Tables[0].Rows.Count == 0)
Console.WriteLine("ProcessData not run yet");
else if (dsAggs.Tables[0].Rows.Count > 1)
Console.WriteLine("aggs processed");
else if (p.AggregationDesign == null || p.AggregationDesign.Aggregations.Count == 0)
{
bool bIndexesBuilt = false;
foreach (System.Data.DataRow row in dsIndexes.Tables[0].Rows)
{
if (Convert.ToBoolean(row["ATTRIBUTE_INDEXED"]))
{
bIndexesBuilt = true;
break;
}
}
if (bIndexesBuilt)
Console.WriteLine("indexes have been processed. no aggs defined");
else
Console.WriteLine("no aggs defined. need to run ProcessIndexes on this partition to build indexes");
}
else
Console.WriteLine("need to run ProcessIndexes on this partition to process aggs and indexes");
}
I am posting this answer as additional information of #GregGalloway excellent answer
After searching for a while, the only way to know if partition are processed is using DISCOVER_PARTITION_STAT and DISCOVER_PARTITION_DIMENSION_STAT.
I found an article posted by Daren Gossbel describing the whole process:
SSAS: Are my Aggregations processed?
In the artcile above the author provided two methods:
using XMLA
One way in which you can find it out with an XMLA discover call to the DISCOVER_PARTITION_STAT rowset, but that returns the results in big lump of XML which is not as easy to read as a tabular result set.
example
<Discover xmlns="urn:schemas-microsoft-com:xml-analysis">
<RequestType>DISCOVER_PARTITION_STAT</RequestType>
<Restrictions>
<RestrictionList>
<DATABASE_NAME>Adventure Works DW</DATABASE_NAME>
<CUBE_NAME>Adventure Works</CUBE_NAME>
<MEASURE_GROUP_NAME>Internet Sales</MEASURE_GROUP_NAME>
<PARTITION_NAME>Internet_Sales_2003</PARTITION_NAME>
</RestrictionList>
</Restrictions>
<Properties>
<PropertyList>
</PropertyList>
</Properties>
</Discover>
using DMV queries
If you have SSAS 2008, you can use the new DMV feature to query this same rowset and return a tabular result.
example
SELECT *
FROM SystemRestrictSchema($system.discover_partition_stat
,DATABASE_NAME = 'Adventure Works DW 2008'
,CUBE_NAME = 'Adventure Works'
,MEASURE_GROUP_NAME = 'Internet Sales'
,PARTITION_NAME = 'Internet_Sales_2003')
Similar posts:
How to find out using AMO if aggregation exists on partition?
Detect aggregation processing state with AMO?

exception when trying to fetch more than 1 thousand customer(s) and invoice(s) from QBO API

I have more than 1000 customer(s) and invoice(s) and I am trying to fetch all those customers and invoice(s) into a drop-down list.
Documentation on the QBO site suggests that we should need to use pagination if I want to load all the customers in a grid, but what I want is to load all the customer(s) and invoice(s) in a drop-down list.
I am getting the following exception when I try to fetch more than 1000 customer(s) and invoice(s):
Validation Exception was thrown.
Details: QueryValidationError: value 100000 is too large. Max allowed value is 1000.
I am trying to fetch all the customers using the following code
public static List<Customer> GetAllQBOCustomers(ServiceContext context)
{
return Helper.FindAll<Customer>(context, new Customer(),1,100000);
}
I wrote the below code and solved my issue.
1. First I get the count of all the customers
2. Then I get all the customers in chunks and the chunk size is 1000
3. Create a List for customers.
4. Define 3 integer type variables for counting.
5. After that use do-while loop
6. Add all the customers are added to the main customer list
string strQuery = "Select Count(*) From Customer";
string custCount = qboAccess.GetCutomerCount(qboInz.QboServiceContext, strQuery);
List<qboData.Customer> customers = new List<Customer>();
int maxSize = 0;
int position = 1;
int count = Convert.ToInt32(custCount);
do
{
var custList = qboAccess.GetAllQBOEntityRecords(qboInz.QboServiceContext, new Customer(), position, 1000);
customers.AddRange(custList);
maxSize += custList.Count();
position += 1000;
} while (count > maxSize);
The straightforward answer is to loop enough times to get the records you need:
public static List<Customer> GetAllQBOCustomers(ServiceContext context)
{
var list = new List<Customer>();
for (int i=0; i<=10000; i+= 1000)
{
var results = Helper.FindAll<Customer>(context, new Customer(),i, 1000);
list.AddRange(results);
}
return list;
}
Or if you want to try to do it in parallel (and the API allows concurrent connections):
public static List<Customer> GetAllQBOCustomers(ServiceContext context)
{
var bag = new ConcurrentBag<Customer>();
Parallel.ForEach( Enumerable.Range(0, 10), i =>
{
var results = Helper.FindAll<Customer>(context, new Customer(),i * 1000, 1000);
bag.AddRange(results);
});
return bag.ToList();
}
Since the series of calls is likely to be expensive, I suggest you cache the results.
You can't download those records all at once. That is what the error is telling you - very clearly. There's no magic way to avoid the server's rules.
However, I really think you should not download them all at once anyway. A drop-down list is not a good way to display that amount of data to users. Consider the user experience - would you want to scroll through a list of thousands of customers to try and find the one you want? Or would it be easier to start typing part of the name and have it pop up a short list of possible matches to choose from?
A more user-friendly way to implement this would be use an auto-complete box instead of a drop-down list, and after the user has typed a few characters, it can use AJAX to search the API for customers whose names or IDs contain those characters. Then you'll only need to return a small number of records each time, and the user will not be stuck having to scroll for 10 minutes just to find a customer at the bottom of a list of 10,000 records.

Speeding up a linq query with 40,000 rows

In my service, first I generate 40,000 possible combinations of home and host countries, like so (clientLocations contains 200 records, so 200 x 200 is 40,000):
foreach (var homeLocation in clientLocations)
{
foreach (var hostLocation in clientLocations)
{
allLocationCombinations.Add(new AirShipmentRate
{
HomeCountryId = homeLocation.CountryId,
HomeCountry = homeLocation.CountryName,
HostCountryId = hostLocation.CountryId,
HostCountry = hostLocation.CountryName,
HomeLocationId = homeLocation.LocationId,
HomeLocation = homeLocation.LocationName,
HostLocationId = hostLocation.LocationId,
HostLocation = hostLocation.LocationName,
});
}
}
Then, I run the following query to find existing rates for the locations above, but also include empty the missing rates; resulting in a complete recordset of 40,000 rows.
var allLocationRates = (from l in allLocationCombinations
join r in Db.PaymentRates_AirShipment
on new { home = l.HomeLocationId, host = l.HostLocationId }
equals new { home = r.HomeLocationId, host = (Guid?)r.HostLocationId }
into matches
from rate in matches.DefaultIfEmpty(new PaymentRates_AirShipment
{
Id = Guid.NewGuid()
})
select new AirShipmentRate
{
Id = rate.Id,
HomeCountry = l.HomeCountry,
HomeCountryId = l.HomeCountryId,
HomeLocation = l.HomeLocation,
HomeLocationId = l.HomeLocationId,
HostCountry = l.HostCountry,
HostCountryId = l.HostCountryId,
HostLocation = l.HostLocation,
HostLocationId = l.HostLocationId,
AssigneeAirShipmentPlusInsurance = rate.AssigneeAirShipmentPlusInsurance,
DependentAirShipmentPlusInsurance = rate.DependentAirShipmentPlusInsurance,
SmallContainerPlusInsurance = rate.SmallContainerPlusInsurance,
LargeContainerPlusInsurance = rate.LargeContainerPlusInsurance,
CurrencyId = rate.RateCurrencyId
});
I have tried using .AsEnumerable() and .AsNoTracking() and that has sped things up quite a bit. The following code shaves several seconds off of my query:
var allLocationRates = (from l in allLocationCombinations.AsEnumerable()
join r in Db.PaymentRates_AirShipment.AsNoTracking()
But, I am wondering: How can I speed this up even more?
Edit: Can't replicate foreach functionality in linq.
allLocationCombinations = (from homeLocation in clientLocations
from hostLocation in clientLocations
select new AirShipmentRate
{
HomeCountryId = homeLocation.CountryId,
HomeCountry = homeLocation.CountryName,
HostCountryId = hostLocation.CountryId,
HostCountry = hostLocation.CountryName,
HomeLocationId = homeLocation.LocationId,
HomeLocation = homeLocation.LocationName,
HostLocationId = hostLocation.LocationId,
HostLocation = hostLocation.LocationName
});
I get an error on from hostLocation in clientLocations which says "cannot convert type IEnumerable to Generic.List."
The fastest way to query a database is to use the power of the database engine itself.
While Linq is a fantastic technology to use, it still generates a select statement out of the Linq query, and runs this query against the database.
Your best bet is to create a database View, or a stored procedure.
Views and stored procedures can easily be integrated into Linq.
Material Views ( in MS SQL ) can further speed up execution, and missing indexes are by far the most effective tool in speeding up database queries.
How can I speed this up even more?
Optimizing is a bitch.
Your code looks fine to me. Make sure to set the index on your DB schema where it's appropriate. And as already mentioned: Run your Linq against SQL to get a better idea of the performance.
Well, but how to improve performance anyway?
You may want to have a glance at the following link:
10 tips to improve LINQ to SQL Performance
To me, probably the most important points listed (in the link above):
Retrieve Only the Number of Records You Need
Turn off ObjectTrackingEnabled Property of Data Context If Not
Necessary
Filter Data Down to What You Need Using DataLoadOptions.AssociateWith
Use compiled queries when it's needed (please be careful with that one...)

Store search results for later use - Accessing results view

I have a c# asp.net MVC project. I am doing a a search and want to access the search results on the details page of one of the search results.
This is so that I can have < prev | next > links on the detail pages that will link to the next property in the last search results.
My approach so far is to put the search results object into a session variable, but I can't figure out the code to actually access it. When I do a watch on Session["SearchResults"] below, I can see the records in the Result View, which seems to hold an array.
I know someone is going to tell me I'm thinking about this all wrong, and I can't wait to be enlightened.
Someone suggested I should just store the last search results on the repository as a public property, would that be a better option? Or can someone recommend an altogether better way of doing what I need to do?
This is my controller
public ActionResult Search(int? page, Search search)
{
search.regusApi = Convert.ToBoolean(ConfigurationManager.AppSettings["regusApiLiveInventory"]);
Session["SearchResults"] = MeetingRoomRepositoryWG.Search(search).AsPagination(page ?? 1, 7);
return View(new SearchResultsWG { SearchCriteria = search, Locations = MeetingRoomRepositoryWG.Search(search).AsPagination(page ?? 1, 7) });
}
public ActionResult NiceDetails(String suburb, String locationName, int id)
{
**Here I want to acceess the session variable**
return View(MeetingRoomRepositoryWG.RoomDetails(id).First());
}
Here is the code from the repository:
public static List<Location> Search(Search search)
{
String LongLatString = search.LongLat;
LongLatString = LongLatString.Substring(1, LongLatString.Length - 2);
var LonLatSplit = LongLatString.Split(',');
var latitude = Convert.ToDecimal(LonLatSplit[0]);
var longitude = Convert.ToDecimal(LonLatSplit[1]);
using (var context = new MyContext())
{
var query = context.Locations.Include("Location_LonLats").ToList();
query.OrderBy(x => (Convert.ToDecimal(x.Location_LonLats.Lat) - latitude) * (Convert.ToDecimal(x.Location_LonLats.Lat) - latitude)
+ (Convert.ToDecimal(x.Location_LonLats.Lon) - longitude) * (Convert.ToDecimal(x.Location_LonLats.Lon) - longitude));
return query;
}
}
Not sure how large the data is you search against but, it's better to not store search results at all. It will scale very poorly and become a resource hog quite easily. Why store e.g. 500 pages of search results if the user only ends up looking at 1 or 3?
How long are you going to hold these potentially large result sets in session storage? For how many users?
Just do a page based search, essentially redoing the search for each "next" click the client does. A good index or something like Lucene.net can help if your searches are too slow.

Categories

Resources