Tools or data base tips to show big data information? - c#

I'm having a problem with an application builded in .net core(c#) and SQL server 2017 with angular js version 1.x (frondend).
The problem is the following we have are very big tables, with millons of records. Only a simple select count in one of theses tables takes to long. we execute the query directly from the code without passing through any ORM librery, but even without using any ORM the queries take too long.
I was asking myself ¿if there is another better way to consult these giant tables likes (external tools, another type of database, etc.) since in many cases you need to show reports and see statistics graphs?.

One possible strategy is to use table partitions using a partition function that match your business needs. With this you can split data in table among many files, thus reducing the number of results to scan.
See this link for detailed info.

OLTP databases like SQL Server are not designed for handling OLAP (aggregate) queries in the real time in case of large datasets. Typical workarounds are:
limit number of aggregated rows with extra WHERE conditions, and add indexes for these columns. This is usually is possible with historic data like orders, events log etc - show reports only for last month or year.
use materialized views and use it for reports that doesn't need much detalization
configure slave read-only instance of SQL Server, possibly add columnstore indexes, and use it for OLAP queries.
replicate your SQL Server data to specialized (possibly, distributed) analytical database that can handle OLAP queries in the real-time (like Amazon Redshift, Vertica, MongoDb, ElasticSearch, Yandex ClickHouse etc)
If reports can be configured by end users ensure that your ROLAP-like engine produces efficient SQL GROUP BY queries.

Related

When to use Temporary SQL Tables vs DataTables

I don't know whether it is better to use temporary tables in SQL Server or use the DataTable in C# for a report. Here is the scope of the report: it will be copied into a workbook with about 10 worksheets - each worksheet containing about 1000 rows and about 30 columns so it's a lot of data. There is some guidance out there but I could not find anything specific regarding the amount of data that is too much for a DataTable. According to https://msdn.microsoft.com/en-us/library/system.data.datatable.aspx, 16M rows but my data set seems unwieldy considering the number of columns I have. Plus, I will either have to make multiple SQL queries to collect the data in my report or try to write a stored procedure in SQL to collect that data. How do I figure out this quandary?
My rule of thumb is that if it can be processed on the database server, it probably should. Keep in mind, no matter how efficient your C# code is, SQL Server will mostly likely to it faster and more efficiently, after all it was designed for data manipulation.
There is no shame in using #temp tables. They maintain stats, can be indexed, and/or manipulated. One recent example, a developer create an admittedly elegant query using cte, the performance was 12-14 seconds vs mine at 1 second using #temps.
Now, one carefully structured stored procedure could produce and return the 10 data-sets for your worksheets. If you are using a product like SpreadSheetLight (there are many options available), it becomes a small matter of passing the results and creating the tabs (no cell level looping... unless you want or need to).
I would also like to add, you can dramatically reduce the number of touch points and better enforce the business logic by making SQL Server do the heavy lifting. For example, a client introduced a 6W risk rating, which was essentially a 6.5. HUNDREDS of legacy reports had to be updated, while I only had to add the 6W into my mapping table.
There's a lot of missing context here - how is this report going to be accessed and run? Is this going to run as a scripted event every day?
Have you considered SSRS?
In my opinion it's best to abstract away your business logic by creating Views or Stored Procedures in the database. Stored Procedures would probably be the way to go but it really depends on your specific environment. Then you can point whatever tools you want to use at the database object. This has several advantages:
if you end up having different versions or different formats of the report, and your logic ever changes, you can update the logic in one place rather than many.
your code is simpler and cleaner, typically:
select v.col1, v.col2, v.col3
from MY_VIEW v
where v.date between #startdate and #enddate
I assume your 10 spreadsheets are going to be something like
Summary Page | Department 1 | Department 2 | ...
So you could make a generalized View or SP, create a master spreadsheet linked to the db object that pulls all the relevant data from SQL, and use Pivot Tables or filters or whatever else you want, and use that to generate your copies that get sent out.
But before going to all that trouble, I would make sure that SSRS is not an option, because if you can use that, it has a lot of baked in functionality that would make your life easier (export to Excel, automatic date parameters, scheduled execution, email subscriptions, etc).

Should I do many sql queries or one large query and do the processing on the server?

The situation is as follows: I have a large-ish dataset with a couple thousand entries that I populate from an Excel file. For each entry I have to match it to another field on a certain table in the database (this table contains only a couple hundred entries).
What's the best way to go about doing it? I can make a query for each entry in the dataset but this seems fairly wasteful; on the other hand I can just select the fields I need from all the entries in the table, put them on a Dictionary or some other data structure and match them on IIS, thus making effectively only one query but doing all the processing on the webserver.
Dataset : ~1000 to ~3000 entries
Table in the DB: ~300 entries
Using asp.net on IIS but the database is a MS access file.
Is either of these better the other? Is there a third, better way I haven't thought of?
Databases are designed to do many things that are useful for data processing. A lot of benefits for transactional processing are contained in the acronym ACID -- atomicity, consistency, isolation, durability. In other words, databases behave the way you would expect when you store something in them. The data is there, relationships are enforced, it will be there tomorrow.
The features that you want are on the querying side. Databases in general (although perhaps not MS Access in particular) allow a relatively standard interface to powerful processing. Database engines know how to optimize queries. Database engines know how to manage memory. Database engines know how to manager hierarchical memory, with disk, RAM, and cache. Databases know how to take advantage of indexes, row partitions, and other optimizations (you can get this functionality by using a free version of a more advanced database, such as SQL Server, Oracle, Postgres, or even MySQL).
You are talking about thousands of rows of data. Databases can easily work with millions of rows. You are talking about two tables. Databases can easily manage many more tables and queries using a dozen.
So, no, you should not load your data into in-memory structures on the application side. You should do the processing in the database and bring back the results you want. Then, you can format the results on the application side, to take advantage of what applications do best: interface to the user.

Scrollable ODBC cursor in C#

I'm a C++ programmer and I'm not familiar with the .NET database model. I usually use IDataReader (OdbcDataReader, OledbDataReader or SqlDataReader) to read data from database. Sometimes when I need a bulk of data I use DataAdapter, but what should I do to achieve the functionality of scrollable cursors that exists in native libraries like ODBC?
Thanks all of you for your answers, but I am in a situation that I can't accept them, of course this is my fault that didn't explain my problem completely. I explain it as a comment in one of answers that now removed.
I have to write a program that will act as a proxy between client side program and MSSQL, for this library I have following requirements:
My program should be compatible with MSSQL2000
I don't know all the tables and queries that will be sent by the user, I should simply add some information to it, make a log, ... and then execute it against MSSQL, so it is really hard to use techniques that based on ordered field(s) of the query or primary key of the table(All my works are in one database but that database is huge and may change over time).
Only a part of data is needed by the client, most DBMS support LIMIT OFFSET, unfortunately MSSQL do not support it, and ROW_NUMBER does not exist in the MSSQL2000 and if it supported, then again I need to understand program logic and that need a parse of SQL command(Actually I write a parsing library with boost::spirit but that's native code and beside that I'm not yet 100% sure about its functionality).
I may have multiple clients but most of queries that will be sent by them are one of a few predefined queries(of course users still send custom queries but its about 30% of all queries), So I think I can open some scrollable cursors and respond to clients using that cursors and a custom cache.
Server machine and its MSSQL will be dedicated to my program, so I really want to use all of the power of the server and DBMS to achieve my functionality.
So now:
What is the problem in using scrollable cursors and why I should avoid them?
How can I use scrollable cursors in .NET?
In SQL Server you can create queries paged thus. The page number you handle it easily from the application. You do not need to create cursors for this task.
For SQL Server 2005 o higher
SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY ID) AS ROW FROM TABLEA ) AS ALIAS
WHERE ROW > 40
AND ROW <= 49
For SQL Server 2000
SELECT TOP 10 T.* FROM TABLA AS T WHERE T.ID NOT IN
( SELECT TOP 39 id from tabla order by id desc )
ORDER BY T.ID DESC
PD: edited to include support for SQL Server 2000
I usually use DataReader.Read() to skip all rows that I do not want to use when doing paging on a DB which do not support paging.
If you don't want to build the SQL paged query yourself you are free to use my paging class: https://github.com/jgauffin/Griffin.Data/blob/master/src/Griffin.Data/BasicLayer/Paging/SqlServerPager.cs
When Microsoft designed the ADO.NET API, they made the decision to expose only firehose cursors (IDataReader etc). This may or may not actually pose a problem for you. You say that you want "functionality of scrollable cursors", but that can mean all sorts of things, not just paging, and each particular use case can be tackled in a variety of ways. For example:
Requirement: The user should be able to arbitrarily page up and down the resultset.
Retrieve only one page of data at a time, e.g. using the ROW_NUMBER() function. This is more efficient than scrolling through a cursor.
Requirement: I have an extremely large data set and I only want to process one row at a time to avoid running out of memory.
Use the firehose cursor provided by ADO.NET. Note that this is only practical if (a) you don't need to hit the database at all during the loop, or (b) you have MARS configured in your connection string.
Simulate a keyset cursor by retrieving the set of unique identifiers into an array, then loop through the array and read one row of data at a time.
Requirement: I am doing a complicated calculation that involves moving forwards and backwards through the resultset.
You should be able to re-write your algorithm to eliminate this requirement. For example, read one set of rows, process them, read another set of rows, process them, etc.
UPDATE (more information provided in the question)
Your business requirements are asking too much. You have to handle arbitrary queries that assume the presence of scrollable cursors, but you can't provide scrollable cursors, and you can't re-write the client code to not use scrollable cursors. That's an impossible position to be in. I recommend you stick with what you currently have (C++ and ODBC) and don't bother trying to re-write it in .NET.
I don't think cursors will work for you particular case. The main reason is that you have 3 tiers. But let's take two steps back.
Most 3 tier applications have a stateless middle tier (your c++ code). Caching is fine since it really just an optimization and does not create any real state in the middle tier. The middle tier normally has a small number of open sessions to the database. Because opening a db session is expensive for the processor, and after the db session is open a set amount of RAM is reserved at the database server. When a request is received by the middle tier, the request is processed and handed on to the SQL database. An algorithm may be used to pick any of the open sessions, or it can even be done at random. In this model it is not possible to know what session will receive the next request. Cursors belong to the session that received the original query request. So you can't really expect that the receiving session will be the one that has your open cursor.
The 3 tier model I described is used mainly for web applications so they can scale to hundreds or thousands of clients. Were SQL servers would never be able to open that many sessions. Microsoft ADO.NET already has many features to support the kind of architecture I described, so it is not very hard to implement. And the same is used even in non Web applications depending on the circumstance. You could potentially keep track of your sessions so you could open a single session per client, I would first make sure that the use case justifies that. Know that open cursors can take up a lot of resources as well.
Cursors still have a place within a single transaction, it's just hard to keep them open so that the client application can fetch/update values within the result set.
What I would suggest its that you do the following within the query transaction. Store in a separate table the primary key values of the main table in your query. On the separate table include other values like sessionid and rownumber. Return a few of the first rows by linking to the new table in the original query. And in subsequent calls just query the corresponding rows again by linking to your new table. You will need an equivalent to a caching mechanism to purge old data, and to refresh the result set according to your needs.

Generic pagination system

I have to develop a layer to retrieve data from a database (can be SQL Server, Oracle or IBM DB2). Queries (which are generic) are written by developers, but i can can modify them in my layer. The tables can be huge (say > 1 000 000 rows), and they are a lot of joins (for example, I have a query with 35 joins - no way to reduce).
So, I have to develop a pagination system, to retreive a "page" (say, 50 rows).
The layer (which is in a dll) is for desktop applications.
Important fact : queries are never ordered by ID.
The only way I found is to generate a unique row number (with MSSQL ROW_NUMBER() function) but won't work with Oracle because there are too much joins.
Does anyone know another way ?
There are only two ways to do pagination code.
The first is database specific. Each of those databases have very different best practices with regards to paging through result sets. Which means that your layer is going to have to know what the underlying database is.
The second is to execute the query as is then just send the relevant records up the stream. This has obvious performance issues in that it would require your data layer to essentially grab all the records all of the time.
This is, IMHO, the primary reason why people shouldn't try to write database agnostic code. At the end of the day there are enough differences between RDBMs that it makes sense to have a pluggable data layer architecture which can take advantage of the specific RDBMs it works with.
In short, there is no ANSI standard for this. For example:
MySql uses the LIMIT keyword for paging.
Oracle has ROWNUM which has to be combined with subqueries. (not sure when it was introduced)
SQL Server 2008 has ROW_NUMBER which should be used with a CTE.
SQL Server 2005 had a different (and very complicated) way entirely of paging in a query which required several different procs and a function.
IBM DB2 has rownumber() which also must be implemented as a subquery.
You can do LINQ on your object collection, if you want to do that in the web side.
list.Skip(numPages * itemsPerPage).Take(itemsPerPage)
Lets you skip to the specified page (aka numPages = 0 is page 1).

Design Strategy: Query and Update data across 2 different databases

We have a requirement in which we need to query data across 2 different databases ( 1 in SQL Server and other in Oracle).
Here are the scenarios which need to be implemented:
Query: Get the data from one database and match for values in other
Update: Get the data from one database and update the objects in other
Technology that we are using: ASP.net, C#
The options that we have thought about:
Staging area in one database
Link Server ( can't go with the approach as it is not allowed due to organization wide policy)
Create web services
Create 2 different DAL and perform list operations with the data from 2 sources in DAL
I would like to know what is the best design strategy to deal with this kind of a scenario? If yes, then what are the pros and cons of that approach
Is it not possible to use SSIS package to do the data transformation between 2 servers and invoke it either via ASP.Net & c# project or via schedule job invoked on demand?
Will the results from one of the databases be small enough to efficiently pass around?
If so, I would suggest treating the databases as two independent datasources.
If the datasets are large, then you may have to consider some form of ETL into a staging area on one of the database. You may have issues if you need the queries to return up-to-date data from both databases. Because you will need to do a real-time ETL.
There is an article here about performing distributed transactions between Microsoft SQL server and Oracle:
https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1054237.html
I don't know how well this works, however if it does work, this will probably be the best solution for you:
It will almost certainly be the fastest method of querying across multiple database servers.
It should also allow for true transactional support even when writing to both databases.
The best strategy for this will be to use Linked Server, as it is designed for querying and writing to heterogeneous databases as you described above. But obviously due to the policy constraint you mentioned, this is not the option.
Therefore, to achieve the result you want in the most optimal performance, here is what I suggest:
Decide which database contains the lookup data only (minimal dataset) and you will need to execute a query on it to pull the info out
Insert the lookup data using bulk copy into a temp/dummy table in the main database (contains most of the data that you will want to retrieve and return to the caller)
Use stored procedure or query to join the temp table with other tables in your main database to retrieve the dataset desired
The decision to whether to write this as web service or not isn't going to change the data retrieval process. But consideration should be given in essentially reducing the overhead on data transfer time by keeping the process as close to your db server as possible either on same machine or within LAN/high speed connection link.
Data update will be quite straightforward. It will just be the standard two phase operations of pull data out from one and update the other. -
It's hard to tell what the best solution is. But we have a scenario that's nearly the same.
RealTime:
For realtime data updating, we are using WebServices, since in our case, the two different databases belongs to distinct projects. So every project offers a WebService which can be used for data retrieval and data update. That has the advantage, that the project must not take care for database structure changes as long the webservice interface does not change.
Static Data:
Static data (e.g. employees) will be mirrored because for faster access. For that huge amount of data we are using flat files for the nightly update.
In case of static data I think it's important to explicit define data owners. For every piece of data it should be clear which database has the original data, and which database only has shadow copies for faster access.
So Static data is readonly in the shadow database, or only updateable through designated WebServices.
The problem with using multiple data sources in your .NET code is that you run the risk of having your CRUD ops fail ACID tests and having data inconsistencies.
I would be most inclined to pursue #Will A's comment to your question...
Set up a replication to a remove server, then link the two remote servers.
Have multiple DALs and handle it in the application - thousands is not a big number, you need to worry only if you are into 100,000s or millions in which case your application will hang.
Use linq to perform data operations on the datasets that are generated rather than looping through them.

Categories

Resources