Dealing with Massive Graphs - Traveling Salesperson

Dealing with Massive Graphs - Traveling Salesperson - c#

I'm teaching myself how to program algorithms involving TSPs (Djikstra, Kruskal) and I'm looking for some start up advice. I am working with C# and SQL. Ideally I'd like to be able to do this strictly in SQL however I'm not sure if that is possible (I assume the runtime would be awful after 50 vertices).
So I guess the question is, can I do this is only SQL and if so what is the best approach? If not, and I have to get C# involved what would be the best approach there?

It is only advisable to do simple calculations in SQL, like calculating sums. Sums are faster in SQL, because only sums are returned instead of all the records. Complicated algorithms like the ones you have in mind must be done in your c# code! First, the SQL language is not suited for such problems, second it is optimized for db accesses, making it very slow for other types of uses.
Read your data from your db with SQL into an appropriate data structure into your c# program. Do all the TSP related logic there and, if you want, store the result in the db, when finished.

i'm going to chime in for SQL. While it would not really be my first choice to work on TSP - it still can easily do this kind of thing - assuming of course that the data model is optimal for your efforts.
The first chore will be to define a data model that holds the information your algorithm needs, then populate with some sample data, then work out a query that can retrieve the arrays as needed.
finally you can decide if some simple SQL in that query would work for you, or perhaps an extension in the form of a stored procedure.
finally, you may opt to pull it out to your alternate language of choice.

Well, I am not sure if SQL is the best option to accomplish this, but you could try using an adjacency matrix for the input. Many published algorithms are designed for this kind of input, and after that the only issue is putting the pseudocode into C#. Take a look that this:
http://en.wikipedia.org/wiki/Adjacency_matrix.
You would be using a two dimensional array to represent the matrix.

Related

More efficient than a Datatable

We have a reporting tool that is grabbing a large amount of records. At times it can be 1 million records. We have been storing this in a datable. I wanted to to know if there was a better object to store this in. I would need to be able to aggregate the data in various ways.
Update:
Yes. Personally believe that should not being getting that many records. This is not the direction I want to go.
Also I am using Oracle
Update Update
Sorry for the delay, but there are always fire to put out here. The main issue was they were running out of memory and getting memory errors. They had issues with the datatable releasing from memory and also binding to a datagridview. I guess what I was looking for was a lighter weight object that wouldn't take as much space.
After thinking about a little more, it really doesn't make any sense to get that much data as diagonalbatman mentioned. furthermore if we have just a few people are using it with these issues. How is it going to scale.
Unfortunately, I have a boss that doesn't listen and an offshore team that is too much of a "yes sir" type attitude. They are serializing the raw data (as an XML file) and releasing the raw data Datatable which I think is not a good direction at all.
#diagonalbatman - OUt of curiousity, do you have an example of this

Why do you need to draw down 1 Milion records into your app?
Can you not do your reporting consolidation / aggregation on the DB? This would make better use of the DB's resources (after all this is what an RDBMS is designed to do) then you can focus your app on working with smaller consolidated sets?

I would recommend you try several options to verify, especially in light of your needed ability to aggregate the data in various ways.
1) Can it be aggregated by proper queries on the data side, this is likely the best solution.
2) if you use POCOs does LINQ improve upon your current memory and performance characteristics. Does LINQ allow you to to the aggregation you require.
Measure the characteristics you care about and try different options.

What you want are Data Cubes. Depending on the type of database you have, you should look at building some Cubes.

Joins are for lazy people?

I recently had a discussion with another developer who claimed to me that JOINs (SQL) are useless. This is technically true but he added that using joins is less efficient than making several requests and link tables in the code (C# or Java).
For him joins are for lazy people that don't care about performance. Is this true? Should we avoid using joins?

No, we should avoid developers who hold such incredibly wrong opinions.
In many cases, a database join is several orders of magnitude faster than anything done via the client, because it avoids DB roundtrips, and the DB can use indexes to perform the join.
Off the top of my head, I can't even imagine a single scenario where a correctly used join would be slower than the equivalent client-side operation.
Edit: There are some rare cases where custom client code can do things more efficiently than a straightforward DB join (see comment by meriton). But this is very much the exception.

It sounds to me like your colleague would do well with a no-sql document-database or key-value store. Which are themselves very good tools and a good fit for many problems.
However, a relational database is heavily optimised for working with sets. There are many, many ways of querying the data based on joins that are vastly more efficient than lots of round trips. This is where the versatilty of a rdbms comes from. You can achieve the same in a nosql store too, but you often end up building a separate structure suited for each different nature of query.
In short: I disagree. In a RDBMS, joins are fundamental. If you aren't using them, you aren't using it as a RDBMS.

Well, he is wrong in the general case.
Databases are able to optimize using a variety of methods, helped by optimizer hints, table indexes, foreign key relationships and possibly other database vendor specific information.

No, you shouldnt.
Databases are specifically designed to manipulate sets of data (obviously....). Therefore they are incredibly efficient at doing this. By doing what is essentially a manual join in his own code, he is attempting to take over the role of something specifically designed for the job. The chances of his code ever being as efficient as that in the database are very remote.
As an aside, without joins, whats the point in using a database? he may as well just use text files.

If "lazy" is defined as people who want to write less code, then I agree. If "lazy" is defined as people who want to have tools do what they are good at doing, I agree. So if he is merely agreeing with Larry Wall (regarding the attributes of good programmers), then I agree with him.

Ummm, joins is how relational databases relate tables to each other. I'm not sure what he's getting at.
How can making several calls to the database be more efficient than one call? Plus sql engines are optimized at doing this sort of thing.
Maybe your coworker is too lazy to learn SQL.

"This is technicaly true" - similarly, a SQL database is useless: what's the point in using one when you can get the same result by using a bunch of CSV files, and correlating them in code? Heck, any abstraction is for lazy people, let's go back to programming in machine code right on the hardware! ;)
Also, his asssertion is untrue in all but the most convoluted cases: RDBMSs are heavily optimized to make JOINs fast. Relational database management systems, right?

Yes, You should.
And you should use C++ instead of C# because of performance. C# is for lazy people.
No, no, no. You should use C instead of C++ because of performance. C++ is for lazy people.
No, no, no. You should use assembly instead of C because of performance. C is for lazy people.
Yes, I am joking. you can make faster programs without joins and you can make programs using less memory without joins. BUT in many cases, your development time is more important than CPU time and memory. Give up a little performance and enjoy your life. Don't waste your time for little little performance. And tell him "Why don't you make a straight highway from your place to your office?"

The last company I worked for didn't use SQL joins either. Instead they moved this work to application layer which is designed to scale horizontally. The rationale for this design is to avoid work at database layer. It is usually the database that becomes bottleneck. Its easier to replicate application layer than database. There could be other reasons. But this is the one that I can recall now.
Yes I agree that joins done at application layer are inefficient compared to joins done by database. More network communication also.
Please note that I'm not taking a hard stand on avoiding SQL joins.

Without joins how are you going to relate order items to orders?
That is the entire point of a relational database management system.
Without joins there is no relational data and you might as well use text files
to process data.
Sounds like he doesn't understand the concept so he's trying to make it seem they are useless. He's the same type of person who thinks excel is a database application.
Slap him silly and tell him to read more about databases. Making multiple connections and pulling data and merging the data via C# is the wrong way to do things.

I don't understand the logic of the statement "joins in SQL are useless".
Is it useful to filter and limit the data before working on it? As you're other respondants have stated this is what database engines do, it should be what they are good at.
Perhaps a lazy programmer would stick to technologies with which they were familiar and eschew other possibilities for non technical reasons.
I leave it to you to decide.

Let's consider an example: a table with invoice records, and a related table with invoice line item records. Consider the client pseudo code:
for each (invoice in invoices)
let invoiceLines = FindLinesFor(invoice)
...
If you have 100,000 invoices with 10 lines each, this code will look up 10 invoice lines from a table of 1 million, and it will do that 100,000 times. As the table size increases, the number of select operations increases, and the cost of each select operation increases.
Becase computers are fast, you may not notice a performance difference between the two approaches if you have several thousand records or fewer. Because the cost increase is more than linear, as the number of records increases (into the millions, say), you'll begin to notice a difference, and the difference will become less tolerable as the size of the data set grows.
The join, however. will use the table's indexes and merge the two data sets. This means that you're effectively scanning the second table once rather than randomly accessing it N times. If there's a foreign key defined, the database already has the links between the related records stored internally.
Imagine doing this yourself. You have an alphabetical list of students and a notebook with all the students' grade reports (one page per class). The notebook is sorted in order by the students' names, in the same order as the list. How would you prefer to proceed?
Read a name from the list.
Open the notebook.
Find the student's name.
Read the student's grades, turning pages until you reach the next student or the last page.
Close the notebook.
Repeat.
Or:
Open the notebook to the first page.
Read a name from the list.
Read any grades for that name from the notebook.
Repeat steps 2-3 until you get to the end
Close the notebook.

Sounds like a classic case of "I can write it better." In other words, he's seeing something that he sees as kind of a pain in the neck (writing a bunch of joins in SQL) and saying "I'm sure I can write that better and get better performance." You should ask him if he is a) smarter and b) more educated than the typical person that's knee deep in the Oracle or SQL Server optimization code. Odds are he isn't.

He is most certainly wrong. While there are definite pros to data manipulation within languages like C# or Java, joins are fastest in the database due to the nature of SQL itself.
SQL keeps detailing statistics regarding the data, and if you have created your indexes correctly, can very quickly find one record in a couple of million. Besides the fact that why would you want to drag all your data into C# to do a join when you can just do it right on the database level?
The pros for using C# come into play when you need to do something iteratively. If you need to do some function for each row, it's likely faster to do so within C#, otherwise, joining data is optimized in the DB.

I will say that I have run into a case where it was faster breaking the query down and doing the joins in code. That being said, it was only with one particular version of MySQL that I had to do that. Everything else, the database is probably going to be faster (note that you may have to optimize the queries, but it will still be faster).

I suspect he has a limited view on what databases should be used for. One approach to maximise performance is to read the entire database into memory. In this situation, you may get better performance and you may want to perform joins if memory for efficiency. However this is not really using a database, as a database IMHO.

No, not only are joins better optimized in database code that ad-hoc C#/Java; but usually several filtering techniques can be applied, which yields even better performance.

He is wrong, joins are what competent programmers use. There may be a few limited cases where his proposed method is more efficient (and inthose I would probably be using a Documant database) but I can't see it if you have any deceent amount of data. For example take this query:
select t1.field1
from table1 t1
join table2 t2
on t1.id = t2.id
where t1.field2 = 'test'
Assume you have 10 million records in table1 and 1 million records in table2. Assume 9 million of the records in table 1 meet the where clause. Assume only 15 of them are in table2 as well. You can run this sql statement which if properly indexed will take milliseconds and return 15 records across the network with only 1 column of data. Or you can send ten million records with 2 columns of data and separately send another 1 millions records with one column of data across the network and combine them on the web server.
Or of course you could keep the entire contents of the database on the web server at all times which is just plain silly if you have more than a trivial amount of data and data that is continually changing. If you don't need the qualities of a relational database then don't use one. But if you do, then use it correctly.

I've heard this argument quite often during my career as a software developer. Almost everytime it has been stated, the guy making the claim didn't have much knowledge about relational database systems, the way they work and the way such systems should be used.
Yes, when used incorrectly, joins seem to be useless or even dangerous. But when used in the correct way, there is a lot of potential for database implementation to perform optimizations and to "help" the developer retrieving the correct result most efficiently.
Don't forget that using a JOIN you tell the database about the way you expect the pieces of data to relate to each other and therefore give the database more information about what you are trying to do and therefore making it able to better fit your needs.
So the answer is definitely: No, JOINSaren't useless at all!

This is "technically true" only in one case which is not used often in applications (when all the rows of all the tables in the join(s) are returned by the query). In most queries only a fraction of the rows of each table is returned. The database engine often uses indexes to eliminate the unwanted rows, sometimes even without reading the actual row as it can use the values stored in indexes. The database engine is itself written in C, C++, etc. and is at least as efficient as code written by a developer.

Unless I've seriously misunderstood, the logic in the question is very flawed
If there are 20 rows in B for each A, a 1000 rows in A implies 20k rows in B.
There can't be just 100 rows in B unless there is many-many table "AB" with 20k rows with the containing the mapping.
So to get all information about which 20 of the 100 B rows map to each A row you table AB too. So this would be either:
3 result sets of 100, 1000, and 20k rows and a client JOIN
a single JOINed A-AB-B result set with 20k rows
So "JOIN" in the client does add any value when you examine the data. Not that it isn't a bad idea. If I was retrieving one object from the database than maybe it makes more sense to break it down into separate results sets. For a report type call, I'd flatten it out into one almost always.
In any case, I'd say there is almost no use for a cross join of this magnitude. It's a poor example.
You have to JOIN somewhere, and that's what RDBMS are good at. I'd not like to work with any client code monkey who thinks they can do better.
Afterthought:
To join in the client requires persistent objects such as DataTables (in .net). If you have one flattened resultset it can be consumed via something lighter like a DataReader. High volume = lot of client resources used to avoid a database JOIN.

Efficiency of linq to sql vs stored procedure

Hi I'm writing a app which has a search page and does a search on the database.
I'm wondering whether I should do this in linq or a stored procedure.
Is the performance of a stored procedure much better than that of linq to sql?
I'm thinking it would be because in order to write the linq query you need to use the datacontext to access the table on which to query. I'm imagining this in itself means that if the table is big it might become inefficient.
That is if you were using:
context.GetTable<T>();
Can any one advise me here?

There is unlikely to be much difference UNLESS you encounter a situation where the TSQL produced by Linq to SQL is not optimal.
If you want absolute control over the TSQL use a stored procedure.
If speed is critical, benchmark both and also examine the TSQL produced by your Linq to SQL solution.
Also, you should be wary of pulling back entire tables (unless they are small, such as frequently accessed lookup data) across the wire in either solution.

If the speed is so critical to you then you should go ahead and benchmark both options on a reasonable set of data. Technically I would expect the SP to be faster but it might not be that much of a difference.

What does "efficient" mean to you?
I'm working on a website where sub seconds (preferably sub 500ms) is the goal. We're using Linq for search on most of our stuff. The only time we're actually using a SP is when we're using the hierarchyid and other SqlServer data types that don't exist in EF.

GetTable probably isn't going to be that different between the two, as fundamentally it's just SELECT * FROM T. You'll see more significant gains from stored procedures in cases where the query isn't being written very optimally by Linq, or in some very high load situations were caching the execution plan makes a difference.
Benchmarking it is the best answer, but from what it looks like you're doing I don't think the difference is going to amount to much.

LINQ to SQL - Lightweight O/RM?

I've heard from some that LINQ to SQL is good for lightweight apps. But then I see LINQ to SQL being used for Stackoverflow, and a bunch of other .coms I know (from interviewing with them).
Ok, so is this true? for an e-commerce site that's bringing in millions and you're typically only doing basic CRUDs most the time with the exception of an occasional stored proc for something more complex, is LINQ to SQL complete enough and performance-wise good enough or able to be tweaked enough to run happily on an e-commerce site? I've heard that you just need to tweak performance on the DB side when using LINQ to SQL for a better approach.
So there are really 2 questions here:
1) Meaning/scope/definition of a "Lightweight" O/RM solution: What the heck does "lightweight" mean when people say LINQ to SQL is a "lightweight O/RM" and is that true??? If this is so lightweight then why do I see a bunch of huge .coms using it?
Is it good enough to run major .coms (obviously it looks like it is) and what determines what the context of "lightweight" is...it's such a generic statement.
2) Performance: I'm working on my own .com and researching different O/RMs. I'm not really looking at the Entity Framework (yet), just want to figure out the LINQ to SQL basics here and determine if it will be efficient enough for me. The problem I think is you can't tweak or control the SQL it generates...

When people describe LINQ to SQL as lightweight, I think they mean it is good enough at what it does, but there is a lot of stuff it doesn't even try to do. I think this is a good thing, because all that extra stuff that other ORMs might try to let you do isn't really even needed if you're just willing to make a few sacrifices.
For example, I think it's a best practice to try to keep all application data in a single database. This is the kind of thing that LINQ to SQL expects if you want to be able to do Joins and whatnot. However, if you work in some environment with layers of bureaucracy, you might not be able to convince everyone to move legacy data around, or centralize on a single way of doing things. In the end you need a more complicated ORM and you end up with arguably crapper software. That's just one example of why you might not be able to shape that data as it needs to be.
So yeah, if big .com's are willing or able to do things in a consistent manner and follow best practices there is no reason why the ORM can't be as simple as necessary.

.NET data storage - something between built-in collections and external SQL database?

I will preface the question by saying that I am somewhat new to the .NET world, and might be missing something entirely obvious. If so, I'd love to hear what that is!
I often find myself writing small program that all do more or less the same thing:
Read in data from one or more files
Store this data in memory, in some sort of container
Crunch data, output analysis results to a text file and quit
I often find myself creating monstrous-looking containers to store said data. E.g.:
Dictionary<DateTime, SortedDictionary<ItemType, List<int>>> allItemTypesAndPropertiesByDate =
new Dictionary<DateTime, SortedDictionary<ItemType, List<int>>>();
This works, in the sense that the data structure describes my intent more or less accurately - I want to be able to access item types and properties by date. Nevertheless, I feel that the storage container is too tightly bound to the output data format (if tomorrow I decide that I'd like to find all dates on which items with certain properties were seen, this data structure becomes a liability). Generally, making input and output changes down the line is time-consuming and error-prone. Plus, I have to keep staring at these ugly-looking declarations - and code to iterate over them is not pretty either.
On the other end of the complexity spectrum, I can create a SQL database with schema that describes input in a more flexible format, and then run queries (using SQL or LINQ to SQL) against the database. This certainly works, but feels like too big of a hammer - I write many programs like these, and don't want to create a database for each one, manage the SQL dependency (even if it is SQL express on local machine), etc. I don't need to actually persist the data - just to read it in, keep it in memory, make a few queries and quit. Even using an in-memory SQLite instance feels like an overkill. I am not overly concerned with runtime performance - these are usually just little local machine experiments - but it just feels wrong.
Ideally, what I would like is to have a low-overhead, in memory row store with a loosely-defined schema that is easily LINQ-queryable, and takes only a few lines of code to set up and use. Does the Microsoft .NET 4 stack include something like this? If you found yourself in a similar predicament, what would you do?
Your thoughts are appreciated - thanks!
Alex

If you find a database structure easier to work with, one option might be to create a DataSet with DataTables representing your schema which you can then query using Linq 2 DataSets

Or you could try to use object databases like db4o; they store the actual objects you would work with, helping you to program in a more object-oriented manner, and it's quite easy to work with. Also, it's not a database server in the traditional sense of the word - it uses flat files as containers and reads/writes directly from/to them.

Why not just use linq?
You can read the data into flat lists, then chain some linq statements to get the structure you want.
Apologies if I'm missing something, but I don't think you need an intermediate.

Comparing databases and OOP, a table definition corresponds to a class definition, a record is an object, and the table data is any kind of collection of objects.
My approach would be to define classes and properties representing the file contents, parse each file entry into an object, and add these objects into a List< T>.
This List can then be queried using Linq.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.