Can someone explain map-reduce in C#? - c#

Can anyone please explain the concept of map-reduce, particularly in Mongo?
I also use C# so any specifics in that area would also be useful.

One way to understand Map-Reduce coming from C# and LINQ is to think of it as a SelectMany() followed by a GroupBy() followed by an Aggregate() operation.
In a SelectMany() you are projecting a sequence but each element can become multiple elements. This is equivalent to using multiple emit statements in your map operation. The map operation can also chose not to call emit which is like having a Where() clause inside your SelectMany() operation.
In a GroupBy() you are collecting elements with the same key which is what Map-Reduce does with the key value that you emit from the map operation.
In the Aggregate() or reduce step you are taking the collections associated with each group key and combining them in some way to produce one result for each key. Often this combination is simply adding up a single '1' value output with each key from the map step but sometimes it's more complicated.
One important caveat with MongoDB's map-reduce is that the reduce operation must accept and output the same data type because it may be applied repeatedly to partial sets of the grouped data. If you are passed an array of values, don't simply take the length of it because it might be a partial result from an earlier reduce operation.

Here's a spot to get started with Map Reduce in Mongo. The cookbook has a few examples, I would focus on these two.
I like to think of map-reduces in the context of "data warehousing jobs" or "rollups". You're basically taking detailed data and "rolling up" a smaller version of that data.
In SQL you would normally do this with sum() and avg() and group by. In MongoDB you would do this with a Map Reduce. The basic premise of a Map Reduce is that you have two functions.
The first function (map) is a basically a giant for loop that runs over your data and "emits" certain keys and values. The second function (reduce), is a giant loop over all of the emitted data. The map says "hey this is the data you want to summarize" and the reduce says "hey this array of values reduces to this single value"
The output from a map-reduce can come in many forms (typically flat files). In MongoDB, the output is actually a new collection.
C# Specifics
In MongoDB all of the Map Reduces are performed inside of the javascript engine. So both the map & reduce function are all written in javascript. The various drivers will allow you to build the javascript and issue the command, however, this is not how I normally do it.
The preferred method for running Map Reduce jobs is to compile the JS into a file and then mongo map_reduce.js. Generally you'll do this on the server somewhere as a cron job or a scheduled task.
Why?
Well, map reduce is not a "real-time", especially with a big data set. It's really designed to be used in a batch fashion. Don't get me wrong, you can call it from your code, but generally, you don't want users to initiate map reduce jobs. Instead you want those jobs to be scheduled and you want users to be querying the results :)

Map Reduce is a way to process data where you have a map stage/function that identifies all data to be processed and process it, row by row.
Then you have a reduce step/function that can be run multiple times, for example once per server in a cluster and then once in the client to return a final result.
Here is a Wiki article describing it in more detail:
http://en.wikipedia.org/wiki/MapReduce
And here is the documentation for MongoDB for Mapreduce
http://www.mongodb.org/display/DOCS/MapReduce
Simple example, find the longest string in a list.
The map step will loop over the list calculating the length of each string, the reduce step will loop over the result from map and for each line keep the longest one.
This can of cause be much more complex but that's the essence of it.

Related

Is it better additional query or get all data and filter in the client?

I have a query with EF Core in which I would like to include a property and from this property, that it is a ICollection, I would like filter what data to get.
It is something like that:
myDbContext.MyEntity.Where(x => x.ID == 1).Include(x => x.MyCollection.Where(y => y.isEnabled == true));
However, I get an error because it is not possible to filter the included properties.
In fact, the items in the collection will be few, about 6 or 7, so I was thinking that I could include all and later filter the data in the client.
Another option it would be get the the main entity first and in a second query to get the childs that really I need.
I always read that the connections to the database are expensive, so it is better to do as less queries as possible, but also I read that the best practice it is to get only the data that I need and no filter in the client, but it is better filter in the query.
But in this case, with EF Core, it seems that I can't filter in the query, so I would like to know what is better, 2 queries and get only the data that I need or one query getting all the data and filter later in the client.
But in this case, with EF Core, it seems that I can't filter in the query, so I would like to know what is better, 2 queries and get only the data that I need or one query getting all the data and filter later in the client.
Which is longer? One long piece of string, or two shorter pieces of string?
You don't know, because I haven't told you the actual lengths. You don't know if it's a 1m string versus two 5cm strings or a 10cm string vs two 8cm string.
And your question here is the same. It's better to do fewer queries than many, and it's better to do short queries than long queries. When a choice is on only one of those metrics (e.g. the shorter query from doing a simple Where on the database vs a simple Where in memory on all results) then we can make sound a priori judgements about which is likely to be the more efficient, and choose accordingly.
When though we have competing factors in play we have to:
Decide whether we even care: If they're going to still be pretty fast either way it might not be worth worrying about; find bigger fish to fry.
Measure.
Make sure what we are measuring is realistic.
The third point is important as one can often create data sets that would make one come out the victor, and other data sets that would make the other win. We need to make sure we're correctly modelling what is encountered in real life.
When the difference is small, or if they are both fast either way (and/or the use is so rare that it's still not a big deal), then just go for whichever is easier to code and maintain.

How can I deal with slow performance on Contains query in Entity Framework / MS-SQL?

I'm building a proof of concept data analysis app, using C# & Entity Framework. Part of this app is calculating TF*IDF scores, which means getting a count of documents that contain every word.
I have a SQL query (to a remote database with about 2,000 rows) wrapped in a foreach loop:
idf = db.globalsets.Count(t => t.text.Contains("myword"));
Depending on my dataset, this loop would run 50-1,000+ times for a single report. On a sample set where it only has to run about 50 times, it takes nearly a minute, so about 1 second per query. So I'll need much faster performance to continue.
Is 1 second per query slow for an MSSQL contains query on a remote machine?
What paths could be used to dramatically improve that? Should I look at upgrading the web host the database is on? Running the queries async? Running the queries ahead of time and storing the result in a table (I'm assuming a WHERE = query would be much faster than a CONTAINS query?)
You can do much better than full text search in this case, by making use of your local machine to store the idf scores, and writing back to the database once the calculation is complete. There aren't enough words in all the languages of the world for you to run out of RAM:
Create a dictionary Dictionary<string,int> documentFrequency
Load each document in the database in turn, and split into words, then apply stemming. Then, for each distinct stem in the document, add 1 to the value in the documentFrequency dictionary.
Once all documents are processed this way, write the document frequencies back to the database.
Calculating a tf-idf for a given term in a given document can now be done just by:
Loading the document.
Counting the number of instances of the term.
Loading the correct idf score from the idf table in the database.
Doing the tf-idf calculation.
This should be thousands of times faster than your original, and hundreds of times faster than full-text-search.
As others have recommended, I think you should implement that query on the db side. Take a look at this article about SQL Server Full Text Search, that should be the way to solve your problem.
Applying a contains query in a loop extremely bad idea. It kills the performance and database. You should change your approach and I strongly suggest you to create Full Text Search indexes and perform query over it. You can retrieve the matched record texts with your query strings.
select t.Id, t.SampleColumn from containstable(Student,SampleColumn,'word or sampleword') C
inner join table1 t ON C.[KEY] = t.Id
Perform just one query, put the desired words which are searched by using operators (or, and etc.) and retrieve the matched texts. Then you can calculate TF-IDF scores in memory.
Also, still retrieving the texts from SQL Server into in memory might takes long to stream but it is the best option instead of apply N contains query in the loop.

API - filter big list with word fragment

I have asp.net web api application. In database I have a big list (between 100.000 and 200.000) of pairs like id:name and this list could be changed quite rarely. I need to implement filtering like this /pair/filter?fragment=bla. It should return first 25 pairs where any word in name starts with word fragment. I see two approachs here: 1st approach is to load data into cache (HttpRuntimeCache, redis or smth like this) to increase loading time and filter in linq. But I think there will be problems with time required for serialiazing/deserialiazing. Another approach: for instance I have a pair 22:some title here so I need to provide separate table like this:
ID | FRAGMENT
22 | some
22 | title
22 | here
with primary key on both columns and separate index on FRAGMENT column to make queries faster. Any offers and remarks are welcome.
UPD: now I've refreshed my mind. I don't want to query database because requests happen quite often. So now I see the best solution is
load entire list in memory
build trie structure which keeps hashset of values in each node
in case of one text fragment - just return the hashset from trie node, in case of few fragments - find all hashsets and get their intersection
You could try a full-text index on your current DB (if its supported) and the CONTAINS keyword like so
SELECT * FROM tableName WHERE CONTAINS(name, 'bla*');
This will look for words starting with "bla" in the entire string, and also match the string "Monkeys blabla"
I dont really understand your question but if you want to query any table you can do so since you already have the queryString. You can try this out.
var res = _repository.Table.Where(c => c.Name.StartsWith("bla")).Take(25);
If it doesnt help. Try to to restructure your question a little bit.
Is this a case of premature optimization?
How many users will be hitting this service simultaneously? How many will be hitting your database simultaneously? How efficient is your query? How much data will be returned across the wire?
In most cases, you can't outsmart an efficient database for performance. Your row count is too small to create a truly heavy burden on your application's runtime performance when querying. This assumes, of course, that your query is well written and that you're properly opening, closing, and freeing resources in a timely fashion.
Caching the data in memory has its trade-offs that should be considered. It increases the memory footprint of your application, and requires you to write and maintain additional code to maintain that cache. That is by no means prohibitive, but should be considered in view of your overall architecture.
Consider these things carefully. From what I can tell, keeping this data in the database is fine. Deserialization tends to be fast (as most of the data you return is native types), and shouldn't be cost-prohibitive.

SqlDataReader get specific ResultSet

When having a query that returns multiple results, we are iterating among them using the NextResult() of the SqlDataReader. This way we are accessing results sequentially.
Is there any way to access the result in a random / non sequential way. For example jump first to the third result, then to the first e.t.c
I am searching if there is something like rdr.GetResult(1), or a workaround.
Since I was asked Why I want something like this,
First of all I have no access to the query and so I can not changes, so in my client I will have the Results in the sequence that server writes / produces them.
To process (build collections, entities --> business logic) the first I need the Information from both the second and the third one.
Again since it is not an option to modify some of the code, I can not somehow (without writing a lot of code) 'store' the connection info (eg. ids) in order to connect in a later step the two ResultSets
The most 'elegant' solution (for sure not the only one) is to process the result sets in non sequential way. So that is why I am asking if there is such a way.
Update 13/6
While Jeroen Mostert answer gives a thoughtful explanation on why, Think2ceCode1ce answer shows the right directions for a workaround. The content of the link in the comments how in additional dataset could be utilized to work in an async way. IMHO this would be the way to go if was going to write a general solution. However in my case, I based my solution in the nature of my data and the logic behind them. In short terms, (1) I read the data as they come sequentially using the SqlDataReader; (2) I store some of the data I need in a dictionary and a Collection, when I am reading the first in row but second in logic ResultSet; (3) When I am Reading the third in row, but first in logic ResultSet I am iterating in through the collection I built earlier and based on the dictionary data I am building my final result.
The final code seems more efficient and it is more maintainable than using the async DataAdapter. However this is a very specific solution based on my data.
Provides a way of reading a forward-only stream of rows from a SQL
Server database.
https://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqldatareader(v=vs.110).aspx
You need to use DataAdapter for disconnected and non-sequential access. To use this you just have to change bit of your ADO.NET code.
Instead of
SqlDataReader sqlReader = sqlCmd.ExecuteReader();
You need
DataTable dt = new DataTable();
SqlDataAdapter sqlAdapter = new SqlDataAdapter(sqlCmd);
sqlAdapter.Fill(dt);
If your SQL returns multiple result sets, you would use DataSet instead of DataTable, and then access result sets like ds.Tables[index_or_name].
https://msdn.microsoft.com/en-us/library/bh8kx08z(v=vs.110).aspx
No, this is not possible. The reason why is quite elementary: if a batch returns multiple results, it must return them in order -- the statement that returns result set #2 does not run before the one that returns result set #1, nor does the client have any way of saying "please just skip that first statement entirely" (as that could have dire consequences for the batch as a whole). Indeed, there's not even any way in general to tell how many result sets a batch will produce -- all of this is done at runtime, the server doesn't know in advance what will happen.
Since there's no way, server-side, to skip or index result sets, there's no meaningful way to do it client-side either. You're free to ignore the result sets streamed back to you, but you must still process them in order before you can move on -- and once you've moved on, you can't go back.
There are two possible global workarounds:
If you process all data and cache it locally (with a DataAdapter, for example) you can go back and forth in the data as you please, but this requires keeping all data in memory.
If you enable MARS (Multiple Active Result Sets) you can execute another query even as the first one is still processing. This does require splitting up your existing single batch code into individual statements (which, if you really can't change anything about the SQL at all, is not an option), but you could go through result sets at will (without caching). It would still not be possible for you to "go back" within a single result set, though.

ParseObject queries with relative search parameters

I'm rather new to Parse and Cloud Code, and I'm having trouble writing a certain query script.
I have a table of Salespeople, who have two integers : dailySold and dailyQuota.
The dailySold is reset to 0 each day, and the dailyQuota is defined by upper management.
Now, I'd like to make queries that call out bulks of users. Say, all users which dailySold is below their dailyQuota. In MySQL it would just look like this :
select * from salespeople where dailySold < dailyQuota
But in Parse / CloudCode I have been unable to find something like this. Currently, I'm loading all the entries, and going through them one by one, populating a large array clientside. This feels like the absolutely wrong way of doing it.
And the query.WhereNotEqualTo() function (and their siblings) seem to only be able to compare with static queries.
Does anyone know how to put together a query to optimize this ? I need it to go through thousands of records, and its often only 10-20 results I'm interested in. If nothing else, I'll have to make a cloudcode function that iterates for me serverside, but I still feel like there is some function I should be able to use, to make a more lean query.
You can't compare two columns in a query. You can only compare a key with a provided object. If the dailyQuota is set by upper management, I'm assuming this is the same for all salespeople, or for groups of people. I'd suggest first making a query for the daily quota and then either use
whereKey:matchesKey:inQuery
or just fetch the dailyQuota first and then use that value in the second query.

Categories

Resources