C# Reading and Summarizing Text File with LINQ - c#

I've read MANY different solutions for the separate functions of LINQ that, when put together would solve my issue. My problem is that I'm still trying to wrap my head about how to put LINQ statements together correctly. I can't seem to get the syntax right, or it comes up mish-mash of info and not quite what I want.
I apologize ahead of time if half of this seems like a duplicate. My question is more specific than just reading the file. I'd like it all to be in the same query.
To the point though..
I am reading in a text file with semi-colon separated columns of data.
An example would be:
US;Fort Worth;TX;Tarrant;76101
US;Fort Worth;TX;Tarrant;76103
US;Fort Worth;TX;Tarrant;76105
US;Burleson;TX;Tarrant;76097
US;Newark;TX;Tarrant;76071
US;Fort Worth;TX;Tarrant;76103
US;Fort Worth;TX;Tarrant;76105
Here is what I have so far:
var items = (from c in (from line in File.ReadAllLines(myFile)
let columns = line.Split(';')
where columns[0] == "US"
select new
{
City = columns[1].Trim(),
State = columns[2].Trim(),
County = columns[3].Trim(),
ZipCode = columns[4].Trim()
})
select c);
That works fine for reading the file. But my issue after that is I don't want the raw data. I want a summary.
Specifically I need the count of the number of occurrences of the City,State combination, and the count of how many times the ZIP code appears.
I'm eventually going to make a tree view out of it.
My goal is to have it laid out somewhat like this:
- Fort Worth,TX (5)
- 76101 (1)
- 76103 (2)
- 76105 (2)
- Burleson,TX (1)
- 76097 (1)
- Newark,TX (1)
- 76071 (1)
I can do the tree thing late because there is other processing to do.
So my question is: How do I combine the counting of the specific values in the query itself? I know of the GroupBy functions and I've seen Aggregates, but I can't get them to work correctly. How do I go about wrapping all of these functions into one query?
EDIT: I think I asked my question the wrong way. I don't mean that I HAVE to do it all in one query... I'm asking IS THERE a clear, concise, and efficient way to do this with LINQ in one query? If not I'll just go back to looping through.
If I can be pointed in the right direction it would be a huge help.
If someone has an easier idea in mind to do all this, please let me know.
I just wanted to avoid iterating through a huge array of values and using Regex.Split on every line.
Let me know if I need to clarify.
Thanks!
*EDIT 6/15***
I figured it out. Thanks to those who answered it helped out, but was not quite what I needed. As a side note I ended up changing it all up anyways. LINQ was actually slower than doing it other ways that I won't go into as it's not relevent. As to those who made multiple comments on "It's silly to have it in one query", that's the decision of the designer. All "Best Practices" don't work in all places. They are guidelines. Believe me, I do want to keep my code clear and understandable but I also had a very specific reasoning for doing it the way I did.
I do appreciate the help and direction.
Below is the prototype that I used but later abandoned.
/* Inner LINQ query Reads the Text File and gets all the Locations.
* The outer query summarizes this by getting the sum of the Zips
* and orders by City/State then ZIP */
var items = from Location in(
//Inner Query Start
(from line in File.ReadAllLines(FilePath)
let columns = line.Split(';')
where columns[0] == "US" & !string.IsNullOrEmpty(columns[4])
select new
{
City = (FM.DecodeSLIC(columns[1].Trim()) + " " + columns[2].Trim()),
County = columns[3].Trim(),
ZipCode = columns[4].Trim()
}
))
//Inner Query End
orderby Location.City, Location.ZipCode
group Location by new { Location.City, Location.ZipCode , Location.County} into grp
select new
{
City = grp.Key.City,
County = grp.Key.County,
ZipCode = grp.Key.ZipCode,
ZipCount = grp.Count()
};

The downside of using File.ReadAllLines is that you have to pull the entire file into memory before operating over it. Also, using Columns[] is a bit clunky. You might want to consider my article describing using DynamicObject and streaming the file as an alternative implemetnation. The grouping/counting operation is secondary to that discussion.

var items = (from c in
(from line in File.ReadAllLines(myFile)
let columns = line.Split(';')
where columns[0] == "US"
select new
{
City = columns[1].Trim(),
State = columns[2].Trim(),
County = columns[3].Trim(),
ZipCode = columns[4].Trim()
})
select c);
foreach (var i in items.GroupBy(an => an.City + "," + an.State))
{
Console.WriteLine("{0} ({1})",i.Key, i.Count());
foreach (var j in i.GroupBy(an => an.ZipCode))
{
Console.WriteLine(" - {0} ({1})", j.Key, j.Count());
}
}

There is no point getting everything into one query. It's better to split the queries so that it would be meaningful. Try this to your results
var grouped = items.GroupBy(a => new { a.City, a.State, a.ZipCode }).Select(a => new { City = a.Key.City, State = a.Key.State, ZipCode = a.Key.ZipCode, ZipCount = a.Count()}).ToList();
Result screen shot
EDIT
Here is the one big long query which gives the same output
var itemsGrouped = File.ReadAllLines(myFile).Select(a => a.Split(';')).Where(a => a[0] == "US").Select(a => new { City = a[1].Trim(), State = a[2].Trim(), County = a[3].Trim(), ZipCode = a[4].Trim() }).GroupBy(a => new { a.City, a.State, a.ZipCode }).Select(a => new { City = a.Key.City, State = a.Key.State, ZipCode = a.Key.ZipCode, ZipCount = a.Count() }).ToList();

Related

Speeding up a linq query with 40,000 rows

In my service, first I generate 40,000 possible combinations of home and host countries, like so (clientLocations contains 200 records, so 200 x 200 is 40,000):
foreach (var homeLocation in clientLocations)
{
foreach (var hostLocation in clientLocations)
{
allLocationCombinations.Add(new AirShipmentRate
{
HomeCountryId = homeLocation.CountryId,
HomeCountry = homeLocation.CountryName,
HostCountryId = hostLocation.CountryId,
HostCountry = hostLocation.CountryName,
HomeLocationId = homeLocation.LocationId,
HomeLocation = homeLocation.LocationName,
HostLocationId = hostLocation.LocationId,
HostLocation = hostLocation.LocationName,
});
}
}
Then, I run the following query to find existing rates for the locations above, but also include empty the missing rates; resulting in a complete recordset of 40,000 rows.
var allLocationRates = (from l in allLocationCombinations
join r in Db.PaymentRates_AirShipment
on new { home = l.HomeLocationId, host = l.HostLocationId }
equals new { home = r.HomeLocationId, host = (Guid?)r.HostLocationId }
into matches
from rate in matches.DefaultIfEmpty(new PaymentRates_AirShipment
{
Id = Guid.NewGuid()
})
select new AirShipmentRate
{
Id = rate.Id,
HomeCountry = l.HomeCountry,
HomeCountryId = l.HomeCountryId,
HomeLocation = l.HomeLocation,
HomeLocationId = l.HomeLocationId,
HostCountry = l.HostCountry,
HostCountryId = l.HostCountryId,
HostLocation = l.HostLocation,
HostLocationId = l.HostLocationId,
AssigneeAirShipmentPlusInsurance = rate.AssigneeAirShipmentPlusInsurance,
DependentAirShipmentPlusInsurance = rate.DependentAirShipmentPlusInsurance,
SmallContainerPlusInsurance = rate.SmallContainerPlusInsurance,
LargeContainerPlusInsurance = rate.LargeContainerPlusInsurance,
CurrencyId = rate.RateCurrencyId
});
I have tried using .AsEnumerable() and .AsNoTracking() and that has sped things up quite a bit. The following code shaves several seconds off of my query:
var allLocationRates = (from l in allLocationCombinations.AsEnumerable()
join r in Db.PaymentRates_AirShipment.AsNoTracking()
But, I am wondering: How can I speed this up even more?
Edit: Can't replicate foreach functionality in linq.
allLocationCombinations = (from homeLocation in clientLocations
from hostLocation in clientLocations
select new AirShipmentRate
{
HomeCountryId = homeLocation.CountryId,
HomeCountry = homeLocation.CountryName,
HostCountryId = hostLocation.CountryId,
HostCountry = hostLocation.CountryName,
HomeLocationId = homeLocation.LocationId,
HomeLocation = homeLocation.LocationName,
HostLocationId = hostLocation.LocationId,
HostLocation = hostLocation.LocationName
});
I get an error on from hostLocation in clientLocations which says "cannot convert type IEnumerable to Generic.List."
The fastest way to query a database is to use the power of the database engine itself.
While Linq is a fantastic technology to use, it still generates a select statement out of the Linq query, and runs this query against the database.
Your best bet is to create a database View, or a stored procedure.
Views and stored procedures can easily be integrated into Linq.
Material Views ( in MS SQL ) can further speed up execution, and missing indexes are by far the most effective tool in speeding up database queries.
How can I speed this up even more?
Optimizing is a bitch.
Your code looks fine to me. Make sure to set the index on your DB schema where it's appropriate. And as already mentioned: Run your Linq against SQL to get a better idea of the performance.
Well, but how to improve performance anyway?
You may want to have a glance at the following link:
10 tips to improve LINQ to SQL Performance
To me, probably the most important points listed (in the link above):
Retrieve Only the Number of Records You Need
Turn off ObjectTrackingEnabled Property of Data Context If Not
Necessary
Filter Data Down to What You Need Using DataLoadOptions.AssociateWith
Use compiled queries when it's needed (please be careful with that one...)

Iterating through datacontext - most efficient way?

I have started using performance wizard in visual studio 2012 because there was a slow method which is basically used to get all users from the datacontext. I fixed the initial problem but I am now curious if I can make it faster.
Currently I am doing this:
public void GetUsers(UserManagerDashboard UserManagerDashboard)
{
try
{
using (GenesisOnlineEnties = new GenesisOnlineEntities())
{
var q = from u in GenesisOnlineEnties.vw_UserManager_Users
select u;
foreach (var user in q)
{
User u = new User();
u.UserID = user.UserId;
u.ApplicationID = user.ApplicationId;
u.UserName = user.UserName;
u.Salutation = user.Salutation;
u.EmailAddress = user.Email;
u.Password = user.Password;
u.FirstName = user.FirstName;
u.Surname = user.LastName;
u.CompanyID = user.CompanyId;
u.CompanyName = user.CompanyName;
u.GroupID = user.GroupId;
u.GroupName = user.GroupName;
u.IsActive = user.IsActive;
u.RowType = user.UserType;
u.MaximumConcurrentUsers = user.MaxConcurrentUsers;
u.ModuleID = user.ModuleId;
u.ModuleName = user.ModuleName;
UserManagerDashboard.GridUsers.users.Add(u);
}
}
}
catch (Exception ex)
{
}
}
It's a very straight forward method. Connect to the database using entity framework, get all users from the view "vw_usermanager_users" and populate the object which is part of a collection.
I was casting ?int to int and I changed the property to ?int so no cast is needed. I know that it is going to take longer because I am looping through records. But is it possible to speed this query up?
Ok, first things first, what does your vw_UserManager_Users object look like? If any of those properties you're referencing are navigational properties:-
public partial class UserManager_User
{
public string GroupName { get { return this.Group.Name; } }
// See how the getter traverses across the "Group" relationship
// to get the name?
}
then you're likely running face-first into this issue - basically you'll be querying your database once for the list of users, and then once (or more) for each user to load the relationships. Some people, when faced with a problem, think "I know, I'll use an O/RM". Now they have N+1 problems.
You're better to use query projection:-
var q = from u in GenesisOnlineEnties.vw_UserManager_Users
select new User()
{
UserID = u.UserId,
ApplicationId = u.ApplicationID,
GroupName = u.Group.Name, // Does the join on the database instead
...
};
That way, the data is already in the right shape, and you only send the columns you actually need across the wire.
If you want to get fancy, you can use AutoMapper to do the query projection for you; saves on some verbosity - especially if you're doing the projection in multiple places:-
var q = GenesisOnlineEnties.vw_UserManager_Users.Project().To<User>();
Next up, what grid are you using? Can you use databinding (or simply replace the Grid's collection) rather than populating it one-by-one with the results from your query?:-
UserManagerDashboard.GridUsers.users = q.ToList();
or:-
UserManagerDashboard.GridUsers.DataSource = q.ToList();
or maybe:-
UserManagerDashboard.GridUsers = new MyGrid(q.ToList());
The way you're adding the users to the grid right now is like moving sand from one bucket to another one grain at a time. If you're making a desktop app it's even worse because adding an item to the grid will probably trigger a redraw of the UI (i.e. one grain at a time and, describing every grain in the bucket to your buddy after each one). Either way you're doing unnecessary work, see what methods your grid gives you to avoid this.
How many users are in the table? If the number is very large, then you'll want to page your results. Make sure that the paging happens on the database rather than after you've got all the data - otherwise it kind of defeats the purpose:-
q = q.Skip(index).Take(pageSize);
though bear in mind that some grids interact with IQueryable to do paging out-of-the-box, in that case you'd just pass q to the grid directly.
Those are the obvious ones. If that doesn't fix your problem, post more code and I'll take a deeper look.
Yes, by turning off change tracking:
var q = from u in GenesisOnlineEnties.vw_UserManager_Users.AsNoTracking()
select u;
Unless you are using all the properties on the entity you can also select only the columns you want.
var q = from u in GenesisOnlineEnties.vw_UserManager_Users.AsNoTracking()
select new User
{
UserId = u.UserId,
...
}

Querying 2 datatables in a dataset

I have 2 datatables named 'dst' and 'dst2'. they are located in the dataset 'urenmat'.
The mayority of the data is in 'dst'. this however contains a column named 'werknemer'. It contains a value which corresponds to a certain row in 'dst2'. This column is named 'nummer'.
What i need is a way to left outer join both datatables where dst.werknemer and dst2.nummer are linked, and a new datatable is created which contains 'dst2.naam' linked to 'dst.werknemer' along with all the other columns from 'dst'.
I have looked everywhere and still can't seem te find the right answer to my question. several sites provide a way using LINQ in this situation. I have tried using LINQ but i am not so skilled at this.
I tried using the 101 LINQ Samples:
http://code.msdn.microsoft.com/101-LINQ-Samples-3fb9811b
urenmat = dataset.
dst = a, b, c, d, werknemer.
dst2 = nummer, naam.
I used the following code from '101'.
var query =
from contact in dst.AsEnumerable()
join order in dst2.AsEnumerable()
on contact.Field<string>("werknemer") equals
order.Field<string>("nummer")
select new
{
a = order.Field<string>("a"),
b = order.Field<string>("b"),
c = order.Field<string>("c"),
d = order.Field<string>("d"),
naam = contact.Field<decimal>("naam")};
I however don't know what to change 'contact' and 'order' to and i can't seem to find out how to save it to a datatable again.
I am very sorry if these are stupid questions but i have tried to solve it myself but it appears i'm stupid:P. Thank for the help in advance!
PS. i am using C# to code, the dataset and datatables are typed.
if you want to produce a projected dataset of dst left outer joined to dst2 you can use this LINQ expression (sorry i don't really work in LINQ query syntax so you'll have to use this lambda syntax instead).
var query = dst.AsEnumerable()
.GroupJoin(dst2.AsEnumerable(), x => x.Field<string>("werknemer"), x => x.Field<string>("nummer"), (contact, orders) => new { contact, orders })
.SelectMany(x => x.orders.DefaultIfEmpty(), (x, order) => new
{
a = order.Field<string>("a"),
b = order.Field<string>("b"),
c = order.Field<string>("c"),
d = order.Field<string>("d"),
naam = x.contact.Field<decimal>("naam")
});
because this is a projected dataset you cannot simply save back to the datatable. If saving is desired then you would want to load the affected row, update the desired fields, then save the changes.
// untyped
var row = dst.Rows.Find(keyValue);
// typed
var row = dst.FindBy...(keyValue);
// update the field
row.SetField("a", "some value");
// save only this row's changes
row.AcceptChanges();
// or after all changes to the table have been made, save the table
dst.AcceptChanges();
Normally if you need to perform loading and saving of (projected) data, an ORM (like entity framework, or LINQ-to-SQL) would be the best tool. However, you are using DataTable's in this case and I'm not sure if you can link an ORM to these (though it seems like it would probably be possible).

Linq-to-SQL select many columns

I'm using Linq-to-SQL and I just started to learn some of the basics. I have problem with select command many columns in many tables. I give songs which selected into session (contain songid) and display songname, artistname, genrename in datagrid.
But it's not working.
ArrayList SelectedSongs = (ArrayList)Session["SelectedSongs"];
string songIds = "";
foreach (int id in SelectedSongs)
songIds += id + ", ";
var query = from s in sa.Songs
from ar in sa.Artists
from g in sa.Genres
where s.SongID in (songIds)
select new { s.SongID, s.SongName, ar.ArtistName, g.GenreName };
dgSongs.DataSource = query;
Can anyone help me solve this problem.
Thanks.
This syntax is not correct Linq:
where s.SongID in (songIds)
The Linq equivalent of SQL's WHERE IN is to use Contains(). You have to turn the statement around and start with the list:
where songIds.Contains(s.SongID)
When using Linq-to-SQL you should always use navigation properties instead of explicit joins. If you have proper foreign keys between your tables those properties will be automatically created. With navigation properties and songIDs changed into an int[] your query should be something like this:
int[] songIDs = ((ArrayList)Session["SelectedSongs"]).OfType<int>().ToArray();
var query = from s in sa.Songs
where songIDs.Contains(s.SongID)
select new
{
s.SongID,
s.SongName,
s.Artist.ArtistName,
s.Genre.GenreName
};
Seems like you're trying to Join multiple tables. I would recommend to take a look at the Join section of this page. Good luck!
I believe you want songIds to be an int[] instead of a csv of ids.

Error from use of C# Linq SQL CONCAT

I have the following three tables, and need to bring in information from two dissimilar tables.
Table baTable has fields OrderNumber and Position.
Table accessTable has fields OrderNumber and ProcessSequence (among others)
Table historyTable has fields OrderNumber and Time (among others).
.
var progress = from ba in baTable
from ac in accessTable
where ac.OrderNumber == ba.OrderNumber
select new {
Position = ba.Position.ToString(),
Time = "",
Seq = ac.ProcessSequence.ToString()
};
progress = progress.Concat(from ba in baTable
from hs in historyTable
where hs.OrderNumber == ba.OrderNumber
select new {
Position = ba.Position.ToString(),
Time = String.Format("{0:hh:mm:ss}", hs.Time),
Seq = ""
});
int searchRecs = progress.Count();
The query compiles successfully, but when the SQL executes during the call to Count(), I get an error
All queries combined using a UNION, INTERSECT or EXCEPT operator must have an equal number of expressions in their target lists.
Clearly the two lists each have three items, one of which is a constant. Other help boards suggested that the Visual Studio 2010 C# compiler was optimizing out the constants, and I have experimented with alternatives to the constants.
The most surprising thing is that, if the Time= entry within the select new {...} is commented out in both of the sub-queries, no error occurs when the SQL executes.
I actually think the problem is that Sql won't recognize your String.Format(..) method.
Change your second query to:
progress = progress.Concat(from ba in baTable
from hs in historyTable
where hs.OrderNumber == ba.OrderNumber
select new {
Position = ba.Position.ToString(),
Time = hs.Time.ToString(),
Seq = ""
});
After that you could always loop trough the progress and format the Time to your needs.

Categories

Resources