Inserting datafiles into a SQL Server database (no separators) - c#

I've tried researching this question but most of the answers is for .csv files which does not help me a lot.
I have a couple of large .dat files containing quite a lot of data (each file around 700MB), and I am trying to develop a software in C# where I will be able to search for a specific string and locate the line where it is (duplicates will occur so a listview / listbox might be a good idea).
Every line follows the exact same data format and the starting index/length of each datatype is well documented.
Example:
Line 1: ZATIXIZ20SWEDENSTACKOVERFLOWCHROME
Documented like this:
Username : 0-6 Age : 7-8 Country: 9-14 Website :
15-27 Browser : 28-33
My guess is that the best approach would be to do some kind of BULK INSERT on the data files into a database and then index it for faster searching later on. I am not quite sure how to do this though, nor what the best approach would be. (It also needs to search through all of the files so maybe it could be a good idea to insert them all into the same table?)
So far I have only tried to read one of the files into memory and then do a simple Regex which of course was not a good idea. Unfortunately I am a bit inexperienced with SQL queries which is why I have not tried a lot yet.
Thanks in advance!

'Insert all data of the same type into a table with indexed columns.
If the properties vary between each file, use multiple tables.
If you want to be able to trace the match back to the original file, use a table with columns:
Key - Internal key from a sequence
FileName - So you know where it came from
Line - The line number
Username
Age
Country
Website
Browser
Where FileName, Line is a unique key.
Here is a link to an article on full-text searching on MSSQL as we don't know which RDMS you are using: http://msdn.microsoft.com/en-us/library/ms142571.aspx#queries
From you example, the line 'ZATIXIZ20SWEDENSTACKOVERFLOWCHROME' becomes:
| Key | FileName | Line | Username | Age | CountryKey | Website | BrowserKey
1 'Data1.dat' 1 'ZATIXIZ' 20 46 'STACKOVERFLOW' 4
In this example, you'd need two more tables: Countries and Browsers. These are optional, as you could just include the information directly in the main table.
I must stress though, that it really depends on how you wish to query this data. The above structure gives you the opportunity to search for 'all swedish users between 20 and 25)' by performing the following query:
select * from TABLENAME where Age < 25 and Age >= 20 and CountryKey = 46
In regards to how you import a fixed width file, it depends greatly on your RDMS. If you're using Oracle, you can use SQL*Loader. Remember that it does not necessarily have to be a single-stage process. You can load the data into the tables and then look up the keys internally after the initial import.
For MSSQL, here is another answer from the stack: Bulk insert fixed width fields
You can also preprocess it in .NET. Again, it depends on your scenario. If you are piping these into your system at a rate of one 900MB file every 10 minutes, you're looking at some serious optimization of the bulk load process (and some serious hardware). But if you only need to load this file once a month, the .NET approach is absolutely fine, even though it may take a few hours.

Related

API - filter big list with word fragment

I have asp.net web api application. In database I have a big list (between 100.000 and 200.000) of pairs like id:name and this list could be changed quite rarely. I need to implement filtering like this /pair/filter?fragment=bla. It should return first 25 pairs where any word in name starts with word fragment. I see two approachs here: 1st approach is to load data into cache (HttpRuntimeCache, redis or smth like this) to increase loading time and filter in linq. But I think there will be problems with time required for serialiazing/deserialiazing. Another approach: for instance I have a pair 22:some title here so I need to provide separate table like this:
ID | FRAGMENT
22 | some
22 | title
22 | here
with primary key on both columns and separate index on FRAGMENT column to make queries faster. Any offers and remarks are welcome.
UPD: now I've refreshed my mind. I don't want to query database because requests happen quite often. So now I see the best solution is
load entire list in memory
build trie structure which keeps hashset of values in each node
in case of one text fragment - just return the hashset from trie node, in case of few fragments - find all hashsets and get their intersection
You could try a full-text index on your current DB (if its supported) and the CONTAINS keyword like so
SELECT * FROM tableName WHERE CONTAINS(name, 'bla*');
This will look for words starting with "bla" in the entire string, and also match the string "Monkeys blabla"
I dont really understand your question but if you want to query any table you can do so since you already have the queryString. You can try this out.
var res = _repository.Table.Where(c => c.Name.StartsWith("bla")).Take(25);
If it doesnt help. Try to to restructure your question a little bit.
Is this a case of premature optimization?
How many users will be hitting this service simultaneously? How many will be hitting your database simultaneously? How efficient is your query? How much data will be returned across the wire?
In most cases, you can't outsmart an efficient database for performance. Your row count is too small to create a truly heavy burden on your application's runtime performance when querying. This assumes, of course, that your query is well written and that you're properly opening, closing, and freeing resources in a timely fashion.
Caching the data in memory has its trade-offs that should be considered. It increases the memory footprint of your application, and requires you to write and maintain additional code to maintain that cache. That is by no means prohibitive, but should be considered in view of your overall architecture.
Consider these things carefully. From what I can tell, keeping this data in the database is fine. Deserialization tends to be fast (as most of the data you return is native types), and shouldn't be cost-prohibitive.

NHibernate matching property as a substring

I'm trying to find the "best" way to match, for example, politicians' names in RSS articles. The names will be stored in a database accessed with NHibernate. As an example:
Id Name
--- ---------------
1 David Cameron
2 George Osborne
3 Alistair Darling
At the time of writing, the BBC politics news RSS feed has an item with the description
Backbench Conservative MPs put pressure on Chancellor George Osborne to stop rail firms in England increasing commuter fares by up to 11%.
For this article, I would like to detect that George Osborne is mentioned. I realise that there are several ways that this could be done, e.g. selecting all the politicians' names first, and comparing them in code, or doing the NHibernate equivalent of a LIKE.
The application itself would have a few dozen feeds, which will be queried at most every 15 minutes. Obviously there are speed, memory and scaling concerns, so I would like to ask for a recommended approach (and NHibernate query if relevant).
As we were discussing on the comments, I believe that there is a simpler approach to this problem:
Keep a list of the politicians' in memory. Because these entities won't be updated often, it's safe to work like this. Just implement an expiration logic to refresh it from the database sooner or later.
For each downloaded feed entry, simply run foreach Name in Politicians FeedEntry.Content.Contains(Name) (or something like it) before saving the entry to the database.
There you go, no complex query needed and less I/O for your solution.
Along the following lines I would either use use a regex expression or a contains to get the politicians that match the feed. The politician names and ids can be a simple collection in memory.
Then the the feed can be saved in a memcached or redis (even a db would do) with a guid. Then save the associated guid in a table that holds politician_id, feed_guid.
For some statistics you can also have a table which is an aggregate of politician_id, num_articles_mentioned where the num_articles_mentioned is incremented by 1.
You can wrap the above in a transaction if needed.

Taking duplicate lines and squeezing them into 1 row on SQL Server 2008 R2?

I'm trying to write a stored procedure in SQL Server that will eliminate some logic in my C# program. What I'm doing right now is the query is in a view. Then I'm making a list with the view.
List<MyView> listOrdered = new List<MyView>();
Here's where it gets hairy. The query returns rows that are duplicates. I don't want to delete the duplicate rows I want to combine them into 1 row. The rows are identical except for 1 column.
Example:
UID Name Age Child
1 John 50 Sally
1 John 50 Steve
2 Joseph 42 Timmy
2 Joseph 42 Billy
So what I'm doing in C# is writing logic that says: (pseudo code)
foreach(item in list)
{
if (UID != UIDCurrent)
{
Build Row
AppendRow to list
}
else
{
Append Child Column to Current Child Column
}
}
Basically it gives me:
UID Name Age Children
1 John 50 Sally, Steve
But instead of doing this logic in C# I would like to do this a stored procedure. So how I can I get SQL Server to combine the children column for each row instead of multiple rows.
If you need anything else to help you help me I will respond.
Oh guys believe me I don't want to do it this way either. The Database I'm using is huge and complex and doing this with C# was sensible and works but I've been asked to turn my function that does this in C# into a stored procedure. I just want to see if this is even possible.
This demonstrates a poor table design. Fix it at the root and then you don't have this silly logic in either your db or C# code.
instead of
people(UID, Name, Age, Child)
try
people(UID, Name, DateOfBirth)
children(Parent references people.UID, child references people.UID)
You can leave age instead of moving to date of birth but it's really a much better idea to do it this way.
This should work if you want to do this in the database, though you should think about doing it in the front-end.
SELECT UID,
Name,
Age,
STUFF(
(SELECT ',' + Child AS [text()]
FROM parentChildren b
WHERE a.UID = b.UID
FOR XML PATH('')),1,1,'') [ChildConcat]
FROM parentChildren a
This is 100%, without a doubt something to be handled in your application code, NOT in the database!
SQL performs fairly poorly in string manipulation operations, especially compared to iterative languages like C#. You want your database to pass you the actual DATA, and then how you display that to be handled in the application layer.
Any attempt to solve this in SQL will be slower and harder to maintain than a version in your application code.

How do I programatically verify, create, and update SQL table structure?

Scenario:
I have an application (C#) that expects a SQL database and login, which are set by a user. Once connected, it checks for the existence of several table and creates them if not found.
I'd like to expand on this by having the program be capable of adding columns to those tables if I release a new version of the program which relies upon the new columns.
Question:
What is the best way to programatically check the structure of an existing SQL table and create or update it to match an expected structure?
I am planning to iterate through the list of required columns and alter the existing table whenever it does not contain the new column. I can't help but wonder if there's an approach that is different or better.
Criteria:
Here are some of my expectations and self-imposed rules:
Newer versions of the program might no longer use certain columns, but they would be retained for data logging purposes. In other words, no columns will be removed.
Existing data in the table must be preserved, so the table cannot simply be dropped and recreated.
In all cases, newly added columns would allow null data, so the population of old records is taken care of by having default null values.
Example:
Here is a sample table (because visual examples help!):
id datetime sensor_name sensor_status x1 x2 x3 x4
1 20100513T151907 na019 OK 0.01 0.21 1.41 1.22
2 20100513T152907 na019 OK 0.02 0.23 1.45 1.52
Then, in a new version, I may want to add the column x5. The "x-columns" are all data-storage columns that accept null.
Edit:
I updated the sample table above. It is more of a log and not a parent table. So the sensors will repeatedly show up in this logging table with the values logged. A separate parent table contains the geographic and other logistical information about the sensor, making the table I wish to modify a child table.
This is a very troublesome feature that you're thinking about implementing. i would advise against it and instead consider scripting changes using a 3rd party tool such as Red Gate's Sql Compare: http://www.red-gate.com/products/SQL_Compare/index.htm
If you're in doubt, consider downloading the trial version of the software and performing a structure diff script on two databases with some non-trivial differences. You'll see from the result that the considerations for such operations are far from simple.
The other way around this type of issue is to redesign your database using the EAV model: http://en.wikipedia.org/wiki/Entity-attribute-value_model (Pivots to dynamically add rows thus never changing the structure. It has its own issues but it's very flexible.)
(To utilize a diff tool you would have to have a copy of all of your db versions and create diff scripts which would go out and get executed with new releases and upgrades. That's a huge mess of its own to maintain. EAV is the way for a thing like this. It wrongfully gets a lot of flak for not being as performant as a traditional db structure but i've used it a number of times with great success. In fact, i have an HIPAA-compliant EAV db (Sql Server 2000) that's been in production for over six years with several of the EAV tables containing tens or millions of rows and it's still going strong w/ no big slow down. Of course we don't do heavy reporting against that db. For reports we have an export that flattens the data into a relational structure.)
The common solution i see would be to store in your database somewhere version information. maybe have a really small table:
CREATE TABLE DB_PROPERTIES (key varchar(100), value varchar(100));
then you could add a row:
key | value
version | 12
Then you could just create a sql update script (or set of scripts) which updates the db from version 12 to version13.
declare v varchar(100)
select v=value from DB_PROPERTIES where key='version'
if v ='12'
#do upgrade from 12 to 13
elsif v='11'
#do upgrade from 11 to 13
...and so on
depending on what upgrade paths you wanted to support you could add more cases. You could also obviously move this upgrade logic into C# and or whatever design works for you. But having the db version information stored in the database will make it much easier to figure out what is already there, rather than querying for all the db structures individually.
If you have to build something in such a way as to rely on the application making table changes, your design is flawed. You should have a related table for the sensor values (x1, x2, etc.), then you can just add another record rather than having to create a new column.
Suggested child table structure
READINGS
ID int
Reading_type varchar (10)
Reading_Value int
Then data in the table would read:
ID Reading_type Reading_value
1 x1 2
1 x2 3
1 x3 1
2 x1 7
Try Microsoft.SqlServer.Management.Smo
These are a set of C# classes that provide an API to SQL Server database objects.
The Microsoft.SqlServer.Management.Smo.Table has a Columns Collection that will allow you to query and manipulate the columns.
Have fun.

C#: Reading data from an xls document

I am currently working on a project for traversing an excel document and inserting data into a database using C#.
The relevant data for this project is:
The excel sheet has 14 rows at the top that I do not care about. (sometimes 15, see Russia/Siberia below)
The data is grouped by name into 2 columns (date and value), such as:
Sheet 1
USA China Russia
Date Value Date Value Siberia
1/1/09 4.3654 1/1/09 2.7456 Date Value
1/2/09 3.5545 1/3/09 9.3214 2/5/09 0.2454
1/3/09 3.2322 1/21/09 5.2234 2/6/09 0.5557
The name I need to acquire is whichever is listed directly above "Date".
I only care about data from dates we do not have in the database. Before each column set is parsed, I will acquire the max date for any given name from the database, and skip anything at or before it.
There is no guarantee that the columns will be in a constant order or have constant spacing.
I do not want data for all names, rather only those in a list I put together before the file is acquired.
My current plan is this:
For each column, if the date field is at row 16, save the name as the value in row 15 above it, check the database for the last date for that name, only insert data where the date is greater than the acquired date.
If the date field is at row 17, do the same thing, but start the for loop through each row at 18.
If the name is not in the list, skip the column. If it is, make sure to grab the column next to it for the necessary values.
My problem is:
I am currently trying to use the ExcelDataReader from Codeplex(http://www.codeplex.com/ExcelDataReader). This only likes csv-like sheets, which this project has not.
I do not know of any alternative Excel readers.
To the best of my knowledge, a straight FileStream traversal of this file can only go row-by-row, rather than column-by-column.
To anyone still reading, thank you for your time. Any recommendations on how to proceed? Please ensure that solutions can traverse each column, not each row.
Also, please don't worry about the database stuff, or the list of names that precedes the traversal.
Addendum: What I'd really like to end up with is some type of table that I can just traverse with a nested loop, making column-centric traversal much, much easier. Because there is so much garbage near the top of the sheet (14+ rows), most simple solutions are not feasible.
If you want to read from excel in C#, i've used this library with great success, it'll give you the flexibility to parse columns/rows just however you'd like:
http://sourceforge.net/projects/koogra/ (read-only)
Other open source libraries i haven't used but could be good:
http://nexcel.sourceforge.net/ (read-only)
http://npoi.codeplex.com/ (can read and write)
http://developer.novell.com/wiki/index.php/Poi.Net (this project is dead)
Alternatively, you can use one of the many good Java libraries, and convert it into a C# assembly using IKVM:
http://jxls.sourceforge.net/
http://www.andykhan.com/jexcelapi/
http://poi.apache.org/ (this one's the grand-daddy of java XLS libraries)
I've covered how to do the IKVM Java -> C# conversion here (it's really not as horrible an option as you think):
http://splinter.com.au/blog/?p=207
Not a straight answer to your question but an alternative idea:
Your data looks like a pivot-ish table. I'd recommend "unpivoting" it into simple table.
Example:
Russia USA
Q1 123 323
Q2 456 321
Q3 567 843
Becomes:
Quarter Country Value
Q1 Russia 123
Q1 USA 323
Q2 Russia 321
....
If that is the case, not sure if I got this right in your question, than processing the data using a OleDB driver or whatever CSV kind of stuff should be become much less painful.
You can access Excel directly using ADO.NET via the ODBC driver. See http://www.davidhayden.com/blog/dave/archive/2006/05/26/2973.aspx or Google for more info on how to do that. You may wish to try HDR=No in your connection string, since your first row isn't really proper headers by the looks of it.
I haven't done this for a while, but I remember that it is a bit "temperamental" and takes some playing around with to get the column names right, but it should work. Try SELECT * FROM [Sheet1$] and see what you get.
I highly recommend saving this Excel document in a CSV format before doing anything else with it. You can do using this code
After you have a CSV, you can either parse it using that library, or write your own parser for it.
As I did before, I prefer to use OLEDB connection in order to connect to an Excel document.
By the way, you can take a look at the following article for more information:
http://www.codeproject.com/KB/office/excel_using_oledb.aspx
SpreadsheetGear for .NET can load workbooks and access any cells on any sheet in any order. You can get the formatted text of the cell (such as "1/1/09") or the underlying value ("1/1/09" is stored as the double 39814.0 in Excel or SpreadsheetGear).
You can see some live ASP.NET samples here and download the free trial here if you want to try it yourself.
Disclaimer: I own SpreadsheetGear LLC

Categories

Resources