So, I have a huge query that I need to run on an Access DB. I am attempting to use a for loop to break it down because I can't run the query all at once (it has an IN with 50k values). The reader is causing all kinds of problems hanging and such. Most times when I break up the for loop into 50-10000 values the reader will read 400 (exactly 400) values and then hang for about 3 minutes then do another hundred or so, hang, ad infinium. If I do over 10k values per query it gets to 2696 and then hangs, does another 1k or so after hanging, on and on. I have never really worked with odbc, sql or any type of database for that matter, so it must be something stupid, or is this expected? Maybe there's a better way to do something like this? Here's my code that is looped:
//connect to mdb
OdbcConnection mdbConn = new OdbcConnection();
mdbConn.ConnectionString = #"Driver={Microsoft Access Driver (*.mdb)};DBQ=C:\PINAL_IMAGES.mdb;";
mdbConn.Open();
OdbcCommand mdbCmd = mdbConn.CreateCommand();
mdbCmd.CommandText = #"SELECT RAW_NAME,B FROM 026_006_886 WHERE (B='CM1' OR B='CM2') AND MERGEDNAME IN" + imageChunk;
OdbcDataReader mdbReader = mdbCmd.ExecuteReader();
while (mdbReader.Read())
{
sw.WriteLine(#"for /R %%j in (" + mdbReader[0] + #") do move %%~nj.tif .\" + mdbReader[1] + #"\done");
linesRead++;
Console.WriteLine(linesRead);
}
mdbConn.Close();
Here's how I populate the imageChunk variable for the IN by reading 5000 lines with a value line from a text file using a StreamReader:
string imageChunk = "(";
for (int j = 0; j < 5000; j++)
{
string image;
if ((image = sr.ReadLine()) != null)
{
imageChunk += #"'" + sr.ReadLine() + #"',";
}
else
{
break;
}
}
imageChunk = imageChunk.Substring(0, imageChunk.Length - 1);
imageChunk += ")";
Your connection to the DB and execution of the querys seems ok to me. I suspect the "hanging" is coming because you are running the query multiple times. A couple of tips for speed. Columns B and MergedName should have indexes on them. Re-factoring your data table structure may also improve speed. Are you MergedNames truely random? If so you are probably stuck with the speed you have. As #Remou suggests, I would also compare total runtime of uploading your MergedNames list to a table, then joining the table to get your results, then delete your table on completion.
Ended up using a data adapter... Was slow but provided constant feedback instead of freezing up. Never really got a good answer why, but got some advice on smarter ways to perform a large query.
Related
I'm running a very simple function that reads lines from a text file in batches. Each line contains an sql query so the function grabs a specified number of queries, executes them against the SQL database, then grabs the next batch of queries until the entire file is read. The problem is that over time with very large files the process starts to slow considerably. I'm guessing there is a memory leak somewhere in the function but can't determine where it may be. There is nothing else going on while this function is running. My programming skills are crude at best so please go easy on me. :)
for (int x = 0; x<= totalBatchesInt; x++)
{
var lines = System.IO.File.ReadLines(file).Skip(skipCount).Take(batchSize).ToArray();
string test = string.Join("\n", lines);
SqlCommand cmd = new SqlCommand(test.ToString());
try
{
var rowsEffected = qm.ExecuteNonQuery(CommandType.Text, cmd.CommandText, 6000, true);
totalRowsEffected = totalRowsEffected + rowsEffected;
globalRecordCounter = globalRecordCounter + rowsEffected;
fileRecordCounter = fileRecordCounter + rowsEffected;
skipCount = skipCount + batchSize;
TraceSource.TraceEvent(TraceEventType.Information, (int)ProcessEvents.Starting, "Rows
progress for " + folderName + "_" + fileName + " : " + fileRecordCounter.ToString() + "
of " + linesCount + " records");
}
catch (Exception esql)
{
TraceSource.TraceEvent(TraceEventType.Information, (int)ProcessEvents.Cancelling, "Error
processing file " + folderName + "_" + fileName + " : " + esql.Message.ToString() + ".
Aborting file read");
}
}
There are many things wrong with your code:
You never dispose your command. That's a native handle to an ODBC driver, waiting for the garbage collector to dispose it is very bad practice.
You shouldn't be sending those commands individually anyway. Either send them all at once in one command, or use transactions to group them together.
This one is the reason why it's getting slower over time: File.ReadLines(file).Skip(skipCount).Take(batchSize) will read the same file over and over and ignore a growing amount of lines every attempt, and so growing slower and slower as the number of lines ignored (but processed) gets larger and larger.
To fix #3, simply create the enumerator once and iterate it in batches. In pure C#, you can do something like:
using var enumerator = File.ReadLines(file).GetEnumerator();
for (int x = 0; x<= totalBatchesInt; x++)
{
var lines = new List<string>();
while(enumerator.MoveNext() && lines.Count < batchSize)
list.Add(enumerator.Current);
string test = string.Join("\n", lines);
// your code...
}
Or if you're using Morelinq (which I recommend), something like this:
foreach(var lines in File.ReadLines(file).Batch(batchSize))
{
// your code...
}
I want to insert 40000 rows to Cassandra with batch. But it always stop at number 32769 and give me an exception "System.ArgumentOutOfRangeException". What should I do that can insert more than 32769 rows to Cassandra.
Here is my code:
//建立DCS 資料
DateTime ToDay = DateTime.Today;
string LotStr = ToDay.ToString("yyMMdd");
DateTime NowTime = DateTime.Now;
List<DCS_Model> DCS_list = new List<DCS_Model>();
Random rnd = new Random();
for (int i = 1; i <= 40000; i++)
{
DCS_list.Add(new DCS_Model(LotStr, String.Format("Tag_{0}", i), rnd.Next(1000) + rnd.NextDouble(), NowTime, NowTime));
}
//上傳至Cassandra
DateTime tt = DateTime.Now;
Cluster cluster = Cluster.Builder().AddContactPoint("192.168.52.182").Build();
ISession session = cluster.Connect("testkeyspace");
//List<PreparedStatement> StatementLs = new List<PreparedStatement>();
var InsertDCS = session.Prepare("INSERT INTO DCS_Test (LOT, NAME, VALUE, CREATETIME, SERVERTIME) VALUES (?, ?, ?, ?, ?)");
var batch = new BatchStatement();
foreach (DCS_Model dcs in DCS_list)
{
batch.Add(InsertDCS.Bind(dcs.LOT,dcs.NAME,dcs.VALUE,dcs.CREATETIME,dcs.SERVERTIME));
}
session.Execute(batch);
//Row result = session.Execute("select * from TestTable").First();
TimeSpan CassandraTime = DateTime.Now - tt;
//Console.WriteLine(CassandraTime);
It will stop at batch.Add(InsertDCS.Bind(dcs.LOT,dcs.NAME,dcs.VALUE,dcs.CREATETIME,dcs.SERVERTIME))
when batch add 32768 times.
Please help me. Thanks!!
Batch functionality in the RDBMS world does not even remotely mirror batch functionality with Cassandra. They might be named the same, but they were designed for different purposes. In fact, Cassandra's should probably be renamed to "atomic" to avoid confusion.
Instead of batching them together all at once, try sending 40k individual requests, async with listenable futures (so that you know when they are all done). I believe the C# equivalent of Java's ListenableFuture is SettableFuture. You should look into that.
Sending 40k individual transactions might seem counter-intuitive. But it certainly beats hammering one Cassandra node as a coordinator (along with all the network traffic that the node will generate) to process and ensure atomicity for 40k upserts.
Also, make sure to use the Token Aware load balancing policy. That will direct your upsert to the exact node that it needs to go (saving you a network hop from using a coordinator).
Cluster cluster = Cluster.Builder().AddContactPoint("192.168.52.182").Build()
.WithLoadBalancingPolicy(new TokenAwarePolicy
(new DCAwareRoundRobinPolicy("westDC")));
I found that the source code of the function "BatchStatement" will throw a Exception when the add count more than Int16.MaxValue. So I change the source code then I solve this problem!!
I'm writing a C# application that runs a number of regular expressions (~10) on a lot (~25 million) of strings. I did try to google this, but any searches for regex with "slows down" are full of tutorials about how backreferencing etc. slows down regexes. I am assuming that this is not my problem because my regexes start out fast and slow down.
For the first million or so strings it takes about 60ms per 1000 strings to run the regular expressions. By the end, it's slowed down to the point where its taking about 600ms. Does anyone know why?
It was worse, but I improved it by using instances of RegEx instead of the cached version and compiling the expressions that I could.
Some of my regexes need to vary e.g. depending on the user's name it might be
mike said (\w*) or john said (\w*)
My understanding is that it is not possible to compile those regexes and pass in parameters (e.g saidRegex.Match(inputString, userName)).
Does anyone have any suggestions?
[Edited to accurately reflect speed - was per 1000 strings, not per string]
This may not be a direct answer to your question about RegEx performance degradation - which is somewhat fascinating. However - after reading all of the commentary and discussion above - I'd suggest the following:
Parse the data once, splitting out the matched data into a database table. It looks like you're trying to capture the following fields:
Player_Name | Monetary_Value
If you were to create a database table containing these values per-row, and then catch each new row as it is being created - parse it - and append to the data table - you could easily do any kind of analysis / calculation against the data - without having to parse 25M rows again and again (which is a waste).
Additionally - on the first run, if you were to break the 25M records down into 100,000 record blocks, then run the algorithm 250 times (100,000 x 250 = 25,000,000) - you could enjoy all the performance you're describing with no slow-down, because you're chunking up the job.
In other words - consider the following:
Create a database table as follows:
CREATE TABLE PlayerActions (
RowID INT PRIMARY KEY IDENTITY,
Player_Name VARCHAR(50) NOT NULL,
Monetary_Value MONEY NOT NULL
)
Create an algorithm that breaks your 25m rows down into 100k chunks. Example using LINQ / EF5 as an assumption.
public void ParseFullDataSet(IEnumerable<String> dataSource) {
var rowCount = dataSource.Count();
var setCount = Math.Floor(rowCount / 100000) + 1;
if (rowCount % 100000 != 0)
setCount++;
for (int i = 0; i < setCount; i++) {
var set = dataSource.Skip(i * 100000).Take(100000);
ParseSet(set);
}
}
public void ParseSet(IEnumerable<String> dataSource) {
String playerName = String.Empty;
decimal monetaryValue = 0.0m;
// Assume here that the method reflects your RegEx generator.
String regex = RegexFactory.Generate();
for (String data in dataSource) {
Match match = Regex.Match(data, regex);
if (match.Success) {
playerName = match.Groups[1].Value;
// Might want to add error handling here.
monetaryValue = Convert.ToDecimal(match.Groups[2].Value);
db.PlayerActions.Add(new PlayerAction() {
// ID = ..., // Set at DB layer using Auto_Increment
Player_Name = playerName,
Monetary_Value = monetaryValue
});
db.SaveChanges();
// If not using Entity Framework, use another method to insert
// a row to your database table.
}
}
}
Run the above one time to get all of your pre-existing data loaded up.
Create a hook someplace which allows you to detect the addition of a new row. Every time a new row is created, call:
ParseSet(new List<String>() { newValue });
or if multiples are created at once, call:
ParseSet(newValues); // Where newValues is an IEnumerable<String>
Now you can do whatever computational analysis or data mining you want from the data, without having to worry about performance over 25m rows on-the-fly.
Regex does takes time to compute. However, U can make it compact using some tricks.
You can also use string functions in C# to avoid regex function.
The code would be lengthy but might improve performance.
String has several functions to cut and extract characters and do pattern matching as u need.
like eg: IndeOfAny, LastIndexOf, Contains....
string str= "mon";
string[] str2= new string[] {"mon","tue","wed"};
if(str2.IndexOfAny(str) >= 0)
{
//success code//
}
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to efficiently write to file from SQL datareader in c#?
I am currently trying to create a web application that uses read-only access to allow users to download large files from our database. The table in question has 400,000 records in it and generates a 50 MB .csv file when exported.
It takes about 7s to run the statement "SELECT * FROM [table]" on SQL server, and about 33s to do so from my web application (hosted on a different server). This is reading all the data into a System.Data.SqlClient.SqlDataReader object.
My problem is that I am at a loss for converting my SqlDataReader to a .csv file. Converting each row of the SqlDataReader to a string and outputting that string to a file line by line takes almost 2 hours, which is unacceptable. Below is the code I'm using to create a file on the web application's server:
while (rdr.Read())
{
string lineout = "";
for (int index = 0; index < rdr.FieldCount; index++)
lineout += rdr[index].ToString().Replace(',', ' ') + ',';
write(lineout, filename); //uses StreamWriter.WriteLine()
}
There has to be a better way. I've looked around and saw a lot of suggestions that essentially recommend doing the above to create a file. This works great with smaller tables, but not the two really large ones we use every day. Can anyone give me a push in the right direction?
You could try building your lineout with a StringBuilder rather than manually concatenating strings:
//you can test whether it makes any difference in performance declaring a single
//StringBuilder and clearing, or creating a new one per loop
var sb = new StringBuilder();
while (rdr.Read())
{
for (int index = 0; index < rdr.FieldCount; index++)
sb.Append(rdr[index].ToString().Replace(',', ' ').Append(',');
write(sb.ToString(), filename); //uses StreamWriter.WriteLine()
sb.Clear();
}
Alternatively try to just write to the file directly and avoid generating each line in memory first:
//assume a StreamWriter instance has been created called sw...
while (rdr.Read())
{
for (int index = 0; index < rdr.FieldCount; index++)
{
sw.Write(rdr[index].ToString().Replace(',', ' ');
sw.WriteLine(",");
}
}
//flush and close stream
Here I am storing the elements of a datagrid in a string builder using a for loop, but it takes too much time when there is a large number of rows. Is there another way to copy the data in to a string builder in less time?
for (int a = 0; a < grdMass.RowCount; a++)
{
if (a == 0)
{
_MSISDN.AppendLine("'" + grdMass.Rows[a].Cells[0].Value.ToString() + "'");
}
else
{
_MSISDN.AppendLine(",'" + grdMass.Rows[a].Cells[0].Value.ToString() + "'");
}
}
There is no way to improve this code given the information you have provided. This is a simply for loop that appends strings to a StringBuilder - there isn't a whole lot going on here that can be optimized.
This may be one of those cases where something takes a long time simply because you are processing a lot of data. Perhaps there is a way to cache this data so you don't have to generate it as often. Is there anything else you can tell us that would help us find a better way to do this?
Side note: It is very important that you validate your suspicions as to the particular section of code that is causing the slowness. Do this by profiling your code so that you don't spend time trying to fix a problem that exists elsewhere.
As others have said, the StringBuilder is about as fast as you're going to get, so assuming this is the only bit of code that could be causing your slow down, there's probably not much you can do... but you could slightly optimise it by removing the small amount of string concatenation you are doing. I.e:
for (int a = 0; a < grdMass.RowCount; a++)
{
if (a == 0)
{
_MSISDN.Append("'");
}
else
{
_MSISDN.Append(",'");
}
_MSISDN.Append(grdMass.Rows[a].Cells[0].Value);
_MSISDN.AppendLine("'");
}
Edit: You could also clean up the if statement (although I highly doubt it's having a noticable effect) like so:
//First row
if (grdMass.RowCount > 0)
{
_MSISDN.Append("'");
_MSISDN.Append(grdMass.Rows[0].Cells[0].Value);
_MSISDN.AppendLine("'");
}
//Second row onwards
for (int a = 1; a < grdMass.RowCount; a++)
{
_MSISDN.Append(",'");
_MSISDN.Append(grdMass.Rows[a].Cells[0].Value);
_MSISDN.AppendLine("'");
}
I'm suspecting that it's not the string building that takes a long time, perhaps it's accessing the grid elements that is slow.
You could rewrite your code like this:
var cellValues = grdMass.Rows
.Select(r => "'" + r.Cells[0].Value.ToString() + "'")
.ToArray();
return String.Join(",", cellValues);
Now you can verify which part takes the most time. Is it building the cellValues array, or is it the String.Join call?
StringBuilder is pretty much as fast as it gets for building up strings -- and that is pretty goshdarned fast. If StringBuilder is too slow, you are probably trying to process too much data in one go. Are you sure it is really the string building which is slow and not some other part of the processing?
One tip that will speed up StringBuilder for very large strings: set the capacity up front. That is, call the StringBuilder(int) constructor instead of the default constructor, passing an estimate of the number of characters you plan to write. It will still expand if you underestimate -- this just saves the initial "well, 1K wasn't enough, time to allocate another 2K... 4K... etc." But this will make only a small difference, and only if your strings are very long.
This would be better....
if (grdMass.RowCount > 0)
{
_MSISDN.AppendLine("'" + grdMass.Rows[0].Cells[0].Value.ToString() + "'");
for (int a = 1; a < grdMass.RowCount; a++)
{
_MSISDN.AppendLine(",'" + grdMass.Rows[a].Cells[0].Value.ToString() + "'");
}
}